Extracting Text

JustHTML gives you a few ways to get text out of a parsed document, depending on whether you want a fast concatenation, or something structured.

1) `to_text()` (concatenated text)

Use to_text() when you want the concatenated text from a whole subtree:

Traverses descendants.
Joins text nodes using separator (default: a single space).
Strips each text node by default (strip=True) and drops empty segments.
Includes <template> contents (via template_content).
Sanitizes untrusted HTML by default (safe-by-default at construction).

from justhtml import JustHTML

doc = JustHTML("<div><h1>Title</h1><p>Hello <b>world</b></p></div>", fragment=True)
print(doc.to_text())
# => Title Hello world

from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True)
print(untrusted.to_text())
# => Hello World

from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True, sanitize=False)
print(untrusted.to_text())
# => Hello alert(1) World

from justhtml import JustHTML

doc = JustHTML("<p>Hello <b>world</b></p>", fragment=True)
print(doc.to_text(separator="", strip=False))
# => Hello world

The default separator=" " avoids accidentally smashing words together when the HTML splits text across nodes:

from justhtml import JustHTML

doc = JustHTML("<p>Hello<b>world</b></p>")

print(doc.to_text())
print(doc.to_text(separator="", strip=True))
# => Hello world
# => Helloworld

Block-only separators

If you use a separator like "\n" to get one “line” per block element, inline elements can split text into multiple nodes and produce extra separators:

from justhtml import JustHTML

doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")

print(doc.to_text(separator="\n"))
# => hi
# => Hello
# => world

Use separator_blocks_only=True to apply separator only between block-level elements:

from justhtml import JustHTML

doc = JustHTML("<p>hi</p><p>Hello <b>world</b></p>")

print(doc.to_text(separator="\n", separator_blocks_only=True))
# => hi
# => Hello world

2) `to_markdown()` (GitHub Flavored Markdown)

to_markdown() outputs a pragmatic subset of GitHub Flavored Markdown (GFM) that aims to be readable and stable for common HTML.

Converts common elements like headings, paragraphs, lists, emphasis, links, and code.
Keeps tables (<table>) and images (<img>) as raw HTML.
Drops <script>, <style>, and <textarea> by default; pass html_passthrough=True to include them and their contents.

from justhtml import JustHTML

doc = JustHTML("<h1>Title</h1><p>Hello <b>world</b></p>")
print(doc.to_markdown())
# => # Title
# =>
# => Hello **world**

Example:

from justhtml import JustHTML

html = """
<div>
  <h1>Title</h1>
  <p>Hello <b>world</b> and <a href="https://example.com">links</a>.</p>
  <ul>
    <li>First item</li>
    <li>Second item</li>
  </ul>
  <pre>code block</pre>
</div>
"""

doc = JustHTML(html)
print(doc.to_markdown())

Output:

# Title

Hello **world** and [links](https://example.com).

- First item
- Second item

```
code block
```

Which should I use?

Use to_text() for the raw concatenated text of a subtree (textContent semantics).
Use to_markdown() when you want readable, structured Markdown.

This site is open source. Improve this page.

Extracting Text

1) to_text() (concatenated text)

Block-only separators

2) to_markdown() (GitHub Flavored Markdown)

Which should I use?

1) `to_text()` (concatenated text)

2) `to_markdown()` (GitHub Flavored Markdown)