← Back to docs

Extracting Text

JustHTML gives you a few ways to get text out of a parsed document, depending on whether you want a fast concatenation, or something structured.

1) to_text() (concatenated text)

Use to_text() when you want the concatenated text from a whole subtree:

from justhtml import JustHTML

doc = JustHTML("<div><h1>Title</h1><p>Hello <b>world</b></p></div>", fragment=True)
print(doc.to_text())
# => Title Hello world
from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True)
print(untrusted.to_text())
# => Hello World
from justhtml import JustHTML

untrusted = JustHTML("<p>Hello<script>alert(1)</script>World</p>", fragment=True, sanitize=False)
print(untrusted.to_text())
# => Hello alert(1) World
from justhtml import JustHTML

doc = JustHTML("<p>Hello <b>world</b></p>", fragment=True)
print(doc.to_text(separator="", strip=False))
# => Hello world

The default separator=" " avoids accidentally smashing words together when the HTML splits text across nodes:

from justhtml import JustHTML

doc = JustHTML("<p>Hello<b>world</b></p>")

print(doc.to_text())
print(doc.to_text(separator="", strip=True))
# => Hello world
# => Helloworld

2) to_markdown() (GitHub Flavored Markdown)

to_markdown() outputs a pragmatic subset of GitHub Flavored Markdown (GFM) that aims to be readable and stable for common HTML.

from justhtml import JustHTML

doc = JustHTML("<h1>Title</h1><p>Hello <b>world</b></p>")
print(doc.to_markdown())
# => # Title
# =>
# => Hello **world**

Example:

from justhtml import JustHTML

html = """
<div>
  <h1>Title</h1>
  <p>Hello <b>world</b> and <a href="https://example.com">links</a>.</p>
  <ul>
    <li>First item</li>
    <li>Second item</li>
  </ul>
  <pre>code block</pre>
</div>
"""

doc = JustHTML(html)
print(doc.to_markdown())

Output:

# Title

Hello **world** and [links](https://example.com).

- First item
- Second item

```
code block
```

Which should I use?