For large files or when you don’t need the full DOM tree, use the streaming API.
The streaming parser is:
from justhtml import stream
from pathlib import Path
html = "<html><body><p>Hello, world!</p></body></html>"
for event, data in stream(html):
print(event, data)
stream() also accepts bytes (bytes, bytearray, memoryview). In that case, the input is decoded using HTML encoding sniffing (including a windows-1252 fallback for legacy documents).
from justhtml import stream
data = Path("page.html").read_bytes()
for event, data in stream(data):
...
To override decoding when you already know the correct encoding:
from justhtml import stream
from pathlib import Path
data = Path("page.html").read_bytes()
for event, data in stream(data, encoding="utf-8"):
...
See Encoding & Byte Input for details.
Output:
start ('html', {})
start ('head', {})
end head
start ('body', {})
start ('p', {})
text Hello, world!
end p
end body
end html
| Event | Data Type | Description |
|---|---|---|
"start" |
(tag_name, attrs_dict) |
Opening tag encountered |
"end" |
tag_name |
Closing tag encountered |
"text" |
str |
Text content |
"comment" |
str |
HTML comment content |
"doctype" |
str |
DOCTYPE name (usually "html") |
from justhtml import stream
from pathlib import Path
html = Path("page.html").read_text()
for event, data in stream(html):
if event == "start":
tag, attrs = data
if tag == "a" and "href" in attrs:
print(attrs["href"])
from justhtml import stream
from collections import Counter
counts = Counter()
for event, data in stream(html):
if event == "start":
tag, attrs = data
counts[tag] += 1
print(counts.most_common(10))
from justhtml import stream
text_parts = []
for event, data in stream(html):
if event == "text":
text_parts.append(data)
full_text = " ".join(text_parts)
from justhtml import stream
in_script = False
for event, data in stream(html):
if event == "start" and data[0] == "script":
in_script = True
elif event == "end" and data == "script":
in_script = False
elif event == "text" and not in_script:
print(data) # Only non-script text
Use the streaming API when:
Use the DOM API (JustHTML) when:
The streaming API is faster than building a full DOM:
| API | Time (100 files) | Memory |
|---|---|---|
JustHTML() |
~1.0s | Higher |
stream() |
~0.7s | Lower |
For most use cases, the difference is negligible. Use whichever API fits your needs.