← Back to docs

Streaming API

For large files or when you don’t need the full DOM tree, use the streaming API.

Overview

The streaming parser is:

Basic Usage

from justhtml import stream
from pathlib import Path

html = "<html><body><p>Hello, world!</p></body></html>"

for event, data in stream(html):
    print(event, data)

Byte Input and Encodings

stream() also accepts bytes (bytes, bytearray, memoryview). In that case, the input is decoded using HTML encoding sniffing (including a windows-1252 fallback for legacy documents).

from justhtml import stream

data = Path("page.html").read_bytes()
for event, data in stream(data):
    ...

To override decoding when you already know the correct encoding:

from justhtml import stream
from pathlib import Path

data = Path("page.html").read_bytes()
for event, data in stream(data, encoding="utf-8"):
    ...

See Encoding & Byte Input for details.

Output:

start ('html', {})
start ('head', {})
end head
start ('body', {})
start ('p', {})
text Hello, world!
end p
end body
end html

Events

Event Data Type Description
"start" (tag_name, attrs_dict) Opening tag encountered
"end" tag_name Closing tag encountered
"text" str Text content
"comment" str HTML comment content
"doctype" str DOCTYPE name (usually "html")

Examples

from justhtml import stream
from pathlib import Path

html = Path("page.html").read_text()

for event, data in stream(html):
    if event == "start":
        tag, attrs = data
        if tag == "a" and "href" in attrs:
            print(attrs["href"])

Count Elements

from justhtml import stream
from collections import Counter

counts = Counter()

for event, data in stream(html):
    if event == "start":
        tag, attrs = data
        counts[tag] += 1

print(counts.most_common(10))

Extract Text Content

from justhtml import stream

text_parts = []
for event, data in stream(html):
    if event == "text":
        text_parts.append(data)

full_text = " ".join(text_parts)

Filter by Tag

from justhtml import stream

in_script = False
for event, data in stream(html):
    if event == "start" and data[0] == "script":
        in_script = True
    elif event == "end" and data == "script":
        in_script = False
    elif event == "text" and not in_script:
        print(data)  # Only non-script text

When to Use Streaming

Use the streaming API when:

Use the DOM API (JustHTML) when:

Performance

The streaming API is faster than building a full DOM:

API Time (100 files) Memory
JustHTML() ~1.0s Higher
stream() ~0.7s Lower

For most use cases, the difference is negligible. Use whichever API fits your needs.