Encoding & Byte Input

JustHTML can parse both Unicode strings (str) and raw byte streams (bytes, bytearray, memoryview).

If you pass bytes, JustHTML will sniff and decode the input using the HTML Standard’s encoding rules.

When Encoding Sniffing Happens

If html is a str: no sniffing/decoding happens (it’s already decoded).
If html is bytes-like: JustHTML decodes it into a str before tokenization.

The chosen encoding is exposed as doc.encoding when you use JustHTML(...).

Why the Default Is `windows-1252`

If no encoding information is found, HTML parsing defaults to Windows-1252 (often called “cp1252”). This can be surprising if you expect UTF-8 everywhere, but it’s important for legacy HTML:

Many older documents were authored as “Latin-1” without an explicit encoding.
Browsers historically treated this as Windows-1252, not ISO-8859-1.
Using the same default makes JustHTML behave like browsers on real-world old documents.

What JustHTML Looks At (High Level)

For byte input, JustHTML follows the standard precedence:

Transport encoding override (what you pass as encoding=)
BOM (byte order mark)
**<meta charset=...> / <meta http-equiv=... content=...> in the initial bytes
Fallback to windows-1252

JustHTML also treats utf-7 labels as unsafe and falls back to windows-1252.

How To Control It

1) Let JustHTML Sniff (recommended for unknown/legacy HTML)

from justhtml import JustHTML
from pathlib import Path

data = Path("page.html").read_bytes()

doc = JustHTML(data)
print(doc.encoding)

2) Override With a Known Encoding

If you already know the correct encoding (e.g. from HTTP headers, file metadata, or your application protocol), pass it as encoding=.

from justhtml import JustHTML
from pathlib import Path

data = Path("page.html").read_bytes()

doc = JustHTML(data, encoding="utf-8")

3) Decode Yourself (when you want full control)

from justhtml import JustHTML
from pathlib import Path

data = Path("page.html").read_bytes()
html = data.decode("utf-8", errors="replace")

doc = JustHTML(html)

Streaming API

The streaming API supports the same byte-input behavior:

from justhtml import stream
from pathlib import Path

for event, data in stream(Path("page.html").read_bytes()):
    ...

To override the encoding:

from justhtml import stream
from pathlib import Path

for event, data in stream(Path("page.html").read_bytes(), encoding="utf-8"):
    ...

This site is open source. Improve this page.