← Back to docs

Encoding & Byte Input

JustHTML can parse both Unicode strings (str) and raw byte streams (bytes, bytearray, memoryview).

If you pass bytes, JustHTML will sniff and decode the input using the HTML Standard’s encoding rules.

When Encoding Sniffing Happens

The chosen encoding is exposed as doc.encoding when you use JustHTML(...).

Why the Default Is windows-1252

If no encoding information is found, HTML parsing defaults to Windows-1252 (often called “cp1252”). This can be surprising if you expect UTF-8 everywhere, but it’s important for legacy HTML:

What JustHTML Looks At (High Level)

For byte input, JustHTML follows the standard precedence:

  1. Transport encoding override (what you pass as encoding=)
  2. BOM (byte order mark)
  3. **<meta charset=...> / <meta http-equiv=... content=...> in the initial bytes
  4. Fallback to windows-1252

JustHTML also treats utf-7 labels as unsafe and falls back to windows-1252.

How To Control It

from justhtml import JustHTML
from pathlib import Path

data = Path("page.html").read_bytes()

doc = JustHTML(data)
print(doc.encoding)

2) Override With a Known Encoding

If you already know the correct encoding (e.g. from HTTP headers, file metadata, or your application protocol), pass it as encoding=.

from justhtml import JustHTML
from pathlib import Path

data = Path("page.html").read_bytes()

doc = JustHTML(data, encoding="utf-8")

3) Decode Yourself (when you want full control)

from justhtml import JustHTML
from pathlib import Path

data = Path("page.html").read_bytes()
html = data.decode("utf-8", errors="replace")

doc = JustHTML(html)

Streaming API

The streaming API supports the same byte-input behavior:

from justhtml import stream
from pathlib import Path

for event, data in stream(Path("page.html").read_bytes()):
    ...

To override the encoding:

from justhtml import stream
from pathlib import Path

for event, data in stream(Path("page.html").read_bytes(), encoding="utf-8"):
    ...