JustHTML can parse both Unicode strings (str) and raw byte streams (bytes, bytearray, memoryview).
If you pass bytes, JustHTML will sniff and decode the input using the HTML Standard’s encoding rules.
html is a str: no sniffing/decoding happens (it’s already decoded).html is bytes-like: JustHTML decodes it into a str before tokenization.The chosen encoding is exposed as doc.encoding when you use JustHTML(...).
windows-1252If no encoding information is found, HTML parsing defaults to Windows-1252 (often called “cp1252”). This can be surprising if you expect UTF-8 everywhere, but it’s important for legacy HTML:
For byte input, JustHTML follows the standard precedence:
encoding=)<meta charset=...> / <meta http-equiv=... content=...> in the initial byteswindows-1252JustHTML also treats utf-7 labels as unsafe and falls back to windows-1252.
from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
doc = JustHTML(data)
print(doc.encoding)
If you already know the correct encoding (e.g. from HTTP headers, file metadata, or your application protocol), pass it as encoding=.
from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
doc = JustHTML(data, encoding="utf-8")
from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
html = data.decode("utf-8", errors="replace")
doc = JustHTML(html)
The streaming API supports the same byte-input behavior:
from justhtml import stream
from pathlib import Path
for event, data in stream(Path("page.html").read_bytes()):
...
To override the encoding:
from justhtml import stream
from pathlib import Path
for event, data in stream(Path("page.html").read_bytes(), encoding="utf-8"):
...