JustHTML is the only pure-Python HTML5 parser that passes 100% of the official html5lib test suite. This page explains how we verify and maintain that compliance.
The html5lib-tests repository is the gold standard for HTML5 parsing compliance. It’s used by browser vendors to verify their implementations against the WHATWG HTML5 specification.
The suite contains:
The tests verify correct handling of:
<html>, <head>, and <body> are auto-inserted<table>)&), numeric (A), and edge cases<meta charset=...>, transport overrides (encoding=), and windows-1252 fallbackHere’s what a test case looks like (from tests1.dat):
#data
<b><p></b></i>
#errors
(1:9) Unexpected end tag </i>
#document
| <html>
| <head>
| <body>
| <b>
| <p>
| <b>
This tests the adoption agency algorithm - when </b> is encountered inside <p>, the browser doesn’t just close <b>. Instead, it splits the formatting across the block element boundary.
We run the same test suite against other Python parsers to compare compliance:
| Parser | Tests Passed | Compliance | Notes |
|---|---|---|---|
| JustHTML | 1743/1743 | 100% | Full spec compliance |
| html5lib | 1538/1743 | 88% | Reference implementation, but incomplete |
| html5_parser | 1462/1743 | 84% | C-based (Gumbo), mostly correct |
| selectolax | 1187/1743 | 68% | C-based (Lexbor), fast but less compliant |
| BeautifulSoup | 78/1743 | 4% | Uses html.parser, not HTML5 compliant |
| html.parser | 77/1743 | 4% | Python stdlib, basic error recovery only |
| lxml | 13/1743 | 1% | XML-based, not HTML5 compliant |
Run python benchmarks/correctness.py to reproduce these results.
These numbers come from a strict tree comparison against the expected output in the html5lib-tests tree-construction fixtures (excluding #script-on / #script-off cases). They will not match the html5lib project’s own reported totals, because html5lib runs the suite in multiple configurations and also has its own skip/xfail lists.
We run the complete html5lib test suite on every commit:
python run_tests.py
To run only a single suite (useful for faster iteration), use --suite:
python run_tests.py --suite tree
python run_tests.py --suite justhtml
python run_tests.py --suite tokenizer
python run_tests.py --suite serializer
python run_tests.py --suite encoding
python run_tests.py --suite unit
Output:
PASSED: 9k+ tests (100%), a few skipped
The skipped tests are scripted (#script-on) cases that require JavaScript execution during parsing.
Per-file results are also written to test-summary.txt, with suite prefixes like html5lib-tests-tree/..., html5lib-tests-tokenizer/..., html5lib-tests-serializer/..., html5lib-tests-encoding/..., and justhtml-tests/....
The encoding coverage comes from both:
html5lib-tests/encoding fixtures (exposed in this repo as tests/html5lib-tests-encoding/...).tests/test_encoding.py) which exercise byte input, encoding label normalization, BOM handling, and meta charset prescanning.Every line and branch of code is covered by tests. We enforce this in CI:
coverage run run_tests.py && coverage report --fail-under=100
This isn’t just vanity - during development, we discovered that uncovered code was often dead code. Removing it made the parser faster and cleaner.
We generate random malformed HTML to find crashes and hangs:
python benchmarks/fuzz.py -n 3000000
Output:
============================================================
FUZZING RESULTS: justhtml
============================================================
Total tests: 3000000
Successes: 3000000
Crashes: 0
Hangs (>5s): 0
Total time: 928s
Tests/second: 3232
The fuzzer generates truly nasty edge cases:
�)<b><p></b></i>)We maintain additional tests in tests/justhtml-tests/ for:
# Clone the test suite (one-time setup)
cd ..
git clone https://github.com/html5lib/html5lib-tests.git
cd justhtml
# Create symlinks
cd tests
ln -s ../../html5lib-tests/tokenizer html5lib-tests-tokenizer
ln -s ../../html5lib-tests/tree-construction html5lib-tests-tree
ln -s ../../html5lib-tests/serializer html5lib-tests-serializer
ln -s ../../html5lib-tests/encoding html5lib-tests-encoding
cd ..
# Run all tests
python run_tests.py
# Verbose output with diffs
python run_tests.py -v
# Run specific test file
python run_tests.py --test-specs test2.test:5,10
# Stop on first failure
python run_tests.py -x
# Check for regressions against baseline
python run_tests.py --regressions
Compare against other parsers:
python benchmarks/correctness.py
HTML5 parsing is notoriously complex. The spec describes an intricate state machine with:
Getting 99% compliance means you’re still breaking on real-world edge cases. Browsers pass 100% because they have to - and now JustHTML does too.
Beyond tree construction, we’re working to standardize parse error reporting. The HTML5 spec defines specific error codes for malformed input, but:
JustHTML uses kebab-case error codes matching the WHATWG spec where possible:
doc = JustHTML("<p>Hello", collect_errors=True)
for error in doc.errors:
print(f"{error.line}:{error.column} {error.code}")
# Output: 1:9 expected-closing-tag-but-got-eof
Our error codes are centralized in src/justhtml/errors.py with human-readable messages. This makes it possible to:
See Error Codes for the complete list.