JustHTML is the only pure-Python HTML5 parser that passes 100% of the official html5lib test suite. This page explains how we verify and maintain that compliance.
The html5lib-tests repository is the gold standard for HTML5 parsing compliance. It’s used by browser vendors to verify their implementations against the WHATWG HTML5 specification.
Our checked-in test inputs contain:
The tests verify correct handling of:
<html>, <head>, and <body> are auto-inserted<table>)&), numeric (A), and edge cases<meta charset=...>, transport overrides (encoding=), and windows-1252 fallbackHere’s what a test case looks like (from tests1.dat):
#data
<b><p></b></i>
#errors
(1:9) Unexpected end tag </i>
#document
| <html>
| <head>
| <body>
| <b>
| <p>
| <b>
This tests the adoption agency algorithm - when </b> is encountered inside <p>, the browser doesn’t just close <b>. Instead, it splits the formatting across the block element boundary.
We run the same test suite against other Python parsers to compare compliance. The cross-parser snapshot below used the 1,743 cases available when it was recorded; the current JustHTML gate covers 1,791 enabled cases.
| Parser | Tests Passed | Compliance | Notes |
|---|---|---|---|
| JustHTML | 1743/1743 | 100% | Full spec compliance in this comparison snapshot; current gate: 1791/1791 |
| selectolax | 1743/1743 | 100% | C-based (Lexbor), fast and spec-compliant with dev html5test output API |
| markupever | 1545/1743 | 89% | Rust-based (html5ever), mostly correct |
| html5lib | 1496/1743 | 86% | Reference implementation, but incomplete |
| html5_parser | 862/1743 | 49% | C-based (Gumbo), fast but loses exposed tree information |
| BeautifulSoup | 6/1743 | <1% | Uses html.parser, not HTML5 compliant |
| html.parser | 6/1743 | <1% | Python stdlib, basic error recovery only |
| lxml | 5/1743 | <1% | XML-based, not HTML5 compliant |
Run python benchmarks/correctness.py to reproduce these results. The selectolax score requires its dev html5test output and fragment-context APIs. These scores were refreshed against html5lib-tests e446320.
These numbers come from a strict tree comparison against the expected output in the html5lib-tests tree-construction fixtures (excluding #script-on / #script-off cases). Unsupported parser capabilities count as failures for this compliance table. The numbers will not match the html5lib project’s own reported totals, because html5lib runs the suite in multiple configurations and also has its own skip/xfail lists.
We run the complete html5lib test suite on every commit:
python run_tests.py
To run only a single suite (useful for faster iteration), use --suite:
python run_tests.py --suite tree
python run_tests.py --suite justhtml
python run_tests.py --suite serializer
python run_tests.py --suite encoding
python run_tests.py --suite unit
Output:
PASSED: 3464/3464 passed (100.0%)
There are also 6 expected skips, including scripted (#script-on) cases that
require JavaScript execution during parsing.
Per-file results are also written to test-summary.txt, with suite prefixes like html5lib-tests-tree/..., html5lib-tests-serializer/..., html5lib-tests-encoding/..., and justhtml-tests/....
The encoding coverage comes from both:
html5lib-tests/encoding fixtures (exposed in this repo as tests/html5lib-tests-encoding/...).tests/test_encoding.py) which exercise byte input, encoding label normalization, BOM handling, and meta charset prescanning.The test suite enforces 100% combined line and branch coverage, including the parser engine:
coverage run run_tests.py && coverage report --fail-under=100
The parser engine is additionally checked behaviorally:
PYTHONPATH=src python benchmarks/html5lib_engine_diff.py \
--fail-under-rate 1.0 \
--fail-on-current-exceptions
This requires exact agreement with the reference parser path across every scored html5lib tree-construction case.
We generate random malformed HTML to find crashes and hangs:
python benchmarks/fuzz.py -n 3000000
Output:
============================================================
FUZZING RESULTS: justhtml
============================================================
Total tests: 3000000
Successes: 3000000
Crashes: 0
Hangs (>5s): 0
Total time: 928s
Tests/second: 3232
The fuzzer generates truly nasty edge cases:
�)<b><p></b></i>)We maintain additional tests in tests/justhtml-tests/ for:
# Clone the test suite (one-time setup)
cd ..
git clone https://github.com/html5lib/html5lib-tests.git
cd justhtml
# Create symlinks
cd tests
ln -s ../../html5lib-tests/tree-construction html5lib-tests-tree
ln -s ../../html5lib-tests/serializer html5lib-tests-serializer
ln -s ../../html5lib-tests/encoding html5lib-tests-encoding
cd ..
# Run all tests
python run_tests.py
# Verbose output with diffs
python run_tests.py -v
# Run specific test file
python run_tests.py --test-specs test2.test:5,10
# Stop on first failure
python run_tests.py -x
# Check for regressions against baseline
python run_tests.py --regressions
Compare against other parsers:
python benchmarks/correctness.py
HTML5 parsing is notoriously complex. The spec describes intricate parsing behavior with:
Getting 99% compliance means you’re still breaking on real-world edge cases. Browsers pass 100% because they have to - and now JustHTML does too.
The html5lib suite verifies tree output, not a standardized diagnostic stream. JustHTML therefore reports a small set of high-value errors instead of duplicating the parser to reproduce every detailed recovery diagnostic:
doc = JustHTML("<!doctype html><!--", collect_errors=True)
for error in doc.errors:
print(f"{error.line}:{error.column} {error.code}")
# Output: 1:19 eof-in-comment
Error collection is optional and adds work. Strict mode raises on the earliest supported diagnostic, but is not a complete HTML conformance validator.
See Error Codes for the supported set and stability contract.