← Back to docs

Sanitization & Security

JustHTML includes a built-in, policy-driven HTML sanitizer intended for rendering untrusted HTML safely.

JustHTML’s sanitizer is validated against the justhtml-xss-bench suite (a headless-browser harness), currently covering 7,000+ real-world XSS vectors. The benchmark can be used to compare output with established sanitizers like nh3 and bleach.

The sanitizer is DOM-based (it runs on the parsed JustHTML tree), and JustHTML is safe-by-default at construction time.

Guides

Quickstart

Most real-world untrusted HTML is a snippet (a fragment) rather than a full document. In that case, pass fragment=True to avoid implicit document wrappers.

If you are sanitizing a full HTML document, the default policy keeps the document structure (it preserves <html>, <head>, and <body> wrappers).

By default, construction sanitizes:

from justhtml import JustHTML

doc = JustHTML('<p>Hello <b>world</b> <script>alert(1)</script></p>', fragment=True)
print(doc.to_html())

Output:

<p>Hello <b>world</b></p>

For a deeper dive, continue in HTML Cleaning and URL Cleaning.

Sanitizing the in-memory DOM with Sanitize(...)

Safe-by-default construction (JustHTML(..., sanitize=True)) sanitizes the in-memory tree once, after parsing and transforms run.

Sanitization is appended automatically after any custom transforms. If you want to run transforms after sanitization, add Sanitize(...) to your transform list and put additional transforms after it:

from justhtml import JustHTML, Sanitize

doc = JustHTML(user_html, fragment=True, transforms=[Sanitize(), PruneEmpty()])
clean_root = doc.root

See also: Transforms (especially Sanitize(...) and Stage([...])).

Why Sanitize(...) is reviewable

Sanitize(...) is not a hidden “black box” pass. Internally, it compiles the sanitization policy into a concrete, readable pipeline of smaller transforms (drop content tags, unwrap disallowed tags, drop dangerous attributes, validate URLs, sanitize styles, enforce link rel, …).

This is a strong security property:

If you’re evaluating or auditing sanitization behavior, the Sanitize(...) transform documentation summarizes the pipeline at a high level.

Threat model

The goal of sanitization is to take untrusted HTML and clean it into output that is safe enough to be embedded as markup into a normal (safe) HTML page.

In scope:

Out of scope (you must handle these separately):

See HTML Cleaning for tag/attribute rules and unsafe handling, and URL Cleaning for URL handling (default_handling) and URL validation rules.