← Back to docs

Transforms

JustHTML supports optional transforms to modify the parsed DOM tree right after parsing.

This is intended as a migration path for Bleach/html5lib filter pipelines, but implemented as DOM transforms (tree-aware and HTML5-treebuilder-correct).

Transforms are the recommended way to mutate the DOM. Direct node edits are supported, but transforms provide clearer ordering guarantees and make it explicit when sanitization should run.

If you’re migrating an existing Bleach setup, see Migrating from Bleach.

Transforms are applied during construction via the transforms keyword argument:

from justhtml import JustHTML

doc = JustHTML("<p>Hello</p>", transforms=[...])

Quick example

from justhtml import JustHTML, Drop, SetAttrs

doc = JustHTML(
    "<p>Hello</p><script>alert(1)</script>",
    transforms=[
        SetAttrs("p", id="greeting"),
        Drop("script"),
    ],
)

# The tree is transformed in memory
print(doc.to_html())

# Output is still safe by default
print(doc.to_html(pretty=False))

Safety model

Important: When sanitize=True, JustHTML ensures the in-memory tree is sanitized by running a Sanitize(...) step after parsing and after your custom transforms.

This means your transforms see the unsanitized tree, and sanitization may rewrite it afterwards (for example, stripping unsafe href/src values). If you want a transform to operate on the sanitized tree, include Sanitize() explicitly in your transform list and place later transforms after it:

from justhtml import JustHTML, Sanitize, Unwrap

doc = JustHTML(
    'Hello <a href="javascript:alert(1)">x</a>',
    transforms=[
        Sanitize(),
        Unwrap("a:not([href])"),
    ],
)
print(doc.to_html())
# => Hello

Raw output is available by disabling sanitization:

doc = JustHTML("<p>Hello</p><script>alert(1)</script>", fragment=True, sanitize=False)
print(doc.to_html(pretty=False))
# => <p>Hello</p><script>alert(1)</script>

Sanitization can remove or rewrite transform results (for example, unsafe tags, event handler attributes, or unsafe URLs in href).

Ordering

Transforms run left-to-right, but JustHTML may batch compatible transforms into a single tree walk for performance.

Batching preserves left-to-right ordering, but it is still a single walk with a moving cursor. If a transform inserts or moves nodes before the current cursor, later transforms in the same walk may not visit those nodes.

If you need explicit pass boundaries (to make multi-pass pipelines easier to read, or to avoid cross-transform batching effects), use Stage([...]) (see “Advanced: Stages” below).

from justhtml import JustHTML, Drop, SetAttrs

doc = JustHTML(
    "<p>Hello</p>",
    transforms=[
        SetAttrs("p", id="x"),
        Drop("p"),
    ],
)

Advanced: Stages

Stage([...]) lets you explicitly split transforms into separate passes.

Use it when you want to make a multi-pass pipeline clearer, or when you want to avoid cross-transform batching effects.

Stages also matter for semantics when earlier transforms insert/move nodes “behind” the current walk position. Splitting into stages forces a new walk, so later transforms see the updated tree.

Example: Let’s a Edit() transform create new nodes and then set attributes

from justhtml import Edit, JustHTML, SetAttrs, Stage
from justhtml.node import Node, Text


def insert_marker(p):
    # Insert a new sibling *before* the current node.
    # Without an explicit stage boundary, later transforms in the same walk
    # may not visit nodes inserted before the current cursor.
    marker = Node("span")
    marker.append_child(Text("NEW "))
    # If this was insert_after, SetAttrs would have seen the node.
    p.parent.insert_before(marker, p)

doc = JustHTML(
    "<p>one</p><p>two</p>",
    fragment=True,
    transforms=[
        # Without Stage, SetAttrs will miss the inserted <span>.
        Edit("p:first-child", insert_marker),
        SetAttrs("span", id="marker"),
    ],
)

# With Stage, the second pass sees the inserted <span>:
doc2 = JustHTML(
    "<p>one</p><p>two</p>",
    fragment=True,
    sanitize=False,
    transforms=[
        Stage([Edit("p:first-child", insert_marker)]),
        Stage([SetAttrs("span", id="marker")]),
    ],
)

print(doc.to_html(pretty=False))
print(doc2.to_html(pretty=False))

Output:

<span>NEW </span><p>one</p><p>two</p>
<span id="marker">NEW </span><p>one</p><p>two</p>

Tree shape

Transforms operate on the HTML5 treebuilder result, not the original token stream.

This means elements may already be inserted, moved, or normalized according to HTML parsing rules (for example, <template> elements end up in <head> in a full document).

Performance

JustHTML compiles transforms before applying them:

Transforms are applied with a single in-place traversal that supports structural edits.

Transforms are optional; omitting transforms keeps constructor behavior unchanged.

Validation

Transform selectors are validated during construction. Invalid selectors raise SelectorError early, before the document is exposed.

Only the built-in transform objects are supported. Unsupported transform objects raise TypeError.

Scope

Selector-based transforms (SetAttrs, Drop, Unwrap, Empty, Edit) apply only to element nodes. They never match document containers, text nodes, comments, or doctypes.

Linkify is different: it scans text nodes and wraps detected URLs/emails in <a> elements. It never touches attributes, existing tags, comments, or doctypes.

Enabling/disabling transforms

All built-in transforms have an enabled flag.

Hooks

All built-in transforms share the same optional keyword parameters:

Some transforms require an additional function argument (for example Edit(..., func)), which is documented in their signatures below.

Built-in transforms

Core selector transforms:

Advanced building blocks (useful for policy-driven pipelines):

Linkify(...)

See Linkify(...) for full documentation and examples.

CollapseWhitespace(skip_tags=(...), enabled=True, callback=None, report=None)

Collapses runs of HTML whitespace characters in text nodes to a single space.

This is similar to html5lib.filters.whitespace.Filter.

By default it skips <pre>, <textarea>, <code>, <title>, <script>, and <style>.

from justhtml import CollapseWhitespace, JustHTML

doc = JustHTML(
    "<p>Hello \n\t world</p><pre>a  b</pre>",
    fragment=True,
    transforms=[CollapseWhitespace()],
)

print(doc.to_html(pretty=False))
# => <p>Hello world</p><pre>a  b</pre>

Sanitize(policy=None, enabled=True, callback=None, report=None)

Sanitizes the in-memory DOM tree using the same sanitizer as construction-time sanitization.

Reviewability (security argument)

Sanitize(...) is implemented as an explicit pipeline of smaller, focused transforms (like DropComments, DropAttrs, DropUrlAttrs, …). This makes it easier to audit: the sanitizer behavior is a readable list of operations rather than a monolithic “magic” pass.

The Sanitize(...) pipeline compiles to this ordered list of transforms (some may be disabled by policy):

For policy details, see Sanitization & Security.

PruneEmpty(...) after Sanitize(...) is useful if sanitization removes unsafe children (for example <script>) and leaves a now-empty wrapper element.

CollapseWhitespace(...) after Sanitize(...) is useful if you want an already-sanitized in-memory tree, but still normalize whitespace (similar to html5lib.filters.whitespace.Filter).

PruneEmpty(selector, strip_whitespace=True, enabled=True, callback=None, report=None)

Recursively drops elements that are empty after transforms have run.

“Empty” means there are no element children and no non-whitespace text.

Note: HTML void elements (like img, br, hr, …) are never considered empty by PruneEmpty(...) and will not be removed by pruning. If you want to remove “empty images”, use Drop(...) (for example Drop('img:not([src]), img[src=""]')) and then prune any now-empty wrapper elements.

If you want whitespace-only text nodes to count as content (so <p> </p> is kept), pass strip_whitespace=False.

PruneEmpty(...) runs as a post-order walk over the tree and removes elements that are empty at that point in the transform pipeline.

If you want to prune after all other transforms, put PruneEmpty(...) at the end (or immediately before Sanitize(...)).

Example: remove empty paragraphs after dropping unwanted tags:

from justhtml import Drop, JustHTML, PruneEmpty

doc = JustHTML(
    "<p></p><p><img></p><p><img src=\"/x\"></p>",
    fragment=True,
    transforms=[
        Drop('img:not([src]), img[src=""]'),
        PruneEmpty("p"),
    ],
)

print(doc.to_html(pretty=False))

Output:

<p><img src="/x"></p>

SetAttrs(selector, enabled=True, callback=None, report=None, attributes=None, **attrs)

Sets/overwrites attributes on matching elements.

Attribute values are converted to strings. Passing None creates a boolean attribute (serialized in minimized form by default).

enabled controls whether the transform runs.

If you need to set an attribute that collides with a keyword-only parameter name (like enabled, callback, or report), use the attributes= dict.

SetAttrs("a", rel="nofollow", target="_blank")
SetAttrs("input", disabled=None)
SetAttrs("div", attributes={"enabled": "true"})

Drop(selector, enabled=True, callback=None, report=None)

Removes matching elements and their contents.

Optional: pass callback(node) / report(msg, node=...) to run hooks right before the node is dropped.

Drop(...) supports any selector that the JustHTML selector engine supports, including comma-separated selectors and attribute selectors.

Example: remove scripts and styles:

from justhtml import JustHTML, Drop

doc = JustHTML(
    "<p>Hello</p><script>alert(1)</script><style>p{}</style>",
    fragment=True,
    transforms=[Drop("script, style")],
)

print(doc.to_html(pretty=False))

Output:

<p>Hello</p>

Example: drop elements by class:

from justhtml import JustHTML, Drop

doc = JustHTML(
    '<p>One</p><div class="ad">Buy</div><p>Two</p>',
    fragment=True,
    transforms=[Drop(".ad")],
)

print(doc.to_html(pretty=False))

Output:

<p>One</p><p>Two</p>

Example: drop only some elements based on attributes:

from justhtml import JustHTML, Drop

doc = JustHTML(
    '<p><img><img src=""><img src="/x"></p>',
    fragment=True,
    transforms=[Drop('img:not([src]), img[src=""]')],
)

print(doc.to_html(pretty=False))

Output:

<p><img src="/x"></p>

Unwrap(selector, enabled=True, callback=None, report=None)

Removes the element but keeps its children (hoists contents).

Optional: pass callback(node) / report(msg, node=...) to run hooks right before the node is unwrapped.

Unwrap("span")
Unwrap("div.wrapper")

Escape(selector, enabled=True, callback=None, report=None)

Escapes a matching element’s start/end tags by turning them into text nodes, but keeps the element’s children.

Example:

from justhtml import Escape, JustHTML

doc = JustHTML(
    "<p><x>hi</x></p>",
    fragment=True,
    sanitize=False,
    transforms=[Escape("x")],
)
print(doc.to_html(pretty=False))
# => <p>&lt;x&gt;hi&lt;/x&gt;</p>

Empty(selector, enabled=True, callback=None, report=None)

Keeps the element but removes its children.

This also clears <template> contents.

Optional: pass callback(node) / report(msg, node=...) to run hooks when a node is emptied.

Empty("pre")
Empty("template")

Edit(selector, func, enabled=True, callback=None, report=None)

Escape hatch for custom logic. Runs func(node) for each matching element.

Optional: pass callback(node) / report(msg, node=...) to run hooks when the transform is applied.

Advanced transform building blocks

These are lower-level transforms that are primarily useful for building policy-driven pipelines (including the built-in Sanitize(...) pipeline).

EditDocument(func, enabled=True, callback=None, report=None)

Runs func(root) exactly once with the document root (#document or #document-fragment).

Use this for transforms that need access to the root container node (which selector-based transforms do not visit).

Decide(selector, func, enabled=True, callback=None, report=None)

General-purpose structural transform.

The callback must return one of:

EditAttrs(selector, func, enabled=True, callback=None, report=None)

Rewrite element attributes using a callback.

RewriteAttrs is a backwards-compatible alias for EditAttrs.

DropComments(enabled=True, callback=None, report=None)

Drops comment nodes (#comment).

DropDoctype(enabled=True, callback=None, report=None)

Drops doctype nodes (!doctype).

DropForeignNamespaces(enabled=True, callback=None, report=None)

Drops elements in non-HTML namespaces (for example SVG/MathML) when enabled.

If provided, callback(node) / report(msg, node=...) is called when a foreign element is dropped.

DropAttrs(selector, patterns=(), enabled=True, callback=None, report=None)

Drops attributes whose names match patterns like "on*" or "*:*".

Patterns support * and ? wildcards. If provided, callback(node) / report(msg, node=...) is called for each dropped attribute.

AllowlistAttrs(selector, allowed_attributes=..., enabled=True, callback=None, report=None)

Keeps only allowlisted attributes.

allowed_attributes is a mapping like:

If provided, callback(node) / report(msg, node=...) is called when an attribute is dropped for not being allowlisted.

DropUrlAttrs(selector, url_policy=..., enabled=True, callback=None, report=None)

Validates and rewrites/drops URL-valued attributes (href, src, srcset, …) according to a UrlPolicy.

If provided, callback(node) / report(msg, node=...) is called when a URL is dropped.

AllowStyleAttrs(selector, allowed_css_properties=..., enabled=True, callback=None, report=None)

Sanitizes inline style attributes using an allowlist of CSS properties.

If provided, callback(node) / report(msg, node=...) is called when a style attribute is dropped.

MergeAttrs(tag, attr=..., tokens=..., enabled=True, callback=None, report=None)

Merges tokens into a whitespace-delimited attribute without removing existing tokens.

This is used by the sanitizer to enforce rel tokens on links (e.g. noopener).

Edit is useful for transformations that don’t fit the built-ins, such as removing attributes, rewriting URLs, or conditionally dropping nodes.

from justhtml import JustHTML, Edit


def strip_tracking_params(a):
    href = a.attrs.get("href")
    if href:
        a.attrs["href"] = href.split("?", 1)[0]


doc = JustHTML(
    "<a href=\"https://e.com/?utm_source=x\">x</a>",
    transforms=[Edit("a", strip_tracking_params)],
)

Removing attributes is typically done via Edit:

from justhtml import JustHTML, Edit


def drop_inline_handlers(node):
    node.attrs.pop("onclick", None)
    node.attrs.pop("onload", None)


doc = JustHTML("<a onclick=\"x()\">x</a>", transforms=[Edit("a", drop_inline_handlers)])

Recipes

from justhtml import JustHTML, SetAttrs

doc = JustHTML(
    "<p><a href=\"https://example.com\">Example</a></p>",
    transforms=[
        SetAttrs("a", rel="nofollow noreferrer"),
    ],
)

Strip elements but keep their text

from justhtml import JustHTML, Unwrap

doc = JustHTML(
    "<p>Hello <span class=\"x\">world</span></p>",
    transforms=[Unwrap("span.x")],
)

Drop unsafe elements early, while still keeping safe-by-default sanitization

from justhtml import JustHTML, Drop

doc = JustHTML(
    "<p>ok</p><script>alert(1)</script>",
    transforms=[Drop("script")],
)
print(doc.to_html(pretty=False))

Acknowledgements

JustHTML’s Linkify behavior is validated against the upstream linkify-it fixture suite (MIT licensed).