This page focuses on URL cleaning: how JustHTML validates and rewrites URL-valued attributes like a[href] or img[src].
For tag/attribute allowlists, inline styles, and unsafe-handling modes, see HTML Cleaning.
On this page:
JustHTML treats a set of attributes as URL-like (including href, src, srcset, action, and a few others).
The reason is that these attributes can trigger navigation or resource loading (and in some cases script execution via unsafe schemes like javascript:). Different attributes also have different security expectations: for example, allowing a[href] is often fine, while allowing img[src] can cause remote requests/tracking. Requiring an explicit (tag, attr) rule forces you to opt in and define what is considered a valid URL for that specific attribute.
For safety, these attributes are only kept if there is an explicit matching rule in UrlPolicy(allow_rules=...) for the (tag, attr) pair.
from justhtml import JustHTML, SanitizationPolicy
policy = SanitizationPolicy(
allowed_tags=["img"],
allowed_attributes={"img": ["src"]},
)
print(JustHTML("""
<img src="https://example.com">
<img src="https://attacker.com">
""", fragment=True, policy=policy).to_html())
Output:
<img>
<img>
Since no urlpolicy was set, the default kicked in, and deleted any URL-like attribute. It’s not enough to allow an attribute if it’s “URL-like”, you need to add a url_policy, matching what you want to allow:
from justhtml import JustHTML, SanitizationPolicy, UrlPolicy, UrlRule
policy = SanitizationPolicy(
allowed_tags=["img"],
allowed_attributes={"img": ["src"]},
url_policy=UrlPolicy(
allow_rules={
("img", "src"): UrlRule(
allowed_schemes={"https"},
allowed_hosts=["example.com"],
),
}
)
)
print(JustHTML("""
<img src="https://example.com">
<img src="http://example.com">
<img src="https://attacker.com">
""", fragment=True, policy=policy).to_html())
Output:
<img src="https://example.com">
<img>
<img>
For a URL-like attribute (like img[src] or a[href]), JustHTML applies these steps:
SanitizationPolicy.allowed_tags.SanitizationPolicy.allowed_attributes.UrlPolicy(allow_rules=...).UrlPolicy.url_filter runs and can rewrite or drop the value here).UrlRule.UrlRule.handling is set, it is appliedURL behavior is controlled by UrlPolicy:
default_handling: the default action for URL-like attributes.default_allow_relative: whether relative URLs (like /path, ./path, ../path, ?q) are allowed by default.For URL-like attributes that match an explicit (tag, attr) rule in allow_rules, validated URLs are kept by default. To strip or proxy a specific attribute, set UrlRule.handling.
Note: URL validation is always enforced by UrlRule.
from justhtml import UrlPolicy
UrlPolicy(
default_handling="strip", # or "allow" / "proxy"
default_allow_relative=True,
allow_rules={},
url_filter=None,
proxy=None,
)
"allow")This is the “keep validated URLs” behavior.
For URL-like attributes that match an explicit (tag, attr) rule in UrlPolicy(allow_rules=...), a validated URL is kept by default unless you override handling with UrlRule.handling.
"strip")Some renderers (notably email clients) want to avoid loading remote resources by default.
The built-in DEFAULT_POLICY already blocks remote image loads by default (img[src] only allows relative URLs).
To strip URL-valued attributes, either omit the (tag, attr) rule (so the attribute is dropped), or set UrlRule(handling="strip") for that attribute.
from justhtml import JustHTML, SanitizationPolicy, UrlPolicy, UrlRule
policy = SanitizationPolicy(
allowed_tags=["img"],
allowed_attributes={"*": [], "img": ["src"]},
url_policy=UrlPolicy(
allow_rules={("img", "src"): UrlRule(handling="strip", allowed_schemes={"http", "https"})},
),
)
print(JustHTML('<img src="https://example.com/x">', fragment=True, policy=policy).to_html())
print(JustHTML('<img src="/x">', fragment=True, policy=policy).to_html())
Output:
<img>
<img>
If you instead want to block remote loads but allow relative image loads, configure the rule:
from justhtml import JustHTML, SanitizationPolicy, UrlPolicy, UrlRule
policy = SanitizationPolicy(
allowed_tags=["img"],
allowed_attributes={"*": [], "img": ["src"]},
url_policy=UrlPolicy(
allow_rules={
("img", "src"): UrlRule(
allow_relative=True,
allowed_schemes=set(),
resolve_protocol_relative=None,
)
},
),
)
print(JustHTML('<img src="https://example.com/x">', fragment=True, policy=policy).to_html())
print(JustHTML('<img src="/x">', fragment=True, policy=policy).to_html())
Output:
<img>
<img src="/x">
"proxy")Instead of keeping URLs, you can rewrite them through a proxy endpoint:
from justhtml import JustHTML, SanitizationPolicy, UrlPolicy, UrlProxy, UrlRule
policy = SanitizationPolicy(
allowed_tags=["a"],
allowed_attributes={"*": [], "a": ["href"]},
url_policy=UrlPolicy(
proxy=UrlProxy(url="/proxy", param="url"),
allow_rules={
("a", "href"): UrlRule(handling="proxy", allowed_schemes={"https"}),
},
),
)
print(JustHTML('<a href="https://example.com/?a=1&b=2">link</a>', policy=policy).to_html())
Output:
<a href="/proxy?url=https%3A%2F%2Fexample.com%2F%3Fa%3D1%26b%3D2">link</a>
Notes:
allow_relative=True.UrlPolicy.proxy) or per rule (UrlRule.proxy).Example: using a per-rule proxy override:
from justhtml import JustHTML, SanitizationPolicy, UrlPolicy, UrlProxy, UrlRule
policy = SanitizationPolicy(
allowed_tags=["a"],
allowed_attributes={"*": [], "a": ["href"]},
url_policy=UrlPolicy(
allow_rules={
("a", "href"): UrlRule(
handling="proxy",
allowed_schemes={"https"},
proxy=UrlProxy(url="/p", param="u"),
)
},
),
)
print(JustHTML('<a href="https://example.com/?a=1&b=2">link</a>', policy=policy).to_html())
Output:
<a href="/p?u=https%3A%2F%2Fexample.com%2F%3Fa%3D1%26b%3D2">link</a>
Protocol-relative URLs start with //, and are relatively unknown. Browsers resolve them to “https” if you are on a https-enabled site, and “http” otherwise.
By default, justhtml resolves them to https before validation. This ensures they are checked against allowed schemes and prevents inheriting an insecure protocol from the embedding page.
You can configure this behavior per rule:
from justhtml import UrlRule
# Default behavior: resolve to https
rule = UrlRule(allowed_schemes=["https"], resolve_protocol_relative="https")
# Resolve to http
rule = UrlRule(allowed_schemes=["http", "https"], resolve_protocol_relative="http")
# Disallow protocol-relative URLs entirely
rule = UrlRule(allowed_schemes=["https"], resolve_protocol_relative=None)
There is currently no way to leave protocol relative URLs untouched. If this is something you need, open an issue with a desciption of your use-case.
srcset contains multiple URLs, so it requires special care.
JustHTML parses the comma-separated candidates and sanitizes each candidate URL using the matching UrlRule for (tag, "srcset").
If any candidate is unsafe, the entire attribute is dropped.
UrlPolicy.url_filter lets you apply a last-mile filter/rewrite (or drop) based on (tag, attr, value).
None to drop the attribute.This runs before validation.
Example: drop URLs to a blocked host:
from justhtml import JustHTML, SanitizationPolicy, UrlPolicy, UrlRule
def url_filter(tag: str, attr: str, value: str) -> str | None:
if "attacker.com" in value:
return None
return value
policy = SanitizationPolicy(
allowed_tags=["a"],
allowed_attributes={"*": [], "a": ["href"]},
url_policy=UrlPolicy(
url_filter=url_filter,
allow_rules={
("a", "href"): UrlRule(
allowed_schemes={"https"},
)
},
),
)
html = '<a href="https://example.com/">ok</a>\n<a href="https://attacker.com/">bad</a>'
print(JustHTML(html, fragment=True, policy=policy).to_html())
Output:
<a href="https://example.com/">ok</a>
<a>bad</a>
A UrlRule controls how a single URL-valued attribute is validated:
from justhtml import UrlRule
UrlRule(
allow_fragment=True,
resolve_protocol_relative="https",
allowed_schemes=set(),
allowed_hosts=None,
handling=None,
allow_relative=None,
proxy=None,
)
Field reference:
allow_fragment (default: True): allow same-document fragments like #section.resolve_protocol_relative (default: "https"): how to resolve protocol-relative URLs like //example.com before validation; set to None to reject them.allowed_schemes (default: set()): allowed schemes for absolute URLs (lowercased), e.g. {"https"}; empty means disallow all absolute URLs.allowed_hosts (default: None): optional host allowlist for absolute URLs; if set, the parsed host must be in this set.handling (default: None): optional handling override for an allowlisted attribute; "strip" drops it, "proxy" rewrites it, and None keeps it after validation.allow_relative (default: None): optional override for UrlPolicy.default_allow_relative (relative URLs like /x, ./x, ?q).proxy (default: None): optional per-rule proxy config used when effective handling is "proxy" (overrides UrlPolicy.proxy).These are small UrlRule(...) building blocks that you can use in UrlPolicy(allow_rules={...}) for a specific (tag, attr) pair.
UrlRule(allowed_schemes={"https"})
UrlRule(allowed_schemes={"http", "https"})
UrlRule(allowed_schemes={"https"}, allowed_hosts={"example.com"})
UrlRule(
allowed_schemes={"https"},
allowed_hosts={"example.com", "static.example.com"},
)
UrlRule(allow_relative=True)
#section) and drop everything else:UrlRule(allow_fragment=True)
UrlRule(allowed_schemes={"https"}, allow_fragment=False)
UrlRule(
allow_relative=True,
allowed_schemes={"https"},
allowed_hosts={"example.com"},
)
//example.com) entirely:UrlRule(
allowed_schemes={"https"},
resolve_protocol_relative=None,
)
mailto: links:UrlRule(allowed_schemes={"mailto"})
tel: links:UrlRule(allowed_schemes={"tel"})
https: and mailto: (common for a[href]):UrlRule(allowed_schemes={"https", "mailto"}, resolve_protocol_relative="https")
UrlRule(handling="strip", allowed_schemes={"https"})
# Uses UrlPolicy.proxy (global proxy config)
UrlRule(handling="proxy", allowed_schemes={"https"})
from justhtml import UrlProxy
# Uses a per-rule proxy override (UrlRule.proxy takes precedence over UrlPolicy.proxy)
UrlRule(
handling="proxy",
allowed_schemes={"https"},
proxy=UrlProxy(url="/proxy", param="url"),
)