Get up and running with JustHTML in 5 minutes.
pip install justhtml
from justhtml import JustHTML
html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)
If your input is an HTML snippet (like user generated content from a WYSIWYG editor), you usually want fragment parsing to avoid implicit <html>, <head>, and <body> insertion:
from justhtml import JustHTML
snippet = "<p>Hello <b>world</b></p>"
doc = JustHTML(snippet, fragment=True)
print(doc.to_html())
# => <p>Hello <b>world</b></p>
If you pass bytes (for example from a file), JustHTML decodes them using HTML encoding sniffing. If no encoding is found, it falls back to windows-1252 for browser compatibility.
from justhtml import JustHTML
from pathlib import Path
data = Path("page.html").read_bytes()
doc = JustHTML(data)
print(doc.encoding)
See Encoding & Byte Input for details and how to override with encoding=....
The parser returns a tree of Node objects:
from justhtml import JustHTML
html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)
root = doc.root # #document
html_node = root.children[0] # <html>
body = html_node.children[1] # <body> (children[0] is <head>)
div = body.children[0] # <div>
# Each node has:
print(div.name) # => div
print(div.attrs) # => {'id': 'main'}
print([child.name for child in div.children]) # => ['p']
print(div.parent.name) # => body
Use familiar CSS syntax to find elements:
# Find all paragraphs
paragraphs = doc.query("p")
# Find by ID
main_div = doc.query("#main")[0]
# Complex selectors
links = doc.query("nav > ul li a.active")
# Multiple selectors
headings = doc.query("h1, h2, h3")
Convert any node back to HTML:
from justhtml import JustHTML
html = "<html><body><div id='main'><p>Hello, <b>world</b>!</p></div></body></html>"
doc = JustHTML(html)
div = doc.query("#main")[0]
print(div.to_html(indent_size=4))
Output:
<div id="main">
<p>Hello, <b>world</b>!</p>
</div>
Reject malformed HTML instead of silently fixing it:
from justhtml import JustHTML
JustHTML("<!doctype html><p></div>", strict=True) # doctest: skip
Output:
Traceback (most recent call last):
File "snippet.py", line 3, in <module>
JustHTML("<!doctype html><p></div>", strict=True)
~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "parser.py", line 127, in __init__
raise StrictModeError(self.errors[0])
File "<html>", line 1
<!doctype html><p></div>
^^^^^^
justhtml.parser.StrictModeError: Unexpected </div> end tag
For large files or when you don’t need the full DOM:
from justhtml import stream
html = "<p>Hello</p><p>world</p>"
for event, data in stream(html):
if event == "start":
tag, attrs = data
print(f"Start: {tag}")
elif event == "text":
print(f"Text: {data}")
elif event == "end":
print(f"End: {data}")
Output:
Start: p
Text: Hello
End: p
Start: p
Text: world
End: p