JustHTML ships with a small CLI for parsing HTML and extracting HTML/text/Markdown from selected parts of a document.
If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command.
If you donβt have it available, use the equivalent python -m justhtml ... form.
# Pretty-print an HTML file
justhtml page.html
# Read HTML from stdin
curl -s https://example.com | justhtml -
Use --selector to choose which nodes to extract.
# Extract text from all paragraphs
justhtml page.html --selector "p" --format text
# Only output the first match
justhtml page.html --selector "main p" --format text --first
Use --fragment to parse the input as an HTML fragment (instead of a full document). This avoids implicit <html>, <head>, and <body> insertion.
echo '<li>Hi</li>' | justhtml - --fragment
--format controls what is printed:
html (default): pretty-printed HTML for each matchtext: concatenated text (same semantics as to_text(separator=" ", strip=True); sanitized by default)markdown: a pragmatic subset of GitHub Flavored Markdown (GFM)Notes:
markdown keeps tables (<table>) and images (<img>) as raw HTML.html and text print one result per line.markdown prints matches separated by a blank line.By default, the CLI sanitizes output (same safe-by-default behavior as JustHTML(..., sanitize=True)).
To disable sanitization for trusted input, pass --unsafe.
In safe mode, you can allow additional tags via --allow-tags (comma-separated). This augments the default policy (document vs fragment).
Example:
justhtml page.html --selector "article" --allow-tags article,section --format markdown
--cleanup removes common unhelpful output artifacts:
<a> tags that have no href<img> tags that have no src (or src="")This is useful when sanitization has stripped attributes and left behind empty tags.
curl -s https://example.com | justhtml - --format html --cleanup
When using --format text, you can control whitespace handling:
--separator "..." (default: a single space) joins text nodes--strip / --no-strip controls whether each text node is stripped and empty segments droppedExample:
justhtml page.html --selector "main" --format text --separator "" --no-strip
0: success1: missing input path or no matches for the selector2: invalid selectorcurl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15
Output:
# JustHTML
[](#justhtml)
A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.
**[π Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**
## Why use JustHTML?
[](#why-use-justhtml)
### 1. Just... Correct β
[](#1-just-correct-)