Command Line Interface

JustHTML ships with a small CLI for parsing HTML and extracting HTML/text/Markdown from selected parts of a document.

Running

If you installed JustHTML (for example with pip install justhtml or pip install -e .), you can use the justhtml command.

If you don’t have it available, use the equivalent python -m justhtml ... form.

Basic usage

# Pretty-print an HTML file
justhtml page.html

# Read HTML from stdin
curl -s https://example.com | justhtml -

Selecting nodes

Use --selector to choose which nodes to extract.

# Extract text from all paragraphs
justhtml page.html --selector "p" --format text

# Only output the first match
justhtml page.html --selector "main p" --format text --first

# In Python, this corresponds to: doc.query_one("main p")

Fragments

Use --fragment to parse the input as an HTML fragment (instead of a full document). This avoids implicit <html>, <head>, and <body> insertion.

echo '<li>Hi</li>' | justhtml - --fragment

Strict Mode

Use --strict to fail immediately and print an error to stderr if the input HTML contains parsing errors or is otherwise malformed according to the WHATWG specification.

echo '</b>' | justhtml - --strict --fragment
# Exits with code 2 and prints:
# StrictModeError: Element 'b' has no matching start tag

Output formats

--format controls what is printed:

html (default): pretty-printed HTML for each match
text: concatenated text (same semantics as to_text(separator=" ", strip=True); sanitized by default)
markdown: a pragmatic subset of GitHub Flavored Markdown (GFM)

Notes:

markdown keeps tables (<table>) and images (<img>) as raw HTML.
With the default sanitization on, that raw HTML comes from the sanitized DOM. Use --format text if you need output with no HTML at all.
For multiple matches:
- html and text print one result per line.
- markdown prints matches separated by a blank line.

Sanitization

By default, the CLI sanitizes output (same safe-by-default behavior as JustHTML(..., sanitize=True)).

To disable sanitization for trusted input, pass --unsafe.

Allow extra tags

In safe mode, you can allow additional tags via --allow-tags (comma-separated). This augments the default policy (document vs fragment).

Example:

justhtml page.html --selector "article" --allow-tags article,section --format markdown

Cleanup

--cleanup removes common unhelpful output artifacts:

unwrap <a> tags that have no href
drop <img> tags that have no src (or src="")
prune empty tags

This is useful when sanitization has stripped attributes and left behind empty tags.

curl -s https://example.com | justhtml - --format html --cleanup

Text options

When using --format text, you can control whitespace handling:

--separator "..." (default: a single space) joins text nodes
--separator-blocks-only applies --separator only between block-level elements (avoid separators inside inline tags)
--strip / --no-strip controls whether each text node is stripped and empty segments dropped

Example:

justhtml page.html --selector "main" --format text --separator "" --no-strip

Exit codes

0: success
1: missing input path or no matches for the selector
2: invalid selector, or a parse error when --strict is enabled

Real-world example

curl -s https://github.com/EmilStenstrom/justhtml/ | justhtml - --selector '.markdown-body' --format markdown | head -n 15

# JustHTML

[](#justhtml)

A pure Python HTML5 parser that just works. No C extensions to compile. No system dependencies to install. No complex API to learn.

**[📖 Read the full documentation here](/EmilStenstrom/justhtml/blob/main/docs/index.md)**

## Why use JustHTML?

[](#why-use-justhtml)

### 1. Just... Correct ✅

[](#1-just-correct-)

This site is open source. Improve this page.