Content Block Extractor

Use cases

Identifying common template patterns at scale Finding content blocks for scraping or migration Analysing page structure consistency Generating XPaths for content extraction

Uses Claude Haiku to analyse filtered HTML and identify major content blocks with XPath expressions.

Removes script, style, noscript, meta, link, header, footer, and nav tags to reduce token usage.

Aggregates results across URLs to find common XPath patterns with frequency counts.

Requires API Key

Platform

Python script (requires Python 3.x)

Input

URL list

Claude API key

Output

CSV with content blocks and XPath patterns

View Source

Features

Claude Haiku API for block detection
XPath expression generation for each block
HTML filtering removes nav/header/footer/scripts
Frequency analysis of XPath patterns across pages
Incremental CSV saves every 50 rows
Token usage reporting (input, cache, output)

How to use

1 Add URLs to input/urls.txt file
2 Configure Claude API key
3 Run script (1-second delay between requests)
4 Review combined_output_with_frequency.csv
5 Analyse top 10 XPath patterns by frequency

Want me to run this for you?

I offer this as a managed service. You get the insights without touching the tool.

Get in Touch

Competitor Content Gap Finder

Content

Discover which descriptive words competitors use in titles that you are missing.

Content Consolidation Analyser

Content

Find cannibalising pages by clustering URLs that share SERP overlap.

Content Hub Classifier

Content

Classify articles into content hubs using GPT-4o-mini with structured JSON output.

Let's work together

Monthly retainers or one-off projects. No lengthy reports that sit in a drawer.

Let's Talk

Content Block Extractor

Features

How to use

Want me to run this for you?

Related Tools

Competitor Content Gap Finder

Content Consolidation Analyser

Content Hub Classifier

Let's work together