Content Block Extractor
Use cases
Uses Claude Haiku to analyse filtered HTML and identify major content blocks with XPath expressions.
Removes script, style, noscript, meta, link, header, footer, and nav tags to reduce token usage.
Aggregates results across URLs to find common XPath patterns with frequency counts.
Platform
Python script (requires Python 3.x)
Input
URL list
Claude API key
Output
CSV with content blocks and XPath patterns
Features
- Claude Haiku API for block detection
- XPath expression generation for each block
- HTML filtering removes nav/header/footer/scripts
- Frequency analysis of XPath patterns across pages
- Incremental CSV saves every 50 rows
- Token usage reporting (input, cache, output)
How to use
- 1 Add URLs to input/urls.txt file
- 2 Configure Claude API key
- 3 Run script (1-second delay between requests)
- 4 Review combined_output_with_frequency.csv
- 5 Analyse top 10 XPath patterns by frequency
Want me to run this for you?
I offer this as a managed service. You get the insights without touching the tool.
Related Tools
Competitor Content Gap Finder
ContentDiscover which descriptive words competitors use in titles that you are missing.
Content Consolidation Analyser
ContentFind cannibalising pages by clustering URLs that share SERP overlap.
Content Hub Classifier
ContentClassify articles into content hubs using GPT-4o-mini with structured JSON output.
Let's work together
Monthly retainers or one-off projects. No lengthy reports that sit in a drawer.
Let's Talk