Back to Tools

Content Block Extractor

Use cases

Identifying common template patterns at scale Finding content blocks for scraping or migration Analysing page structure consistency Generating XPaths for content extraction

Uses Claude Haiku to analyse filtered HTML and identify major content blocks with XPath expressions.

Removes script, style, noscript, meta, link, header, footer, and nav tags to reduce token usage.

Aggregates results across URLs to find common XPath patterns with frequency counts.

Requires API Key

Platform

Python script (requires Python 3.x)

Input

URL list

Claude API key

Output

CSV with content blocks and XPath patterns

View Source

Features

  • Claude Haiku API for block detection
  • XPath expression generation for each block
  • HTML filtering removes nav/header/footer/scripts
  • Frequency analysis of XPath patterns across pages
  • Incremental CSV saves every 50 rows
  • Token usage reporting (input, cache, output)

How to use

  1. 1 Add URLs to input/urls.txt file
  2. 2 Configure Claude API key
  3. 3 Run script (1-second delay between requests)
  4. 4 Review combined_output_with_frequency.csv
  5. 5 Analyse top 10 XPath patterns by frequency

Let's work together

Monthly retainers or one-off projects. No lengthy reports that sit in a drawer.

Let's Talk