Does it extract only the main article content?

No, it strips script, style, nav, footer and header elements and returns everything else in the body. Sidebars, cookie notices, related-product widgets and other page furniture stay in the Content column. If you need boilerplate-free main content, use a dedicated extractor (the Reading Score Analyser on this site uses Trafilatura for that) or post-filter the output.

Will it work on JavaScript-rendered pages?

Only partially. Pages are fetched with plain HTTP requests and parsed as raw HTML, so content injected by JavaScript after load is never seen. Single-page apps typically return an empty or near-empty Content column with a Success status; check Content_Length to spot them.

How do the delay and concurrency settings interact?

Each worker sleeps for a random 0.5x to 1.5x of the configured delay before its request, but workers run in parallel, so the overall request rate is roughly workers divided by delay. Three workers with a 1 second delay is about 3 requests per second against the target site; lower the worker count or raise the delay for fragile servers.

How does it handle pages with wrong or missing charset declarations?

It sets the response encoding from the content itself (requests' apparent_encoding) rather than trusting the HTTP header, which fixes most mojibake from mislabelled pages. Results are also re-sorted back to your input order after the concurrent fetches complete, so the output rows line up with your source list.

What do the output columns actually contain?

Title is the title tag, H1 is only the first h1 on the page (additional h1s are ignored), Content_Length is a character count, not words, and failed URLs keep their row with the error message in the Error column so nothing silently disappears.

Back to Tools

Content Extractor

Use cases

Content audits Striking distance keyword analysis Competitor content extraction Bulk content gathering for analysis

Uses BeautifulSoup to remove script/style/nav/footer/header elements, replaces <br> with newlines, and normalises whitespace.

ThreadPoolExecutor for concurrent requests (1-10 workers).

Randomised rate limiting (0.5-1.5x configured delay) to avoid blocks.

Customisable User-Agent header (Chrome 120 default).

Streamlit App

Platform

Browser-based (no installation required)

Input

URLs via text area (one per line) or CSV upload

URL list (paste or CSV)

Output

CSV/Excel: URL, Title, H1, Content Length, Status, Error messages. Display shows extraction progress.

Launch App View Source

Features

BeautifulSoup HTML cleaning (removes scripts, nav, footer)
ThreadPoolExecutor concurrent requests (1-10 workers)
Randomised rate limiting (0.5-1.5x delay)
Request timeout slider (5-30 seconds)
Customisable User-Agent header
CSV and Excel (.xlsx) export via openpyxl

How to use

1 Enter URLs or upload CSV and select URL column
2 Configure request delay (0.5-5.0 seconds)
3 Set concurrent workers (1-10) and timeout (5-30s)
4 Optionally customise User-Agent
5 Run extraction
6 Download CSV or Excel with results

Frequently asked questions

Does it extract only the main article content?: No, it strips script, style, nav, footer and header elements and returns everything else in the body. Sidebars, cookie notices, related-product widgets and other page furniture stay in the Content column. If you need boilerplate-free main content, use a dedicated extractor (the Reading Score Analyser on this site uses Trafilatura for that) or post-filter the output.
Will it work on JavaScript-rendered pages?: Only partially. Pages are fetched with plain HTTP requests and parsed as raw HTML, so content injected by JavaScript after load is never seen. Single-page apps typically return an empty or near-empty Content column with a Success status; check Content_Length to spot them.
How do the delay and concurrency settings interact?: Each worker sleeps for a random 0.5x to 1.5x of the configured delay before its request, but workers run in parallel, so the overall request rate is roughly workers divided by delay. Three workers with a 1 second delay is about 3 requests per second against the target site; lower the worker count or raise the delay for fragile servers.
How does it handle pages with wrong or missing charset declarations?: It sets the response encoding from the content itself (requests' apparent_encoding) rather than trusting the HTTP header, which fixes most mojibake from mislabelled pages. Results are also re-sorted back to your input order after the concurrent fetches complete, so the output rows line up with your source list.
What do the output columns actually contain?: Title is the title tag, H1 is only the first h1 on the page (additional h1s are ignored), Content_Length is a character count, not words, and failed URLs keep their row with the error message in the Error column so nothing silently disappears.

Want me to run this for you?

I run this tool as a managed service, or build something custom around your data. You get the insights without touching the code.

Tell Me What You Need

Reading Score AnalyserAppContent

Analyse content readability from sitemaps or URLs using Flesch scores.

Page Intent ClassifierAppContent

Use OpenAI to classify page intent and expected user actions.

Review Sentiment ExtractorAppContent

Use OpenAI to extract positive and negative sentiments from product reviews.

Keyword-to-Page MapperAppContent

Semantically match keywords to existing pages using ML embeddings.

Category Title SuggesterAppContent

Analyse category pages and suggest high-performing keywords for titles.

Percentage Change CalculatorBrowserReporting

Calculate percentage changes between two values quickly.

Need something built for your business?

This tool started as bespoke client work. I build custom scripts, data pipelines, and full apps for SEO and product data problems that off-the-shelf tools don't solve.