Web Media Scraper

A FastAPI-based web application that extracts embedded media URLs (images, videos, streams, iframes, API endpoints) from any webpage. It uses multiple extraction strategies — HTML parsing, JavaScript analysis, iframe recursion, API endpoint discovery, and (optionally) headless Chromium rendering — to find media URLs that simple scrapers miss.

Features

Multi-phase extraction pipeline: HTML tag parsing, inline script analysis, iframe/embed following (up to 3 levels deep), and API endpoint probing
Optional headless Chromium mode (opt-in per scan): Renders the page in real Chromium, follows iframe chains across origins, and captures XHR/fetch callsites — useful for streaming sites that build the player with JavaScript and only expose their underlying API endpoint after a second request from inside the player iframe
Ad-network filtering: A bundled denylist of ~80 ad/tracker domains (doubleclick, googlesyndication, taboola, outbrain, …) drops ad URLs from results before they hit the database. User-extensible via AD_BLOCKLIST env var
Politeness controls: Configurable per-request delays (MIN_DELAY/MAX_DELAY), per-host connection cap, concurrent-scan semaphore, and robots.txt respect
SSRF guard: Outbound URL safety check (private IPs, link-local, non-HTTP schemes) gates both the plain HTTP client and the headless browser
URL classification: Automatically categorizes discovered URLs as image, video, stream, iframe, API, or other
API discovery: Detects WordPress AJAX endpoints, common embed API patterns (getSources, getStream), and synthesizes endpoints from HTML data attributes
Base64 decoding: Finds media URLs hidden inside base64-encoded strings in JavaScript
Network capture: Records and classifies URLs from all HTTP responses during the scan
SQLite storage: All scan results persisted asynchronously via SQLAlchemy + aiosqlite
Web UI: HTML interface with real-time scan progress, category filtering, scan history, JSON export, and a "JS-rendered" badge for headless scans

Project Structure

webscraper/
├── run.py                          # Entry point
├── pytest.ini                      # pytest configuration
├── requirements.txt
├── requirements-dev.txt            # Test-only deps
├── .env.example                    # Environment variable template
├── app/
│   ├── main.py                     # FastAPI app setup + /static mount
│   ├── config.py                   # Settings from env vars
│   ├── database.py                 # Async SQLAlchemy engine + idempotent migrations
│   ├── models.py                   # Scan, MediaURL, ScanStatus, MediaCategory
│   ├── routers/
│   │   ├── scan.py                 # Routes: /, /scan, /scan/{id}, /history, /scan/{id}/cancel
│   │   └── export.py               # Route: /scan/{id}/export/json
│   ├── scraper/
│   │   ├── engine.py               # Main scan pipeline (background task)
│   │   ├── browser.py              # httpx HTTP client
│   │   ├── headless.py             # Playwright Chromium driver (opt-in)
│   │   ├── ad_filter.py            # Bundled ad-network denylist
│   │   ├── html_parser.py          # Extract media from HTML tags
│   │   ├── js_analyzer.py          # Extract media from <script> blocks
│   │   ├── url_classifier.py       # URL normalization and classification
│   │   ├── url_hash.py             # Dedup key normalization
│   │   ├── api_discoverer.py       # API endpoint discovery + probing
│   │   ├── network_catcher.py      # Captures URLs from HTTP responses
│   │   ├── discovery.py            # Sitemap / RSS / feed probe
│   │   ├── robots.py               # robots.txt cache + respect
│   │   └── safety.py               # SSRF guard (is_safe_url)
│   ├── templates/                  # Jinja2 HTML templates
│   │   ├── base.html
│   │   ├── index.html
│   │   ├── history.html
│   │   └── components/
│   │       ├── scan_form.html      # Standalone form with js_render checkbox
│   │       └── results_body.html
│   └── static/                     # CSS assets (headless.css additive overlay)
└── scraper.db                      # SQLite database (created at runtime)

Prerequisites

Python 3.11+
pip
(Optional, for headless mode) Playwright + Chromium binary

Installation

# Clone the repo
git clone <repo-url>
cd webscraper

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

# (Optional, only if you want headless mode)
pip install playwright
playwright install chromium

Configuration

Copy the example env file and adjust as needed:

cp .env.example .env

Available environment variables:

Variable	Default	Description
`HOST`	`0.0.0.0`	Server bind address
`PORT`	`8000`	Server port
`DATABASE_URL`	`sqlite+aiosqlite:///./scraper.db`	Async database URL
`MAX_CONCURRENT_SCANS`	`2`	Max scans running in parallel
`MIN_DELAY`	`1.0`	Minimum delay between requests (seconds)
`MAX_DELAY`	`5.0`	Maximum delay between requests (seconds)
`HEADLESS_MAX_IFRAME_DEPTH`	`3`	How many iframe levels the headless driver will follow
`HEADLESS_BROWSER_TIMEOUT_MS`	`20000`	Per-page navigation timeout in headless mode
`AD_BLOCKLIST`	(empty)	Comma-separated extra ad domains to drop (e.g. `tracker.foo.com,ads.bar.com`)
`LOG_LEVEL`	`INFO`	Application log level

Running Locally

# Option 1: Using run.py
python run.py

# Option 2: Using uvicorn directly
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

The server starts at http://localhost:8000. The SQLite database (scraper.db) is created automatically on first launch, and any pending schema migrations (idempotent ALTER TABLE statements) are applied on startup.

Usage

Web UI

Open http://localhost:8000 in your browser
Enter a URL and click Scan
(Optional) Open Advanced and tick Use headless browser to render the page in Chromium
Watch real-time progress as the scraper runs through its pipeline
View results filtered by category (image, video, stream, iframe, API, other)
Export results as JSON from the scan results page

API Endpoints

Method	Path	Description
`GET`	`/`	Home page with scan form
`POST`	`/scan`	Start a new scan. Form fields: `url`, `max_pages`, `max_depth`, `js_render` (1/0)
`GET`	`/scan/{id}`	View scan results (query param: `category`)
`GET`	`/scan/{id}/status`	HTMX partial for live scan status
`POST`	`/scan/{id}/cancel`	Cooperative cancel a running scan
`GET`	`/history`	List of recent scans
`GET`	`/scan/{id}/export/json`	Download results as JSON

Example: Start a plain scan via curl

curl -X POST http://localhost:8000/scan -d "url=https://example.com" -L

Example: Start a headless scan via curl

curl -X POST http://localhost:8000/scan \
  -d "url=https://example.com" \
  -d "js_render=1" \
  -L

Example: Export results as JSON

curl http://localhost:8000/scan/1/export/json -o results.json

How It Works

The scan pipeline runs as a FastAPI background task. Plain (default) and headless scans share the same downstream phases — the only difference is whether Phase 1.5 runs.

Fetch — Downloads the target page over standard HTTPS with an honest User-Agent
Headless render (optional) — If js_render=1, launches Playwright Chromium, waits for networkidle, and captures every XHR/fetch response plus the post-render DOM. Iframes are followed up to HEADLESS_MAX_IFRAME_DEPTH levels by re-fetching the iframe src in a fresh top-level page (cross-origin-safe)
HTML parsing — Extracts URLs from <img>, <video>, <source>, <iframe>, <embed>, <a>, <meta> (Open Graph), and CSS background-image
JavaScript analysis — Parses inline <script> blocks for media URL patterns, JSON blobs, quoted URLs, base64-encoded URLs, and API endpoint patterns
Sitemap / feed probe — Discovers sitemap.xml, RSS, and Atom feeds linked from the page
Iframe recursion — Follows iframe/embed sources up to max_depth levels to find media in nested player pages
API discovery — Probes common API patterns on discovered player domains, synthesizes endpoints from HTML data attributes, and detects WordPress AJAX configurations
Network capture — Classifies URLs from all HTTP responses and walks JSON response bodies for embedded media URLs
Ad filter — Drops URLs matching the bundled denylist (plus anything in AD_BLOCKLIST); the dropped count is recorded on the Scan row
Merge & deduplicate — Combines all results, normalizes URLs, removes duplicates, and saves to the database

Headless Mode

Modern streaming sites (gogoanime, animepahe, megacloud embeds, JW Player hosts) build the page in the browser. The server response is just a shell — the "underlying API endpoint" you actually want, typically a JSON call returning {"sources": [{"file": "https://...m3u8"}]}, only fires after the page boots, the player iframe is followed, and a second XHR goes out from inside the iframe.

To get those URLs, tick Use headless browser in the scan form (or send js_render=1 via curl). The scraper will:

Launch Playwright Chromium in headless mode
Navigate to the page, wait for networkidle
Capture every XHR/fetch response (URL, method, status, content-type, request/response headers, body preview)
BFS through cross-origin iframes, re-fetching each iframe src in a fresh top-level page (this is the only way to read the XHR callsites from inside a cross-origin iframe)
Feed every captured URL — and the post-render DOM at every iframe level — through the same extract_from_html / extract_from_scripts / api_discoverer pipeline used by plain scans

Result pages from headless scans show a small JS-rendered badge next to the scan title.

One-time setup:

pip install playwright
playwright install chromium   # downloads ~150MB browser binary

Without the binary, plain scans still work — headless mode gracefully falls back to an empty result set and logs a warning.

Safety: every page navigation is gated by the same is_safe_url SSRF guard the plain HTTP client uses, via a page.route() handler that aborts unsafe requests before they leave the browser.

Resource cost: headless scans are 5-10x slower per scan than plain scans. Run them on a desktop machine, not on a phone.

Ad Filtering

The bundled denylist in app/scraper/ad_filter.py covers ~80 ad/tracker/analytics domains (Google AdSense, DoubleClick, Taboola, Outbrain, MGID, Amazon A9, etc.). URLs matching the denylist are dropped before the result is saved, and the dropped count is recorded on the Scan.blocked_count column, surfaced on the result page as Blocked: N.

The denylist is domain-only — it matches tracker.foo.com against any URL whose host ends in tracker.foo.com (so eu.us.tracker.foo.com is also matched). Numeric IP URLs and malformed URLs are never matched (they pass through).

User override: set AD_BLOCKLIST to a comma-separated list of extra domains to drop. Entries are trimmed, lowercased, and stripped of leading dots. Malformed entries (no dot) are ignored.

# In .env
AD_BLOCKLIST=tracker.example.com,ads.foo.com

AD_ALLOWLIST is not currently implemented — the simpler mental model is "the bundled denylist + your extras."

Testing

# Install test deps
pip install -r requirements-dev.txt

# Run the full suite
pytest

# Run only the ad-filter unit tests
pytest tests/test_ad_filter.py -v

# Run only the headless unit tests (pure-function ones; browser-gated tests skip automatically
# if PLAYWRIGHT_BROWSERS_PATH is unset)
pytest tests/test_headless.py -v

# Run the headless browser tests against an installed Chromium
PLAYWRIGHT_BROWSERS_PATH=/path/to/browsers pytest tests/test_headless.py -v

Test layout:

tests/test_ad_filter.py — denylist matching, env override, combined filter
tests/test_headless.py — pure-function helpers (URL classification, JSON peek) + browser-gated integration tests
tests/test_* — existing endpoint / pipeline / format tests

Tech Stack

FastAPI — Async web framework
SQLAlchemy 2.0 + aiosqlite — Async ORM with SQLite
BeautifulSoup 4 — HTML parsing
httpx — Async HTTP client
Jinja2 — Server-side HTML templates
HTMX — Partial page swaps for live status
Playwright — Headless Chromium (optional, for js_render=1)
yt-dlp — Media extraction utilities
Rich — CLI output formatting
pytest + pytest-asyncio + respx — Test stack

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Media Scraper

Features

Project Structure

Prerequisites

Installation

Configuration

Running Locally

Usage

Web UI

API Endpoints

Example: Start a plain scan via curl

Example: Start a headless scan via curl

Example: Export results as JSON

How It Works

Headless Mode

Ad Filtering

Testing

Tech Stack

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Web Media Scraper

Features

Project Structure

Prerequisites

Installation

Configuration

Running Locally

Usage

Web UI

API Endpoints

Example: Start a plain scan via curl

Example: Start a headless scan via curl

Example: Export results as JSON

How It Works

Headless Mode

Ad Filtering

Testing

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages