Skip to content

hishantik/media-scraper

Repository files navigation

Web Media Scraper

A FastAPI-based web application that extracts embedded media URLs (images, videos, streams, iframes, API endpoints) from any webpage. It uses multiple extraction strategies — HTML parsing, JavaScript analysis, iframe recursion, API endpoint discovery, and (optionally) headless Chromium rendering — to find media URLs that simple scrapers miss.

Features

  • Multi-phase extraction pipeline: HTML tag parsing, inline script analysis, iframe/embed following (up to 3 levels deep), and API endpoint probing
  • Optional headless Chromium mode (opt-in per scan): Renders the page in real Chromium, follows iframe chains across origins, and captures XHR/fetch callsites — useful for streaming sites that build the player with JavaScript and only expose their underlying API endpoint after a second request from inside the player iframe
  • Ad-network filtering: A bundled denylist of ~80 ad/tracker domains (doubleclick, googlesyndication, taboola, outbrain, …) drops ad URLs from results before they hit the database. User-extensible via AD_BLOCKLIST env var
  • Politeness controls: Configurable per-request delays (MIN_DELAY/MAX_DELAY), per-host connection cap, concurrent-scan semaphore, and robots.txt respect
  • SSRF guard: Outbound URL safety check (private IPs, link-local, non-HTTP schemes) gates both the plain HTTP client and the headless browser
  • URL classification: Automatically categorizes discovered URLs as image, video, stream, iframe, API, or other
  • API discovery: Detects WordPress AJAX endpoints, common embed API patterns (getSources, getStream), and synthesizes endpoints from HTML data attributes
  • Base64 decoding: Finds media URLs hidden inside base64-encoded strings in JavaScript
  • Network capture: Records and classifies URLs from all HTTP responses during the scan
  • SQLite storage: All scan results persisted asynchronously via SQLAlchemy + aiosqlite
  • Web UI: HTML interface with real-time scan progress, category filtering, scan history, JSON export, and a "JS-rendered" badge for headless scans

Project Structure

webscraper/
├── run.py                          # Entry point
├── pytest.ini                      # pytest configuration
├── requirements.txt
├── requirements-dev.txt            # Test-only deps
├── .env.example                    # Environment variable template
├── app/
│   ├── main.py                     # FastAPI app setup + /static mount
│   ├── config.py                   # Settings from env vars
│   ├── database.py                 # Async SQLAlchemy engine + idempotent migrations
│   ├── models.py                   # Scan, MediaURL, ScanStatus, MediaCategory
│   ├── routers/
│   │   ├── scan.py                 # Routes: /, /scan, /scan/{id}, /history, /scan/{id}/cancel
│   │   └── export.py               # Route: /scan/{id}/export/json
│   ├── scraper/
│   │   ├── engine.py               # Main scan pipeline (background task)
│   │   ├── browser.py              # httpx HTTP client
│   │   ├── headless.py             # Playwright Chromium driver (opt-in)
│   │   ├── ad_filter.py            # Bundled ad-network denylist
│   │   ├── html_parser.py          # Extract media from HTML tags
│   │   ├── js_analyzer.py          # Extract media from <script> blocks
│   │   ├── url_classifier.py       # URL normalization and classification
│   │   ├── url_hash.py             # Dedup key normalization
│   │   ├── api_discoverer.py       # API endpoint discovery + probing
│   │   ├── network_catcher.py      # Captures URLs from HTTP responses
│   │   ├── discovery.py            # Sitemap / RSS / feed probe
│   │   ├── robots.py               # robots.txt cache + respect
│   │   └── safety.py               # SSRF guard (is_safe_url)
│   ├── templates/                  # Jinja2 HTML templates
│   │   ├── base.html
│   │   ├── index.html
│   │   ├── history.html
│   │   └── components/
│   │       ├── scan_form.html      # Standalone form with js_render checkbox
│   │       └── results_body.html
│   └── static/                     # CSS assets (headless.css additive overlay)
└── scraper.db                      # SQLite database (created at runtime)

Prerequisites

  • Python 3.11+
  • pip
  • (Optional, for headless mode) Playwright + Chromium binary

Installation

# Clone the repo
git clone <repo-url>
cd webscraper

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# venv\Scripts\activate   # Windows

# Install dependencies
pip install -r requirements.txt

# (Optional, only if you want headless mode)
pip install playwright
playwright install chromium

Configuration

Copy the example env file and adjust as needed:

cp .env.example .env

Available environment variables:

Variable Default Description
HOST 0.0.0.0 Server bind address
PORT 8000 Server port
DATABASE_URL sqlite+aiosqlite:///./scraper.db Async database URL
MAX_CONCURRENT_SCANS 2 Max scans running in parallel
MIN_DELAY 1.0 Minimum delay between requests (seconds)
MAX_DELAY 5.0 Maximum delay between requests (seconds)
HEADLESS_MAX_IFRAME_DEPTH 3 How many iframe levels the headless driver will follow
HEADLESS_BROWSER_TIMEOUT_MS 20000 Per-page navigation timeout in headless mode
AD_BLOCKLIST (empty) Comma-separated extra ad domains to drop (e.g. tracker.foo.com,ads.bar.com)
LOG_LEVEL INFO Application log level

Running Locally

# Option 1: Using run.py
python run.py

# Option 2: Using uvicorn directly
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

The server starts at http://localhost:8000. The SQLite database (scraper.db) is created automatically on first launch, and any pending schema migrations (idempotent ALTER TABLE statements) are applied on startup.

Usage

Web UI

  1. Open http://localhost:8000 in your browser
  2. Enter a URL and click Scan
  3. (Optional) Open Advanced and tick Use headless browser to render the page in Chromium
  4. Watch real-time progress as the scraper runs through its pipeline
  5. View results filtered by category (image, video, stream, iframe, API, other)
  6. Export results as JSON from the scan results page

API Endpoints

Method Path Description
GET / Home page with scan form
POST /scan Start a new scan. Form fields: url, max_pages, max_depth, js_render (1/0)
GET /scan/{id} View scan results (query param: category)
GET /scan/{id}/status HTMX partial for live scan status
POST /scan/{id}/cancel Cooperative cancel a running scan
GET /history List of recent scans
GET /scan/{id}/export/json Download results as JSON

Example: Start a plain scan via curl

curl -X POST http://localhost:8000/scan -d "url=https://example.com" -L

Example: Start a headless scan via curl

curl -X POST http://localhost:8000/scan \
  -d "url=https://example.com" \
  -d "js_render=1" \
  -L

Example: Export results as JSON

curl http://localhost:8000/scan/1/export/json -o results.json

How It Works

The scan pipeline runs as a FastAPI background task. Plain (default) and headless scans share the same downstream phases — the only difference is whether Phase 1.5 runs.

  1. Fetch — Downloads the target page over standard HTTPS with an honest User-Agent
  2. Headless render (optional) — If js_render=1, launches Playwright Chromium, waits for networkidle, and captures every XHR/fetch response plus the post-render DOM. Iframes are followed up to HEADLESS_MAX_IFRAME_DEPTH levels by re-fetching the iframe src in a fresh top-level page (cross-origin-safe)
  3. HTML parsing — Extracts URLs from <img>, <video>, <source>, <iframe>, <embed>, <a>, <meta> (Open Graph), and CSS background-image
  4. JavaScript analysis — Parses inline <script> blocks for media URL patterns, JSON blobs, quoted URLs, base64-encoded URLs, and API endpoint patterns
  5. Sitemap / feed probe — Discovers sitemap.xml, RSS, and Atom feeds linked from the page
  6. Iframe recursion — Follows iframe/embed sources up to max_depth levels to find media in nested player pages
  7. API discovery — Probes common API patterns on discovered player domains, synthesizes endpoints from HTML data attributes, and detects WordPress AJAX configurations
  8. Network capture — Classifies URLs from all HTTP responses and walks JSON response bodies for embedded media URLs
  9. Ad filter — Drops URLs matching the bundled denylist (plus anything in AD_BLOCKLIST); the dropped count is recorded on the Scan row
  10. Merge & deduplicate — Combines all results, normalizes URLs, removes duplicates, and saves to the database

Headless Mode

Modern streaming sites (gogoanime, animepahe, megacloud embeds, JW Player hosts) build the page in the browser. The server response is just a shell — the "underlying API endpoint" you actually want, typically a JSON call returning {"sources": [{"file": "https://...m3u8"}]}, only fires after the page boots, the player iframe is followed, and a second XHR goes out from inside the iframe.

To get those URLs, tick Use headless browser in the scan form (or send js_render=1 via curl). The scraper will:

  1. Launch Playwright Chromium in headless mode
  2. Navigate to the page, wait for networkidle
  3. Capture every XHR/fetch response (URL, method, status, content-type, request/response headers, body preview)
  4. BFS through cross-origin iframes, re-fetching each iframe src in a fresh top-level page (this is the only way to read the XHR callsites from inside a cross-origin iframe)
  5. Feed every captured URL — and the post-render DOM at every iframe level — through the same extract_from_html / extract_from_scripts / api_discoverer pipeline used by plain scans

Result pages from headless scans show a small JS-rendered badge next to the scan title.

One-time setup:

pip install playwright
playwright install chromium   # downloads ~150MB browser binary

Without the binary, plain scans still work — headless mode gracefully falls back to an empty result set and logs a warning.

Safety: every page navigation is gated by the same is_safe_url SSRF guard the plain HTTP client uses, via a page.route() handler that aborts unsafe requests before they leave the browser.

Resource cost: headless scans are 5-10x slower per scan than plain scans. Run them on a desktop machine, not on a phone.

Ad Filtering

The bundled denylist in app/scraper/ad_filter.py covers ~80 ad/tracker/analytics domains (Google AdSense, DoubleClick, Taboola, Outbrain, MGID, Amazon A9, etc.). URLs matching the denylist are dropped before the result is saved, and the dropped count is recorded on the Scan.blocked_count column, surfaced on the result page as Blocked: N.

The denylist is domain-only — it matches tracker.foo.com against any URL whose host ends in tracker.foo.com (so eu.us.tracker.foo.com is also matched). Numeric IP URLs and malformed URLs are never matched (they pass through).

User override: set AD_BLOCKLIST to a comma-separated list of extra domains to drop. Entries are trimmed, lowercased, and stripped of leading dots. Malformed entries (no dot) are ignored.

# In .env
AD_BLOCKLIST=tracker.example.com,ads.foo.com

AD_ALLOWLIST is not currently implemented — the simpler mental model is "the bundled denylist + your extras."

Testing

# Install test deps
pip install -r requirements-dev.txt

# Run the full suite
pytest

# Run only the ad-filter unit tests
pytest tests/test_ad_filter.py -v

# Run only the headless unit tests (pure-function ones; browser-gated tests skip automatically
# if PLAYWRIGHT_BROWSERS_PATH is unset)
pytest tests/test_headless.py -v

# Run the headless browser tests against an installed Chromium
PLAYWRIGHT_BROWSERS_PATH=/path/to/browsers pytest tests/test_headless.py -v

Test layout:

  • tests/test_ad_filter.py — denylist matching, env override, combined filter
  • tests/test_headless.py — pure-function helpers (URL classification, JSON peek) + browser-gated integration tests
  • tests/test_* — existing endpoint / pipeline / format tests

Tech Stack

  • FastAPI — Async web framework
  • SQLAlchemy 2.0 + aiosqlite — Async ORM with SQLite
  • BeautifulSoup 4 — HTML parsing
  • httpx — Async HTTP client
  • Jinja2 — Server-side HTML templates
  • HTMX — Partial page swaps for live status
  • Playwright — Headless Chromium (optional, for js_render=1)
  • yt-dlp — Media extraction utilities
  • Rich — CLI output formatting
  • pytest + pytest-asyncio + respx — Test stack

About

A fast-api web application that extracts embedded media URLs (images, videos, streams, iframes, API endpoints) from any webpage.

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors