A FastAPI-based web application that extracts embedded media URLs (images, videos, streams, iframes, API endpoints) from any webpage. It uses multiple extraction strategies — HTML parsing, JavaScript analysis, iframe recursion, API endpoint discovery, and (optionally) headless Chromium rendering — to find media URLs that simple scrapers miss.
- Multi-phase extraction pipeline: HTML tag parsing, inline script analysis, iframe/embed following (up to 3 levels deep), and API endpoint probing
- Optional headless Chromium mode (opt-in per scan): Renders the page in real Chromium, follows iframe chains across origins, and captures XHR/fetch callsites — useful for streaming sites that build the player with JavaScript and only expose their underlying API endpoint after a second request from inside the player iframe
- Ad-network filtering: A bundled denylist of ~80 ad/tracker domains (doubleclick, googlesyndication, taboola, outbrain, …) drops ad URLs from results before they hit the database. User-extensible via
AD_BLOCKLISTenv var - Politeness controls: Configurable per-request delays (
MIN_DELAY/MAX_DELAY), per-host connection cap, concurrent-scan semaphore, androbots.txtrespect - SSRF guard: Outbound URL safety check (private IPs, link-local, non-HTTP schemes) gates both the plain HTTP client and the headless browser
- URL classification: Automatically categorizes discovered URLs as image, video, stream, iframe, API, or other
- API discovery: Detects WordPress AJAX endpoints, common embed API patterns (
getSources,getStream), and synthesizes endpoints from HTML data attributes - Base64 decoding: Finds media URLs hidden inside base64-encoded strings in JavaScript
- Network capture: Records and classifies URLs from all HTTP responses during the scan
- SQLite storage: All scan results persisted asynchronously via SQLAlchemy + aiosqlite
- Web UI: HTML interface with real-time scan progress, category filtering, scan history, JSON export, and a "JS-rendered" badge for headless scans
webscraper/
├── run.py # Entry point
├── pytest.ini # pytest configuration
├── requirements.txt
├── requirements-dev.txt # Test-only deps
├── .env.example # Environment variable template
├── app/
│ ├── main.py # FastAPI app setup + /static mount
│ ├── config.py # Settings from env vars
│ ├── database.py # Async SQLAlchemy engine + idempotent migrations
│ ├── models.py # Scan, MediaURL, ScanStatus, MediaCategory
│ ├── routers/
│ │ ├── scan.py # Routes: /, /scan, /scan/{id}, /history, /scan/{id}/cancel
│ │ └── export.py # Route: /scan/{id}/export/json
│ ├── scraper/
│ │ ├── engine.py # Main scan pipeline (background task)
│ │ ├── browser.py # httpx HTTP client
│ │ ├── headless.py # Playwright Chromium driver (opt-in)
│ │ ├── ad_filter.py # Bundled ad-network denylist
│ │ ├── html_parser.py # Extract media from HTML tags
│ │ ├── js_analyzer.py # Extract media from <script> blocks
│ │ ├── url_classifier.py # URL normalization and classification
│ │ ├── url_hash.py # Dedup key normalization
│ │ ├── api_discoverer.py # API endpoint discovery + probing
│ │ ├── network_catcher.py # Captures URLs from HTTP responses
│ │ ├── discovery.py # Sitemap / RSS / feed probe
│ │ ├── robots.py # robots.txt cache + respect
│ │ └── safety.py # SSRF guard (is_safe_url)
│ ├── templates/ # Jinja2 HTML templates
│ │ ├── base.html
│ │ ├── index.html
│ │ ├── history.html
│ │ └── components/
│ │ ├── scan_form.html # Standalone form with js_render checkbox
│ │ └── results_body.html
│ └── static/ # CSS assets (headless.css additive overlay)
└── scraper.db # SQLite database (created at runtime)
- Python 3.11+
- pip
- (Optional, for headless mode) Playwright + Chromium binary
# Clone the repo
git clone <repo-url>
cd webscraper
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# (Optional, only if you want headless mode)
pip install playwright
playwright install chromiumCopy the example env file and adjust as needed:
cp .env.example .envAvailable environment variables:
| Variable | Default | Description |
|---|---|---|
HOST |
0.0.0.0 |
Server bind address |
PORT |
8000 |
Server port |
DATABASE_URL |
sqlite+aiosqlite:///./scraper.db |
Async database URL |
MAX_CONCURRENT_SCANS |
2 |
Max scans running in parallel |
MIN_DELAY |
1.0 |
Minimum delay between requests (seconds) |
MAX_DELAY |
5.0 |
Maximum delay between requests (seconds) |
HEADLESS_MAX_IFRAME_DEPTH |
3 |
How many iframe levels the headless driver will follow |
HEADLESS_BROWSER_TIMEOUT_MS |
20000 |
Per-page navigation timeout in headless mode |
AD_BLOCKLIST |
(empty) | Comma-separated extra ad domains to drop (e.g. tracker.foo.com,ads.bar.com) |
LOG_LEVEL |
INFO |
Application log level |
# Option 1: Using run.py
python run.py
# Option 2: Using uvicorn directly
uvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadThe server starts at http://localhost:8000. The SQLite database (scraper.db) is created automatically on first launch, and any pending schema migrations (idempotent ALTER TABLE statements) are applied on startup.
- Open
http://localhost:8000in your browser - Enter a URL and click Scan
- (Optional) Open Advanced and tick Use headless browser to render the page in Chromium
- Watch real-time progress as the scraper runs through its pipeline
- View results filtered by category (image, video, stream, iframe, API, other)
- Export results as JSON from the scan results page
| Method | Path | Description |
|---|---|---|
GET |
/ |
Home page with scan form |
POST |
/scan |
Start a new scan. Form fields: url, max_pages, max_depth, js_render (1/0) |
GET |
/scan/{id} |
View scan results (query param: category) |
GET |
/scan/{id}/status |
HTMX partial for live scan status |
POST |
/scan/{id}/cancel |
Cooperative cancel a running scan |
GET |
/history |
List of recent scans |
GET |
/scan/{id}/export/json |
Download results as JSON |
curl -X POST http://localhost:8000/scan -d "url=https://example.com" -Lcurl -X POST http://localhost:8000/scan \
-d "url=https://example.com" \
-d "js_render=1" \
-Lcurl http://localhost:8000/scan/1/export/json -o results.jsonThe scan pipeline runs as a FastAPI background task. Plain (default) and headless scans share the same downstream phases — the only difference is whether Phase 1.5 runs.
- Fetch — Downloads the target page over standard HTTPS with an honest
User-Agent - Headless render (optional) — If
js_render=1, launches Playwright Chromium, waits fornetworkidle, and captures every XHR/fetch response plus the post-render DOM. Iframes are followed up toHEADLESS_MAX_IFRAME_DEPTHlevels by re-fetching the iframesrcin a fresh top-level page (cross-origin-safe) - HTML parsing — Extracts URLs from
<img>,<video>,<source>,<iframe>,<embed>,<a>,<meta>(Open Graph), and CSSbackground-image - JavaScript analysis — Parses inline
<script>blocks for media URL patterns, JSON blobs, quoted URLs, base64-encoded URLs, and API endpoint patterns - Sitemap / feed probe — Discovers sitemap.xml, RSS, and Atom feeds linked from the page
- Iframe recursion — Follows iframe/embed sources up to
max_depthlevels to find media in nested player pages - API discovery — Probes common API patterns on discovered player domains, synthesizes endpoints from HTML data attributes, and detects WordPress AJAX configurations
- Network capture — Classifies URLs from all HTTP responses and walks JSON response bodies for embedded media URLs
- Ad filter — Drops URLs matching the bundled denylist (plus anything in
AD_BLOCKLIST); the dropped count is recorded on theScanrow - Merge & deduplicate — Combines all results, normalizes URLs, removes duplicates, and saves to the database
Modern streaming sites (gogoanime, animepahe, megacloud embeds, JW Player hosts) build the page in the browser. The server response is just a shell — the "underlying API endpoint" you actually want, typically a JSON call returning {"sources": [{"file": "https://...m3u8"}]}, only fires after the page boots, the player iframe is followed, and a second XHR goes out from inside the iframe.
To get those URLs, tick Use headless browser in the scan form (or send js_render=1 via curl). The scraper will:
- Launch Playwright Chromium in headless mode
- Navigate to the page, wait for
networkidle - Capture every XHR/fetch response (URL, method, status, content-type, request/response headers, body preview)
- BFS through cross-origin iframes, re-fetching each iframe
srcin a fresh top-level page (this is the only way to read the XHR callsites from inside a cross-origin iframe) - Feed every captured URL — and the post-render DOM at every iframe level — through the same
extract_from_html/extract_from_scripts/api_discovererpipeline used by plain scans
Result pages from headless scans show a small JS-rendered badge next to the scan title.
One-time setup:
pip install playwright
playwright install chromium # downloads ~150MB browser binaryWithout the binary, plain scans still work — headless mode gracefully falls back to an empty result set and logs a warning.
Safety: every page navigation is gated by the same is_safe_url SSRF guard the plain HTTP client uses, via a page.route() handler that aborts unsafe requests before they leave the browser.
Resource cost: headless scans are 5-10x slower per scan than plain scans. Run them on a desktop machine, not on a phone.
The bundled denylist in app/scraper/ad_filter.py covers ~80 ad/tracker/analytics domains (Google AdSense, DoubleClick, Taboola, Outbrain, MGID, Amazon A9, etc.). URLs matching the denylist are dropped before the result is saved, and the dropped count is recorded on the Scan.blocked_count column, surfaced on the result page as Blocked: N.
The denylist is domain-only — it matches tracker.foo.com against any URL whose host ends in tracker.foo.com (so eu.us.tracker.foo.com is also matched). Numeric IP URLs and malformed URLs are never matched (they pass through).
User override: set AD_BLOCKLIST to a comma-separated list of extra domains to drop. Entries are trimmed, lowercased, and stripped of leading dots. Malformed entries (no dot) are ignored.
# In .env
AD_BLOCKLIST=tracker.example.com,ads.foo.comAD_ALLOWLIST is not currently implemented — the simpler mental model is "the bundled denylist + your extras."
# Install test deps
pip install -r requirements-dev.txt
# Run the full suite
pytest
# Run only the ad-filter unit tests
pytest tests/test_ad_filter.py -v
# Run only the headless unit tests (pure-function ones; browser-gated tests skip automatically
# if PLAYWRIGHT_BROWSERS_PATH is unset)
pytest tests/test_headless.py -v
# Run the headless browser tests against an installed Chromium
PLAYWRIGHT_BROWSERS_PATH=/path/to/browsers pytest tests/test_headless.py -vTest layout:
tests/test_ad_filter.py— denylist matching, env override, combined filtertests/test_headless.py— pure-function helpers (URL classification, JSON peek) + browser-gated integration teststests/test_*— existing endpoint / pipeline / format tests
- FastAPI — Async web framework
- SQLAlchemy 2.0 + aiosqlite — Async ORM with SQLite
- BeautifulSoup 4 — HTML parsing
- httpx — Async HTTP client
- Jinja2 — Server-side HTML templates
- HTMX — Partial page swaps for live status
- Playwright — Headless Chromium (optional, for
js_render=1) - yt-dlp — Media extraction utilities
- Rich — CLI output formatting
- pytest + pytest-asyncio + respx — Test stack