Context
Unstructured handles document parsing beautifully but getting live web content into the pipeline is still a pain -- especially for Cloudflare-protected sites.
Suggestion
anybrowse could work as a web source connector -- fetches URLs with real browser rendering and returns clean markdown ready for Unstructured to further process or chunk.
import requests
def fetch_url(url: str) -> str:
r = requests.post("https://anybrowse.dev/scrape", json={"url": url})
return r.json().get("markdown", "")
# Then pass to unstructured partition
from unstructured.partition.text import partition_text
elements = partition_text(text=fetch_url("https://example.com"))
Works on Cloudflare-protected sites, JS-rendered pages, and standard HTML.
Context
Unstructured handles document parsing beautifully but getting live web content into the pipeline is still a pain -- especially for Cloudflare-protected sites.
Suggestion
anybrowse could work as a web source connector -- fetches URLs with real browser rendering and returns clean markdown ready for Unstructured to further process or chunk.
Works on Cloudflare-protected sites, JS-rendered pages, and standard HTML.