Skip to content

Connector suggestion: anybrowse for live web content ingestion with Cloudflare bypass #4288

@kc23go

Description

@kc23go

Context

Unstructured handles document parsing beautifully but getting live web content into the pipeline is still a pain -- especially for Cloudflare-protected sites.

Suggestion

anybrowse could work as a web source connector -- fetches URLs with real browser rendering and returns clean markdown ready for Unstructured to further process or chunk.

import requests

def fetch_url(url: str) -> str:
    r = requests.post("https://anybrowse.dev/scrape", json={"url": url})
    return r.json().get("markdown", "")

# Then pass to unstructured partition
from unstructured.partition.text import partition_text
elements = partition_text(text=fetch_url("https://example.com"))

Works on Cloudflare-protected sites, JS-rendered pages, and standard HTML.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions