Skip to content

feat: add Tavily Extract as parallel scraping option alongside Firecrawl#29

Open
tavily-integrations wants to merge 1 commit into
federicodeponte:masterfrom
Tavily-FDE:feat/tavily-migration/add-tavily-extract-scraping
Open

feat: add Tavily Extract as parallel scraping option alongside Firecrawl#29
tavily-integrations wants to merge 1 commit into
federicodeponte:masterfrom
Tavily-FDE:feat/tavily-migration/add-tavily-extract-scraping

Conversation

@tavily-integrations

Copy link
Copy Markdown

Summary

Adds Tavily Extract as a configurable page content extraction option alongside the existing Firecrawl provider. When TAVILY_API_KEY is set, Tavily Extract is tried first for page scraping, with Firecrawl as secondary fallback and the simple requests-based scraper as tertiary fallback.

Why: Tavily Extract provides an alternative content extraction API that can improve reliability and reduce dependency on a single scraping provider.

Changes

  • engine/utils/tavily_extract_client.py (new): TavilyExtractClient wrapping TavilyClient.extract(), returning the same {success, content, url, metadata} dict as FirecrawlClient.scrape_url()
  • engine/utils/fallback_services.py: Added scrape_page_with_tavily() function; updated scrape_page_with_firecrawl() to attempt Tavily Extract first when TAVILY_API_KEY is set
  • engine/requirements.txt: Added tavily-python>=0.5.0
  • .env.example: Documented TAVILY_API_KEY (also enables Extract)

Dependency changes

  • Added tavily-python>=0.5.0 to engine/requirements.txt

Environment variable changes

  • TAVILY_API_KEY: When set, enables Tavily Extract as the primary page scraping method (also used for web search if that migration unit is applied)

Notes for reviewers

  • Firecrawl and FIRECRAWL_API_KEY remain fully functional and unchanged
  • Tavily Extract only activates when TAVILY_API_KEY is present in the environment
  • The fallback chain is: Tavily Extract → Firecrawl → simple requests scraper
  • TavilyExtractClient gracefully handles missing tavily-python package via ImportError

Automated Review

  • Passed after 1 attempt(s)
  • Final review: The implementation correctly adds Tavily Extract as a primary page-scraping option with Firecrawl as fallback. The new TavilyExtractClient is well-structured, returns a drop-in compatible response dict, handles all error cases (empty results, failed_results, exceptions), and the fallback chain in scrape_page_with_firecrawl is implemented cleanly. All four planned files are changed, the tavily-python dependency is pinned correctly, and TAVILY_API_KEY is documented in .env.example. Note: the migration plan stated TAVILY_API_KEY would already be in .env.example from the prerequisite unit, but neither that unit's PR description nor the git diff support this — this unit adds it correctly regardless. Only minor issues found; no critical or major blockers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant