Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

🔄 Data Extraction & Transformation

Parsing, ETL pipelines, format conversion, data wrangling, and transformation utilities.

⭐ Top Starred

Skill	Stars
Create, repair, and recalculate spreadsheet workbooks without breaking formulas	⭐ 116.9k
MarkItDown Document-to-Markdown Converter by Microsoft	⭐ 93.2k
PostgreSQL MCP Server	⭐ 85.3k
SQLite MCP Server	⭐ 85.3k
Build document-grounded agent context workflows with RAGFlow	⭐ 79.8k
Use RAGFlow as a retrieval and context layer for agent workflows	⭐ 79.8k
Elasticsearch MCP	⭐ 76.5k
PaddleOCR Multilingual Document OCR and Structured Data Toolkit	⭐ 73.7k
Tesseract OCR Data Extractor	⭐ 73.6k
Tesseract OCR Document Extractor	⭐ 73.6k

📦 Top Downloaded

Skill	Downloads
Metabase Open Source Business Intelligence and Embedded Analytics	⬇ 15/wk
Cheerio DOM Extraction Pipeline	⬇ 19.6M/wk
Cheerio HTML and XML Parsing Library for Node.js Extraction Workflows	⬇ 19.6M/wk
fx Terminal JSON Viewer and Processor	⬇ 206k/wk
GraphQL Data Federation Agent	⬇ 34.2M/wk
GraphQL Schema Introspection Mapper	⬇ 34.2M/wk
CSV Schema Validator & Auto-Fixer	⬇ 291.1M/wk
Stripe Revenue Analytics Dashboard Builder	⬇ 9.3M/wk
Apache Kafka Schema Registry Extractor	⬇ 2.5M/wk
Apache Kafka Schema Registry Validator	⬇ 2.5M/wk

Full Skill List

Skill	Stars	Downloads
Create, repair, and recalculate spreadsheet workbooks without breaking formulas	116.9k	—
MarkItDown Document-to-Markdown Converter by Microsoft	93.2k	—
PostgreSQL MCP Server	85.3k	—
SQLite MCP Server	85.3k	—
Build document-grounded agent context workflows with RAGFlow	79.8k	—
Use RAGFlow as a retrieval and context layer for agent workflows	79.8k	—
Elasticsearch MCP	76.5k	—
PaddleOCR Multilingual Document OCR and Structured Data Toolkit	73.7k	—
Tesseract OCR Data Extractor	73.6k	—
Tesseract OCR Document Extractor	73.6k	—
Apache Superset Dashboard and SQL Exploration Skill	72.3k	—
Protocol Buffer Schema Generator	71.2k	—
Scrapy Spider Data Pipeline	61.3k	—
MinerU PDF-to-Markdown Document Parser	57.8k	—
Docling Document Parsing and Conversion	57.8k	—
Docling Document Conversion and Extraction Toolkit	57.6k	—
Docling Document Parsing and Conversion Toolkit	57.6k	—
Docling AI Document Intelligence Pipeline	56.9k	—
Pandas DataFrame Pipeline Builder	48.5k	—
Pandas DataFrame Pipeline Orchestrator	48.5k	—
Pandas DataFrame Schema Enforcer	48.5k	—
Pandas DataFrame Schema Validator	48.5k	—
Pandas Profiling Report Generator	48.5k	—
ClickHouse Query Agent	46.9k	—
Metabase Open Source Business Intelligence and Embedded Analytics	46.8k	15/wk
Apache Airflow MCP	45k	—
Apache Spark Job Manager	43.1k	—
Apache Spark DataFrame ETL Pipeline	43.1k	—
Paperless-ngx Document OCR and Archive Management System	38.1k	—
Polars Blazing-Fast DataFrame Query Engine	37.9k	—
DuckDB SQL Analytics Agent	37.1k	—
LangExtract LLM-Powered Structured Text Extraction	35k	—
jq JSON Stream Transformer	34.5k	—
jq Pipeline Builder Agent	34.5k	—
Marker PDF-to-Markdown Converter	33.2k	—
LightRAG Graph-Based Retrieval-Augmented Generation Framework	33.2k	—
Apache Kafka Schema Extractor	32.5k	—
Apache Kafka Stream Transformer	32.4k	—
Apache Kafka Stream Processor	32.4k	—
Cheerio DOM Extraction Pipeline	30.3k	19.6M/wk
Cheerio HTML and XML Parsing Library for Node.js Extraction Workflows	30.3k	19.6M/wk
Turn mixed local folders into a queryable knowledge graph with Graphify	25.7k	—
Typesense Typo-Tolerant Search Engine	25.5k	—
Airbyte Connector Config Generator	21.1k	—
Teable No-Code Postgres Database Platform and Airtable Alternative	21.1k	—
fx Terminal JSON Viewer and Processor	20.4k	206k/wk
GraphQL Data Federation Agent	20.3k	34.2M/wk
GraphQL Schema Introspection Mapper	20.3k	34.2M/wk
Surya Document OCR with Layout Analysis and Table Recognition	19.5k	—
Extract structured markdown, JSON, and tagged-PDF-ready outputs from PDFs with OpenDataLoader PDF	19.1k	—
gallery-dl Image Gallery and Collection Downloader	17.5k	—
Convert dense PDFs into LLM-ready text and page-aligned markdown with olmOCR	17.1k	—
Maxun No-Code Web Data Extraction Platform	15.3k	—
Dagster Data Pipeline Orchestrator	15.3k	—
yq YAML and Structured Data Processor	15.1k	—
CSV Schema Validator & Auto-Fixer	14.7k	291.1M/wk
Unstructured Document ETL Toolkit	14.5k	—
Unstructured Document ETL for LLM Pipelines	14.4k	—
gron Greppable JSON Flattener	14.4k	—
Unstructured Document Partitioning and ETL Library for LLM Pipelines	14.4k	—
Gitingest Repository-to-Prompt Codebase Extraction Tool	14.3k	—
Generate LLM fine-tuning, RAG, and eval datasets from source material with easy-dataset	14k	—
Instructor Structured Data Extraction from LLMs	12.7k	—
dbt MCP Server	12.6k	—
dbt Cloud MCP	12.6k	—
dbt Data Transform Orchestrator	12.6k	—
dbt Data Transformation Orchestrator	12.6k	—
dbt Model Dependency Analyzer	12.6k	—
dbt Model Dependency Resolver	12.6k	—
dbt Model Lineage & Test Coverage Checker	12.6k	—
dbt Model Lineage Analyzer	12.6k	—
dbt Model Lineage Extractor	12.6k	—
dbt Model Lineage Mapper	12.6k	—
dbt Model Transformation Architect	12.6k	—
Datasette Data Exploration and Publishing Tool	10.9k	—
Grist Self-Hosted Relational Spreadsheet and Database Platform	10.8k	—
xsv High-Performance CSV Toolkit	10.8k	—
Jina Reader URL-to-Markdown Converter and Web Search API	10.6k	—
Orama Embeddable Search Engine and RAG Pipeline for JavaScript	10.3k	—
pdfplumber Python PDF Text and Table Extraction Library	10.1k	—
Miller CSV TSV JSON Data Processor	9.8k	—
Gorse AI-Powered Open Source Recommender System Engine	9.6k	—
Translate and validate SQL across dialects with SQLGlot	9.1k	—
Profile and triage messy tabular files from the terminal with VisiData	9k	—
WeasyPrint HTML and CSS to PDF Document Generator	8.8k	—
Redpanda Connect Declarative Stream Processor	8.6k	—
Normalize raw CLI output into JSON for reliable downstream parsing and automation	8.6k	—
Dasel Multi-Format Data Selector and Modifier	7.9k	—
Steampipe Zero-ETL SQL Cloud API Query Engine	7.7k	—
Extract structured text, metadata, tables, and images from mixed documents through an MCP server with Kreuzberg	7.6k	—
htmlq Command-Line HTML Content Extractor with CSS Selectors	7.5k	—
Migrate MySQL, SQLite, or CSV data into PostgreSQL with repeatable load files before cutover with pgloader	6.4k	—
Sync cloud and SaaS inventory into SQL tables for audits with CloudQuery	6.4k	—
csvkit Python CSV Utility Suite	6.4k	—
Apache Camel Route Data Mapper	6.2k	—
Convert DOCX documents into clean HTML for publishing workflows with Mammoth	6.2k	—
Evidence BI-as-Code SQL and Markdown Analytics Framework	6.1k	—
jnv Interactive JSON Navigator and jq Filter Editor	6k	—
dlt Python Data Load Tool	5.2k	—
ExifTool Metadata Reader and Writer for Images and Files	4.6k	—
franc Natural Language Detection Library and CLI	4.4k	—
Stripe Revenue Analytics Dashboard Builder	4.4k	9.3M/wk
Apache Kafka Schema Registry Extractor	4k	2.5M/wk
Apache Kafka Schema Registry Validator	4k	2.5M/wk
xan SIMD-Powered CSV Processing and Analysis CLI	3.9k	—
Newsboat Terminal RSS and Atom Feed Reader	3.8k	—
Inspect large CSV files interactively before cleanup, mapping, or downstream transforms with csvlens	3.7k	56.9k/wk
Turn messy document collections into structured rows with DocETL	3.7k	—
Apache Tika Content Extraction Hub	3.7k	—
Apache Tika Document Parser	3.7k	—
Apache Tika Document Parser Agent	3.7k	—
Apache Tika Document Extractor	3.7k	—
Camelot Advanced PDF Table Intelligence	3.7k	—
Camelot PDF Stream Parser	3.7k	—
PDF Table Extraction with Camelot	3.7k	—
Profile and clean large CSV datasets from the terminal with qsv	3.6k	—
qsv Blazing-Fast CSV Data Wrangling Toolkit	3.6k	—
Ingestr Cross-Database Data Copier	3.4k	—
Apache Avro Schema Evolution Agent	3.3k	—
JSON-to-Avro Schema Transformer	3.3k	—
Plan and preview warehouse SQL model changes before rollout with SQLMesh	3k	—
Postgres MCP Pro	2.7k	—
Diff nested JSON, API responses, and config snapshots before approving changes	2.5k	—
Meltano Declarative ELT Data Integration Engine	2.4k	—
Enrich Paperless-ngx documents with AI-generated titles tags and correspondents using paperless-gpt	2.3k	—
rehype Plugin-Based HTML Processor by the Unified Collective	2.2k	—
trdsql SQL Query Engine for CSV JSON and YAML Files	2.2k	—
Extract invoice fields from vendor PDFs into structured records	2.1k	—
markdownify Python HTML to Markdown Conversion Library	2.1k	—
sqlite-utils Python CLI for SQLite Database Manipulation	2k	—
Tabula PDF Table Extraction Agent	2k	—
Tabula PDF Table Extractor	2k	—
Query and rewrite Markdown structure with mdq	1.7k	—
Anyquery Universal SQL Engine with MCP Integration	1.7k	—
Repair, split, merge, and normalize PDFs with qpdf before downstream processing	1.5k	—
Documind AI-Powered Structured Data Extraction from Documents	1.5k	14/wk
Salesforce Bulk API Data Loader	1.5k	936.6k/wk
Infer And Normalize Broken CSV Dialects Before Import With Clevercsv	1.3k	—
Export Obsidian vaults into clean Markdown trees for publishing or downstream processing	1.3k	—
xq Command-Line XML and HTML Beautifier and Content Extractor	1.1k	—
Extract structured fields from HTML XML and JSON endpoints with Xidel selectors	835	—
Give agents governed semantic data context with Wren Engine	661	—
dbt MCP Server for Data Pipeline Context	526	—
Compare dbt models and warehouse relations before trusting migration parity with dbt-audit-helper	402	—
Parquet Column Mapper	387	170.7k/wk
Parquet Column Pruning Optimizer	387	170.7k/wk
Parquet Column Statistics Profiler	387	170.7k/wk
Parquet Schema Extractor for S3	387	170.7k/wk
Operate Airflow and warehouse workflows through agent-safe data engineering skills with Astronomer Agents	337	—
Compare recurring CSV, TSV, or JSON exports and emit row-level change sets before syncs	330	—
Weaviate MCP Server	161	—
Turn documents into validated knowledge graphs with Docling Graph	134	—
Crawl4AI MCP Server	84	—
Turn captured WARC pages into clean text and language-tagged records with warc2text	23	—
Search large PDFs and read only the relevant pages before answering	17	—
Process, redact, OCR, and sign documents with Nutrient Agent Skill	5	—
Convert HTML emails and web fragments into clean plain text for downstream agents	—	8.2M/wk
Metabase Dashboard Snapshot & Alerting	—	—
Parquet to PostgreSQL Loader	—	—
QuickBooks Online Invoice Reconciliation Agent	—	—
Reddit Subreddit Sentiment Tracker	—	—
Snowflake MCP	—	—
Snowflake MCP Server	—	—
Snowflake Query History Extractor	—	—
Snowflake Query Optimizer Agent	—	—
Snowflake Query Profiler	—	—
Weights & Biases Run Monitor	—	—
XML XSLT Transform Pipeline	—	—

← Back to all categories

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

🔄 Data Extraction & Transformation

⭐ Top Starred

📦 Top Downloaded

Full Skill List

FilesExpand file tree

data-extraction-transformation

Directory actions

More options

Directory actions

More options

Latest commit

History

data-extraction-transformation

Folders and files

parent directory

README.md

🔄 Data Extraction & Transformation

⭐ Top Starred

📦 Top Downloaded

Full Skill List