Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 

README.md

🔄 Data Extraction & Transformation

Parsing, ETL pipelines, format conversion, data wrangling, and transformation utilities.

⭐ Top Starred

Skill Stars
Create, repair, and recalculate spreadsheet workbooks without breaking formulas ⭐ 116.9k
MarkItDown Document-to-Markdown Converter by Microsoft ⭐ 93.2k
PostgreSQL MCP Server ⭐ 85.3k
SQLite MCP Server ⭐ 85.3k
Build document-grounded agent context workflows with RAGFlow ⭐ 79.8k
Use RAGFlow as a retrieval and context layer for agent workflows ⭐ 79.8k
Elasticsearch MCP ⭐ 76.5k
PaddleOCR Multilingual Document OCR and Structured Data Toolkit ⭐ 73.7k
Tesseract OCR Data Extractor ⭐ 73.6k
Tesseract OCR Document Extractor ⭐ 73.6k

📦 Top Downloaded

Skill Downloads
Metabase Open Source Business Intelligence and Embedded Analytics ⬇ 15/wk
Cheerio DOM Extraction Pipeline ⬇ 19.6M/wk
Cheerio HTML and XML Parsing Library for Node.js Extraction Workflows ⬇ 19.6M/wk
fx Terminal JSON Viewer and Processor ⬇ 206k/wk
GraphQL Data Federation Agent ⬇ 34.2M/wk
GraphQL Schema Introspection Mapper ⬇ 34.2M/wk
CSV Schema Validator & Auto-Fixer ⬇ 291.1M/wk
Stripe Revenue Analytics Dashboard Builder ⬇ 9.3M/wk
Apache Kafka Schema Registry Extractor ⬇ 2.5M/wk
Apache Kafka Schema Registry Validator ⬇ 2.5M/wk

Full Skill List

Skill Stars Downloads
Create, repair, and recalculate spreadsheet workbooks without breaking formulas 116.9k
MarkItDown Document-to-Markdown Converter by Microsoft 93.2k
PostgreSQL MCP Server 85.3k
SQLite MCP Server 85.3k
Build document-grounded agent context workflows with RAGFlow 79.8k
Use RAGFlow as a retrieval and context layer for agent workflows 79.8k
Elasticsearch MCP 76.5k
PaddleOCR Multilingual Document OCR and Structured Data Toolkit 73.7k
Tesseract OCR Data Extractor 73.6k
Tesseract OCR Document Extractor 73.6k
Apache Superset Dashboard and SQL Exploration Skill 72.3k
Protocol Buffer Schema Generator 71.2k
Scrapy Spider Data Pipeline 61.3k
MinerU PDF-to-Markdown Document Parser 57.8k
Docling Document Parsing and Conversion 57.8k
Docling Document Conversion and Extraction Toolkit 57.6k
Docling Document Parsing and Conversion Toolkit 57.6k
Docling AI Document Intelligence Pipeline 56.9k
Pandas DataFrame Pipeline Builder 48.5k
Pandas DataFrame Pipeline Orchestrator 48.5k
Pandas DataFrame Schema Enforcer 48.5k
Pandas DataFrame Schema Validator 48.5k
Pandas Profiling Report Generator 48.5k
ClickHouse Query Agent 46.9k
Metabase Open Source Business Intelligence and Embedded Analytics 46.8k 15/wk
Apache Airflow MCP 45k
Apache Spark Job Manager 43.1k
Apache Spark DataFrame ETL Pipeline 43.1k
Paperless-ngx Document OCR and Archive Management System 38.1k
Polars Blazing-Fast DataFrame Query Engine 37.9k
DuckDB SQL Analytics Agent 37.1k
LangExtract LLM-Powered Structured Text Extraction 35k
jq JSON Stream Transformer 34.5k
jq Pipeline Builder Agent 34.5k
Marker PDF-to-Markdown Converter 33.2k
LightRAG Graph-Based Retrieval-Augmented Generation Framework 33.2k
Apache Kafka Schema Extractor 32.5k
Apache Kafka Stream Transformer 32.4k
Apache Kafka Stream Processor 32.4k
Cheerio DOM Extraction Pipeline 30.3k 19.6M/wk
Cheerio HTML and XML Parsing Library for Node.js Extraction Workflows 30.3k 19.6M/wk
Turn mixed local folders into a queryable knowledge graph with Graphify 25.7k
Typesense Typo-Tolerant Search Engine 25.5k
Airbyte Connector Config Generator 21.1k
Teable No-Code Postgres Database Platform and Airtable Alternative 21.1k
fx Terminal JSON Viewer and Processor 20.4k 206k/wk
GraphQL Data Federation Agent 20.3k 34.2M/wk
GraphQL Schema Introspection Mapper 20.3k 34.2M/wk
Surya Document OCR with Layout Analysis and Table Recognition 19.5k
Extract structured markdown, JSON, and tagged-PDF-ready outputs from PDFs with OpenDataLoader PDF 19.1k
gallery-dl Image Gallery and Collection Downloader 17.5k
Convert dense PDFs into LLM-ready text and page-aligned markdown with olmOCR 17.1k
Maxun No-Code Web Data Extraction Platform 15.3k
Dagster Data Pipeline Orchestrator 15.3k
yq YAML and Structured Data Processor 15.1k
CSV Schema Validator & Auto-Fixer 14.7k 291.1M/wk
Unstructured Document ETL Toolkit 14.5k
Unstructured Document ETL for LLM Pipelines 14.4k
gron Greppable JSON Flattener 14.4k
Unstructured Document Partitioning and ETL Library for LLM Pipelines 14.4k
Gitingest Repository-to-Prompt Codebase Extraction Tool 14.3k
Generate LLM fine-tuning, RAG, and eval datasets from source material with easy-dataset 14k
Instructor Structured Data Extraction from LLMs 12.7k
dbt MCP Server 12.6k
dbt Cloud MCP 12.6k
dbt Data Transform Orchestrator 12.6k
dbt Data Transformation Orchestrator 12.6k
dbt Model Dependency Analyzer 12.6k
dbt Model Dependency Resolver 12.6k
dbt Model Lineage & Test Coverage Checker 12.6k
dbt Model Lineage Analyzer 12.6k
dbt Model Lineage Extractor 12.6k
dbt Model Lineage Mapper 12.6k
dbt Model Transformation Architect 12.6k
Datasette Data Exploration and Publishing Tool 10.9k
Grist Self-Hosted Relational Spreadsheet and Database Platform 10.8k
xsv High-Performance CSV Toolkit 10.8k
Jina Reader URL-to-Markdown Converter and Web Search API 10.6k
Orama Embeddable Search Engine and RAG Pipeline for JavaScript 10.3k
pdfplumber Python PDF Text and Table Extraction Library 10.1k
Miller CSV TSV JSON Data Processor 9.8k
Gorse AI-Powered Open Source Recommender System Engine 9.6k
Translate and validate SQL across dialects with SQLGlot 9.1k
Profile and triage messy tabular files from the terminal with VisiData 9k
WeasyPrint HTML and CSS to PDF Document Generator 8.8k
Redpanda Connect Declarative Stream Processor 8.6k
Normalize raw CLI output into JSON for reliable downstream parsing and automation 8.6k
Dasel Multi-Format Data Selector and Modifier 7.9k
Steampipe Zero-ETL SQL Cloud API Query Engine 7.7k
Extract structured text, metadata, tables, and images from mixed documents through an MCP server with Kreuzberg 7.6k
htmlq Command-Line HTML Content Extractor with CSS Selectors 7.5k
Migrate MySQL, SQLite, or CSV data into PostgreSQL with repeatable load files before cutover with pgloader 6.4k
Sync cloud and SaaS inventory into SQL tables for audits with CloudQuery 6.4k
csvkit Python CSV Utility Suite 6.4k
Apache Camel Route Data Mapper 6.2k
Convert DOCX documents into clean HTML for publishing workflows with Mammoth 6.2k
Evidence BI-as-Code SQL and Markdown Analytics Framework 6.1k
jnv Interactive JSON Navigator and jq Filter Editor 6k
dlt Python Data Load Tool 5.2k
ExifTool Metadata Reader and Writer for Images and Files 4.6k
franc Natural Language Detection Library and CLI 4.4k
Stripe Revenue Analytics Dashboard Builder 4.4k 9.3M/wk
Apache Kafka Schema Registry Extractor 4k 2.5M/wk
Apache Kafka Schema Registry Validator 4k 2.5M/wk
xan SIMD-Powered CSV Processing and Analysis CLI 3.9k
Newsboat Terminal RSS and Atom Feed Reader 3.8k
Inspect large CSV files interactively before cleanup, mapping, or downstream transforms with csvlens 3.7k 56.9k/wk
Turn messy document collections into structured rows with DocETL 3.7k
Apache Tika Content Extraction Hub 3.7k
Apache Tika Document Parser 3.7k
Apache Tika Document Parser Agent 3.7k
Apache Tika Document Extractor 3.7k
Camelot Advanced PDF Table Intelligence 3.7k
Camelot PDF Stream Parser 3.7k
PDF Table Extraction with Camelot 3.7k
Profile and clean large CSV datasets from the terminal with qsv 3.6k
qsv Blazing-Fast CSV Data Wrangling Toolkit 3.6k
Ingestr Cross-Database Data Copier 3.4k
Apache Avro Schema Evolution Agent 3.3k
JSON-to-Avro Schema Transformer 3.3k
Plan and preview warehouse SQL model changes before rollout with SQLMesh 3k
Postgres MCP Pro 2.7k
Diff nested JSON, API responses, and config snapshots before approving changes 2.5k
Meltano Declarative ELT Data Integration Engine 2.4k
Enrich Paperless-ngx documents with AI-generated titles tags and correspondents using paperless-gpt 2.3k
rehype Plugin-Based HTML Processor by the Unified Collective 2.2k
trdsql SQL Query Engine for CSV JSON and YAML Files 2.2k
Extract invoice fields from vendor PDFs into structured records 2.1k
markdownify Python HTML to Markdown Conversion Library 2.1k
sqlite-utils Python CLI for SQLite Database Manipulation 2k
Tabula PDF Table Extraction Agent 2k
Tabula PDF Table Extractor 2k
Query and rewrite Markdown structure with mdq 1.7k
Anyquery Universal SQL Engine with MCP Integration 1.7k
Repair, split, merge, and normalize PDFs with qpdf before downstream processing 1.5k
Documind AI-Powered Structured Data Extraction from Documents 1.5k 14/wk
Salesforce Bulk API Data Loader 1.5k 936.6k/wk
Infer And Normalize Broken CSV Dialects Before Import With Clevercsv 1.3k
Export Obsidian vaults into clean Markdown trees for publishing or downstream processing 1.3k
xq Command-Line XML and HTML Beautifier and Content Extractor 1.1k
Extract structured fields from HTML XML and JSON endpoints with Xidel selectors 835
Give agents governed semantic data context with Wren Engine 661
dbt MCP Server for Data Pipeline Context 526
Compare dbt models and warehouse relations before trusting migration parity with dbt-audit-helper 402
Parquet Column Mapper 387 170.7k/wk
Parquet Column Pruning Optimizer 387 170.7k/wk
Parquet Column Statistics Profiler 387 170.7k/wk
Parquet Schema Extractor for S3 387 170.7k/wk
Operate Airflow and warehouse workflows through agent-safe data engineering skills with Astronomer Agents 337
Compare recurring CSV, TSV, or JSON exports and emit row-level change sets before syncs 330
Weaviate MCP Server 161
Turn documents into validated knowledge graphs with Docling Graph 134
Crawl4AI MCP Server 84
Turn captured WARC pages into clean text and language-tagged records with warc2text 23
Search large PDFs and read only the relevant pages before answering 17
Process, redact, OCR, and sign documents with Nutrient Agent Skill 5
Convert HTML emails and web fragments into clean plain text for downstream agents 8.2M/wk
Metabase Dashboard Snapshot & Alerting
Parquet to PostgreSQL Loader
QuickBooks Online Invoice Reconciliation Agent
Reddit Subreddit Sentiment Tracker
Snowflake MCP
Snowflake MCP Server
Snowflake Query History Extractor
Snowflake Query Optimizer Agent
Snowflake Query Profiler
Weights & Biases Run Monitor
XML XSLT Transform Pipeline

← Back to all categories