A curated list of awesome Arabic Artificial Intelligence resources
LLMs · Datasets · Benchmarks · Speech · OCR · Tools · Research
قائمة منسقة لأفضل موارد الذكاء الاصطناعي العربي
نماذج لغوية · بيانات · تقييمات · صوت · رؤية · أدوات · أبحاث
The Arabic language is spoken by over 400 million people worldwide, yet remains significantly underrepresented in the AI ecosystem. This repository aims to be the single source of truth for Arabic AI resources — bringing together state-of-the-art models, datasets, benchmarks, tools, and research from across the MENA region and beyond.
اللغة العربية يتحدث بها أكثر من 400 مليون شخص حول العالم، لكنها لا تزال ممثَّلة بشكل ضعيف في منظومة الذكاء الاصطناعي. يهدف هذا المستودع إلى أن يكون المرجع الموحَّد لموارد الذكاء الاصطناعي العربي، جامعاً أحدث النماذج والبيانات وأدوات التقييم والأبحاث من المنطقة العربية والعالم.
💡 Found this useful? Give it a ⭐ to help others discover it! 💡 هل أفادك المحتوى؟ ضع نجمة ⭐ لمساعدة الآخرين على اكتشافه!
- 🤖 Large Language Models (LLMs)
- 📊 Datasets
- 🏆 Benchmarks & Leaderboards
- 🔊 Speech & Audio
- 👁️ Vision & OCR
- 🛠️ Libraries & Tools
- 📚 Research Papers & Surveys
- 🎓 Courses & Tutorials
- 🏢 Companies & Startups
- 🌐 Communities & Conferences
- 📰 Blogs & Newsletters
- 🤝 Contributing
- 📜 License
State-of-the-art Arabic and multilingual language models with strong Arabic capabilities.
- Jais — Family of bilingual Arabic-English LLMs (590M to 70B parameters) developed by Inception (G42), Cerebras, and MBZUAI. Apache 2.0 license. One of the most influential Arabic LLMs.
- Jais 2 — Next-generation Arabic open-weight LLM (8B and 70B) released December 2025 by Inception, Cerebras, and MBZUAI. Trained on the largest Arabic-first dataset ever assembled.
- Fanar-1 — 9B parameter Arabic-focused LLM by QCRI (Qatar). Open weights under Apache 2.0. Strong performance on cultural and dialectal tasks.
- Fanar 2.0 — 27B parameter upgraded Arabic-centric multimodal model by QCRI (2025), supporting language, speech, and image generation.
- ALLaM-7B — 7B model by SDAIA (Saudi Data and AI Authority). Apache 2.0. Optimized for Modern Standard Arabic and Saudi dialects.
- ALLaM-34B — 34B parameter Arabic-centric LLM by HUMAIN. Top-ranked on Stanford HELM for Arabic accuracy.
- AceGPT — 7B/13B Arabic LLM by KAUST with RLAIF-based instruction tuning. Strong cultural alignment.
- AceGPT-v2 — Improved Arabic LLMs (8B, 32B, 70B) with alignment at pre-training. Available on Hugging Face.
- Falcon-Arabic — TII's first Arabic model in the Falcon series (May 2025). Excels at Arabic grammar, reasoning, and code-switching. Falcon LLM Site
- Falcon-H1 — TII's best-in-class high-performance hybrid model with strong Arabic capabilities.
- SILMA-9B-Instruct-v1.0 — Top-ranked open-weights Arabic LLM (until Feb 2025) based on Gemma. Apache 2.0 license.
- Mistral Saba — 24B parameter model by Mistral AI (Feb 2025) specifically designed for Arabic and South Asian languages.
- Command R7B Arabic — 7B open-weight Arabic-optimized model by Cohere Labs (Feb 2025). Technical Report
- Aya & Aya Expanse — Cohere's multilingual model supporting 101+ languages with strong Arabic support.
- Yehia-7B — Fine-tuned Arabic LLM by Navid AI, based on ALLaM. Top-performing on multiple Arabic benchmarks.
- Pronoia LLM — Tarjama's enterprise-focused Arabic LLM (7B and 14B variants). Ranked #1 on Open Arabic LLM Leaderboard.
- Mulhem — First Saudi domain-specific LLM by Watad, trained exclusively on Saudi data sets.
- Noor — Large-scale Arabic NLP model by TII (10B parameters, 2022).
- Arabic-Sahm — Arabic language model by NAMAA-Space.
- Atlas-Chat — First family of LLMs specifically developed for Moroccan Darija. Available in 2B, 9B, and 27B by MBZUAI-Paris. Paper
- Nile-Chat — Family of LLMs for Egyptian dialect handling both Arabic and Latin scripts. Available in 4B, 12B, and 3x4B-A6B (MoE). Paper
- AL Atlas — Moroccan Darija pretraining by AtlasIA.
- Lahjawi — First cross-dialect translation model, covering 15 distinct Arabic dialects.
- Cohere Command-R / Command-A — Strong multilingual model with excellent Arabic support and improved Arabic dialect matching.
- GPT-4 / GPT-4o — Leading commercial model with strong Arabic capabilities.
- Claude (Anthropic) — Excellent Arabic comprehension and generation.
- Gemini — Google's multilingual model with native Arabic support.
- Fanar (Closed) — Qatar's flagship Arabic GenAI platform.
- AraBERT — BERT-based Arabic language understanding model by AUB MIND Lab. Foundational work, still widely used.
- AraGPT2 — GPT-2 variants pre-trained on Arabic by AUB MIND Lab.
- AraT5 — T5-style text-to-text Transformer for Arabic language generation. Available on Hugging Face. Paper
- AraELECTRA — ELECTRA-based Arabic language model by AUB MIND Lab.
- CAMeLBERT — Suite of BERT models for Arabic variants (MSA, DA, CA) by Columbia/NYU's CAMeL Lab.
- MARBERT / ARBERT — Arabic-focused BERT variants by UBC-NLP.
- Arabic-Orca — Arabic instruction-tuned variants of Orca.
- Arabic-Mistral — Community fine-tunes of Mistral on Arabic data.
- Arabic-Llama variants — Various Llama models fine-tuned for Arabic.
High-quality datasets for training, fine-tuning, and evaluating Arabic AI models.
- 101 Billion Arabic Words Dataset — Massive web-scale Arabic corpus, one of the largest publicly available.
- ArabicWeb24 — Curated, deduplicated web crawl of Arabic content from 2024.
- FineWeb2 Arabic — Comprehensive curated Arabic portions of FineWeb2.
- OSCAR Arabic — Multilingual web corpus with a substantial Arabic subset.
- mC4 Arabic — Arabic portion of Google's multilingual C4.
- Arabic Wikipedia Dumps — Clean Arabic encyclopedia content.
- CIDAR — High-quality Arabic instruction-tuning dataset, culturally aligned.
- Arabic Alpaca — Arabic translations of the Alpaca instruction dataset.
- InstAr-500k — 500k Arabic instructions for SFT.
- Arabic Dolly — Arabic version of the Dolly instruction dataset.
- Egyptian-SFT-Mixture — 1.85M Egyptian Arabic SFT examples by MBZUAI-Paris.
- Darija-SFT-Mixture — Moroccan Darija SFT collection.
- MADAR — Multi-Arabic Dialect Applications and Resources covering 25 cities.
- NADI — Nuanced Arabic Dialect Identification shared task datasets. NADI 2025 focuses on multidialectal Arabic speech.
- QADI — Qatari Arabic Dialect Identification dataset.
- Arabic Online Commentary (AOC) — Newspaper comments labeled by dialect.
- Arabizi-Egypt — Arabizi Egyptian dataset for LLM pre-training.
- Tashkeela — Large corpus for Arabic diacritization (75M+ words). Also on Kaggle.
- Sadeed_Tashkeela — Cleaned Tashkeela corpus for training diacritization models.
- ArabicaQA — Comprehensive Arabic Question Answering dataset (89,095 questions). GitHub
- AraSum — Arabic text summarization datasets.
- Arabic Sentiment Analysis datasets — Multiple sentiment-labeled corpora.
- Quran QA Datasets — Religious-domain QA over the Holy Quran. GitLab
- Arabic NLI & Semantic Similarity — Arabic SNLI and MultiNLI datasets.
- Alexandria — Dialectal Arabic Machine Translation Dataset for Real Conversations.
- Masader — First online catalog of Arabic NLP datasets by ARBML — 600+ datasets with 25+ metadata annotations each.
- Adawat — Aggregated catalog of Arabic NLP tools and resources.
Standardized evaluation suites for measuring Arabic AI model performance.
- Open Arabic LLM Leaderboard (OALL) — The official HuggingFace leaderboard for Arabic LLMs by Hugging Face, Inception, and MBZUAI. Blog
- OALL v2 — Second version of the leaderboard with 14 benchmarks across diverse Arabic tasks.
- QIMMA Leaderboard — Quality-First Arabic LLM Leaderboard by TII with code evaluation (Arabic HumanEval+ and MBPP+). GitHub
- AraGen Leaderboard — Generative-task benchmark using the novel 3C3H metric (correctness, completeness, conciseness + helpfulness, honesty, harmlessness) by Inception/MBZUAI.
- BALSAM Benchmark — Community-driven Arabic LLM benchmark by KSGAAL with 78 NLP tasks. Saudipedia
- AlGhafa Benchmark — Multiple-choice benchmark for zero-shot and few-shot evaluation of Arabic LLMs.
- ArabicMMLU — Arabic version of MMLU sourced from school exams across 40+ subjects. GitHub | Paper
- HELM Arabic — Stanford CRFM's Holistic Evaluation of Language Models for Arabic (v1.0.0, 2025). Blog
- Arabic Broad Benchmark (ABB) — Comprehensive benchmark by SILMA.AI covering diverse Arabic tasks.
- Arabic LLM Leaderboard - ABL — Advanced visualizations and in-depth analytics by SILMA AI.
- ARB - Arabic Multimodal Reasoning Benchmark — First benchmark for step-by-step reasoning in Arabic across textual and multimodal tasks.
- ArabicaQA Benchmark — Question answering benchmark with extensive LLM evaluations.
- CamelEval — Culturally aligned evaluation benchmark for Arabic models.
- AraSTS — Arabic Semantic Textual Similarity benchmark.
- ARCD — Arabic Reading Comprehension Dataset.
- CIDAR-EVAL — Cultural alignment evaluation for Arabic LLMs.
- AraLingBench — Human-annotated benchmark for evaluating Arabic linguistic abilities of LLMs.
- Arabic-LLM-Benchmarks Repository — Comprehensive curated repository by TII.
- Arabic AI Benchmarks & Leaderboards (SILMA) — Comprehensive record of all benchmarks in the Arabic AI ecosystem.
Arabic speech recognition, synthesis, and audio processing resources.
- Whisper Arabic Fine-tunes — Collection of Whisper models fine-tuned for Arabic ASR.
- ArTST — Arabic Text and Speech Transformer by MBZUAI. Unified speech model for Arabic. Collection | Paper
- MGB-2 Dataset — Large-scale Arabic broadcast news corpus for ASR. Paper
- MGB-3 Dataset — Egyptian dialect ASR challenge dataset. Paper
- MGB-5 Dataset — Moroccan Arabic ASR challenge.
- Common Voice Arabic — Mozilla's crowdsourced Arabic speech dataset.
- QASR — Largest transcribed Arabic speech corpus (2000+ hours).
- klaam — Arabic speech recognition, classification and TTS by ARBML.
- NADI 2025 ASR — Shared task on Spoken Dialect Identification and Multidialectal ASR.
- ClArTTS — Classical Arabic text-to-speech corpus and models.
- Arabic SpeechT5 — SpeechT5 adapted for Arabic TTS.
- Tacotron2-Arabic — Arabic implementation of Tacotron 2.
- XTTS Arabic — Coqui XTTS supporting Arabic voice cloning.
- ElevenLabs Arabic — Commercial multilingual TTS with high-quality Arabic voices.
- Munsit — Accurate Arabic STT/TTS platform with smart assistants and meeting transcription.
Optical Character Recognition and computer vision for Arabic script.
- QARI-OCR — High-accuracy open-source Arabic OCR model collection by NAMAA-Space, with multiple versions (v0.1, v0.2, v0.3, v0.4). Paper
- Kraken OCR (Arabic models) — Open-source OCR engine with trained Arabic models.
- EasyOCR Arabic — Easy-to-use OCR library with Arabic support.
- Tesseract Arabic — Google's Tesseract OCR with Arabic language data.
- KHATT Database — Arabic handwritten text database for research.
- Arabic Document Layout Analysis — Tools for understanding Arabic document structure.
- Falcon Perception — Multimodal AI model enabling systems to see, read, and understand images (including Arabic content) by TII.
Production-ready libraries for Arabic text processing and NLP.
- CAMeL Tools — Comprehensive Python toolkit for Arabic NLP by NYU Abu Dhabi's CAMeL Lab. Includes tokenization, morphology, dialect ID, NER, and more. Paper
- Farasa — Fast and accurate Arabic NLP toolkit by QCRI. Segmentation, POS, NER, diacritization. Python wrapper (farasapy)
- MADAMIRA — Morphological analyzer and disambiguator for Arabic.
- PyArabic — Python library for Arabic text processing utilities.
- Tashaphyne — Arabic light stemmer and root extractor.
- Tnkeeh — Arabic text preprocessing library with normalization tools.
- Maha — Text processing library with rich Arabic support.
- nmatheg — Simple strategy for training and finetuning NLP models for Arabic by ARBML.
- ARBML Library — Implementation of many Arabic NLP and CV projects with multiple interfaces.
- ar-corrector — Arabic spell-checker and corrector.
- Arabic-Stopwords — Comprehensive Arabic stopwords list.
- arabic-reshaper — Reshape Arabic text for correct display.
- python-bidi — Bidirectional text handling for Arabic.
Foundational and recent academic work on Arabic AI and NLP.
- The Landscape of Arabic Large Language Models — Comprehensive ACM survey of Arabic LLM development, architectures, and challenges.
- Evaluating Arabic Large Language Models: A Survey — Systematic review of 40+ Arabic LLM benchmarks and evaluation methodologies.
- A Review of Arabic Post-Training Datasets and Their Limitations — Critical analysis of instruction-tuning datasets for Arabic.
- A Panoramic Survey of Natural Language Processing in the Arab World — Comprehensive CACM survey.
- Jais Technical Report — Original paper introducing the Jais family of Arabic LLMs.
- ALLaM Paper — Large Language Models for Arabic and English by SDAIA.
- AceGPT Paper — Technical details of AceGPT and its RLAIF approach.
- AraBERT Paper — Foundational paper on transformer-based Arabic language understanding.
- AraT5 Paper — Text-to-Text Transformers for Arabic.
- CAMeLBERT Paper — Variant-aware BERT models for MSA, DA, and Classical Arabic.
- ArabicaQA Paper — Large-scale Arabic question answering dataset and benchmark.
- ArTST Paper — Arabic Text and Speech Transformer (won Best Paper at ArabicNLP 2023).
- Fanar Paper — Fanar: An Arabic-Centric Multimodal Generative AI Platform.
- Atlas-Chat Paper — Adapting LLMs for Moroccan Darija.
- Nile-Chat Paper — Egyptian Language Models for Arabic and Latin Scripts.
- Command R7B Arabic Paper — Small enterprise-focused multilingual model.
- BALSAM Paper — A Platform for Benchmarking Arabic LLMs.
- Lahjawi Paper — Arabic Cross-Dialect Translator.
- Cross-dialectal Arabic translation — Frontiers in AI 2025 comparative analysis.
- Emerging Techniques in Arabic NLP — Frontiers in AI editorial 2025.
- ArXiv Arabic NLP Papers — Live feed of latest Arabic NLP research on ArXiv.
- ACL Anthology - ArabicNLP 2025 — Proceedings of The Third Arabic NLP Conference.
Learning resources for getting started with Arabic AI development.
- HuggingFace NLP Course — Free course covering NLP fundamentals applicable to Arabic.
- CAMeL Tools Tutorials — Hands-on tutorials for Arabic text processing.
- Arabic NLP Series (YouTube) — Episode series exploring Arabic NLP tools and resources.
- Fine-tuning Whisper for Arabic ASR — Adaptable guide.
- SILMA AI Blog Series — Arabic LLM development blog series.
- Arabic LLM Models List — Continuously updated guide by SILMA AI.
- Arabic NLP Resources by ARBML — Open-source organization with Arabic NLP tutorials and tools.
- Stanford CS224N Lectures — General NLP course; concepts transfer well to Arabic work.
Leading organizations driving Arabic AI innovation in the MENA region and beyond.
- TII - Technology Innovation Institute (UAE) — Creators of Falcon, Falcon-Arabic, Noor, and QIMMA leaderboard.
- MBZUAI (UAE) — Leading AI research university, behind Jais, ArTST, ArabicMMLU.
- MBZUAI-Paris — Institute of Foundation Models behind Atlas-Chat and Nile-Chat.
- QCRI - Qatar Computing Research Institute (Qatar) — Creators of Fanar and Farasa.
- SDAIA (Saudi Arabia) — Creators of original ALLaM, national AI authority.
- KAUST AI Initiative (Saudi Arabia) — Research behind AceGPT.
- KSGAAL (Saudi Arabia) — King Salman Global Academy for Arabic Language; creators of BALSAM benchmark.
- Inception (G42) (UAE) — Behind Jais and Jais 2; one of the largest AI companies in MENA.
- HUMAIN (Saudi Arabia) — PIF-backed full-stack AI company launched May 2025; behind ALLaM-34B.
- SILMA AI (Saudi Arabia) — Arabic LLM and benchmark provider; creators of SILMA-9B.
- Watad (Saudi Arabia) — Creators of Mulhem Arabic-English LLM.
- Navid AI (Saudi Arabia) — Behind Yehia-7B Arabic LLM.
- Mozn (Saudi Arabia) — AI solutions and Arabic NLP.
- Intella (Saudi Arabia/Egypt) — Arabic voice AI and ASR. Raised $12.5M Series A in 2025.
- Tarjama& / Arabic.AI (UAE) — MENA's leading language AI company; creators of Pronoia LLM. Raised $15M Series A.
- NAMAA-Space — Network for Advancing Modern Arabic NLP & AI; creators of QARI-OCR.
- Clusterlab AI — Behind 101 Billion Arabic Words Dataset and InstAr-500k.
- AtlasIA (Morocco) — Behind AL Atlas Moroccan Darija models.
- Synapse Analytics (Egypt) — AI platform with Arabic capabilities.
- Bayzat (UAE) — HR-Tech leveraging Arabic NLP.
- Cequens (Egypt) — Communications platform with Arabic AI features.
- Cerebras Systems — Compute partner for Jais 2 training.
Connect with the Arabic AI community.
- ArabicNLP 2026 — The Fourth Arabic Natural Language Processing Conference.
- ArabicNLP 2025 — Third edition, co-located with EMNLP 2025 (Suzhou, China).
- ArabicNLP 2024 — Second edition, co-located with ACL.
- WANLP Workshop — Workshop on Arabic Natural Language Processing.
- OSACT Workshop — Open-Source Arabic Corpora and Processing Tools workshop.
- NADI Shared Tasks — Nuanced Arabic Dialect Identification (ongoing).
- Qur'an QA Shared Task — Question answering on the Holy Quran.
- SIGARAB — ACL Special Interest Group on Arabic NLP. Google Group
- HuggingFace OALL — Open Arabic LLM Leaderboard organization.
- Cohere Labs Arabic — Arabic AI models, datasets, and discussions.
- ARBML on GitHub — Open-source Arabic ML organization (700+ researchers). Website
- NAMAA-Space — Network for Advancing Modern Arabic NLP & AI.
- OpenReview SIGARAB — Open peer review for Arabic NLP papers.
- r/LocalLLaMA — Active discussions including Arabic LLMs.
Stay updated on Arabic AI developments.
- SILMA AI Blog — Regular posts on Arabic LLMs, benchmarks, and best practices.
- Middle East AI News — News and analysis on AI in the MENA region.
- Africa AI News — AI developments across Africa, including Arabic-speaking countries.
- TII Blog — Updates from the Technology Innovation Institute.
- Inception Blog — Insights from the team behind Jais.
- MBZUAI News — Research news and announcements.
- MAGNiTT — MENA startup ecosystem coverage including AI.
- Hub71 Insights — Abu Dhabi tech ecosystem and AI startup news.
- Wamda — MENA technology and entrepreneurship news.
Contributions are warmly welcomed! This list grows stronger with community input.
How to contribute:
- Found a great resource we missed? Open a Pull Request.
- Spotted a broken link or outdated info? Open an Issue.
- Want to suggest a new category? Start a Discussion.
Please read our Contribution Guidelines before submitting.
Quick rules:
- ✅ Resource must be related to Arabic AI/NLP/Speech/Vision.
- ✅ Add resources in alphabetical or logical order within their section.
- ✅ Include a clear, concise one-line description.
- ✅ Prefer open-source and actively maintained projects.
- ❌ No duplicate entries.
- ❌ No paid courses or affiliate links without disclosure.
If this list helped you, please consider:
- ⭐ Starring the repository
- 🔄 Sharing it with your network
- 🤝 Contributing a resource or improvement
- 💬 Joining the discussion to shape the future of Arabic AI
Every star helps more Arabic AI builders discover these resources! 🚀
This list stands on the shoulders of giants. Special thanks to:
- The Arabic NLP research community for decades of foundational work.
- MBZUAI, TII, QCRI, SDAIA, KAUST, HUMAIN, KSGAAL, and other institutions advancing Arabic AI.
- Open-source contributors who make Arabic AI accessible to everyone.
- Sindre Sorhus for creating the Awesome List standard.
- Every contributor who helps keep this list current and comprehensive.
This work is licensed under CC0 1.0 Universal — you may freely use, modify, and distribute this content without restriction.
Osama AL Hajj
## ⭐ If you find this list valuable, please star it to help others discover it!
**Made with ❤️ for the Arabic AI community**
صُنع بكل حب لمجتمع الذكاء الاصطناعي العربي
