Skip to content

OsamaALHajj/awesome-arabic-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Arabic AI Banner

🌟 Awesome Arabic AI Awesome

A curated list of awesome Arabic Artificial Intelligence resources
LLMs · Datasets · Benchmarks · Speech · OCR · Tools · Research

قائمة منسقة لأفضل موارد الذكاء الاصطناعي العربي
نماذج لغوية · بيانات · تقييمات · صوت · رؤية · أدوات · أبحاث

GitHub stars GitHub forks GitHub contributors GitHub last commit License: CC0-1.0 PRs Welcome


📖 About | عن المشروع

The Arabic language is spoken by over 400 million people worldwide, yet remains significantly underrepresented in the AI ecosystem. This repository aims to be the single source of truth for Arabic AI resources — bringing together state-of-the-art models, datasets, benchmarks, tools, and research from across the MENA region and beyond.

اللغة العربية يتحدث بها أكثر من 400 مليون شخص حول العالم، لكنها لا تزال ممثَّلة بشكل ضعيف في منظومة الذكاء الاصطناعي. يهدف هذا المستودع إلى أن يكون المرجع الموحَّد لموارد الذكاء الاصطناعي العربي، جامعاً أحدث النماذج والبيانات وأدوات التقييم والأبحاث من المنطقة العربية والعالم.

💡 Found this useful? Give it a ⭐ to help others discover it! 💡 هل أفادك المحتوى؟ ضع نجمة ⭐ لمساعدة الآخرين على اكتشافه!


📑 Contents | المحتويات

🤖 Large Language Models (LLMs)

State-of-the-art Arabic and multilingual language models with strong Arabic capabilities.

Open-Source General Models

  • Jais — Family of bilingual Arabic-English LLMs (590M to 70B parameters) developed by Inception (G42), Cerebras, and MBZUAI. Apache 2.0 license. One of the most influential Arabic LLMs.
  • Jais 2 — Next-generation Arabic open-weight LLM (8B and 70B) released December 2025 by Inception, Cerebras, and MBZUAI. Trained on the largest Arabic-first dataset ever assembled.
  • Fanar-1 — 9B parameter Arabic-focused LLM by QCRI (Qatar). Open weights under Apache 2.0. Strong performance on cultural and dialectal tasks.
  • Fanar 2.0 — 27B parameter upgraded Arabic-centric multimodal model by QCRI (2025), supporting language, speech, and image generation.
  • ALLaM-7B — 7B model by SDAIA (Saudi Data and AI Authority). Apache 2.0. Optimized for Modern Standard Arabic and Saudi dialects.
  • ALLaM-34B — 34B parameter Arabic-centric LLM by HUMAIN. Top-ranked on Stanford HELM for Arabic accuracy.
  • AceGPT — 7B/13B Arabic LLM by KAUST with RLAIF-based instruction tuning. Strong cultural alignment.
  • AceGPT-v2 — Improved Arabic LLMs (8B, 32B, 70B) with alignment at pre-training. Available on Hugging Face.
  • Falcon-Arabic — TII's first Arabic model in the Falcon series (May 2025). Excels at Arabic grammar, reasoning, and code-switching. Falcon LLM Site
  • Falcon-H1 — TII's best-in-class high-performance hybrid model with strong Arabic capabilities.
  • SILMA-9B-Instruct-v1.0 — Top-ranked open-weights Arabic LLM (until Feb 2025) based on Gemma. Apache 2.0 license.
  • Mistral Saba — 24B parameter model by Mistral AI (Feb 2025) specifically designed for Arabic and South Asian languages.
  • Command R7B Arabic — 7B open-weight Arabic-optimized model by Cohere Labs (Feb 2025). Technical Report
  • Aya & Aya Expanse — Cohere's multilingual model supporting 101+ languages with strong Arabic support.
  • Yehia-7B — Fine-tuned Arabic LLM by Navid AI, based on ALLaM. Top-performing on multiple Arabic benchmarks.
  • Pronoia LLM — Tarjama's enterprise-focused Arabic LLM (7B and 14B variants). Ranked #1 on Open Arabic LLM Leaderboard.
  • Mulhem — First Saudi domain-specific LLM by Watad, trained exclusively on Saudi data sets.
  • Noor — Large-scale Arabic NLP model by TII (10B parameters, 2022).
  • Arabic-Sahm — Arabic language model by NAMAA-Space.

Dialectal Arabic LLMs

  • Atlas-Chat — First family of LLMs specifically developed for Moroccan Darija. Available in 2B, 9B, and 27B by MBZUAI-Paris. Paper
  • Nile-Chat — Family of LLMs for Egyptian dialect handling both Arabic and Latin scripts. Available in 4B, 12B, and 3x4B-A6B (MoE). Paper
  • AL Atlas — Moroccan Darija pretraining by AtlasIA.
  • Lahjawi — First cross-dialect translation model, covering 15 distinct Arabic dialects.

Commercial & Closed Models

  • Cohere Command-R / Command-A — Strong multilingual model with excellent Arabic support and improved Arabic dialect matching.
  • GPT-4 / GPT-4o — Leading commercial model with strong Arabic capabilities.
  • Claude (Anthropic) — Excellent Arabic comprehension and generation.
  • Gemini — Google's multilingual model with native Arabic support.
  • Fanar (Closed) — Qatar's flagship Arabic GenAI platform.

Foundational & Encoder Models

  • AraBERT — BERT-based Arabic language understanding model by AUB MIND Lab. Foundational work, still widely used.
  • AraGPT2 — GPT-2 variants pre-trained on Arabic by AUB MIND Lab.
  • AraT5 — T5-style text-to-text Transformer for Arabic language generation. Available on Hugging Face. Paper
  • AraELECTRA — ELECTRA-based Arabic language model by AUB MIND Lab.
  • CAMeLBERT — Suite of BERT models for Arabic variants (MSA, DA, CA) by Columbia/NYU's CAMeL Lab.
  • MARBERT / ARBERT — Arabic-focused BERT variants by UBC-NLP.

Fine-tuned & Specialized Models


📊 Datasets

High-quality datasets for training, fine-tuning, and evaluating Arabic AI models.

Pre-training Datasets

Instruction & SFT Datasets

  • CIDAR — High-quality Arabic instruction-tuning dataset, culturally aligned.
  • Arabic Alpaca — Arabic translations of the Alpaca instruction dataset.
  • InstAr-500k — 500k Arabic instructions for SFT.
  • Arabic Dolly — Arabic version of the Dolly instruction dataset.
  • Egyptian-SFT-Mixture — 1.85M Egyptian Arabic SFT examples by MBZUAI-Paris.
  • Darija-SFT-Mixture — Moroccan Darija SFT collection.

Dialectal Arabic Datasets

  • MADAR — Multi-Arabic Dialect Applications and Resources covering 25 cities.
  • NADI — Nuanced Arabic Dialect Identification shared task datasets. NADI 2025 focuses on multidialectal Arabic speech.
  • QADI — Qatari Arabic Dialect Identification dataset.
  • Arabic Online Commentary (AOC) — Newspaper comments labeled by dialect.
  • Arabizi-Egypt — Arabizi Egyptian dataset for LLM pre-training.

Domain-Specific Datasets

Dataset Catalogs

  • Masader — First online catalog of Arabic NLP datasets by ARBML — 600+ datasets with 25+ metadata annotations each.
  • Adawat — Aggregated catalog of Arabic NLP tools and resources.

🏆 Benchmarks & Leaderboards

Standardized evaluation suites for measuring Arabic AI model performance.


🔊 Speech & Audio

Arabic speech recognition, synthesis, and audio processing resources.

Speech Recognition (ASR)

Text-to-Speech (TTS)

  • ClArTTS — Classical Arabic text-to-speech corpus and models.
  • Arabic SpeechT5 — SpeechT5 adapted for Arabic TTS.
  • Tacotron2-Arabic — Arabic implementation of Tacotron 2.
  • XTTS Arabic — Coqui XTTS supporting Arabic voice cloning.
  • ElevenLabs Arabic — Commercial multilingual TTS with high-quality Arabic voices.
  • Munsit — Accurate Arabic STT/TTS platform with smart assistants and meeting transcription.

👁️ Vision & OCR

Optical Character Recognition and computer vision for Arabic script.


🛠️ Libraries & Tools

Production-ready libraries for Arabic text processing and NLP.

  • CAMeL Tools — Comprehensive Python toolkit for Arabic NLP by NYU Abu Dhabi's CAMeL Lab. Includes tokenization, morphology, dialect ID, NER, and more. Paper
  • Farasa — Fast and accurate Arabic NLP toolkit by QCRI. Segmentation, POS, NER, diacritization. Python wrapper (farasapy)
  • MADAMIRA — Morphological analyzer and disambiguator for Arabic.
  • PyArabic — Python library for Arabic text processing utilities.
  • Tashaphyne — Arabic light stemmer and root extractor.
  • Tnkeeh — Arabic text preprocessing library with normalization tools.
  • Maha — Text processing library with rich Arabic support.
  • nmatheg — Simple strategy for training and finetuning NLP models for Arabic by ARBML.
  • ARBML Library — Implementation of many Arabic NLP and CV projects with multiple interfaces.
  • ar-corrector — Arabic spell-checker and corrector.
  • Arabic-Stopwords — Comprehensive Arabic stopwords list.
  • arabic-reshaper — Reshape Arabic text for correct display.
  • python-bidi — Bidirectional text handling for Arabic.

📚 Research Papers & Surveys

Foundational and recent academic work on Arabic AI and NLP.


🎓 Courses & Tutorials

Learning resources for getting started with Arabic AI development.


🏢 Companies & Startups

Leading organizations driving Arabic AI innovation in the MENA region and beyond.

Research Institutes & Government-Backed

Startups & Companies

  • Inception (G42) (UAE) — Behind Jais and Jais 2; one of the largest AI companies in MENA.
  • HUMAIN (Saudi Arabia) — PIF-backed full-stack AI company launched May 2025; behind ALLaM-34B.
  • SILMA AI (Saudi Arabia) — Arabic LLM and benchmark provider; creators of SILMA-9B.
  • Watad (Saudi Arabia) — Creators of Mulhem Arabic-English LLM.
  • Navid AI (Saudi Arabia) — Behind Yehia-7B Arabic LLM.
  • Mozn (Saudi Arabia) — AI solutions and Arabic NLP.
  • Intella (Saudi Arabia/Egypt) — Arabic voice AI and ASR. Raised $12.5M Series A in 2025.
  • Tarjama& / Arabic.AI (UAE) — MENA's leading language AI company; creators of Pronoia LLM. Raised $15M Series A.
  • NAMAA-Space — Network for Advancing Modern Arabic NLP & AI; creators of QARI-OCR.
  • Clusterlab AI — Behind 101 Billion Arabic Words Dataset and InstAr-500k.
  • AtlasIA (Morocco) — Behind AL Atlas Moroccan Darija models.
  • Synapse Analytics (Egypt) — AI platform with Arabic capabilities.
  • Bayzat (UAE) — HR-Tech leveraging Arabic NLP.
  • Cequens (Egypt) — Communications platform with Arabic AI features.
  • Cerebras Systems — Compute partner for Jais 2 training.

🌐 Communities & Conferences

Connect with the Arabic AI community.

Conferences & Workshops

Communities


📰 Blogs & Newsletters

Stay updated on Arabic AI developments.

  • SILMA AI Blog — Regular posts on Arabic LLMs, benchmarks, and best practices.
  • Middle East AI News — News and analysis on AI in the MENA region.
  • Africa AI News — AI developments across Africa, including Arabic-speaking countries.
  • TII Blog — Updates from the Technology Innovation Institute.
  • Inception Blog — Insights from the team behind Jais.
  • MBZUAI News — Research news and announcements.
  • MAGNiTT — MENA startup ecosystem coverage including AI.
  • Hub71 Insights — Abu Dhabi tech ecosystem and AI startup news.
  • Wamda — MENA technology and entrepreneurship news.

🤝 Contributing

Contributions are warmly welcomed! This list grows stronger with community input.

How to contribute:

  1. Found a great resource we missed? Open a Pull Request.
  2. Spotted a broken link or outdated info? Open an Issue.
  3. Want to suggest a new category? Start a Discussion.

Please read our Contribution Guidelines before submitting.

Quick rules:

  • ✅ Resource must be related to Arabic AI/NLP/Speech/Vision.
  • ✅ Add resources in alphabetical or logical order within their section.
  • ✅ Include a clear, concise one-line description.
  • ✅ Prefer open-source and actively maintained projects.
  • ❌ No duplicate entries.
  • ❌ No paid courses or affiliate links without disclosure.

🌟 Show Your Support

If this list helped you, please consider:

  • Starring the repository
  • 🔄 Sharing it with your network
  • 🤝 Contributing a resource or improvement
  • 💬 Joining the discussion to shape the future of Arabic AI

Every star helps more Arabic AI builders discover these resources! 🚀


🙏 Acknowledgments

This list stands on the shoulders of giants. Special thanks to:

  • The Arabic NLP research community for decades of foundational work.
  • MBZUAI, TII, QCRI, SDAIA, KAUST, HUMAIN, KSGAAL, and other institutions advancing Arabic AI.
  • Open-source contributors who make Arabic AI accessible to everyone.
  • Sindre Sorhus for creating the Awesome List standard.
  • Every contributor who helps keep this list current and comprehensive.

📜 License

CC0

This work is licensed under CC0 1.0 Universal — you may freely use, modify, and distribute this content without restriction.


👤 Maintainer

Osama AL Hajj

GitHub


                           ## ⭐ If you find this list valuable, please star it to help others discover it!

                                         **Made with ❤️ for the Arabic AI community**

صُنع بكل حب لمجتمع الذكاء الاصطناعي العربي

⬆ Back to Top


About

A curated list of awesome Arabic AI resources — LLMs,datasets, benchmarks, speech, OCR, and tools for 400M+ Arabic speakers. | قائمة منسقة لموارد الذكاء الاصطناعي العربي

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors