LEAP26

I am Mohammed Nasser Jaber. I built Arabic AI beyond translation models

Mohammed Fathy
Mohammed Fathy

6 min


Arabic Is Not One Language, And That Is Where Most Models Fail

When asked about what Western-first models misunderstand, Mohammed Nasser Jaber does not start with performance metrics or benchmarks. He goes straight to framing. The issue, he argues, is not capability but assumption.

Most systems treat Arabic as a single, standardised language. That assumption collapses quickly in real use. Arabic operates in layers, where Modern Standard Arabic coexists with more than twenty dialects, often within the same sentence. Models trained on clean, formal text struggle the moment they encounter this switching.

But he pushes further. The deeper failure is not just geographic variation, but time itself. Arabic, in his view, suffers from what he calls linguistic temporal drift. Words that were common seventy or a hundred years ago still appear in legal and historical records, yet no longer exist in modern datasets. For most systems, these terms register as noise.

His response has been to treat that “noise” as signal. By building a referenced corpus of historical dialectal vocabularies, his team is trying to restore continuity across time, not just across regions. The focus is not on matching strings, but on understanding morphology, roots, and intent. It is a shift from translation to interpretation.


Turning Invisible Archives Into Usable Systems

On the question of “dark data”, Jaber brings the conversation into a more applied space. He points to fragmented legal and historical documents from Yemen and the Gulf, many written in non-standard scripts and effectively invisible to modern systems.

These are not edge cases. They are core records, inaccessible to search, policy, or analysis.

The challenge is not only technical. It is about trust. His approach relies on a human-in-the-loop system where every digitised word carries a confidence score. Uncertainty is not hidden. It is surfaced and escalated to human experts when needed.

He also rejects the default move to cloud infrastructure. Instead, he emphasises local, air-gapped systems to ensure that sensitive historical and legal data remains controlled. The system is designed to be accountable at every step, not just accurate.


Why OCR Is the Real Bottleneck

When the conversation turns to where the real leverage sits, Jaber does not hesitate. It is not agents or retrieval systems. It is OCR.

Without reliable document intelligence, everything else is built on sand. You cannot retrieve or reason over data that has never been properly digitised.

For Arabic, this problem is compounded by handwriting, dialectal variation, and degraded historical materials. Solving OCR in this context is not just a preprocessing step. It is the unlock that makes higher-level systems viable.

He frames it bluntly. OCR is the bottleneck of Arabic knowledge. Remove it, and the rest of the stack becomes meaningful.


The Non-Negotiables of High-Stakes Systems

Pressed on what it takes to deploy AI in legal or corporate environments, Jaber draws a hard line. Hallucination is not a minor flaw. It is a liability.

His requirements are strict. Every output must be tied back to an exact source, down to the page and line. The system must understand dialectal variation in queries and map it correctly to formal legal language. And before any response reaches a user, a secondary agent must challenge it against the source material.

This is not about improving accuracy in the abstract. It is about building systems that can be trusted in environments where mistakes carry real consequences.


What Linguistic Sovereignty Looks Like in Practice

When asked to unpack “linguistic sovereignty”, Jaber avoids rhetoric and moves into engineering choices.

It starts with data. Instead of relying on generic, web-scraped corpora, his approach prioritises curated, region-specific datasets. In evaluation, it means abandoning Western-centric benchmarks like BLEU scores in favour of metrics that reflect Arabic linguistic structure.

Privacy is treated as a structural concern, not a feature. Data remains local, controlled, and governed within the region it belongs to.

The underlying idea is simple. A language should not require external systems to interpret its own history and laws.


Building Adab, And Choosing Depth Over Efficiency

When the conversation shifts to model development, Jaber is quick to reframe. The real challenge, he says, was not the foundation model itself but the data pipeline behind it.

Creating usable datasets required manual labelling of dialectal vocabularies stretching back nearly a century. This is work that cannot be automated because the data does not exist in any structured form.

The key technical decision came down to tokenisation. Standard tokenisers fragment Arabic words into meaningless pieces, losing the structure embedded in roots and morphology. His team chose to build a custom tokenizer designed specifically for Arabic.

It was a costly decision. Training became slower and more resource-intensive. But the trade-off was intentional. He chose depth of understanding over efficiency of computation.


Choosing Applied Research Over Academic Novelty

Asked to reflect on the tension between academia and product, Jaber describes a familiar split. Academic work rewards novelty and complexity. Products demand reliability and speed.

At a critical moment, he had to choose between publishing a high-theory paper and delivering a functional system to a client. He chose the latter.

The decision reshaped his approach. Instead of pursuing cutting-edge ideas in isolation, he focuses on applying modern techniques like RAG and agentic systems to long-standing problems, particularly the digitisation of archives.

For him, the value of research lies in its ability to make old data usable.


Reframing Arabic’s “Complexity” as an Advantage

When the conversation turns to misconceptions, Jaber points to one he hears repeatedly. The idea that Arabic is too complex for AI.

He rejects it outright. The issue is not the language. It is the frameworks imposed on it.

Most systems try to force Arabic into structures designed for Latin-based languages. When those systems fail, the language gets blamed. His view is the opposite. When model architecture aligns with Arabic’s native structure, its complexity becomes an advantage, offering richer meaning and denser information.


The Breakthrough Came From Humans, Not Just Models

Asked about his biggest success, Jaber points to an OCR pilot on traditional Yemeni handwritten documents.

The technical component mattered. Using language models to infer context from damaged or unclear text was a step forward. But it was not sufficient on its own.

Read next

The real leverage came from collaboration. By working with people who understood regional handwriting styles, his team was able to label data that would otherwise remain inaccessible.

It is a pattern that runs through his work. Progress does not come from replacing humans, but from pairing their expertise with systems that can scale it.

Read next