Researchers train AI model that hits near-full performance with just 12.5 percent of its experts

May 16, 2026
Olivia
Cryptocurrency
0

Researchers at the Allen Institute for AI and UC Berkeley have built EMO, a mixture-of-experts model that develops modular structures during pre-training. The model can be stripped down to a small fraction of its experts with barely any drop in performance.

Mixture-of-experts (MoE) architectures are now standard in language models like DeepSeek-V4 or Qwen3.5. They activate only a handful of experts per token, which lets them scale to hundreds of billions of parameters without blowing up compute costs. But the full model still has to sit in memory because different tokens within a task call on different experts. If you only want to do math or code, you can’t just load a slice of the model and call it a day.

According to the paper, that’s because experts in standard MoEs tend to latch onto shallow language patterns. They respond to things like prepositions, punctuation, or articles instead of higher-level domains like math or code. That makes it impossible to carve out a useful subset.

Document boundaries as a training signal

EMO tackles this with a simple trick. Instead of sorting training data into fixed domains like math or biology ahead of time—the way projects like BTX or Ai2’s own FlexOlmo do—the authors use document boundaries. Tokens within a document usually belong to the same domain.

EMO forces all tokens in a document to pick their active experts from a shared pool. The model decides which experts belong in that pool by averaging its router preferences across all tokens in a document and keeping the most frequently selected ones.

Diagram showing how domain-specific subsets of experts (math, code, biomedical) can be selected from the full EMO model. — EMO trains modularity as a first-order goal. You can select an arbitrary subset of experts for a given domain without hurting the full model’s performance. | Image: Allen Institute

Two adjustments were needed to keep training stable. First, the authors stopped calculating load balancing, which aims to spread work evenly across experts, locally per training batch. Instead, they compute it globally across many documents. Otherwise, the two training goals would fight each other: one bundles tokens within a document, and the other spreads them across as many experts as possible.

Second, the researchers randomly vary the size of the document pool during training instead of fixing it. This teaches the model to work with expert subgroups of different sizes at inference time.

A quarter of the experts, one percent performance loss

The team trained a MoE with 1 billion active and 14 billion total parameters with 128 experts, eight active per token, on 1 trillion tokens from the OLMoE pre-training corpus. As a full model, EMO matches an identically trained standard MoE. The authors say it beats OLMoE despite using five times more data.

The researchers then started removing experts to see how far they could go. With just 25 percent of them left (32 out of 128), EMO loses about one percentage point of absolute performance averaged across several benchmarks. At 12.5 percent (16 experts), the drop is around three points.

A standard MoE collapses in the same setup, losing 10 to 15 percentage points and, in some cases, falling below the level of a dense model with the same number of active parameters. On the math benchmark GSM8K, subsets with just 12.5 percent of the experts match full-model performance again after fine-tuning.

Bar chart comparing benchmark results for EMO and a standard MoE at full and reduced expert counts (base model). — On the base models, EMO stays close to full performance even with only 16 out of 128 active experts (12.5 percent), while the standard MoE drops sharply. | Image: Allen Institute

Bar chart of GSM8K results for EMO and a standard MoE at different expert counts per layer. — On the math task GSM8K, EMO holds steady at full-model level (12.0) even with 16 experts (12.2), while the standard MoE drops to 4.9 with half the experts and falls below random guessing in the smallest setting. | Image: Allen Institute

Finding the right experts doesn’t take much data, the authors say. A single few-shot example is enough to pick a subgroup that performs comparably to one selected on a full validation dataset. EMO works with both simple router-based selection and the more specialized Easy-EP method.

Experts learn topics, not punctuation

To understand what EMO actually learned, the researchers analyzed how the model distributes tokens to experts internally. For each token, they recorded the probability with which the router sends it to each expert. These patterns create a kind of fingerprint per token. They then grouped tokens with similar fingerprints into clusters.

Routing comparison: standard MoE with independently selected experts per token (left) vs. EMO with a document-wide restricted expert pool (right). — In a standard MoE (left), each token independently picks its top-k experts, so nearly all experts get used over the course of a document. EMO (right) defines a shared pool per document that all tokens must route through. This enforces consistent expert usage and drives domain specialization. | Image: Allen Institute

The difference is clear-cut. In a standard MoE, expert clusters correspond to shallow linguistic categories: prepositions, proper names, definite articles. EMO’s clusters map to actual topics: health, medical and wellness, US politics and elections, film, music, TV & book reviews. Tokens from the same document converge on a single cluster in EMO; in a standard MoE, they scatter across many. An interactive visualization of the clusters is available online.

Visualization of token clusters from router activations: EMO with semantic domains like health and politics (left) vs. standard MoE with syntactic clusters like prepositions (right). — EMO forms clusters for meaningful domains (health, politics, film, and music), and tokens from the same document usually land in the same cluster. Standard MoE creates clusters at the word-type level—prepositions, articles—and scatters a document’s tokens across many of them. | Image: Allen Institute

On a sample of 20 million documents from the WebOrganizer dataset with 24 human-assigned domain labels, the authors checked whether related domains also activate similar experts. In EMO, the patterns separate much more cleanly, especially in the model’s deeper layers. In standard MoE, they overlap more.

Use cases go beyond memory savings

The most obvious application is running models in memory-constrained settings where only domain-relevant experts get loaded. In a head-to-head comparison, EMO expert subgroups match or beat both a standard MoE with 32 experts and a dense model with eight active parameters, each trained from scratch.

Line chart of average MMLU accuracy vs. memory budget for EMO, standard MoE, and fixed baselines. — In the smaller 130-billion-token setting, EMO expert subsets push the Pareto frontier for memory-to-accuracy: they beat both standard MoEs and dense models trained from scratch. | Image: Allen Institute

The researchers also discuss fine-tuning models at runtime. A child-friendly app, for example, could switch off clusters that respond to spam, gambling, or adult content. In an initial test, the authors retrained a 32-expert subgroup of EMO and plugged it back into the 128-expert model. This improved the full model but didn’t reach the level of the standalone subgroup. EMO could also help with monitoring, since the experts make it visible which parts of the model a given input is using.

Ai2 is releasing the EMO model, a comparably trained standard MoE baseline, and the training code on Hugging Face and GitHub. The researchers have also published an interactive demo of the token activations. Open questions remain: how best to select and combine expert subgroups, how to retrain individual modules for specific tasks, and how the modular structure can be used to make models more interpretable.

AI News Without the Hype – Curated by Humans

Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive “AI Radar” frontier report six times a year, full archive access, and access to our comment section.

Subscribe now

Source link