13 People Overthrow Transformer: New Architecture SSA Cuts Computing Power by a Thousand Times with Cost Just 5% of Opus

May 6, 2026
Olivia
Cryptocurrency
0

The dominance of Transformer is at stake! A SubQ model with the SAA architecture has emerged, with a context cost of only 5% of Opus for 12 million tokens, and the computational load has been reduced by a thousand times.

Is Transformer’s throne in jeopardy?!

Today, an AI model called SubQ has emerged, shocking the world.

It is the world’s first model based on a fully subquadratic sparse attention architecture (SSA), with a context of up to 12 million tokens.

The core advantage of SubQ lies in its SSA architecture, which “dynamically selects” points of focus based on the content and does not blindly calculate the associations between all tokens.

Compared with Transformer, its computational load is directly reduced by 1000 times.

Experimental results show that for a context of 1 million tokens, SubQ is 52 times faster than FlashAttention, and the cost is less than 5% of Claude Opus.

The company behind this architecture is called Subquadratic, located in Miami, with only 13 employees in total.

AI expert Bindu Reddy commented sharply, “If all this is true, the valuations of Anthropic and OpenAI will directly go to zero!”

Some people also said that this is the real way for LLM to scale in the future.

The “original sin” of Transformer remains unsolved for nine years

In 2017, Google’s paper “Attention is All You Need” established the dominant position of the Transformer architecture.

In the following nine years, from GPT to Claude to Gemini, all cutting – edge large models have been built on the same foundation: the dense attention mechanism.

For a long time, Transformer has worked in a very brute – force way, that is, each token has to be compared with all other tokens in the sequence.

This mechanism has trapped it in the quagmire of “quadratic complexity”. Every time the context doubles, the computational cost soars four times.

This means that the longer the input, the more expensive, slower, and more prone to collapse the model becomes.

This explains why the context of almost all LLMs is stuck at around 1 million tokens. It’s not that the technology can’t achieve a longer context, but that it’s too costly to use even if it can be achieved.

This time, the birth of SubQ has fundamentally changed this equation.

The emergence of the SSA architecture: aiming for “less” rather than “faster”

The core breakthrough of SubQ is called SSA – Subquadratic Sparse Attention.

Its idea is surprisingly simple: it no longer compares each token with all other tokens.

Since in a trained model, the vast majority of attention weights are close to zero, why bother calculating them?

What SSA does is, for each query, select the truly worthy positions in the sequence based on the “content”, and then precisely calculate the attention only at these positions.

It only calculates the truly meaningful interactions and skips over 99% of the useless calculations.

Here are the three key features of SSA:

Linear scalability

The computational load increases with the number of selected positions, rather than with the length of the entire sequence. When the context doubles, the cost only doubles, not quadruples.

Content – dependent routing

The model decides where to look based on semantics, rather than position. Whether the key information is at the 3rd token or the 11 millionth token in the sequence, it can be found.

Precise retrieval

Unlike recurrent models that compress information into a fixed state, SSA retains the ability to precisely retrieve information from any position.

To put it simply, SSA is not about “calculating dense attention faster”, but about “letting the model do less attention calculations”.

The reduced computational load is directly translated into speed.

Speed soars 52.5 times, and the cost is less than 5% of Opus

The data released by SubQ is shocking:

For a length of 1 million tokens, SSA is 52.2 times faster than the standard dense attention + FlashAttention – 2.

It is 7.2 times faster for 128,000 tokens, 13.2 times faster for 256,000 tokens, and 23 times faster for 512,000 tokens.

Obviously, the longer the context, the more dominant the advantage.

This is a direct manifestation of SSA’s linear scalability – dense attention becomes slower as the length increases, while SSA becomes more cost – effective as the length increases.

Looking at the computing power consumption, for 1 million tokens, the attention FLOP is reduced by 62.5 times. For 12 million tokens, this number soars to nearly 1000 times.

As for the cost, Subquadratic gave a very intuitive comparison:

In the RULER 128K benchmark test, SubQ costs $8, while Opus costs $2600, creating a cost gap of 300 times.

The most important thing is that these speed and cost advantages do not come at the cost of accuracy.

In the RULER 128K benchmark test: SubQ gets 95%, while Opus 4.6 gets 94.8%;

In the SWE – Bench Verified (code engineering): SubQ scores 81.8, exceeding Opus 4.6’s 80.8.

In the MRCR v2 (long – context retrieval): SubQ gets 65.9%, which, although lower than Opus 4.6’s 78%, far exceeds GPT 5.4 (39%) and Gemini 3.1 Pro (23%).

Looking at these figures together is thought – provoking:

A seed – stage company, with a cost of less than 5% of Opus, has tied or even exceeded the flagship models of Anthropic and OpenAI in multiple core benchmark tests.

With a single prompt, SubQ can handle ultra – long information of 12 million tokens:

Whether it’s an entire codebase, months of PR records, or the state of a long – running AI agent, it can handle them all with ease, and the cost is only one – fifth of the original.

It has to be said that if all this comes true, this will be the most important architectural breakthrough since the emergence of Transformer.

A 13 – person startup aims to overthrow Transformer

Subquadratic was founded in 2024, raised $29 million in seed funding, and has a valuation of $500 million.

It has two co – founders: CEO Justin Dangel and CTO Alexander Whedon.

The research team consists of 11 people, all of whom are doctors, from Meta, Google, the University of Oxford, the University of Cambridge, and Adobe.

It’s worth mentioning that this company was previously called Aldea and was working on speech models. Later, it transformed to research on attention architectures.

This time, three product lines were launched simultaneously:

SubQ API: A full – context interface for 12M tokens

SubQ Code: A command – line coding agent that can handle an entire codebase at once

SubQ Search: A deep – research tool, free in the initial stage