Billion-Parameter Theories

Written by Sean Linehan. Published on Mar 9, 2026.

For most of human history, the things we couldn't explain, we called mystical. The movement of stars, the trajectories of projectiles, the behavior of gases. Then, over the course of a few centuries, we pulled these phenomena into the domain of human inquiry. We called it science.

What's remarkable, in retrospect, is how terse those explanations turned out to be. F=ma. E=mc². PV=nRT.

The universe, or at least vast swaths of it, submitted to compression ratios that seem almost unreasonable. You could capture the behavior of every falling object on Earth in three variables and describe the relationship between matter and energy in five characters.

The deepest truths fit on a napkin.

They had to. When your tools are pencils, chalkboards, and human working memory, a theory has to be small or you can't use it. The decompression happens in a human brain in real time. So theories needed to be not just correct, but operable at human scale. A physicist scribbling equations on paper needs to be able to hold the model in her head while she works through implications.

And so we developed an implicit belief that good theories are small. If a theory was elegant, we learned to trust it. If you couldn't express it concisely, you probably didn't understand it well enough.

This worked extraordinarily well for a certain class of problems. Call them the complicated.

A complicated system is one with many parts that interact in structured ways, but that ultimately yields to decomposition. A jet engine is complicated, and so are orbital mechanics and the circuit board in your laptop. You can break these systems into components, study each one, and reassemble your understanding into a coherent picture. The picture might be intricate, but it is, in principle, completable.

The Enlightenment and its intellectual descendants gave us a powerful toolkit for taming the complicated. And then we made the natural mistake of assuming that toolkit would scale to everything.

The Complex

Poverty is not complicated. It is complex.

So is climate change. So is drug addiction, mental health, immune response, urban decay, ecosystem collapse, and the behavior of financial markets.

These are systems where the interactions between dimensions are themselves dynamic. Feedback loops create emergent behavior that isn't derivable from studying the components in isolation. Interventions in one area produce non-obvious cascading effects in others. And in many cases, like markets or public health, studying the system can cause changes to the system itself through reflexivity.

We've known about this distinction for decades. The Santa Fe Institute, founded in 1984 by scientists who realized their own disciplines couldn't speak to each other about the problems that actually mattered, was built around precisely this insight.

Researchers there, working across physics, biology, economics, and computer science, identified recurring features of complex systems, from power law distributions and self-organized criticality to sensitivity to initial conditions and phase transitions. They created a vocabulary and a set of concepts that advanced our understanding.

But they also ran into a wall.

The concepts they developed were descriptive rather than prescriptive. Knowing that a system exhibits power law behavior tells you the shape of what will happen without telling you the specifics. You couldn't pick these principles up and use them to intervene in the world with precision.

There's a parallel in linguistics. Chomsky showed that all human languages share deep recursive structure. True, and essentially irrelevant to the language modeling that actually learned to do something with language. The universal principles were correct, but too general to be operable.

Complex systems remained resistant to science. But we tried anyway. Economics attempted to become the physics of human markets. We built elegant mathematical models with perfectly rational agents and perpetual equilibrium. The models were so mathematically pristine that physicists who encountered them marveled at the technique while questioning whether any of it described the actual world.

Pharmacology tried to treat the body as a complicated machine, targeting individual pathways with individual molecules. Sometimes it works brilliantly. Sometimes it works partially. And often it doesn't work at all, because the body is a web of interactions that doesn't respect the boundaries we draw around individual mechanisms.

The pattern repeated everywhere we applied Enlightenment tools to complex problems. Partial success, persistent failure, and the lingering sense that we were missing something fundamental.

Practice Before Theory

There's an old pattern in science. Practice comes first.

Blacksmiths worked metal for millennia before metallurgy existed as a discipline. Medieval architects built Gothic cathedrals that still stand today without any formal understanding of structural engineering. Farmers selectively bred crops for thousands of years before anyone had heard of genetics.

In each case, practitioners developed reliable and useful capabilities without any theoretical understanding of the underlying mechanisms. And then, when theory finally caught up, it didn't just explain what practitioners were already doing. It blew the doors open. Metallurgy didn't just explain blacksmithing, it gave us titanium alloys and semiconductors. Structural engineering didn't just explain cathedrals, it gave us skyscrapers.

I think we're in an analogous moment with complexity.

The tools of modern AI, from deep neural networks to transformer architectures, let us build compressed models of complex systems that actually work. We can do things with them. But we are, in a meaningful sense, the blacksmiths. We make improvements through intuition and experiment. We know what works without fully understanding why.

The Santa Fe Institute spent the late 1980s building early prototypes of exactly these tools. Researchers there created artificial stock markets with adaptive agents that spontaneously produced bubbles and crashes. They built self-organizing networks and genetic algorithms. But the models remained too small to be operable, and the elegant law of self-organization they hoped to discover never materialized.

The Missing Medium

So why do today's models work when SFI's didn't?

Not because we found better equations (though, we have). Because the theory these problems require is simply very large, and we finally have tools that can hold it.

Elegant equations might not exist for complex systems. The most compressed possible representation of how a complex system behaves might still be billions of parameters large. Larger than anything a human brain can hold in working memory. For as long as our only tool for operationalizing theories was the human mind armed with pencil and paper, these problems were simply beyond our reach.

They aren't anymore.

Take large language models. Fundamentally, a large language model is a compressed model of an extraordinarily complex system, the totality of human language use, which itself reflects human thought, culture, social dynamics, and reasoning. The compression ratio is enormous. The model is unimaginably smaller than the system it represents. That makes it a theory of that system, in every sense that matters, a lossy but useful representation that lets you make predictions and run counterfactuals.

It's just not a theory that fits on a t-shirt.

Good Explanations Have Reach

There's a reasonable objection to everything I've argued so far, and it comes from the physicist and philosopher David Deutsch. Deutsch holds that good explanations are compact and general, hard to vary without breaking. The more caveats and carve-outs a theory requires, the worse it smells. E=mc² has reach because it applies universally and you can't tinker with it. A lookup table of experimental results does not.

By this standard, a billion-parameter neural network doesn't look like a theory. It might give you useful predictions about a particular complex system, but it offers no portable understanding. You can't pick it up and carry it to a new problem.

Deutsch would look at "the model is the theory" and see capitulation.

This objection is fair to some extent. But it rests on a conflation.

When we talk about a trained model, we're talking about the weights. Billions of numerical parameters encoding what the model learned from a specific dataset. Those weights are large and parochial.

But the architecture of the model, the structure that made learning possible in the first place, is something else entirely.

The architecture of a transformer can be described on a few sheets of paper. Attention mechanisms, feed-forward layers, residual connections, layer normalization. And this same compact structure, when trained on language, learns language. Trained on protein structures, it learns protein folding. Trained on weather patterns, it learns weather.

In Deutsch's terms, the architecture has reach.

So perhaps there are two layers of theory here. The system-specific layer, the trained weights, is large and particular to its domain. This will likely always be true. The theory of this economy or this climate will always be vast.

But the meta-layer, the minimal architecture that can learn to represent arbitrary complex systems, might be compact and universal. It might be exactly the kind of good explanation Deutsch would champion.

If that's right, the physics of complexity would look different from what anyone at the Santa Fe Institute expected. It would not be a law about how complex systems behave. It would be a description of what structure can learn them.

Andrej Karpathy's work on nanoGPT is, in a practical sense, a search for exactly this. The smallest possible implementation that can still be trained to model complex phenomena. Strip away everything that isn't load-bearing. What's left?

We haven't found it yet. The transformer might not be the final answer. But for the first time, we have candidate architectures that demonstrably work across wildly different domains of complexity.

Interpretability as Complexity Science

The architecture might be compact, but the trained models remain vast and opaque. And there's a tempting conclusion to draw from this. We've built useful oracles, but oracles aren't science.

The emerging field of mechanistic interpretability suggests otherwise. Researchers are developing tools to understand how neural networks do what they do, from network ablation and selective activation to feature visualization and circuit tracing. These techniques let you study a trained model the way a biologist studies an organism, through careful experimentation and observation.

By studying how these models internally represent complex phenomena, we may extract more compressible truths about the phenomena themselves. If a neural network trained on climate data develops internal representations that cluster certain variables together in unexpected ways, that's a clue about the structure of the underlying system.

The model becomes not just a tool for prediction, but a specimen for study.

In this light, mechanistic interpretability might be the actual emerging science of complexity. The method is different from anything in the Enlightenment toolkit. You don't start with first principles and derive equations. You train a model that captures the behavior of a complex system, and then you study the model to discover what structure it found.

The theory is extracted from the compression, rather than the compression being derived from the theory.

It's early, but the direction is promising.

What This Changes

If this framing is right, many of the hardest problems facing humanity, from chronic disease and addiction to poverty and climate, were never fundamentally intractable. They were just too complex for the only medium of theory we had.

And now we have a new medium.

The problems remain hard. Building a sufficiently rich model of a complex system is an enormous undertaking. And the epistemology shifts in ways that might be uncomfortable. Instead of "I understand the causal mechanism and can predict what happens if I change X," you get something more like "I have a sufficiently rich model that I can simulate what happens if I change X, with probabilistic confidence." The answers are distributions, not deterministic outputs. That's a different kind of knowing.

But it might be the kind of knowing these problems actually admit.

We spent centuries wishing complex systems would yield to terse, elegant theories. The models that capture any particular complex system will probably always be large. But the structure that can learn them all might yet prove to be small.

It's remarkable how much of reality turned out to be modelable by theories that fit in a few symbols. Perhaps it shouldn't be remarkable at all that not everything can be.