Specification Engineering
... is a bet on better code gen and more complexity
AI engineering wants to be declarative. Ultimately, the algorithms in language model attention heads which devs interleave with their Python and Typescript are fuzzy and inscrutable. Their natural abstractions are guarantees, predicates on co-images. We can’t know what they do, only what they will have done.
This is fine when AI software is 99% Python logic and 1% AI. Most software accesses SQL databases, but that’s not a problem - the interaction is tightly scoped and managed. But what happens when the software is 99% AI and 1% Python, at least in terms of complexity/headspace?
At that point, the software is poorly served by the imperative paradigm. There may be plenty of LoC, but as far as it’s owner is concerned, there’s nothing imperative about it. At that point, much of the actual logic of the program lives in … the heads of its developers. The prompts won’t speak for themselves - each maintainer will, as a byproduct of hours of whiteboarding and painful trial-and-error, have reams and reams of knowledge regarding how different AI components interrelate with one another that simply cannot be safely deduced from the code itself, possibly even by a superintelligence.
Storing critical system information solely in human minds - and doing so more often as time goes on and AI becomes a bigger part of software - is not a good idea. The I/O bandwidth is low, the information degrades quickly, and collaboration scales poorly. Its a structural trend directly counter to the massive productivity gains the rest of software is seeing - and it’s holding AI software development back. Evals can sometimes help introduce some structure and legibility, but they’re too fragmented - the requirements your engineers care about are distributed across hundreds of test cases you will never read and likely struggle to version and update.
As time goes on, teams and engineers will want AI systems like Synth to help them - and, to be most effective in controlling and intervening on the software, those systems will need a legible and durable source of truth. Finding the right abstractions will take time, but now is the time to start.
Every abstraction is leaky, and so directly maintaining imperative Python in AI software will be a necessity for the foreseeable future. But, great engineering teams will use processes and tools to ensure that the system specification is syncronized and takes precedence. PRs and prompt updates can be compiled up into spec diffs, and rejected if they introduce breaking changes. Syncronization in the other direction is where the abstraction starts paying for itself.
Add a requirement -> AI spools up 1k LoC and Synth stress-tests two new prompts and a sub-agent, with 5 new evals to boot. Evals just become a way to check guarantees and create impetus for the compiler to update prompts/code/LoRAs. Naturally, syncronization will sometimes go both ways. Adding a better model might require simpler code with fewer prompts, depending on how preferences are outlined, and so we might go models -> evals -> spec -> code -> evals -> spec. Suddenly equilibrium becomes a more apt description than compilation. But don’t let that scare you away.
Declarative specs with guarantees aren’t new. They’ve been used as long as software’s been written. Engineering teams benefit greatly from clearly communicating system level guarantees, and maintained them even back when it took precious human-hours to do so. What is new is intelligence that helps us consistently and cheaply transpile between them and software in a git commit hook. Let’s use it.