Building Frontier AI for the Physical World

May 8, 2026

Blog

Every now and then, I run a simple test on the latest frontier vision-language models (VLMs), ChatGPT, Claude, and Gemini. I show them an image of a hand with six fingers and ask how many fingers it has. Every single time, they say five, with the sixth finger fully visible and unambiguous. It turns out that these models are not counting what is in the image; they are merely confirming what they expect to be there. A recent Stanford study introduces the concept of “mirage reasoning,” finding that these VLMs were generating detailed image descriptions and confident reasoning traces for images that were never provided to them. So, what my finger test reveals, and this work confirms is that what looks like visual understanding is often an illusion.

Impressive as these frontier models are, their failure on something as simple as counting fingers should make us deeply skeptical of their ability to handle the far harder spatial problems the physical world presents, including occlusion, depth, sizing, and the precise geometric and physical relations between objects and their parts that determine whether a system can function as intended. While VLMs are extraordinarily powerful at describing scenes in language and capturing semantic content, spatial reasoning, true geometric understanding of where things are, how they relate in three dimensions, what fits and what does not, remains a profound weakness. The gap between texts and the 3D world is structural, and scaling alone will not fully close it.

I have built two strong convictions following directly from such a realization. First, the most important and most difficult frontiers of AI lie not in the digital world of text-to-image or video generation, but in the physical world, where outputs must be geometrically valid, functionally correct, and constructible. Second, nowhere is the gap between what current VLMs can do and what is actually required more severe than in the AEC domain, where training data is scarce, domain knowledge is deep and specialized, and the margin for error is not aesthetic but physical and financial, a difference measured not in perceptual plausibility but in fractions of an inch. Such an understanding ultimately brought me to Augmenta.

‍

A Journey That Was Always Heading Here

My research career was not set out to automate building designs. But looking back, several major threads were rehearsing for exactly this. My early work on structural analysis and synthesis of 3D shapes and scenes established that geometry carries rich semantic structure: arrangements of parts and objects encode relationships, hierarchies, and functional intent, and the same applies to rooms in a building. My work on symmetry hierarchies (Eurographics 2010) and symmetry maximization for building facades (SIGGRAPH 2013) showed that regularity makes complex designs manageable, a principle that applies as directly to a building's MEP (mechanical, electrical, and plumbing) systems as to a piece of furniture assembly.

My foray into deep learning a decade ago began with generative models for 3D shapes and structures: GRASS (SIGGRAPH 2017) was the first deep generative network for 3D shape structures, operating on hierarchical rather than raw geometric representations; IM-Net (CVPR 2019) helped catalyze the neural implicit representation revolution that resulted in neural radiance fields (NeRFs); and BSP-Net (CVPR 2020 Best Student Paper) grounded neural 3D generation in classical spatial data structures to produce compact, physically interpretable mesh representations. Lately, my work has sought to inject relational inductive bias into state-of-the-art neural reconstructive and generative models, pushing toward AI that does not merely generate geometry, but understands how its parts connect, coordinate, and function.

Through it all, I was always drawn back to functionality, beyond the geometry and spatial structures designed to realize it. The ultimate question in shape understanding is not “what does this look like?” but “what does it do?” In an era of generative AI, this distinction is easy to lose sight of and costly to ignore. A floorplan or a 3D building that merely looks right is not a design, but a picture. A real design must work: every system, inside and out, must function as intended; every component must be physically valid; and ultimately, the whole must be constructible. How do you build AI that is spatially, physically, and functionally aware? The answer would not be AI that describes geometry in words, but AI that reasons in 3D space and generates outputs that are physically coherent and functionally correct.

Spatial and functional AI are the subjects of my talk at CDFAM Barcelona 2026.

‍

Why Now, Why Construction, and Why Augmenta

By market size, the construction industry overtakes energy, healthcare, transportation, and entertainment, while having the lowest AI adoption rate. Computationally, construction is one of the hardest instances of physical AI. A hospital floor contains hundreds of thousands of physical elements. Designing the systems running through it involves millions of simultaneous spatial constraints across continuous 3D space, with hard requirements on clearance, constructibility, and code compliance. Building construction is also harder than generic product design in the one way that hurts most economically: there is no volume over which to amortize the cost of design. I still vividly remember my CEO Francesco Iorio holding up an iPhone and saying: “If you design one of these, you can make a million copies. But no two buildings are ever meant to be exactly the same.” Indeed, the cost of human engineering effort amortizes across millions of identical manufactured units, so the per-unit design cost shrinks toward zero with scale. In building construction, there is no such luxury. Every project is bespoke, bearing the full weight of its own design labor from the first line to the last. There is no volume over which to spread the cost, and no template that survives contact with a new site, new program, or new constraints.

While the need is urgent, and despite the scale of the industry, construction has so far operated without a computational design engine. Every routing decision, every rack placement, every trade coordination remains predominantly manual. The prerequisites for automation, including digital building models, computational power, mature algorithms, sufficient AI capabilities, have only recently converged to make it tractable. Against this backdrop, I came to Augmenta because I saw a rare alignment: a hard computational problem, a genuine market gap, and a team that understood exactly where the boundary between AI and deterministic computation lies. My latest passion and research in spatial AI had found its natural home, a place where functional AI is not merely an aspiration but an operational requirement, and where getting the relationship between geometry and intelligence right is not an academic question but the work itself.

‍

What “AI for Construction” Actually Requires

Effective spatial AI for the built environment requires two fundamentally distinct layers. The first is a computational engine built for geometric and physical precision: a purpose-built system that solves massive-scale 3D geometry generation, routing, layout, constraint satisfaction, and functional rectification with deterministic correctness. This layer takes a building information model as input and computes geometrically valid, constructible 3D designs. The algorithms and data structures required are among the most sophisticated in visual and spatial computing; they are not the kind of software that emerges from prompting an LLM. To this end, geometry processing is elevated from a supporting role in the AI stack to a foundational one. Every system that must act in the real world, be it autonomous vehicles, surgical robots, or construction, depends on geometry processing as its interface with physical reality. My belief is that the most consequential research agenda should ask not how to replace geometry processing with learned models, but how to make the two work together to empower one another.

The second layer is AI reasoning, to interpret design intent, navigate trade-offs, and learn from human corrections, with LLMs and geometric deep learning serving as the intelligent interface for the computational engine. Critically, this interface must be fast. Building design is not a one-shot process; it is inherently iterative, with designers refining intent, evaluating alternatives, and responding to constraints as they emerge. An AI layer that cannot return meaningful feedback at interactive speeds does not enable this workflow; it breaks it. The training signal itself depends on this loop: high-level user intent flows in, a candidate design flows back, and the designer's corrections become the supervision that makes the next iteration better.

The relationship between the two layers is additive and compounding. The engine generates both data and the structured action space within which AI can operate. Every project produces paired data, automated output alongside human correction, that continuously supervises and improves the AI layer. At the same time, this two-layer architecture also raises a deeper question: what kind of AI model should belong in that second layer? A general-purpose model trained on the breadth of human knowledge, or a specialized model trained deep on the specific geometry, constraints, and construction intelligence of the built environment — one that does not merely know about buildings, but truly understands how they work?

‍

General-Purpose vs. Specialized Foundation Models

Twenty years at the frontier of geometric deep learning has given me a clear view of something easy to miss from the outside: the capabilities of general-purpose foundation models and the requirements of spatial design automation are separated by a gap that prompt engineering and scaling alone cannot close. I had the opportunity to argue this position directly at a panel on 3D foundation models at 3DV 2026, that there is a widening gap between what general-purpose foundation models can achieve and what domain-specific systems must deliver when the objective shifts from linguistic or visual plausibility to physical and functional precision.

The obstacle is not only one of model capability. It is one of data. There is no publicly available, high-quality dataset of fully engineered 3D buildings, particularly for complex non-residential types such as hospitals, schools, and data centers. This is not merely data scarcity, it is a structural obstacle encompassing scale, ownership, controllability, and the near-total absence of labeled design intent. Physical scanning alone cannot solve it: scanned data are expensive to acquire, constrained by ownership, and do not capture the engineering reasoning embedded in a design. The data must instead be actively created through generative design, a principle embedded in Augmenta since its founding. In turn, these generative models must evolve through continual learning, driven by real-world deployment and signals of user intent, creating a compounding data and capability flywheel that general-purpose AI cannot access or replicate.

At Augmenta, we are addressing these challenges by targeting the hardest layer of the stack first: MEP systems, where precision, scale, and domain expertise intersect most intensely. The Augmenta Construction Platform (ACP) can already generate fully constructible electrical designs, with every generated solution becoming training data for the next. The long-term goal is a foundation model for 3D building design: one that embeds construction intelligence and can synthesize fully engineered 3D buildings from multimodal inputs. Reaching this goal requires co-evolving the computational engine, the data pipeline, and the domain intelligence in unison, three things that can only come from being deployed in the real world, on real buildings, at real scale. This is, in the end, what separates functional AI from the kind that merely looks right. That six-fingered hand still stumps every model I test, a small but clarifying reminder that the gap between visual plausibility and physical understanding remains wide, and that bridging it, in the domain that matters most economically and computationally, is precisely what we are here to do.