"Vibe coding," a term coined by Andrej Karpathy in early 2025 to describe a workflow where developers describe intent to an AI assistant and let the model generate the implementation, has earned its popularity. It has produced real, shipped products. It has lowered the barrier to building software for people who never could before, and it has accelerated experienced developers across a wide range of projects. For greenfield applications, internal tools, prototypes, and many production systems, the workflow genuinely works.

There is one category of project, though, where it consistently goes wrong: rebuilding an existing Excel spreadsheet as an application.

The reason has nothing to do with the quality of the AI or the skill of the person prompting it. It has to do with what a business spreadsheet actually is. A working Excel model is not a sketch. It is a validated artifact that has been used, corrected, audited, and refined over years. The numbers it produces are trusted because real people have stress-tested them in real situations. When a pricing analyst, an actuary, or a finance lead opens that file, they are looking at the most reliable artifact in the organization for the decisions it supports.

Vibe coding is fundamentally a generative process. It produces a new implementation based on a description of intent. That is exactly the right approach when the intent is the source of truth. It is the wrong approach when the source of truth is a complex, working artifact whose correctness has already been established. In that case, what the team needs is preservation, not regeneration, and the gap between those two things is where the trouble starts.

The remainder of this post lays out where that gap matters most, and why teams whose business runs on spreadsheet models should think carefully before applying a generative workflow to them.

The Dual-System Maintenance Problem

The moment a spreadsheet is vibe-coded into an application, the organization is running two systems instead of one. Both of them claim to represent the same business logic. Neither one is going to fully replace the other.

Spreadsheet and app drift

The spreadsheet does not stop being used after the app is generated. It is too useful. It is where finance does scenario analysis, where operations tests new assumptions, where leadership asks "what if" without filing a ticket. The spreadsheet is the model layer for the business, and that role does not transfer cleanly to a rewrite.

So the spreadsheet keeps evolving. New tax brackets get added. A rate gets adjusted. A new product line gets folded into the calculation. Meanwhile, the generated application sits frozen at the moment it was produced. Every spreadsheet change is now a maintenance event the app does not know about, and within a few weeks the two systems are producing different answers for the same question. Nobody can tell which one is correct without re-deriving the logic from scratch.

This is the same spreadsheet sprawl problem teams have faced for decades, except now the sprawl includes a piece of generated code that nobody on the team fully understands.

Synchronization and debugging issues

Keeping the two systems in sync is harder than it sounds, because generated code from an iterative prompting workflow is rarely structured for line-by-line maintenance. The model produces something that works on the inputs it was shown. It does not naturally produce a clean mapping back to the spreadsheet cells it was generated from. When a formula in the workbook changes, there is no obvious one-to-one update path on the application side.

Most teams respond by regenerating the code. They paste the new spreadsheet back into the AI assistant and ask for an updated version. This is when the real damage accumulates: every regeneration is a fresh translation, with a fresh set of subtle differences from the previous one. Variable names shift. Edge cases are handled differently. Rounding behavior changes between runs. The application becomes a moving target, and debugging a discrepancy means debugging a system that was re-invented on each release.

The spreadsheet, meanwhile, has been right the whole time. It is the application that keeps drifting.

Spreadsheets Contain Hidden Business Logic

Excel files are not just calculators. They are living records of how a business actually operates. Much of that logic is undocumented, embedded in cell references, named ranges, and conditional structures that have accumulated meaning over years of use. Most enterprises do not have a handful of these files; they have thousands, each one carrying its own embedded policy.

AI misunderstands business intent

When an AI assistant reads a spreadsheet, it sees syntax. It does not see the policy that produced the syntax. It does not know that a particular constant is a regulatory floor, that a specific lookup table reflects a negotiated commission structure, or that an apparently redundant IF branch exists because of a one-off legal settlement that the team has been told never to remove.

This is not a knock on the model. A human developer reading the same spreadsheet cold would miss the same context. The difference is that a human translation usually involves conversations with the people who built the workbook, and those conversations surface the unwritten rules. A generative workflow optimized for speed tends to skip that step. The result is code that mirrors the formulas mathematically while quietly stripping out the reasoning behind them. The application produces numbers that look correct under normal conditions, but the moment the business logic needs to change, the team is left guessing which lines of code correspond to which policies. The audit trail that lived inside the spreadsheet, sometimes literally in the comments column, did not survive the translation.

For organizations that depend on transparent, inspectable calculations, this is a serious regression. The spreadsheet is auditable by design. Every formula is visible, every input is traceable, every step can be examined by a non-developer. Generated code, useful as it is in other contexts, is structured for the convenience of the system that produced it rather than the humans who will eventually need to govern, audit, or amend it.

Small formula mistakes create massive damage

The risk is not that the original formulas are wrong. The risk is that the translation introduces mistakes the spreadsheet never had.

A generated function might handle integer overflow differently from Excel's evaluation engine. It might apply a different rounding mode by default. It might evaluate a sequence of operations in a different order than the workbook does, producing results that match for nine inputs out of ten and quietly diverge on the tenth. It might implement a helper with the wrong sign convention because the model interpreted the intent of a negative value in a cell differently than the spreadsheet author did.

None of these are problems in the spreadsheet. They are problems introduced during translation, and they often only show up under specific combinations of inputs that the original workbook handled correctly without anyone noticing it was doing so. In finance, payroll, pricing, insurance, and any regulated workflow, the cost of a single quiet miscalculation can be enormous. The spreadsheet you trusted is no longer the system producing the answer, and the system that replaced it has a defect profile nobody has mapped.

Why Massive Testing Is Still Required

In many domains where vibe coding shines, the cost of a small bug is low. A landing page that renders slightly off, a prototype that miscounts items in an admin view, a script that needs a small fix after the first run: these are recoverable. Spreadsheet-driven business logic is not that domain. The cost of a small bug is potentially enormous, and the verification step that other projects can sometimes get away with skipping is not optional here.

Validation against the original spreadsheet

The only credible way to verify that a generated application matches the spreadsheet it was built from is to treat the original workbook as the source of truth and check the new application's output against it across every input scenario that matters. This means running both systems in parallel, comparing results cell by cell, and identifying every divergence.

For a moderately complex model, this validation suite will involve thousands of test cases. It needs to cover normal operating ranges, boundary conditions, historical scenarios, and the rare combinations that show up only at quarter-end or under specific regulatory triggers. The work cannot be delegated to the AI that wrote the code, because the AI is the system under test. It has to be supervised by someone who understands what the spreadsheet is supposed to do and can recognize when the generated application diverges from it.

The point is not that this work is impossible. The point is that it is the same effort a traditional engineering team would have spent, which means the speed advantage of the generative workflow is largely consumed by the validation it forces back onto the team.

Why "looks correct" is not enough

Plausible output is the trickiest signal in this kind of project. A generated application will compile, render a clean interface, and return reasonable-looking numbers on the inputs the developer happened to try. It will not announce the cases where it diverges from the spreadsheet. There is no exception, no log line, no visible failure. The application simply returns a different answer than the workbook would have returned, and unless someone is actively comparing the two, the discrepancy goes undetected.

This is the inverse of how spreadsheets fail. Spreadsheet errors are usually visible: a formula breaks, a reference shows #REF!, a column does not sum the way it should. The spreadsheet wears its mistakes on its face. A translated implementation hides them. By the time someone notices the application has been miscalculating for six months, the damage has already propagated into invoices, reports, and decisions.

A polished UI and confident output are evidence of fluency, not correctness. In most projects, the two are correlated closely enough that the distinction does not matter. In a spreadsheet conversion, they come apart, and the gap between them is where the bugs live.

The Real Cost of Vibe Coding Excel Spreadsheets

None of this is an argument against vibe coding in general. It is an argument about domain fit.

Vibe coding is genuinely transformative for projects where the goal is to create something new, where the cost of a small error is low, and where the system being built does not have to match a pre-existing source of truth to the cell. Most software projects in the world look like that, and the productivity gains in those categories are real.

Spreadsheet-driven business logic does not look like that. The model already exists. It is already correct. It is the artifact the business has been relying on. The job is not to generate a new implementation. The job is to preserve the existing one and put a better workflow around it.

The teams who get the most value out of their AI-augmented development workflows are the ones who recognize this distinction. They use generative tools where generation is the point, and they preserve the spreadsheet as the calculation engine in cases where the spreadsheet is the asset. The web layer handles user input, validation, permissions, persistence, and integration with the rest of the stack. When the business logic changes, it changes in one place: the workbook. The application picks up the new logic automatically, because the workbook is still doing the work.

This is the operating model that avoids the dual-system maintenance burden, the hidden-logic risk, and the massive validation requirement described above. It also frees the AI to do what it actually does best: accelerate the parts of the project that benefit from generation, while leaving the trusted artifact intact.

Vibe coding has a real and growing place in modern software development. Excel spreadsheets simply are not where that place is.