Why Extracting Data Is Only Half the Job

Spend five minutes on any document extraction vendor's website and you'll hear the same story: upload a PDF, get a spreadsheet. The narrative ends at the moment structured data appears in Excel. But anyone who has actually processed invoices for a living knows that getting the numbers into a grid is the easy part. The work that eats afternoons — the work that generates the errors that surface three months later during a reconciliation — happens after the extraction finishes. It happens in the formula bar.

What document extraction actually delivers — and what it doesn't

The pitch is straightforward: a 40-line invoice arrives as a PDF. You upload it. The AI reads every charge line — description, quantity, unit price, line total — and outputs a spreadsheet with columns already labeled. In marketing terms, this is "end-to-end automation." In accounting terms, it's the starting gun.

Because here is what the spreadsheet actually contains after extraction: raw values, as they appeared on the page. The quantity column has numbers. The unit price column has numbers. The line total column has numbers. But nobody — not the AI, not the extraction engine — has verified that Quantity × Unit Price actually equals the Line Total printed on the invoice. Nobody has summed all twenty line totals and compared the result against the Subtotal on the last page. Nobody has checked whether the tax percentage applied to the subtotal produces the tax amount the vendor wrote, or flagged the invoice as "needs review" when the numbers don't reconcile.

The extraction tool gave you data. It did not give you verified data. And the gap between those two things — between "the numbers are in Excel" and "the numbers are correct and ready for the general ledger" — is where the real hours disappear.

Extraction converts unstructured documents into structured data. That is format conversion — a solved problem. What remains unsolved for most teams is computation on that data: line totals, cross-row aggregation, conditional flags, and variance detection. These are not extraction tasks. They are post-extraction tasks. And they are almost entirely manual.

The spreadsheet formula that secretly costs more than manual data entry

Invoice data extraction tools have cut the "keying in numbers" step from 3 minutes per page to roughly 5-10 seconds. That is a genuine improvement. But put a stopwatch on the full workflow — from PDF arrival to "ready to post" — and the distribution of time shifts in a way most tool comparisons don't capture.

A typical invoice processing workflow after AI extraction involves at least four categories of formula work. Each one is individually small — a column here, a SUM there — but collectively they form a repetitive spreadsheet assembly line that nobody budgets for:

Line total verification. For every row of the invoice, you need =C2*D2 in column E — quantity times unit price — and a comparison against the printed line total in column F. A single invoice with 15 line items means 15 multiplication formulas and 15 comparison formulas. Across 200 invoices per month, that's 6,000 formula cells created, dragged, and spot-checked.
Subtotal reconciliation. After verifying individual lines, you sum the computed line totals and compare against the printed subtotal. Then you apply the tax rate (which may differ per jurisdiction or per line item — some items are taxable, some aren't) and compare against the printed tax amount. Then you sum subtotal plus tax and compare against the invoice total. For a multi-page invoice with split tax rates, this is not one SUM formula. It's a chain of interdependent calculations that breaks if any upstream value is wrong.
Conditional flags. Does the invoice total exceed the PO amount? Is the payment due within 7 days (flag for urgent approval)? Does the vendor appear on the preferred supplier list? Each of these is a conditional formula — =IF(F2>G2,"OVER BUDGET","") — that someone writes, formats, and drags across every row.
Standardization formulas. Dates arrive in every conceivable format: 06/15/2026, 15-Jun-2026, 20260615. Currency amounts mix decimal commas and periods depending on the vendor's country. Someone writes =DATEVALUE() wrappers and =SUBSTITUTE() chains to normalize everything before it can touch the accounting system.

None of this work is extraction. The AI already extracted the right numbers. But the numbers aren't usable until these calculations are done — and at most organizations, the calculation workload is invisible. It happens in Excel, in 15-minute bursts between meetings, by people whose job descriptions don't include "spreadsheet formula technician." The work gets done, but nobody tracks how long it takes — and nobody asks whether it's necessary.

If a mid-market AP clerk processes 200 invoices per month and spends an average of 8 minutes per invoice on post-extraction formula work — writing verification columns, dragging formulas, reconciling subtotals — that's 26 hours per month on tasks that extract data but compute nothing. At the BLS median wage for bookkeeping clerks of $23.33/hour, the cost is over $600 per month in formula-writing labor alone. For a team of three clerks, that's $1,800 per month — $21,600 per year — spent on Excel formulas that would be unnecessary if the calculations happened during extraction.

The extraction tool saved the team 3 minutes per page. But the formula work that followed — the line totals, the cross-checks, the conditional columns — consumed 8 more minutes that the tool never touched. The real bottleneck didn't move. It just became more visible.

Why the document extraction industry treats extraction as the finish line

The tools that dominate the market — template-based OCR, machine-learning classifiers, large vision models — are all built around a single engineering problem: "given a document image, output structured text." This is a hard problem that took decades to solve well. The teams building these tools are, understandably, organized around the problem they know how to solve.

But the engineer's definition of "done" — "the text is in a database row" — does not match the accountant's definition of done — "the numbers have been verified, calculated, and are ready for the general ledger." The extraction output is a data artifact. The accounting output is a financial artifact. The transformation from one to the other requires computation, and the extraction industry has largely left that computation to the user.

This is not a failure of individual tools. It's a structural gap in how the problem has been defined. The software industry looked at document processing and saw "OCR needs to be better." It built better OCR. Then it saw "formats are unpredictable" and built layout-agnostic AI. Each iteration made extraction faster and more accurate — but each iteration also made the post-extraction formula work more conspicuous by its absence. When extraction takes 10 seconds and formula work still takes 8 minutes, the extraction speed stops being the headline. The formula gap becomes the headline.

The most revealing evidence of this gap is how AP teams actually use their extraction tools. They extract. They export to Excel. And then they add columns — not because the extraction missed data, but because the tool doesn't compute. They add the Quantity × Unit Price column. They add the variance column. They add the approval-flag column. They add the standardized-date column. The spreadsheet they send to the accounting system has twice as many columns as the extraction tool produced. Half the columns are extraction output. The other half are formulas someone wrote at 4:00 PM on a Tuesday.

Stop typing data by hand — let AI read it for you

Upload an image or PDF — structured spreadsheet data in 10 seconds

Try It Now →

No sign-up · No credit card · Results in 10 seconds

The calculation gap in practice: when your invoice total doesn't match

To see why post-extraction formulas aren't just tedious but structurally risky, consider the most common reconciliation failure in AP: the invoice total mismatch.

A vendor sends an invoice with twelve line items. The extraction tool captures every field correctly: twelve descriptions, twelve quantities, twelve unit prices, twelve line totals, one subtotal, one tax amount, one invoice total. All numbers are accurate to the original document. But when you sum the twelve extracted line totals, they come to $3,847. The printed subtotal on the invoice says $3,812. The difference is $35.

The error is not in the extraction. It's in the vendor's invoice — a line item was mispriced, or a discount was applied inconsistently, or a rounding decision produced a discrepancy. But the extraction tool has no mechanism for detecting this. It faithfully reproduced the vendor's numbers without verifying them. The detection happens in Excel, when someone writes =SUM(F2:F13) and compares it against cell F15. If nobody writes that formula — or if the formula is written correctly but only applied to the first page of a multi-page invoice — the $35 discrepancy survives. It enters the general ledger. It becomes a reconciliation item three months later, at which point finding the source invoice and verifying the line arithmetic costs more in labor than the $35 itself.

This scenario is not rare. It is the default condition of any extraction workflow that doesn't include computation. Every invoice becomes a math problem that someone has to set up and solve manually in a spreadsheet. At low volumes, the math is manageable. At 200 invoices per month, the math becomes a full-time task that nobody is officially assigned to. At 500 invoices per month, the math becomes a risk — because errors that are caught 95% of the time are not caught the other 5%, and the 5% that slip through are the ones that matter.

The extraction error rate for modern AI tools is under 1% for printed text on standard documents. The post-extraction calculation error rate — formula mistakes, missed rows, misaligned SUM ranges — has no published benchmark, because nobody measures it. But every AP manager knows it's higher than 1%.

Moving the calculation step from Excel back into extraction

If the problem is that extraction produces raw values and computation happens afterward in a separate tool, the logical solution is to collapse the two steps into one. Instead of "extract first, calculate later in Excel," the calculation runs at the moment of extraction — while the AI is reading the document and writing the output table.

This is the mechanism behind what ImageToTable.ai calls Computed Columns. When you define the columns you want extracted from a document, you don't have to limit yourself to fields that exist on the page. You can define columns whose values are derived from other extracted fields through calculation. The AI reads the document, extracts the source values, performs the computation, and writes the result directly into the output — all in a single pass. No separate spreadsheet. No formula bar. No dragging cells.

For an invoice, the practical applications are immediate:

Line Total verification. Define a computed column Computed Line Total (Qty × Unit Price). For every line item on the invoice, the AI multiplies quantity by unit price and outputs the result. Compare it against the printed line total column — any discrepancy is visible in the output, not in a formula you forgot to write.
Subtotal reconciliation. Define a computed column that sums all extracted line totals and compares the result against the printed subtotal. The output is not a raw number — it's a reconciliation: "Sum of lines: $3,847. Printed subtotal: $3,812. Variance: $35." The calculation that used to require a chain of Excel formulas is baked into the extraction itself.
Tax verification. Define a computed column Expected Tax (Subtotal × 0.0825) using a fixed tax rate parameter. Compare against the printed tax amount. If the vendor applied the wrong rate, the variance is flagged before the data ever reaches Excel.
Budget flags. Define a computed column that checks whether the invoice total exceeds a reference value: Budget Check (Invoice Total > PO Amount). The output is "Over Budget" or "OK" — a conditional flag generated during extraction, not added afterward.

Computed columns don't eliminate the need to verify. They eliminate the need to calculate in order to verify. The AI does the arithmetic. The AP clerk reviews the result. The distinction matters because calculation is rote work — error-prone when done manually at scale — and review is judgment work, which humans do better. Moving the calculation upstream means the human spends their 8 minutes per invoice on the part machines can't do: deciding what the variance means and what action to take.

This capability exists in two forms. For quick use, you can write the computation directly into the column name — Line Total (Qty × Unit Price) — and the AI parses the logic from natural language. For more complex, multi-step derivations, logged-in users can define the computation in a structured JSON rule format, keeping column names clean while the calculation logic is expressed precisely. Both approaches produce the same result: a column in your output table whose values were computed during extraction, not added later. For teams processing invoices in volume, batch invoice data extraction with computed columns turns what used to be hours of post-processing formula work into something that finishes before the upload completes.

JPG/PNG/PDF AI Extraction + Calculation

Files are processed securely and not stored.

Frequently asked questions

How much time does post-extraction formula work actually consume?

For a mid-market AP team processing 200 invoices per month, post-extraction calculations — line total verification, subtotal reconciliation, conditional flags, date standardization — consume approximately 25-30 hours per month, based on an average of 8 minutes of formula work per invoice. This is formula work that exists after the extraction tool has already done its job. The extraction itself takes seconds per page. The formulas take minutes per invoice. As extraction speed improves, the formula gap becomes proportionally larger, not smaller.

Can't I just use Excel templates to automate these formulas?

Pre-built Excel templates reduce the setup time per batch but don't eliminate the manual steps. The template still needs to be applied to each extraction output — importing data, ensuring column alignment hasn't shifted, verifying that formulas reference the correct rows. Templates help with the writing of formulas but not with the validation. A SUM formula that captures rows 2 through 13 works perfectly until an invoice has 14 line items and row 14 is silently excluded. Templates reduce formula labor but don't eliminate the need for formula review — and the review is what consumes the real time.

Does ImageToTable.ai's Computed Columns work with handwritten invoices?

Yes — Computed Columns operate on whatever values the AI extracts from the document, whether the source is printed or handwritten. If the AI can read the quantity and unit price from a handwritten invoice, it can multiply them during extraction just as it would for a printed invoice. The accuracy of the computation depends on the accuracy of the underlying extraction; if a handwritten number is misread, the computed result will inherit that error. The AI's handwriting accuracy varies with legibility — clearly written numbers on standard forms are reliably extracted; dense, cursive scrawl on unstructured layouts may require review.

What kinds of calculations can Computed Columns handle?

Computed Columns support row-level arithmetic (multiply, divide, add, subtract between fields on the same row), cross-row aggregation (sum all line totals within a document), conditional logic (output "Over Budget" if invoice total exceeds a threshold, otherwise "OK"), fixed parameter references (embed a tax rate or reference value in the calculation rule without needing the document to contain it), and multi-step derivations (compute a subtotal from line items, then apply tax, then compare against the printed total). For simple computations, write the logic directly in the column name. For complex, multi-step calculations, use the JSON Rule Format available to logged-in users.

Does this replace the need for a human to review invoices?

No — and that's not the goal. Computed Columns replace the calculation step, not the review step. A human still needs to look at the output and decide what a variance means: is a $35 discrepancy an acceptable rounding artifact or a billing error that needs a credit memo? The value of Computed Columns is that the human arrives at that decision faster because the arithmetic has already been done. Instead of spending 5 minutes setting up formulas to discover the $35 discrepancy, the reviewer sees it immediately in the output and spends their 5 minutes deciding what to do about it.

What if I need a calculation that Computed Columns doesn't support?

Computed Columns cover the most common post-extraction calculations: arithmetic, summation, comparison, and conditional logic. For highly specialized calculations — actuarial formulas, multi-currency forex conversions at live rates, depreciation schedules — Excel or a dedicated financial system remains the appropriate tool. Computed Columns are designed to handle the 90% of post-extraction work that is repetitive and formulaic, not to replace every spreadsheet function in existence. For most invoice processing workflows, that 90% represents the bulk of the time spent.

See how your next invoice processes with computed totals

Upload an invoice. Add a computed column. Watch the calculations happen during extraction — not after.

Why Extracting Data IsOnly Half the Job

Key Takeaways

What document extraction actually delivers — and what it doesn't

The spreadsheet formula that secretly costs more than manual data entry

Why the document extraction industry treats extraction as the finish line

The calculation gap in practice: when your invoice total doesn't match

Moving the calculation step from Excel back into extraction

Frequently asked questions

How much time does post-extraction formula work actually consume?

Can't I just use Excel templates to automate these formulas?

Does ImageToTable.ai's Computed Columns work with handwritten invoices?

What kinds of calculations can Computed Columns handle?

Does this replace the need for a human to review invoices?

What if I need a calculation that Computed Columns doesn't support?

Why Extracting Data Is
Only Half the Job