Scanned PDF to Excel: Extract the Columns You Name, Not Everything on the Page
Layout converters dump your scan's visual structure into broken rows and merged cells. Generic data extractors pull every field — then you spend time filtering. Column-name extraction gives you exactly the rows and columns you asked for, in a spreadsheet that's ready to use.
5-10s per page · Up to 99% accuracy on printed text
What You Can Extract from Any Scanned PDF
Type the column names you want — the AI locates those values anywhere on the scanned page by understanding what they mean, not where they happen to sit. Works across any vendor format, any scan quality.
These are examples of column names you type. The AI finds matching values on every scanned page — output is one clean spreadsheet.
Two Problems Stack Up in a Scanned PDF — Most Tools Solve One and Ignore the Other
Scanned PDFs have no text layer — just an image. That creates two compounding problems: recognizing characters from pixels, then understanding which value belongs to which field. Here's where common approaches break down, and where column-name extraction starts from a different premise entirely.
Where Standard Approaches Break Down
Layout converters treat scans like digital PDFs. They reconstruct the visual grid — which looks right on screen but stores amounts as text strings, breaks multi-line rows, and produces merged cells. The output requires manual cleanup before any data can be filtered or summed.
Generic data extractors pull everything — you still filter manually. A scanned invoice might surface 40+ detected values: vendor header, all 14 line items, three tax rows, footer notes, and page numbers. You get a complete dump, then spend time deleting rows you didn't need.
Template-based tools fail across vendors. A template built for one supplier's invoice format produces wrong output the moment another supplier uses a different layout — which is always. Multi-vendor batches require a separate template per format.
How Column-Name Extraction Works
You define the output shape before extraction begins. Type the column names you want — Vendor Name, Invoice #, Line Item, Amount, Due Date — and the AI treats those as the target. It doesn't reconstruct the page layout; it reads for meaning and fills only what you asked for.
Vision models read semantically, not positionally. "Invoice Number" is understood as a concept. Whether it appears top-right, bottom-left, or mid-page — and whether the scan is slightly skewed or the font is non-standard — the model finds the value next to that label because it understands what an invoice number is.
One column definition handles every vendor in the batch. Upload 50 scanned invoices from 30 different suppliers. Your six column names apply to all of them — no per-vendor templates, no per-format setup. Processing takes 5-10 seconds per page (vs ~3 minutes manual entry per page).
How to Extract Specific Fields from a Batch of Scanned Invoices
Upload Your Scanned PDFs
You have a folder of scanned invoices: some high-resolution flatbed scans, some photographed with a phone, a few that came through fax. Formats can be PDF, JPG, PNG, or WebP — mixed formats in one batch are fine. No pre-processing or de-skewing needed before upload.
Type Your Column Names Once
Enter Vendor Name, Invoice Number, Invoice Date, Line Item Description, Amount, Tax, Total. The AI applies these column definitions to every document in the batch — it doesn't need to know the layout of each vendor's format. It reads each scan and locates those values by understanding their meaning.
Download One Merged Excel File
Each scanned page becomes a row. The columns are exactly the ones you defined — no extra columns, no blank rows from failed layout reconstruction. If a field wasn't found on a particular page, the cell is empty rather than filled with a wrong value. Export as XLSX, CSV, or JSON.
When It Works — and When to Expect Lower Accuracy
Scanned documents vary widely in quality. Understanding where accuracy holds and where it degrades helps you decide when to spot-check results.
When It Works Best
Clear scans of printed documents. Flatbed scans at 150 DPI or above, or phone photos taken straight-on in good light. Up to 99% accuracy on printed text — amounts, dates, and reference numbers read reliably.
Field-value layouts with recognizable labels. Invoices, purchase orders, forms, and statements where data appears next to labeled fields like "Invoice No." or "Total Due". The AI identifies values by their labels, not by position.
Multi-vendor batches with consistent column targets. If you need the same six fields from 50 scanned invoices across 30 suppliers, one batch with one set of column names produces a merged spreadsheet without per-vendor template setup.
When to Be Cautious
Severely degraded source material. Photocopies of photocopies, fax output below ~100 DPI, or documents with heavy ink bleed will reduce accuracy. The model reads context to compensate for noise, but there's a floor — spot-check results from poor-quality sources.
Dense handwritten annotations on printed forms. Printed text on scans achieves up to 99% accuracy. Handwriting is lower and varies by legibility — neat handwritten entries are read well, but heavy cursive or faint pencil marks need manual review.
Values embedded in unlabeled paragraphs. If the data you need is a number buried inside a sentence with no surrounding label — "the total obligation shall not exceed forty-two thousand dollars" — the AI may not reliably extract it. Field-value layouts with clear labels work best.
Frequently Asked Questions
What's the difference between a scanned PDF and a digital PDF — and does it affect how extraction works?
A digital PDF has an embedded text layer — standard tools can select and copy text directly. A scanned PDF is a photograph of a document with no text layer, just pixels. Standard tools run OCR to guess the characters, then attempt to reconstruct the layout — two separate steps, each introducing errors. This tool uses a vision large model that reads the scan the way a person would, handling recognition and structure understanding in one pass. Up to 99% accuracy on clearly printed text.
Can I choose which columns to extract — like Invoice Number and Total — or does it pull everything?
You choose the columns. Type the field names you want — Invoice Number, Vendor Name, Line Item Description, Amount — and the AI extracts only those values from each scanned page. The column names you enter become the exact headers in the output Excel file. If you don't specify columns, the AI automatically identifies the document's key fields and generates a structured table on its own — useful as a starting point to see what's extractable.
How accurate is extraction on low-quality or faded scans?
Accuracy depends on source quality. Clear flatbed scans or straight-on phone photos of printed text achieve up to 99% accuracy. Faded text, heavy compression, or scans taken at significant angles will be lower — the vision model uses surrounding context to compensate for noise, but there's a practical floor. For degraded sources, plan to spot-check the output. A clear scan taken directly from the original document is always your best input.
Can I batch process scanned PDFs from different vendors and get one merged spreadsheet?
Yes. Upload scanned PDFs from any number of vendors in one batch — different layouts, different formats, even different file types (PDF, JPG, PNG mixed). Define one set of column names and the AI applies it to every document. Each page becomes a row in the output. Processing takes 5-10 seconds per page, roughly 18x faster than manual entry (based on ~3 minutes manual per page vs ~5-10s here). The output is a single merged XLSX or CSV file.
What happens when my scanned PDF has both printed fields and handwritten entries?
Mixed documents — printed forms with handwritten fill-ins — are handled well when the handwriting is reasonably legible. The AI reads both printed labels and handwritten values together, treating the document holistically rather than separating OCR passes. Neat block handwriting extracts reliably. Heavy cursive, faint pencil marks, or annotations layered over printed text will reduce accuracy on those specific fields and should be reviewed manually.