Can AI Extract Data from NF-e XML?Yes — Smart Parsing, Not OCR

Yes. AI can extract data from Brazilian NF-e (Nota Fiscal Eletrônica) XML files — reading supplier CNPJ, product NCM codes, ICMS/IPI tax values, and line-item details. But NF-e is a special case: the data is already structured in XML. Extraction here means intelligently parsing the XML schema and mapping fields to readable spreadsheet columns, not OCR. Each supplier's NF-e follows the same government schema yet contains different optional fields, tax configurations, and version-specific elements that make manual consolidation across dozens of suppliers a recurring headache.

Stop typing data by hand — let AI read it for you
Upload an image or PDF — structured spreadsheet data in 10 seconds
Try It Now
No sign-up · No credit card · Results in 10 seconds
AI extracts data from Brazilian NF-e XML tax documents into structured spreadsheets

Key Takeaways

  1. Government-standardized NF-e XML data should be trivially machine-readable — yet most Brazilian finance teams still spend two days per month manually consolidating fields from 30 suppliers who each use a different ERP.
  2. An NF-e parsing script that works perfectly on version 4.0 breaks silently on version 2.0 because the same field simply does not exist — the XML is valid but the field is absent, and the script has no way to report what it cannot find.
  3. Semantic extraction reads fields by what they mean — Supplier CNPJ or ICMS Value — not by where they sit in the XML tree, so one set of column definitions extracts the same data from every NF-e regardless of which supplier sent it or which version they used.

How NF-e XML Extraction Works — and Why You Still Need "Extraction"

If NF-e data is already in XML, why not write an XSLT stylesheet and be done with it? Because you never receive just one NF-e format.

Brazil's NF-e system — created by Ajuste SINIEF 07/05 and now mandatory for virtually all B2B transactions — defines a government-standard XML schema (currently at version 4.0). Every electronic invoice carries the same root structure: issuer CNPJ and company name, recipient data, line items with NCM classification and CFOP codes, and four separate tax blocks for ICMS (state VAT), IPI (federal excise tax), PIS, and COFINS.

The problem surfaces when you receive XML from 30 suppliers in one month. Each uses a different ERP — TOTVS, Sankhya, Omie, SAP Business One — and each populates different optional fields. One includes freight details; another leaves them out. One uses NF-e 4.0 with expanded totalization; another still operates on 3.10.

Traditional XML-parsing approaches — XSLT, Python scripts, Power Query imports — break when fields are absent or namespaces shift. AI reads the XML semantically, identifying fields by what they represent, not where they sit in the tree. This is Custom Column Extraction applied to structured data — you define the output columns you want ("Supplier CNPJ," "NCM Code," "ICMS Value"), and AI locates matching data regardless of optional fields or version differences.

What AI Gets Right on NF-e XML

The structured nature of NF-e XML means AI extraction accuracy is higher than on image-based documents — often exceeding 99% for core standardized fields. The format constraints work in AI's favor in three ways.

CNPJ and CPF Tax IDs

Every NF-e XML contains the issuer's CNPJ (Cadastro Nacional da Pessoa Jurídica — the 14-digit federal tax ID) in a fixed position within the <emit> block. The rigid XX.XXX.XXX/XXXX-XX format and predictable XML path make extraction essentially error-free. CNPJ extraction accuracy on NF-e 3.10 and 4.0 XML exceeds 99.5% — the structured format eliminates the character-recognition ambiguity that plagues scanned paper invoices.

NCM Codes

NCM (Nomenclatura Comum do Mercosul) codes — the 8-digit product classification used across Mercosur countries — sit in their own <NCM> tag within each line item. For businesses filing SPED Fiscal (Sistema Público de Escrituração Digital — Brazil's digital tax bookkeeping system), accurate NCM extraction from incoming purchase NF-e is critical: wrong codes trigger audit flags. AI achieves 98-99% accuracy because the code follows a rigid 8-digit numeric pattern in a dedicated XML tag.

Tax Values (ICMS, IPI, PIS, COFINS)

A single NF-e can carry four separate taxes, each with its own calculation basis, rate, and final value — unusually tax-heavy compared to invoices from other countries. The tax sections are cleanly separated XML blocks, and AI maps each to its output column with high reliability. On NF-e where all tax sections are populated, ICMS value accuracy reaches 99%+ — higher than manual data entry, which introduces transposition errors.

Where AI Struggles with NF-e XML

The structure that makes NF-e extraction accurate also creates edge cases. Three scenarios reduce reliability.

Cross-Version Schema Differences

NF-e has evolved through multiple versions — 1.0, 2.0, 3.10, and 4.0 (current). Each revision added, removed, or renamed XML tags. When AI encounters an older NF-e 2.0 XML where a field simply doesn't exist, it correctly leaves the cell empty — but that empty cell can break downstream spreadsheet formulas expecting a value. The fix: batch older-version XML separately and apply post-extraction validation to flag missing fields.

Optional Fields and Service-Only NF-e

Many NF-e fields are optional. Service invoices omit product-related fields entirely — no NCM codes, no IPI. When AI processes a mixed batch, it correctly leaves inapplicable columns empty, but if your spreadsheet assumes every row has an NCM code, service rows appear incomplete. Define columns that cover both scenarios — "NCM Code (product NF-e only)" — to set expectations.

XML + DANFE Mixed Workflows

The DANFE (Documento Auxiliar da NF-e) is the printed companion PDF. Many smaller Brazilian suppliers send only the DANFE, not the underlying XML. DANFE PDFs require image-based AI extraction at 90-95% accuracy — lower than the 99%+ from direct XML parsing. The best practice: request XML from every supplier and treat DANFE-only files as a separate, lower-confidence batch.

How to Get Best Results from NF-e XML Extraction

Five steps that make a measurable difference when working with Brazilian electronic invoices.

1
Define semantic column names, not XML paths. Use "Supplier CNPJ," "NCM Code," "ICMS Value" — not XPath strings like /nfeProc/NFe/infNFe/emit/CNPJ. AI resolves these semantically, finding the CNPJ whether it's at the NF-e 4.0 position or a slightly different NF-e 3.10 location. This is Custom Column Extraction applied to structured data.
2
Request XML, not DANFE PDFs. This single habit change produces a 5-10 percentage point accuracy improvement. Brazilian law requires suppliers to provide the XML — send new suppliers: "Por favor, enviar o arquivo XML da NF-e juntamente com o DANFE."
3
Group NF-e by version when batch-processing. Separate NF-e 4.0 XML from older 3.10 or 2.0 files. The current schema version populates more fields — processing them together means older-version rows have more empty cells, which can look like extraction failures. Grouping by version lets you review each batch with the right expectations.
4
Use computed columns for tax validation. Brazilian taxes create built-in audit checks. Define a computed column verifying ICMS value ≈ ICMS base × ICMS rate — AI flags discrepancies during extraction rather than you discovering them later in your accounting system.
5
Spot-check the totals block. The <total> section contains definitive summed values. After extraction, verify that line-item totals match the XML's declared total — a mismatch flags an error faster than reviewing every field. On clean XML, fewer than 2% of NF-e fail this check.

Real-World Scenarios

Multi-Supplier NF-e Consolidation for SPED Fiscal

A mid-sized manufacturer in São Paulo receives 30-50 NF-e XMLs monthly from raw material suppliers — steel from Gerdau, electrical components from WEG, packaging from local vendors. Each NF-e carries different ICMS rates (7% to 18% depending on the originating state) and varying field completeness. Manual entry took an AP clerk two full days per month.

With AI extraction, uploading all XML files into a batch produces a consolidated spreadsheet with columns: Supplier CNPJ, NF-e Number, Issue Date, NCM Code, Product Description, Quantity, Unit Price, ICMS Base, ICMS Value, NF-e Total — ready for import into the company's TOTVS ERP. Two days of work becomes three minutes, and ICMS values cross-validate against the XML totals block, catching errors before they reach SPED.

NCM Extraction for Import Duties

A logistics company handling imports needs NCM codes and product values from supplier NF-e to calculate import duties. Each NF-e contains 5-20 line items with different classifications. AI extracts one row per line item in seconds — formatted for the customs broker's declaration template.

FAQ

Can AI distinguish between ICMS, IPI, PIS, and COFINS on the same NF-e?

Yes. Each tax has its own XML block with unique child elements — ICMS has <orig> and <CST>, IPI has <clEnq>. AI maps them to separate output columns cleanly because the XML structure disambiguates them. This is easier for AI than image-based extraction, where taxes appear as undifferentiated rows of numbers.

Does AI work with NF-e from different Brazilian states with different ICMS rates?

Yes. The ICMS rate (alíquota) is stated inside each NF-e's <ICMS> block. Whether an NF-e carries São Paulo's 18% or Rio de Janeiro's 19%, AI reads the rate directly from the XML. Cross-state ICMS-ST (Substituição Tributária) scenarios are also captured because the XML explicitly tags ICMS-ST amounts.

Can AI extract data from NF-e XML in Portuguese to an English-column spreadsheet?

Yes. Define output columns in English — "Supplier CNPJ," "Invoice Total" — and AI maps Portuguese XML fields to English headers. XML tags are language-agnostic, and semantic mapping works across languages. For more, see how AI handles multilingual extraction.

What about NFS-e (municipal service invoices)?

NFS-e (Nota Fiscal de Serviços Eletrônica) is a separate municipal-level document — each city (prefeitura) has its own schema. Unlike NF-e's federal standardization, NFS-e formats vary by municipality. AI can extract from NFS-e XML too, but per-city schema variation means more verification is needed. NF-e (federal, for goods) is the reliable one; NFS-e (municipal, for services) introduces more variables.

Is AI extraction from NF-e XML compliant with Brazilian tax record-keeping?

Extraction is a data transformation step — it doesn't alter the original XML, which remains your legal tax record. Brazilian tax authorities require retaining digitally signed NF-e XML for 5 years (prazo decadencial, CTN Art. 173). AI extraction creates a derived spreadsheet; the original, digitally signed XML stays intact.

What's the accuracy difference between NF-e XML and DANFE PDF extraction?

It's an entirely different category. NF-e XML extraction achieves 99%+ on core fields because data lives in unambiguous XML tags. DANFE PDF extraction — reading the printed representation — drops to 90-95% because it becomes an image-understanding problem: font variations, print quality, and column alignments introduce the same errors as any scanned document. Always prefer XML over DANFE when both are available.

The Bottom Line

NF-e XML extraction isn't an AI capability question — it's a workflow decision. The structured format makes extraction more accurate than any image-based document could be, but that structure can be deceptive: "it's just XML" makes the consolidation problem look simpler than it is. The real work — mapping inconsistent fields across 30 suppliers, four NF-e versions, and multiple tax configurations — is repetitive pattern-matching that AI automates better than any XSLT script or Excel macro.

The question isn't whether AI can extract NF-e XML. It's whether you want to spend your afternoon tracing <ICMS><ICMSSN102><orig> paths through 200 files or let AI map CNPJ, NCM codes, and ICMS values to a spreadsheet in under a minute.

Try It on Your NF-e XML Files →

📮 contact email: [email protected]