Why Clinical Data — Already Digital —
Still Gets Extracted by Hand
Walk into any US hospital and the clinical documentation is, for all practical purposes, paperless. Radiology reports are generated as PDFs inside the PACS. Discharge summaries are typed into the EHR. Operative notes are dictated, transcribed, and uploaded. And yet — somewhere in the same building, a registered nurse is sitting at a computer, reading those exact same reports line by line, and manually typing data points into a clinical registry entry form, field by field, for the third time this week from the same patient's chart. The documents are digital. The data extraction is not.
Key Takeaways
- A single CABG (coronary artery bypass graft) patient generates over 200 manual data points for just one cardiac registry — and the same patient's chart typically feeds five separate registries, each demanding its own independent abstraction session from identical source documents.
- The problem is not that clinical documents are on paper — over 90% of US hospitals use an EHR (electronic health record) — it is that those records export as narrative PDFs whose fields are visible to humans but invisible to every hospital database.
- The clinical abstraction workforce — thousands of nurses and health information professionals retyping the same reports into different registries daily — is not a temporary gap in the technology stack but has become the technology stack itself, at a hidden annual cost in the billions of dollars.
The Two Parallel Worlds of Clinical Documentation
Clinical documentation exists in two information ecosystems that barely speak to each other. The first is structured data: checkboxes, dropdown menus, radio buttons. ICD-10 codes that classify a diagnosis. CPT codes that describe a procedure. Lab values that slot into a database field — hemoglobin 12.3 g/dL, creatinine 0.9 mg/dL. This is the world EHRs were built to manage. It is searchable, queryable, reportable. It is also, by volume and by clinical richness, the shallow end of the pool.
The second world is unstructured data: the narrative prose that clinicians actually generate when they describe what they saw, what they thought, and what they did. The radiology report that says "there is a 1.2 cm spiculated nodule in the right upper lobe, with associated pleural retraction — recommend CT-guided biopsy." The discharge summary that narrates a 12-day hospitalization, from presenting symptoms through complications to post-discharge instructions. The operative note that describes, in 800 words of surgical detail, exactly which vessels were bypassed, with which grafts, under what conditions. The progress note that captures a clinician's evolving assessment over three shifts.
This second world — the narrative world — contains an estimated 80% of all clinically meaningful data in the health record. It carries the reasoning behind the decision-making, the nuance that diagnostic codes flatten, the context that makes a lab value actionable rather than just a number. And it is, by default, invisible to every reporting tool, analytics platform, and automated workflow in the hospital.
The structured world answers "what happened" in shorthand. The unstructured world answers "why it happened, what it means, and what should happen next." The problem is that the machines can only read the first.
Why the EHR Didn't Fix This
There is a reasonable assumption that has persisted since the HITECH Act of 2009 drove EHR adoption from 9% to over 90% of US hospitals: electronic health records should have solved the data accessibility problem. If clinical information is digital, it should be machine-readable. If it's machine-readable, it should be queryable. If it's queryable, manual extraction should be obsolete.
The assumption breaks at the first step. EHRs are not clinical knowledge systems. They are transactional databases optimized for billing, built in an era when the primary use case for digitizing a patient encounter was generating a claim. The core engineering decision embedded in every major EHR platform — Epic, Cerner, Meditech, Allscripts — is that clinical narratives are stored as unstructured attachments, not as structured fields. A radiology report generated inside the hospital's PACS gets attached to the patient record. A discharge summary typed into a free-text box gets saved as a text blob. An operative note gets uploaded as a PDF.
The EHR stores these documents. It does not parse them. It does not index their contents. It does not map the phrase "1.2 cm spiculated nodule in the right upper lobe" to a structured data element that a query could retrieve. From the perspective of a database, the radiology report and the discharge summary and the operative note sit in the same category as a scanned copy of a 1998 paper chart: digitized but not structured, stored but not searchable.
A study published in the Journal of Medical Internet Research (2025) examined the information overlap between structured codes and free-text notes across 1.8 million patients and found that structured data alone — ICD codes, procedure codes, lab values — captured only a fraction of the clinical picture. Free-text notes contained "detailed descriptions capturing the nuances of patient care." The EHR's structured fields told you the patient had a CABG. The operative note told you how the CABG happened — which matters enormously for quality measurement, risk adjustment, and clinical research.
This is not a failure of any particular EHR vendor. It is a consequence of what EHRs were designed to do. They were built to capture structured data for billing and regulatory reporting. They were not built to extract meaning from narrative. The fact that 80% of clinical data lives in free text is not a bug — it is the natural consequence of clinicians documenting care the way humans communicate complex information: in sentences, not in dropdowns.
An EHR makes clinical documentation digital. It does not make it structured. Extracting data from a radiology narrative stored inside Epic requires the same cognitive labor as extracting it from a typed report in a manila folder — reading, interpreting, and transcribing the relevant information into a separate system. The medium changed. The manual labor did not.
The Abstraction Workforce Nobody Talks About
Because EHRs store clinical narratives as unsearchable blobs, hospitals employ an entire professional class whose full-time job is reading those narratives and manually entering specific data points into other systems. They are called clinical data abstractors, and they represent one of the largest hidden labor costs in American healthcare.
Clinical data abstractors are typically registered nurses (RNs), Registered Health Information Technicians (RHITs), or Certified Tumor Registrars (CTRs) — licensed clinicians or credentialed health information professionals who review patient charts and extract key data elements for quality reporting, clinical registries, research, and regulatory compliance. The work requires clinical knowledge: you cannot abstract a surgical registry without understanding operative anatomy, and you cannot abstract a cardiac registry without interpreting hemodynamic data. The American Data Network, one of the largest clinical abstraction outsourcing firms, describes the abstractor's core task as reviewing "clinical notes, test results, imaging reports, and medications" and translating "those details into structured fields."
The scale of this workforce is difficult to measure precisely because abstraction is not a standardized job title — it is embedded inside quality departments, registry teams, and clinical research units. But the economics are visible at the hospital level. A 2018 presentation from Massachusetts General Hospital's registry operations team broke down the staffing costs for 11 surgical specialty society registries at a single academic medical center:
| Registry | FTEs Required | Patients/Year | Annual Staffing Cost |
|---|---|---|---|
| STS-Cardiac (Adult Cardiac Surgery) | 3 RN FTEs + 0.5 PSC | 1,300 | ~$250,000–$300,000 |
| ACS-NSQIP (Surgical Quality) | 1.5 RN FTE + analyst + manager | 1,800 | ~$120,000–$180,000 |
| ACS-NTDB and ACS-TQIP (Trauma) | 3.5 FTE staff + 0.3 manager | 2,500 | ~$250,000–$350,000 |
| STS-Thoracic | 1 RN FTE + manager | 1,000 | ~$80,000–$120,000 |
| SRTR (Solid Organ Transplant) | 7.0–10.0 RN FTEs + 1.5 manager | 750 | ~$500,000–$700,000 |
Source: Massachusetts General Hospital, CMSS presentation (2018). Estimates based on reported FTE ranges.
That is five registries at one hospital, consuming roughly $1.2 to $1.7 million in annual staffing costs — and these are just the registries for which MGH publicly shared FTE data. Most academic medical centers participate in 8 to 15 registries. The Society of Thoracic Surgeons National Database alone covers 95% of adult cardiac surgeries across the US, with each CABG case requiring abstraction of 200+ data elements spanning preoperative risk factors, intraoperative details, and 30-day post-discharge outcomes. The NCDR network — operated by the American College of Cardiology — includes over 2,400 hospitals across six registries covering cardiac catheterization, ICD implantation, valve procedures, and more.
A summer 2024 survey by Carta Healthcare of clinical data abstractors across US hospitals found that 50% of respondents spend more than half their time on manual data entry and abstraction. The survey described "a troubling paradox: while clinicians view registries as vital for quality and process improvement, the burdensome task of manual data abstraction is pushing these professionals to their limits." When asked about automation, 45% believed automated tools would make abstraction faster for their organization, 30% believed they would improve data quality, and 20% said they would reduce costs. The demand for automation is coming from the abstractors themselves — the people whose jobs, in theory, automation might threaten. In practice, the volume of data to abstract is growing faster than the workforce can keep up.
On Reddit, the sentiment is blunter. A clinical research professional posted: "I just spent hours trying to enter patient data into a registry only to find out there are almost 100 patients in this registry (all behind)." Another thread on r/clinicalresearch asks, simply: "How much time is generally spent on data entry or looking into patient records for info?" — the kind of question that signals a workflow problem so baked into daily operations that nobody has a baseline answer, because the answer is "most of the day."
The economic scale becomes visible when you extrapolate: a single clinical data abstractor earning $75,000 a year who spends 50% of their time on tasks that consist of reading a report, finding a specific value, and typing it into another system represents roughly $37,500 in annual labor spent on de facto transcription. Multiplied across the abstraction workforce at a single multi-registry academic hospital — 10 to 20 FTEs — that is $375,000 to $750,000 per hospital per year. Across the 2,400 NCDR-participating hospitals alone, the aggregate cost of manual registry abstraction is conservatively in the low billions of dollars annually, before accounting for the opportunity cost of trained clinical professionals performing data transcription instead of patient-facing work.
Clinical data abstractors are the human structuring layer between EHRs and registries. Their job exists because two systems that both hold clinical data — the EHR and the registry — cannot exchange that data without a person reading one and typing into the other. The abstraction workforce is not a temporary gap in the technology stack. It is the technology stack.
One Patient, One Chart, Five Registries — and Five Separate Data Entry Sessions
The economics of abstraction are multiplied by a structural feature of clinical registries that has no equivalent in other industries: multiple registries pull from the same source documents, and they do not share data with each other.
Consider a patient who undergoes coronary artery bypass graft surgery. The Society of Thoracic Surgeons (STS) Adult Cardiac Surgery Database requires 200+ data elements for this patient: preoperative risk factors (diabetes status, ejection fraction, prior PCI), intraoperative details (number of grafts, cross-clamp time, use of internal mammary artery), and 30-day outcomes (mortality, stroke, deep sternal wound infection, renal failure, prolonged ventilation).
The same patient's chart contains the same operative note. But this patient may also be abstracted into the NCDR CathPCI Registry — because they had a preoperative catheterization — and that registry has its own data dictionary with its own field definitions. If the surgery involved a transcatheter valve procedure, the STS/ACC TVT Registry adds another set of variables. If the patient had a complication requiring a return to the operating room, the ACS NSQIP surgical quality registry may apply. If the hospital participates in a Get With The Guidelines (GWTG) program for the patient's cardiovascular condition, that is a fifth registry with its own abstraction requirements.
All five registries read the same source documents. The same radiology report. The same operative note. The same discharge summary. The same lab values. And in nearly every hospital in the United States, five different data-abstracting workflows — often split across different abstractors, sometimes the same person doing the same work five times — manually extract overlapping data points into five separate registry submission platforms.
The MGH data makes this visible. A single hospital manages 11 surgical registries with staffing requirements ranging from 0.5 FTE (small registries with ≤500 cases/year) to 10 FTEs (transplant registry with 750 cases/year). The variable definitions often differ across registries even for the same clinical concept — one registry defines "postoperative renal failure" with one creatinine threshold, another with a different threshold or time window. The abstraction time per case ranges from 15 minutes to 4 hours, depending on the registry's complexity and the patient's clinical course.
This is not a technology interoperability problem that HL7 FHIR can solve. FHIR can standardize the transport of data between systems — making sure that when System A sends a lab value to System B, both systems agree on the format of the transmission. What FHIR cannot do is turn a narrative paragraph into a structured field. It cannot read a radiology report that says "1.2 cm spiculated nodule" and populate a registry field for "tumor size in greatest dimension." That transformation — from prose to structured data — still requires a human reader, or an AI system capable of semantic extraction. The interoperability standards solved the transmission problem. They did not solve the structuring problem.
A single patient's clinical documentation can feed five or more registries, each demanding its own abstraction session from the same source material. The duplicate labor is not a rounding error — it is a structural feature of a system where registries were built as independent data collection silos, each with its own data dictionary, field definitions, and submission protocol.
The Irony: Digital Already, Just Not Structured
There is a persistent narrative in health IT that the challenge is "digitization" — getting paper records into computers. This narrative made sense in 2005, when most hospitals ran on paper charts and the HITECH Act had not yet begun. It makes no sense today. Over 90% of US hospitals use an EHR. Radiology departments have been filmless for over a decade; most radiology reports are generated, signed, and distributed entirely within digital PACS-to-EHR workflows. Discharge summaries are typed, not dictated onto cassette tapes. Operative notes are entered into templated EHR modules. The clinical documentation that matters most — the narratives that contain the richest clinical information — is already digital.
The bottleneck is not digitization. The bottleneck is structuring.
And the structuring bottleneck has a specific, measurable shape. It is the gap between "this patient had a CABG" — a structured fact the EHR can report — and the 200 individual data points the STS registry requires about how that CABG happened. Every one of those 200 data points exists somewhere in the clinical documentation: the preoperative ejection fraction is in the echocardiogram report, the number of grafts is in the operative note, the postoperative ventilation duration is in the ICU flow sheet, the 30-day mortality status comes from a post-discharge follow-up phone call documented as a free-text note. The information is in the chart. It is simply not in a format that machines can read.
This reframes the entire automation conversation. The question is not "can we digitize clinical documentation?" — that ship sailed. The question is "can we extract structured data from clinical narratives that are already digital, without hiring more people to read and type?"
The distinction matters because it changes what kind of technology actually addresses the problem. Template-based OCR — the kind that reads "where" a field sits on a page — was designed for documents with fixed layouts: standardized forms, printed tables, structured invoices. A clinical operative note has no fixed layout. It is a narrative paragraph, written by a surgeon, describing a procedure that may play out differently every time. You cannot template a narrative. You can only understand it.
This is where the current generation of AI extraction tools — built on vision language models (VLMs) rather than template OCR — enters the conversation. A VLM does not need to know where on the page the ejection fraction is written. It needs to know what an ejection fraction is — that it is a percentage value, typically expressed as "EF 45%" or "LVEF estimated at 40-45%" — and find it in the narrative wherever it appears. This is semantic extraction, not coordinate-based extraction. It works on the principle that clinical concepts have consistent semantic signatures across differently worded narratives, and that a model trained to understand language can find "the ejection fraction" regardless of whether the cardiologist wrote "EF 40%" or "LV systolic function moderately reduced, estimated EF 40-45%."
The core inefficiency in clinical data abstraction is not that documents are on paper. It is that documents exist as prose — rich, nuanced, clinically valuable prose — and the systems that need the data from those documents demand structured fields. The digitization problem is solved. The structuring problem is where the billions in manual labor live.
What Structuring Clinical Data Actually Means
If the bottleneck is structuring — not digitization — then the solution is not a better scanner and not a faster typist. It is a system that can read clinical narratives the way a human abstractor reads them: understanding what each sentence means, identifying which concepts map to which registry fields, and producing structured output that a human can then validate.
This is a fundamentally different task from what most document automation tools were built to do. Traditional document extraction tools — the ones that handle invoices and purchase orders — work by learning the layout of a form. They memorize that "Invoice Number" appears in the top-right corner and "Total" appears at the bottom of the last page. When a new invoice arrives from the same vendor, the tool reads the same coordinates and pulls the same fields. When a different vendor sends a differently formatted invoice, the tool needs a new template.
Clinical narratives defeat this approach on two fronts. First, there is no fixed layout — a discharge summary from Hospital A and a discharge summary from Hospital B are both narratives, but they organize information differently, use different headings, and express clinical concepts with different vocabulary. Second, and more fundamentally, the data itself is not positional. You do not find "cross-clamp time 47 minutes" in a specific box on the operative note. You find it embedded in a paragraph, surrounded by other surgical details, written in whatever prose style the surgeon prefers.
Semantic extraction solves this by operating on meaning, not position. The VLM reads the entire document, understands which clinical concepts are present, and extracts the values that correspond to each concept — regardless of where on the page the concept appears, what phrasing the author used, or whether the document is a typed PDF, a scanned report, or a screenshot of the EHR interface. The extractor does not need to be retrained for each new hospital's documentation format, because it is not learning formats — it is recognizing concepts.
The practical workflow is not "AI replaces the abstractor." It is "AI handles the reading step, and the abstractor handles the validation step." The AI populates the 200+ fields of the STS cardiac registry entry from the operative note, the discharge summary, the echo report, and the follow-up note. The abstractor — an RN with cardiac surgery experience — reviews the populated fields, corrects any extraction errors, applies clinical judgment to ambiguous cases, and submits the validated entry. The abstractor's time shifts from finding data (scrolling through 80 pages of EHR documentation, the part that consumes 50%+ of the workday per the Carta survey) to validating data (the part that requires clinical expertise and cannot be automated).
For a CABG patient whose abstraction currently takes 45 to 90 minutes — spanning preoperative, intraoperative, and postoperative documentation across multiple EHR modules — a semantic extraction tool that handles the initial data pull can cut the abstractor's per-case time by half or more. The math is straightforward: if an RN abstractor earning $40/hour processes 1,300 CABG cases per year (the volume MGH reported for their STS-Cardiac registry), and AI-assisted extraction saves 30 minutes per case, that is 650 hours of RN labor reclaimed annually — roughly $26,000 in recovered salary cost, redirected from transcription to validation and quality improvement work. Across five registries, across 2,400 hospitals, the aggregate is not a rounding error.
Frequently Asked Questions
Why don't EHRs just make clinical documentation structured by default?
Because structured data entry — dropdowns, checkboxes, constrained vocabularies — is fundamentally at odds with how clinicians think and communicate. A checkbox can capture "chest pain: present" but cannot capture "patient describes intermittent substernal chest pressure radiating to the left shoulder, worse with exertion, relieved by rest, onset approximately 2 weeks ago, frequency increasing." The checkbox captures a billing code. The narrative captures the clinical reasoning. Forcing clinicians to document exclusively in structured fields would produce data that machines can read but that other clinicians cannot use. The tradeoff is real, and the medical community has — correctly — opted for clinically useful documentation over machine-friendly documentation.
How many clinical registries does a typical hospital participate in?
A community hospital may participate in 3 to 5 registries — typically covering stroke (GWTG), cardiac procedures (NCDR CathPCI), and surgical quality (ACS NSQIP). A large academic medical center typically participates in 10 to 15 registries, spanning cardiac surgery (STS), trauma (TQIP), transplant (SRTR), oncology (NCDB), and multiple subspecialty registries. MGH's published data covers 11 registries; many academic centers exceed this number. Each registry adds abstraction FTEs, and the FTEs compound because registries do not share data.
What types of clinical documents need manual abstraction?
The documents that generate the most abstraction labor are radiology reports, discharge summaries, operative notes, progress notes, and pathology reports — the narrative-heavy documents where the clinically richest information lives. Lab values, medication orders, and vital signs are structured data that EHRs can export directly. The manual labor concentrates overwhelmingly on the free-text documents that contain the clinical reasoning and nuance that structured fields were never designed to capture.
Can AI actually read a radiology report accurately enough for registry use?
Vision language models can extract discrete data points from radiology narratives — tumor dimensions, laterality, imaging modality, follow-up recommendations — with accuracy that makes them viable as a first-pass tool for an abstractor to validate. They are not a replacement for clinical review, because radiology reports contain ambiguity (impressions that hedge, measurements that are qualified as "approximately") that requires human interpretation. The appropriate architecture is AI-assisted abstraction: the model populates fields, the abstractor validates. This is the same model that the Carta survey found abstractors wanted — tools that reduce manual hunting time without replacing clinical judgment.
What is the difference between digitization and structuring in clinical documentation?
Digitization means converting a document from physical to electronic form — scanning a paper chart, generating a PDF from an EHR, storing an image in a PACS. The document is now a file. Structuring means converting the content of that document from narrative prose to discrete, queryable data fields — extracting "cross-clamp time: 47 minutes" from a paragraph in an operative note and populating a database field called "cross_clamp_time_minutes" with the value "47." Digitization creates a file a human can read. Structuring creates data a machine can use. The problem in clinical documentation is that digitization happened, but structuring did not follow — which is why hospitals still employ people to do it manually.
The structural truth of clinical documentation: EHRs made clinical data digital but not structured. Registries demand structured data but cannot extract it from narratives. Between these two incompatible systems sits a workforce of thousands of nurses and health information professionals, manually bridging the gap one report at a time, one field at a time, one registry at a time — often reading the same documents and extracting the same data points for five different systems in five separate sessions. The cost is not just the salaries of the abstractors. It is the clinical talent diverted from patient care to data transcription. It is the registry participation that hospitals cannot afford and therefore skip — leaving quality gaps unmeasured. It is the research questions that go unasked because the data exists in prose that nobody has the budget to structure. AI extraction does not solve every layer of this problem — clinical judgment, registry field definitions, and payer-specific rules remain human domains. What it solves is the layer that should never have been human in the first place: reading a paragraph and typing the answer into a box.