When we discuss using language models and classification systems to assist with MedDRA coding in ICSR processing, we're talking about a problem that has a clear accuracy ceiling and a floor that most implementations don't actually reach. The honest version of what these tools do well and what they fail at consistently is more useful to PV teams than a general claim that automation reduces coding time — which it does, but under conditions that aren't universally applicable.
I've spent considerable time evaluating how MedDRA auto-coding performs across different narrative types and will try to give a technically precise account of both the successes and the systematic failures.
What MedDRA Coding Actually Requires
Before evaluating AI-assisted coding, it helps to be precise about what manual MedDRA coding requires from a trained coder. The task is not simply "find the closest matching term in the MedDRA dictionary." It involves:
- Reading the adverse event narrative for clinical meaning, distinguishing the primary event from symptoms, sequelae, and diagnostic findings that should be coded as separate PTs
- Applying the "one concept per code" principle — a single narrative phrase may contain multiple distinct clinical concepts requiring separate PTs
- Navigating the MedDRA hierarchy (PT → HLT → HLGT → SOC) to ensure the most specific appropriate PT is selected, not a broader term that loses clinical resolution
- Applying causality judgment: whether an event described in the narrative meets the standard for coding as a drug-related adverse event or should be coded as a medical history item, concurrent condition, or indication
- Handling temporal ambiguity: events described as occurring "shortly after" starting drug may warrant different coding treatment than events with clear onset timing
Most AI-assisted coding tools perform well on the first and third tasks when the narrative is clean and unambiguous. They perform inconsistently on the second task and poorly on the fourth and fifth.
Where Auto-Coding Genuinely Helps
For straightforward ICSRs — a consumer report describing a single unambiguous adverse event in plain language, or a structured manufacturer expedited report where the clinical team has already pre-coded the primary event and the auto-coder needs to validate and populate the structured fields — classification accuracy is high. In internal testing on these case types, we consistently see auto-coding agreement with expert human coders in the range of 88-93% at the PT level.
These cases are not rare — a substantial fraction of ICSR volume at most PV departments consists of relatively clean single-event reports. If your ICSR processing queue is high-volume and the primary bottleneck is the coding step, auto-coding on these case types is a reasonable efficiency gain with manageable quality risk, provided you have a quality review process that samples coded cases and catches systematic errors.
Auto-coding also helps with dictionary lookup tasks that are purely mechanical: identifying the current MedDRA version's PT for a clearly described clinical concept, suggesting synonym mapping when a reporter uses non-standard terminology, and flagging cases where the described event doesn't cleanly match any single PT (which is a signal to escalate for manual review). These augmentation functions are often more practically valuable than full automation because they don't require the tool to make causality or ambiguity judgments it isn't reliable at.
Complex Narratives: Where the Models Consistently Fail
The failure modes of AI-assisted MedDRA coding cluster around narrative complexity in predictable ways. Understanding the failure modes is more useful than knowing the aggregate accuracy number, because it tells you which case types should never leave the auto-coding queue without human review.
Multi-event narratives with causal relationships between events. A narrative that describes "the patient developed rash, which was treated with an antihistamine, following which they experienced drowsiness and difficulty concentrating" contains three potentially codable events: rash (primary AE), drowsiness and difficulty concentrating (which could be drug effects of the antihistamine, symptoms of the underlying condition, or continuation of the primary event). Current models routinely mishandle the hierarchical causal structure — they either code all events as primary AEs or undercode by treating the secondary events as noise. Expert human coders know to code the antihistamine's sedation as a separate AE requiring its own causal assessment.
Ambiguous causality between the study drug and a concurrent medical event. Narratives involving patients with significant comorbidities frequently present events where the drug-causality assessment is clinically non-trivial. "Patient with known atrial fibrillation developed palpitations and increased ventricular rate three days after starting Drug X" — is this a drug-related adverse event, a manifestation of the underlying cardiac condition, or an expected pharmacodynamic effect consistent with the drug's mechanism? The clinical judgment required to assign an appropriate causality category and select the appropriate PT (cardiac arrhythmia NOS vs. atrial fibrillation aggravated vs. ventricular rate increased) is exactly the kind of task models trained on historical coded cases do poorly on, because the training distribution doesn't provide signal about the clinical reasoning behind the code choices.
Negative or conditional phrasing. "No evidence of hepatotoxicity was observed" gets coded as an absence of adverse event. "Hepatotoxicity could not be excluded" is a different clinical statement that some models incorrectly handle as a positive coding. This is a well-documented failure mode in clinical NLP generally, not specific to MedDRA coding, but it's particularly consequential in PV because false positive AE codes inflate apparent signal rates and false negatives miss events that should be in the safety database.
Events described at the wrong level of clinical specificity. A reporter who writes "the patient had liver problems" has provided a clinically insufficient description for precise PT assignment. A well-trained human coder will code this as "hepatic function abnormal" or a similar non-specific hepatic PT and flag the case for follow-up with the reporter to obtain more specific information. Many auto-coding tools will assign a more specific PT based on probabilistic association with the drug class or therapeutic area — oncology hepatotoxicity reports are more likely to involve transaminase elevation than hepatitis, so the model may assign a transaminase PT — which is speculative coding that introduces systematic bias rather than appropriately coding uncertainty.
The Training Data Problem
Most AI-assisted MedDRA coding systems are trained on existing coded ICSR databases. This is a methodological constraint that shapes what these tools can and cannot do. They learn to reproduce the coding decisions that human coders made in the past, including the errors and inconsistencies those coders made. MedDRA coding accuracy in historical ICSR databases is not uniform — inter-coder agreement studies consistently find significant disagreement rates, particularly for complex multi-event narratives and cases requiring causality judgment.
A model trained on imperfectly coded historical data learns the modal coding choice for a given narrative type, not the correct coding choice. For straightforward narratives where expert coders agree reliably, this distinction doesn't matter. For complex narratives where expert coders disagree significantly, the model is learning from noise and its outputs on those case types are correspondingly unreliable.
This is one reason we're cautious about the headline accuracy numbers that vendors report for MedDRA auto-coding systems. Those numbers typically reflect agreement with the existing coded database, which is not the same as agreement with truly expert coding. If the test set contains the same proportion of complex narratives as the live ICSR queue, the accuracy numbers may be informative. If the test set was selected to show the model in favorable conditions — clean narratives, common event types, consistent coding in the training data — the operational accuracy on your actual case mix may be significantly lower.
Practical Implementation Guidance
For PV teams considering or evaluating auto-coding tools, a few implementation principles that we've found consistently important:
Define a clear review stratification at intake. Not all cases need equal human review after auto-coding. Cases that the model assigns with high confidence scores on single-event narratives can reasonably be reviewed at a lower sampling rate than cases where the model indicates uncertainty, cases with complex polypharmacy context, and cases from therapeutic areas with high MedDRA version change frequency.
Track model error patterns over time by case type. If the model consistently miscodes complex causality narratives from a specific therapeutic area, that's a workflow signal: either route those cases to manual coding, or implement a targeted model fine-tuning step on that case type. The aggregate accuracy number is less operationally useful than the error distribution by case type.
Don't use auto-coded data for training without quality review. If auto-coded ICSRs re-enter the training corpus without human validation, the model's errors become self-reinforcing. This is an obvious principle but one that's easy to violate when the operational pressure is to process cases quickly.
Be explicit in your quality management documentation about which case types are auto-coded without human review and which require mandatory QC. ICH E6(R2) and the associated GVP Module VI guidance don't prescribe specific auto-coding practices, but they do require documented quality processes for ICSR data entry and coding. The documentation trail for auto-coding decisions is a regulatory audit consideration that's worth getting right from the start rather than retrofitting.
What This Means for Signal Detection Downstream
ICSR coding quality is a direct input to signal detection quality. Systematic auto-coding errors don't cancel out in the disproportionality calculation — they introduce directional bias. If a model consistently miscodes a specific event type in one direction (under-coding complex multi-event narratives, for example), the ROR and PRR estimates for PTs that capture those events will be systematically deflated. This is the kind of error that's hard to detect in signal detection QC because the suppressed signal never appears in your flagged output.
For PV teams that use automated coding as input to signal detection, the implication is that signal detection accuracy is bounded above by coding accuracy. High-quality, carefully validated auto-coding with good human review on complex cases is a foundation that downstream analysis can build on. Poorly validated auto-coding that systematically miscodes complex narratives will undermine signal detection regardless of how sophisticated the detection method is.
We think there's real utility in AI-assisted MedDRA coding, applied to the right case types with appropriate human oversight. The useful framing isn't "can AI code MedDRA?" — for straightforward narratives it clearly can, with high accuracy. The useful framing is "which cases can you responsibly route to auto-coding without human validation, and which cases require human judgment that the current tools don't reliably replicate?" Getting that boundary right is where the operational value actually lives.