FDA FAERS for Safety Scientists: What the Public Database Actually Contains

FAERS is publicly available, extensively documented by FDA, and has been the subject of countless analyses in the pharmacovigilance literature. Yet many of the safety scientists I've talked with over the past several years — people who run quarterly signal detection cycles for a living — are working with a partial understanding of what's actually in the database. The gaps usually aren't in the high-level structure; those are well-covered in FDA's technical documentation. They're in the practical details that only become apparent when you've spent significant time with the raw ASCII files.

This is a practical guide oriented toward working safety scientists, not a regulatory overview. I'm going to focus on what the data actually contains, where it's systematically incomplete, and what the structural characteristics of FAERS mean for analytical decisions.

The File Structure: What Comes in Each Quarterly Release

FAERS quarterly downloads consist of seven ASCII delimited files. The files that matter most for signal detection analysis are DEMO (demographic information per report), DRUG (one row per drug per report, including drug role coding), REAC (one row per adverse event per report, using MedDRA Preferred Term coding), OUTCO (outcomes — hospitalization, disability, death, etc.), and RPSR (report source — MAH, physician, consumer, etc.).

The INDI (drug indication) and THER (therapy dates) files are included but have severe completeness problems that limit their analytical usefulness. Drug indication is a free-text field populated inconsistently across reporters — manufacturer reports tend to have reasonably complete indication data, consumer reports almost never do. Therapy dates (start date, end date) are populated for a minority of reports and should never be relied upon for duration-of-exposure analysis at scale.

The primary key across files is ISR (Individual Safety Report number) in older data and PRIMARYID in post-2012 data. The CASEID field links cases that have been resubmitted or updated — a single case may have multiple ISR/PRIMARYID values if the initial report was subsequently amended. Deduplication on CASEID with retention of the most recent version (highest PRIMARYID) is standard practice but deserves explicit documentation in your analytical protocols.

Report Volume and Temporal Coverage

As of the most recent complete FAERS fiscal year data, the cumulative database contains over 20 million ICSRs spanning from 2004 forward, with the historical AERS data extending back to 1969 in a separate legacy file set. For practical signal detection purposes, most teams work with a defined time window rather than the full historical database — commonly 10 years of rolling data, though the appropriate window depends on the drug's approval date and the stability of reporting patterns over time.

Report volume has grown substantially since FAERS launched in 2012 and accelerated further after 2016 with increased manufacturer reporting requirements and the expansion of electronic reporting standards. This growth is not uniform: certain therapeutic areas (oncology biologics, immunosuppressants, newer antidiabetic agents) have seen especially large reporting volume increases. When you're computing disproportionality statistics on a 10-year window, the earlier years are structurally different from the more recent years — not just in volume, but in reporter mix and coding practices. Treating the 10-year cumulative database as a homogeneous population for signal detection purposes introduces bias that's worth acknowledging.

Reporter Type Composition and What It Means for Signal Interpretation

The RPSR file records the primary report source for each case. FDA categorizes these as: manufacturer (MAH), healthcare provider (physician, pharmacist, other HCP), consumer/patient, and regulatory authority (foreign regulatory submissions). The composition of reporter types varies enormously by drug and therapeutic area, and it matters analytically in ways that aren't always appreciated.

Manufacturer reports (MAH-submitted) represent the bulk of total FAERS volume — roughly 80-85% of reports in recent years. These reports are submitted in compliance with 21 CFR Part 314.81 / 601.12 safety reporting obligations. They are typically more complete in terms of structured fields, but they also reflect the reporting decisions of the manufacturer's pharmacovigilance department. The rate of "does not meet serious reporting criteria" determinations, the speed of expedited vs. periodic submission, and the drug role coding choices (primary suspect vs. concomitant) all carry the fingerprint of the manufacturer's internal processes.

Healthcare provider reports tend to have better clinical narrative quality but are often less complete in structured fields. Consumer/patient reports have the lowest structured-field completion rates but may capture adverse events that HCP reports miss — particularly events that patients experience but don't report to their physician, or events that occur after the physician relationship has ended.

Why does reporter type matter for signal detection? Because the reporting rate relative to exposure is different across reporter types, and because the drug role coding practices differ systematically. A consumer report where the patient lists all their medications is likely to code most drugs as suspect; a manufacturer report processed through a PV department's workflow may apply more conservative causality assessment. If your signal detection analysis doesn't stratify or at minimum account for reporter type composition, you may be computing disproportionality on a population that mixes very different reporting behaviors.

What FAERS Does Not Contain

This is where I find the most gaps in practical understanding. FAERS is a spontaneous reporting database — it contains adverse event reports submitted to FDA, not a population-level pharmacovigilance record. The implications of this are significant:

No denominator data. FAERS does not contain prescription volume, patient-years of exposure, or any information about how many patients took a drug without experiencing a reported adverse event. ROR and PRR estimates are based entirely on the report count distribution within FAERS, not on incidence rates in the exposed population. This is why disproportionality statistics are not incidence estimates and should never be presented as such.

Selective underreporting. The spontaneous reporting system captures a fraction of actual adverse events — commonly estimated in the range of 1-10% for serious events, considerably lower for non-serious events. This underreporting is not random. It is concentrated in adverse events that are common, non-distinctive, or expected given the drug's known profile. An adverse event that everyone already knows about will be underreported relative to a novel unexpected event, regardless of actual incidence. This means FAERS systematically overstates the apparent signal-to-noise ratio for novel associations relative to established ones.

Limited patient history. FAERS does not contain the patient's full medication history, comorbidity list, or prior adverse events. The concomitant medications field captures what the reporter chose to list, which is typically the drugs prescribed at the time of the adverse event — not all drugs the patient has taken, not drugs recently discontinued, not OTC medications unless the reporter mentioned them. For pharmacokinetic interaction analysis, the absence of recently-discontinued drugs is a meaningful limitation because some interactions persist beyond the co-administration window.

No outcome follow-up. Once a report is submitted, FAERS contains only the information in that report and any subsequent follow-up submissions for the same case. There is no systematic outcome tracking — whether the patient recovered, the dose was adjusted, re-challenge occurred, or the event was ultimately attributed to a different cause.

The Deduplication Problem Is Worse Than You Think

Duplicate reports are a known limitation of FAERS and FDA's technical documentation acknowledges them. What the documentation doesn't fully convey is how variable the duplicate rate is across drug types and reporter populations. For high-profile drugs under active safety surveillance, the same adverse event may be reported by the patient, by their physician, by the hospital pharmacist, and by the manufacturer's PV department — all as separate ICSRs that FAERS will link by CASEID if the submitters used the same case number, but which may remain as independent duplicate reports if they didn't.

FDA provides a de-duplication file mapping CASEID to primary report, but this mapping is approximate and doesn't catch all cross-reporter duplicates. For high-volume drugs with multiple mandatory-reporting MAHs (e.g., branded plus generic manufacturers), the duplicate rate can meaningfully inflate apparent report counts. Signal detection analyses that use raw PRIMARYID counts without de-duplication will overcount for these drugs.

Our standard practice is to apply CASEID-level deduplication as a baseline, then apply a secondary heuristic check on demographic fields (approximate age, sex, country) and event dates for cases where CASEID matching is absent. This doesn't fully solve the problem but reduces the most systematic inflation.

Practical Notes for Query Design

If you're querying FAERS for signal detection — whether through FDA's public-facing FAERS dashboard, through the full quarterly downloads, or through a licensed signal detection platform — a few practical notes:

MedDRA version changes across years. A search on a specific Preferred Term will miss reports coded in older MedDRA versions that used different PTs for the same clinical concept. For any analysis spanning more than 2-3 years of data, consider including MedDRA hierarchy lookups at the HLT and HLGT level to catch version-era differences in coding granularity
Drug name standardization in FAERS is imperfect. Generic names, brand names, chemical names, and misspellings all appear in the drug name field. Any analysis relying on drug name string matching should include a curated synonym list and ideally a UNII or RxNorm normalization step
The DRUG file drug role coding (PS/SS/C/I) is reporter-assigned and reflects clinical judgment that may not align with the signal you're investigating. An analysis limited to primary suspect drugs will miss interaction signals where the drug of interest was coded as concomitant
The REAC file contains only MedDRA PTs as coded by the reporter. Free-text narrative is not included in the public FAERS data. If the coded PT doesn't capture the clinical detail that's relevant to your analysis, you don't have access to the underlying narrative to verify

FAERS is, despite its limitations, an irreplaceable pharmacovigilance resource. No other public database provides comparable breadth of post-market adverse event data across the US drug market. The goal of understanding its structural characteristics is not to dismiss its value — it's to use it accurately, communicate its limitations clearly in signal assessments, and design analytical approaches that account for what it cannot show you as precisely as for what it can.