The Sensitivity-Specificity Tradeoff in Pharmacovigilance Signal Detection

The signal detection threshold problem in pharmacovigilance is genuinely hard, and I think it's underappreciated how much of the difficulty is not statistical but operational. The statistics are well understood. The receiver operating characteristic curve, the sensitivity-specificity tradeoff, the relationship between your detection threshold and your false positive rate — this is standard material. The hard part is what happens when you move the threshold in either direction and what it costs your team.

I'm going to walk through the threshold decision with some specificity, because I think the abstract framing — "balance sensitivity and specificity" — obscures the actual tradeoffs PV scientists are navigating every time they calibrate their signal detection parameters.

The Asymmetry of Errors in Pharmacovigilance

In most classification problems, a false positive and a false negative have roughly symmetric costs, or the asymmetry is clearly in one direction that can be designed for. In pharmacovigilance signal detection, the cost asymmetry is more complex than it appears.

The obvious framing is that false negatives are catastrophic (you miss a real safety signal and patients are harmed) and false positives are merely costly (you investigate a non-signal and waste time). From a pure patient safety standpoint, this framing suggests setting thresholds aggressively low — maximize sensitivity, accept whatever false positive rate follows.

But this logic has a practical failure mode. PV teams have finite capacity. If you lower your PRR threshold from 2 to 1.5, or your EBGM lower 95% credible interval threshold from the conventional 2.0 to 1.5, the number of candidate signals your team needs to evaluate can increase by a factor of 3 to 5 depending on your product portfolio. If your team can handle 15 signals per review cycle and you've just generated 45, you have three choices: defer assessment, do shallower assessment, or expand resources. All three options have consequences.

Deferred assessment means signals sit in a queue past the point where action might have been timely. Shallow assessment means scientists are reviewing candidates for 20 minutes instead of 4 hours, which means real signals may be dismissed on inadequate evidence. Expanding resources requires headcount and budget decisions that operate on a different timescale than signal detection parameter adjustment.

The false negative risk doesn't disappear when you lower your threshold — it shifts. A team drowning in false positives is a team that is less likely to give adequate attention to the real signals mixed among them.

Where Conventional Thresholds Come From

The commonly used thresholds — PRR ≥ 2 with N ≥ 3 and chi-squared ≥ 4 (Evans et al.), EBGM lower 95% CI ≥ 2, IC lower 95% CI > 0 (WHO) — were established through empirical calibration against known pharmacological signals in large historical datasets. They represent reasonable operating points on the sensitivity-specificity curve for a general signal detection application across a broad drug portfolio.

What they don't represent is an optimal threshold for your specific product portfolio, your specific patient population, your team's review capacity, or the current state of your background ICSR data quality. They are defensible starting points, not ground truth.

Different product classes have systematically different background reporting rates. Oncology products in a heavily pre-treated population will generate case reports with complex, multi-symptom presentations and high concomitant medication rates that produce high background disproportionality scores across many drug-event pairs. Applying the same threshold to an oncology product and a first-line antihypertensive will generate very different signal candidate volumes even at comparable true signal density.

This is one reason we think product-class-specific threshold calibration deserves more attention than it typically receives in standard signal detection training. We're not saying the conventional thresholds are wrong — we're saying they're a starting point, not a destination.

The Multi-Drug Case and Why Standard Thresholds Break Down

Here is where I want to be precise about a specific failure mode that matters for modern FAERS data.

Traditional disproportionality methods — PRR, ROR, EBGM, IC — compute a drug-event association for each drug independently, controlling for the overall reporting rate of the event and the overall reporting rate of the drug. What they do not control for, in their standard implementations, is the co-medication structure of the underlying cases.

Consider a case series where Drug A appears with a nominally elevated PRR for event E. When you pull the underlying cases, 58% of them also list Drug B as a concomitant medication, and Drug B is known to carry a documented risk for event E. The PRR for Drug A may be entirely a product of this co-medication enrichment in your case series rather than a true Drug A signal.

At conventional thresholds, this Drug A signal will pass your screen. Your team will expend assessment effort on it. And depending on your product portfolio and patient population, this type of confounding may be endemic to your FAERS data, not occasional.

Standard threshold-lowering does nothing to address this problem — in fact, it worsens it by introducing more confounded signals into your assessment queue. The solution isn't threshold calibration; it's a different analytical approach that models the co-medication structure rather than marginalizing over it. That's what n-drug graph analysis is designed to do: evaluate the drug-event association within the full co-reporting context rather than collapsing it to a pairwise computation.

Practical Threshold-Setting in a Portfolio Context

When we think about how to actually set thresholds for a specific product, we work through a few practical questions.

First: what is the base rate of adverse event reporting for this drug class in FAERS, and how does it compare to the general database background? A drug with high media attention, an active patient advocacy community, or a recently issued FDA safety communication will have inflated reporting rates that affect all disproportionality statistics. You need to account for this when interpreting threshold-based flags.

Second: what is the expected signal density? A drug with a well-characterized safety profile and no newly identified risks should generate few genuine signals. If you're seeing 20 candidate signals per quarterly analysis, your threshold may be miscalibrated for the product, your background rates may be shifting, or there may be a genuine new safety signal in your portfolio. These are different situations requiring different responses, and your threshold-setting philosophy should allow you to distinguish them.

Third: what is the team's realistic review capacity, and what does a shallow assessment look like versus a thorough one? I'd argue that for most teams, 10-15 signals per quarterly review cycle with thorough assessment is more protective than 30-40 signals with a quick triage. The statistics of the detection method matter less than the quality of the human analysis that follows.

Specificity Gains Without Sensitivity Loss: The Case for Better Upstream Analysis

The honest answer to the sensitivity-specificity tradeoff in practice is that you cannot significantly improve both simultaneously by adjusting a single threshold parameter. What you can do is improve the analytical method so that the signal candidate list has higher positive predictive value before it reaches the assessment queue — meaning you're screening more candidates out not by raising the threshold but by running a more discriminating analysis.

A graph-aware approach to signal detection — one that identifies drug-event associations that are anomalous within the full co-medication context of reported cases rather than in pairwise isolation — generates candidate signals with higher prior probability of being genuine, because the analysis has already controlled for the most common confound in polypharmacy FAERS data. The threshold question doesn't disappear, but the distribution of candidates above the threshold shifts toward genuine signals and away from co-medication artifacts.

This is the direction we've pushed our own analysis at TrialVyx. The 3-month early detection window we see relative to conventional disproportionality analysis isn't because we use lower thresholds — it's because the graph structure captures interaction patterns that pairwise methods register as noise until case counts become large enough to overcome the confounding. Earlier, more specific signal candidates at the same threshold is a better outcome than earlier, noisier candidates at a lower threshold.

The goal is a detection system calibrated to your product, your population, and your team's review capacity — not one that optimizes a single statistic in isolation from the operational reality of who reads the output and what they can do with it.