What Makes a Good Patient-Trial Match? A Look at Eligibility Criteria Mapping

Eligibility criteria in clinical protocols are written for regulators, not algorithms. Making them machine-interpretable is an unsolved problem — here is how we approach it.

Patient-trial matching algorithms

The Gap Between Regulatory Language and Computational Logic

Clinical trial eligibility criteria are written to satisfy two audiences simultaneously: regulators who need precise scientific and safety justification for every restriction, and clinicians who need to apply those restrictions in a real patient encounter. Neither of those audiences is an algorithm. The language reflects this: criteria are written in natural language, with embedded clinical assumptions, implicit temporal context, and terminology that depends on shared clinical knowledge to interpret correctly.

"No evidence of active hepatic impairment" is a reasonable eligibility criterion. It is not, however, a computable specification. Evidence is an epistemic concept. Active is a temporal concept with no defined boundary in the criterion itself. Hepatic impairment has multiple definitions depending on whether you're using Child-Pugh scoring, CTCAE grading, or a set of specific lab value thresholds. An algorithm that treats this criterion as a keyword lookup will produce results that are systematically inaccurate in ways that are difficult to detect from outcomes data alone.

This language-to-logic translation problem is where most AI patient matching tools fail quietly — not catastrophically, but persistently, in ways that inflate screen failure rates and erode sponsor confidence in algorithmic pre-screening outputs.

The Four Layers of Eligibility Criteria Complexity

Eligibility criteria span several distinct layers of computational complexity, and a patient-trial matching system needs to handle all of them reliably to perform at clinical standards.

Layer 1: Structured Field Matching

The simplest criteria map directly to coded EHR fields: age within a range, diagnosis code present, specific lab value above a threshold. These criteria can be evaluated by SQL-style queries against structured EHR data with high reliability, assuming the underlying data is complete and accurately coded. In practice, structured completeness rates for key clinical fields vary substantially by site, health system, and therapeutic area — but when the data exists, structured matching is fast and reliable.

Layer 2: Temporal Logic

Many criteria include temporal constraints that require sequence reasoning: "diagnosis of condition X at least 6 months prior to screening," "no use of medication Y within 3 months," "ECOG performance status documented within 28 days." These cannot be resolved by a point-in-time lookup. They require reasoning across the patient's longitudinal record, with correct date arithmetic that accounts for the specific anchor date (typically the screening visit date) specified in the protocol.

Temporal logic is where many matching systems make systematic errors. The anchor date is often not explicitly specified in the data model, requiring the matching algorithm to infer it. Date-shifted records (de-identified using HIPAA-compliant date shifting) maintain relative temporal intervals but not absolute dates, which complicates anchor-date-dependent temporal logic. These are solvable engineering problems, but they require deliberate design rather than generic clinical NLP application.

Layer 3: Negation and History Attribution

Criteria involving negation — "no prior malignancy in the last 5 years," "no contraindicated medications" — require not just identifying whether a concept appears in the record, but correctly attributing it as applying to the patient (rather than a family member), as current (rather than historical), and as confirmed (rather than suspected or ruled out). Clinical NLP research has established that negation detection is one of the most challenging and consequential tasks in clinical text processing — negation scope varies widely by documentation style, and errors compound when a negative criterion is mis-classified as positive, systematically excluding eligible patients.

Layer 4: Derived and Composite Criteria

Some eligibility criteria cannot be evaluated from any single data element — they require deriving a clinical state from a combination of inputs. "Adequate bone marrow function defined as ANC ≥ 1.5 × 109/L, platelet count ≥ 75 × 109/L, hemoglobin ≥ 9 g/dL" requires three separate lab values, each with its own unit-conversion and threshold check. Organ function criteria in oncology trials often have 6-8 component lab value checks. The composite result — "adequate" function — must be derived from all components simultaneously, and missing any single component should not automatically disqualify the patient; it should trigger a data completeness flag rather than a match failure.

The Pre-Screening Accuracy Problem in Practice

When matching algorithms mishandle any of these four layers, the downstream effect shows up in screen failure rates. Consider a Phase III cardiovascular outcomes trial with 23 eligibility criteria. If the matching algorithm correctly handles 20 of the 23 criteria but systematically misclassifies patients on two criteria involving negation and one criterion involving temporal logic, the resulting candidate list will contain a predictable fraction of patients who will fail screening on those three criteria.

If the mismatch rate per criterion is modest — say, a 15% error rate on each of the three problematic criteria — the compound effect on the full candidate list is a meaningful inflation of the screen failure rate. The sponsor sees a pre-screening tool that generates candidates, but doesn't see a pre-screening tool that is selectively inaccurate on a subset of criteria. The failure mode is invisible without criterion-level accuracy auditing.

This is why criterion-level performance evaluation — measuring sensitivity and positive predictive value separately for each eligibility criterion — is more informative than aggregate match accuracy reporting. A system that achieves 82% overall match accuracy but shows 55% PPV on negation criteria is a different operational risk than a system that shows 78% aggregate accuracy with uniform distribution across criteria types.

Protocol-Specific Model Tuning vs. Generic Clinical NLP

There is an important distinction between generic clinical NLP models and protocol-specific matching models. Generic clinical NLP pipelines — including several well-validated open-source and commercial options — are designed to extract clinical entities, relationships, and assertions from clinical text at a general level. They perform well on standard entity types (disease names, medication names, lab values) and are appropriate for population health analytics, cohort identification at the population level, and clinical decision support where moderate precision is acceptable.

Protocol-specific matching requires a higher standard. The criteria in a specific protocol have specific thresholds, specific temporal logic, and specific clinical definitions that may not align with how the same clinical concept is handled in a generic NLP pipeline. A matching system that applies a generic NLP pipeline to protocol criteria interpretation without protocol-specific tuning will systematically mishandle the edge cases that matter most — the cases where a patient is borderline eligible, where the temporal context changes the eligibility determination, or where the protocol's definition of a clinical state diverges from standard clinical usage.

We're not saying generic NLP has no role in recruitment — it's a reasonable starting point and handles high-volume straightforward criteria well. We're saying that the performance gap between a protocol-tuned model and a generic clinical NLP application is most significant precisely in the high-value pre-screening scenarios — complex protocols with stringent biomarker criteria — where accurate matching matters most.

What Good Looks Like: Matching Quality Standards

A patient-trial matching system operating at clinical decision-support quality standards should meet several benchmarks. Sensitivity (the fraction of eligible patients correctly identified) should be evaluated separately from specificity (the fraction of ineligible patients correctly excluded). In the pre-screening context, high sensitivity is prioritized — it is worse to miss an eligible patient than to include a patient who will fail formal screening, because the cost of a missed eligible patient is enrollment delay, not just wasted coordinator time.

Systems should provide criterion-level match reasoning, not just a binary eligible/ineligible output. Coordinators need to know which criteria were matched with high confidence, which were matched with moderate confidence requiring verification, and which were flagged as unable to evaluate due to data gaps. This reasoning output is what converts a matching algorithm from a black box into a clinical decision-support tool — and it's what allows coordinators to prioritize their review effort on the criteria that actually matter rather than re-reviewing confident matches.

Finally, matching accuracy should be tracked longitudinally against screen failure outcomes. The cases where a matched patient fails formal screening should be systematically analyzed to determine whether the failure was a data quality issue, a matching algorithm error, or a genuine edge case in eligibility interpretation. That feedback loop is what drives continuous model improvement and what distinguishes a maturing clinical matching system from a static lookup tool.

Want to See the Platform in Action?

Request a 45-minute demo scoped to your therapeutic area and site network.

Request a Demo