The Investigator CV Problem
Ask any clinical operations team to walk through their site selection process and you'll hear roughly the same answer: we look at investigator experience, historical enrollment performance on prior trials, site infrastructure capacity, regulatory compliance history, and therapeutic area expertise. These are valid inputs. The problem is that they are collectively backward-looking in a way that systematically underweights the single most important forward-looking factor: does this site have the right patient population right now, for this protocol?
An investigator with a strong Phase III oncology curriculum vitae may have excellent capabilities and a historically high enrollment velocity. But if their current patient panel skews toward treatment-experienced patients while the protocol requires treatment-naive first-line candidates, that CV-based advantage evaporates. The historical performance signal and the current population fit signal can point in opposite directions — and conventional site selection frameworks have no reliable way to distinguish between them.
AI-derived signals from EHR and patient population data offer a different class of information: prospective, protocol-specific, population-level. The question isn't just "has this site enrolled well before?" but "does this site have the specific patient phenotype this protocol requires, in sufficient volume, right now?"
What AI-Based Site Feasibility Scoring Measures
Site feasibility scoring, as applied by EHR-integrated platforms, typically combines several data categories into a composite score:
- Phenotype prevalence: The estimated count of patients in the site's patient population who match the target phenotype, derived from de-identified EHR data. This is the most direct signal for enrollment potential and the one most commonly absent from traditional feasibility questionnaires.
- Eligibility conversion rate estimate: Based on the site's historical ratio of screened to enrolled patients in comparable prior trials. This captures site-specific factors — coordinator quality, investigator rigor, patient proximity — that affect conversion independent of patient population size.
- Competitive trial exposure: How many concurrent trials in overlapping patient populations is this site currently running? Sites with high competitive exposure show lower enrollment velocities per trial even when their patient populations are large, because coordinator bandwidth and patient referral channels are allocated across multiple studies.
- Phenotype data completeness: What is the structured data completeness rate for the key eligibility criteria at this site? A site with a large target population but poorly structured EHR data — missing lab values, incomplete medication records — will be harder to match against systematically, regardless of the actual patient volume.
The composite of these signals produces a feasibility score that correlates with enrollment velocity more reliably than investigator experience alone. That doesn't make investigator experience irrelevant — site quality matters enormously for protocol compliance, patient safety, and data quality. It means that population fit and site quality should be evaluated as separate dimensions, not collapsed into a single "investigator track record" score.
A Worked Example: Site Activation in a CNS Trial
Consider a scenario in the early-stage CNS therapeutic space — a Phase II trial in mild-to-moderate Alzheimer's disease requiring a specific cognitive score range (MMSE 18-26) at baseline, no prior use of investigational treatments in the prior 12 months, and MRI-confirmed absence of significant white matter disease. These three criteria together describe a patient population that is genuinely difficult to find through passive referral channels.
A mid-size biopharma sponsor running this program activated 12 sites based on standard feasibility questionnaire responses. Six months in, three sites had enrolled no patients. Four more sites were tracking significantly below their projected enrollment rates. The nine underperforming sites had all indicated on their feasibility questionnaires that they had "high patient flow" in memory care and neurology — technically accurate, but not specific to the narrow phenotype the protocol required.
A retrospective EHR phenotype analysis of those nine sites found that the combination of MMSE score documentation (which requires a structured cognitive assessment at baseline and regular intervals — not consistently performed at all sites), absence of investigational treatment history (documented inconsistently in medication records), and MRI findings with specific white matter characterization meant that fewer than 40% of patients who appeared eligible on diagnosis codes alone actually met the full phenotype. The feasibility questionnaires captured none of this specificity.
Three sites that the feasibility questionnaire process had not initially prioritized showed better phenotype prevalence when the EHR data was analyzed — because they served patient populations that had been longitudinally followed in memory clinics with structured cognitive assessment protocols. Those three sites, added four months into the program, outperformed the original activation cohort on a per-patient basis.
The Limits of AI-Derived Feasibility Signals
It is worth being direct about where AI-based site selection signals have real limitations, because overstating the case leads to bad operational decisions.
First, phenotype prevalence from EHR data reflects patients who have already been in that health system's care. It doesn't capture the full addressable patient population in a geographic catchment, particularly at community sites with lower referral network breadth. A site with a small recorded phenotype population may still be an effective enrollment site if it sits at a referral hub for a specialty with low documentation rates in the EHR.
Second, EHR data quality varies substantially across sites and health systems. At a site where relevant lab values, vital signs, and medication fields are systematically missing or incomplete, a phenotype query will undercount the eligible population. This is a data quality problem, not a patient population problem, and a feasibility score built on incomplete data will underweight the site's true potential.
Third, AI signals don't capture investigator relationships, sub-investigator training quality, IRB review speed, site coordinator tenure, or the dozens of other operational factors that determine whether an enrolled patient becomes a protocol-compliant, retained participant. These factors still require direct feasibility assessment and cannot be derived from EHR population data.
We're not saying AI-based feasibility scoring should replace site assessment — we're saying it should augment it as a distinct layer of evidence that addresses the population-fit question, which traditional assessment methods address poorly.
Integrating Population Signals Into the Site Selection Process
The most operationally tractable approach is a two-stage feasibility process. In the first stage, a population-fit analysis using EHR phenotype signals is used to rank a longer list of candidate sites by estimated patient volume and phenotype data quality. This typically happens before outreach to investigators, using de-identified population data available through health system data use agreements or through clinical data network partnerships.
In the second stage, the top-ranked sites from the population analysis receive full feasibility assessment — investigator outreach, site visit, IRB timeline assessment, competitive trial inventory, coordinator capacity review. This focus of assessment resources on sites that have already been validated for population fit is significantly more efficient than applying the full feasibility process to a broad undifferentiated list.
The time to meaningful enrollment data under this approach — the point at which sponsors have enough site-level velocity to identify underperformers and reallocate — typically compresses, because the sites that activate are more likely to have the patient population to sustain enrollment from the outset. The back-end work of identifying and replacing poor performers shrinks accordingly.
The Selection Decision Is the Enrollment Decision
There is a common tendency in clinical operations to treat enrollment and site selection as separate problems — selection happens once at the beginning, enrollment is an ongoing operational challenge to be managed throughout. This framing understates how thoroughly selection determines enrollment outcomes. The patient population at a site is fixed at the time of selection; the competitive trial landscape at a site is largely fixed; the investigator's current patient panel is what it is.
Adding population-level intelligence to site selection doesn't make enrollment guaranteed — protocols are complex, patients are people with their own lives and constraints, and operational execution still matters. What it does is ensure that the enrollment challenge you face is resource allocation and retention, not a fundamental mismatch between where you've put your sites and where the patients are. That's a much more solvable problem.