Researchers at EMBL-EBI in Cambridge and the German Cancer Research Center (DKFZ) have introduced Delphi-2M, a generative transformer trained on health records that estimates individual disease risks over time and simulates possible future trajectories.
According to the study published in Nature, the model learns sequences of diagnoses and basic health information from large databases and predicts the probability of more than 1,000 conditions - including when they might occur.
Delphi-2M was first trained on a little over 400,000 participants from the UK Biobank and then evaluated, without fine-tuning, on data from 1.93 million people in Denmark’s national health registers. The team emphasizes that the system delivers probabilities and trends, not medical certainties or causal explanations. For now, it should be seen mainly as proof of concept.
How the model works
Technically, Delphi-2M adapts a GPT-style transformer to medical timelines: instead of words, it processes life events along a timeline. Importantly, it predicts not just what might happen (the next diagnosis), but also when.
Inputs include a person’s medical history as a list of ICD-10 diagnoses with age at first occurrence, plus basic demographic and lifestyle factors such as sex, BMI, smoking, and alcohol use. The model processes these along a timeline and outputs daily hazard rates - the probability each day that one of more than 1,000 conditions (or death) could occur.
It also estimates the expected time until the next event and can simulate complete possible future trajectories based on that. Predictions are updated whenever new patient information is added.
To handle long gaps in medical records, the model inserts neutral placeholders. Code and documentation are available on GitHub, though the model itself is restricted under UK Biobank data access rules.
How accurate are the predictions?
In internal testing, Delphi-2M outperformed chance by a wide margin for nearly all diseases, with particularly strong results in predicting short-term mortality. Accuracy decreased somewhat over longer horizons, but underlying trends held up even after 10 years.
External validation on Danish health data showed only a slight performance drop compared to UK Biobank results - suggesting the approach could scale to larger populations and more diverse datasets if trained on more data or bigger models.
Potential applications and timeline
The researchers see immediate potential in public health planning. Aggregated predictions could help estimate regional or demographic disease burdens more accurately. For use with individual patients, they expect a five-to-ten-year timeline given regulatory hurdles, according to the Financial Times.
The model performed strongest on conditions with clearer progression patterns such as cardiovascular disease, diabetes, and sepsis. It was less reliable for rare congenital disorders or diagnoses strongly shaped by external factors. The team is exploring ways to integrate other data layers such as genomics and proteomics. Core methods for combining risk and time modeling have been patented.
Limits and open questions
The study also makes clear the system’s limitations. The UK Biobank skews toward healthier, better-educated participants aged 40 to 70. Deaths before enrollment are absent, and very old age groups are underrepresented. Diagnoses also come from mixed sources, including self-reports, primary care, hospitals, and registers.
Such gaps can bias predictions. For instance, if a patient’s records lack hospital data, the model underpredicts conditions mostly diagnosed in hospitals. Conversely, for individuals with hospitalization history, it will predict such conditions far more often. Sepsis, 93 percent hospital-coded, was predicted about eight times more frequently in these cases. These patterns partly reflect real care pathways, but they also introduce artifacts.
For this reason, the authors explicitly caution against causal interpretations. Models like Delphi-2M should be seen as a complement to - not a replacement for - clinical judgment.