Leveraging Real-World Data and machine learning to identify pre-diagnosis lung cancer patients

Case study

Overview

In today’s evolving oncology landscape, early detection of cancer remains one of the most critical factors influencing patient outcomes. However, many cancers - particularly lung cancer - are often diagnosed at later stages, limiting treatment options and survival rates. Life sciences organisations are increasingly turning to realworld data and advanced analytics to uncover subtle, pre-diagnostic signals hidden within patient journeys. By leveraging large-scale longitudinal claims data and machine learning, it’s now possible to identify patients who exhibit early indicators of cancer before a formal diagnosis is made.

Challenge

A leading life sciences organisation approached Symphony Health, an ICON plc company, with a critical need: to identify potential lung cancer patients before their confirmatory diagnosis. The objective was to use this insight to support targeted deployment of their proprietary cancer diagnostic tool, helping clinicians intervene earlier and improve outcomes. 

Despite advancements in oncology, lung cancer continues to present a unique challenge due to its late-stage diagnosis in a majority of patients. The client recognised that traditional retrospective data was insufficient to support their forward-looking strategy. They needed a data-driven solution to detect subtle clinical patterns and flag high-risk patients earlier in the care continuum - specifically, those who may be eligible for diagnostic testing but had not yet been definitively diagnosed. 

They turned to Symphony Health to leverage the vast, longitudinal data available within the Integrated Dataverse (IDV®), a comprehensive and de-identified claims dataset covering over 300 million patient lives across the United States. The goal: unlock early indicators of lung cancer and build a predictive pipeline to identify patients who were likely harboring the disease but had yet to be diagnosed.

Solution

To meet this challenge, Symphony deployed a threepronged approach integrating real-world data science, clinical domain expertise, and machine learning innovation. 

  • Defining the “end event” & pre-diagnostic journey: The team first established the “end event” as the formal lung cancer diagnosis code observed in the IDV® dataset. From there, each patient’s prediagnostic journey, mapping claims data for clinical activity, diagnostic procedures, prescription fills, comorbidities, and other markers that could serve as early red flags.
  • Constructing cancer-specific business rules: With input from oncology experts, Symphony developed lung cancer-specific business rules, including tailored lookback periods to capture relevant activity leading up to a diagnosis. This rule set included codes for common signs (e.g., persistent cough, unexplained weight loss), imaging orders (e.g., chest X-rays, CT scans), referrals to pulmonologists, and prescribed treatments that could suggest emerging disease patterns.
  • Building & validating machine learning models: Multiple machine learning algorithms were then applied - including random forest, gradient boosting machines, and logistic regression - to model patients likely to have undiagnosed lung cancer. These models were trained on large cohorts and evaluated against real-world patient samples to ensure both predictive performance and clinical relevance. 

Importantly, the team did not stop at prediction accuracy. They sought to deconstruct model outputs to surface the most impactful clinical variables driving predictions. This helped ensure transparency and potential future regulatory alignment.

Outcome

The results provided the client with a powerful and actionable intelligence framework: 

  • Key variable insights: Symphony identified and reported a comprehensive list of signs, symptoms, procedures, and prescribing patterns that the models found utilised most frequency in classifying patients as likely to be diagnosed. This not only enhanced the client’s understanding of disease emergence but also provided a roadmap for future model refinement.
  • Provider attribution: Patients flagged as “likely” lung cancer cases were attributed to HCPs most actively practicing within the lung cancer therapeutic area. This allowed the client to prioritise outreach to key physician segments such as pulmonologists, oncologists, and high-referring primary care physicians (PCPs).
  • Scalable framework: The methodology used for lung cancer was replicable and scalable. In fact, Symphony successfully applied the same process to nine additional cancers, including pancreatic, ovarian, colorectal, and prostate cancers - each with custom-tailored rules and modeling logic.

Value delivered

By combining the breadth of real-world claims data with tailored machine learning workflows, Symphony enabled the client to identify thousands of high-risk, undiagnosed cancer patients across multiple tumor types. These individuals represent prime candidates for diagnostic testing, expanding the client’s addressable market and enhancing the potential for earlier cancer detection. 

In the lung cancer use case alone, Symphony delivered: 

  • A validated list of predictive indicators of undiagnosed lung cancer
  • A ranked attribution of healthcare providers managing likely patients
  • An interactive dashboard highlighting geographic and specialty distribution of potential patients
  • A scalable model architecture ready for integration into the client’s commercial or clinical operations 

Ultimately, this initiative not only supported the client’s go-to-market efforts but also contributed to the broader goal of improving early detection and patient outcomes in some of the most challenging cancers.

For more information

Contact us