Clinical Use Case

ER-Positive Breast Cancer Recurrence Prediction

ER-positive breast cancer is the most common breast cancer subtype, yet patients with identical diagnoses experience markedly different recurrence rates. MEFINDER builds a multimodal recurrence predictor by fusing imaging, pathology, and clinical text from the largest annotated breast cancer cohort in the US.

Clinical Context

ER-positive breast cancer accounts for the majority of breast cancer diagnoses, yet current standard-of-care tools are unable to reliably distinguish patients who will experience late recurrence (5–20 years after diagnosis) from those who will remain disease-free. This uncertainty leads to over-treatment in some patients and under-treatment in others.

Expensive molecular assays (such as Oncotype DX, approximately $3,400 per test) offer partial prognostic guidance but may not capture the full biological complexity of disease. MEFINDER integrates imaging, pathology, and clinical text to build affordable, equity-aware alternatives that provide more accurate prognostication across diverse patient populations.

The NLP pipeline extracts recurrence events, treatment timelines, and patient-centered outcomes directly from free-text clinical notes — enabling large-scale, retrospective cohort construction without manual chart review.

260,815

patients in EMBED v2

~1M

imaging exams, cross-modality

NLP validated on

Mayo Clinic · Stanford · Emory · UC Davis

Data Modalities

Mammography

Full-field digital mammography (FFDM) and digital breast tomosynthesis (DBT)

Breast MRI

Dynamic contrast-enhanced MRI for high-risk screening and staging

Digital Pathology

H&E whole-slide images from core needle biopsies

Clinical Text

Radiology and pathology reports processed by NLP pipelines for recurrence labeling

EHR

Structured electronic health records including ICD-10 codes, medications, and visit history

Key Dataset

EMBED v2

The EMBED (EMory BrEast imaging Dataset) v2 cohort is hosted at Emory University and comprises 260,815 breast cancer patients with approximately one million imaging exams across multiple modalities. It is one of the largest and most diverse annotated breast imaging datasets available for research, enabling equity-aware model development and validation across demographic subgroups.

Patients

260,815

Imaging exams

~1 million

Imaging modalities

FFDM, DBT, breast MRI

Institution

Emory University

Use

Multimodal recurrence prediction, NLP pipeline development

Relevant Tools

VLM for Mammography

Knowledge-grounded adaptation strategy for vision-language models for screening mammography.

Builds unique case-sets for screening mammography using mini-batch selective sampling for VLM adaptation. Evaluated with two VLMs: MedCLIP (in-domain) and ALBEF (out-of-domain). Validated zero-shot, few-shot, and supervised on UW Madison datasets and externally on Mayo Clinic. Authors include Aisha Urooj Khan et al. Model checkpoints available via download link.

BreastRecurrence_Transformer

Transformer-based NLP for identification of breast cancer recurrence occurrence and timing from EMRs.

Adaptable to other cancer sites. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker. Model weights available via Google Drive.

Breast Cancer Treatment Extraction

Hybrid UMLS parser + fine-tuned LLM for extracting longitudinal treatment timelines from free-text clinical notes.

Combines a UMLS-based parser with fine-tuned language models (GPT-2, BioGPT, LLaMA) to extract structured treatment timelines from unstructured clinical notes. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker.

PCO Extraction

Fine-tuning framework for LLMs to extract patient-centered outcomes from breast cancer clinical notes.

Extracts treatment-related side effects including fatigue, depression, anxiety, nausea, and lymphedema from breast cancer clinical notes. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker.

Recurrence Site Extraction (BioLinkBERT)

Fine-tuned BioLinkBERT model for extracting sites of distant recurrence from clinical, radiology, and pathology notes.

Fine-tuned on annotated clinical, radiology, and pathology notes to identify distant recurrence sites. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker.

Mammogram Implant Identifier

ResNet18 CNN that identifies breast implants in mammograms without relying on DICOM tags.

Trained on 6,250 mammograms (5,000 train/validate, 1,250 test). Does not rely on DICOM metadata tags. Model weights available in repository.