Clinical Use Case
ER-Positive Breast Cancer Recurrence Prediction
ER-positive breast cancer is the most common breast cancer subtype, yet patients with identical diagnoses experience markedly different recurrence rates. MEFINDER builds a multimodal recurrence predictor by fusing imaging, pathology, and clinical text from the largest annotated breast cancer cohort in the US.
Clinical Context
ER-positive breast cancer accounts for the majority of breast cancer diagnoses, yet current standard-of-care tools are unable to reliably distinguish patients who will experience late recurrence (5–20 years after diagnosis) from those who will remain disease-free. This uncertainty leads to over-treatment in some patients and under-treatment in others.
Expensive molecular assays (such as Oncotype DX, approximately $3,400 per test) offer partial prognostic guidance but may not capture the full biological complexity of disease. MEFINDER integrates imaging, pathology, and clinical text to build affordable, equity-aware alternatives that provide more accurate prognostication across diverse patient populations.
The NLP pipeline extracts recurrence events, treatment timelines, and patient-centered outcomes directly from free-text clinical notes — enabling large-scale, retrospective cohort construction without manual chart review.
260,815
patients in EMBED v2
~1M
imaging exams, cross-modality
NLP validated on
Mayo Clinic · Stanford · Emory · UC Davis
Data Modalities
Mammography
Full-field digital mammography (FFDM) and digital breast tomosynthesis (DBT)
Breast MRI
Dynamic contrast-enhanced MRI for high-risk screening and staging
Digital Pathology
H&E whole-slide images from core needle biopsies
Clinical Text
Radiology and pathology reports processed by NLP pipelines for recurrence labeling
EHR
Structured electronic health records including ICD-10 codes, medications, and visit history
Key Dataset
EMBED v2
The EMBED (EMory BrEast imaging Dataset) v2 cohort is hosted at Emory University and comprises 260,815 breast cancer patients with approximately one million imaging exams across multiple modalities. It is one of the largest and most diverse annotated breast imaging datasets available for research, enabling equity-aware model development and validation across demographic subgroups.
Patients
260,815
Imaging exams
~1 million
Imaging modalities
FFDM, DBT, breast MRI
Institution
Emory University
Use
Multimodal recurrence prediction, NLP pipeline development
Relevant Tools
01
VLM for Mammography
Knowledge-grounded adaptation strategy for vision-language models for screening mammography.
Builds unique case-sets for screening mammography using mini-batch selective sampling for VLM adaptation. Evaluated with two VLMs: MedCLIP (in-domain) and ALBEF (out-of-domain). Validated zero-shot, few-shot, and supervised on UW Madison datasets and externally on Mayo Clinic. Authors include Aisha Urooj Khan et al. Model checkpoints available via download link.
02
BreastRecurrence_Transformer
Transformer-based NLP for identification of breast cancer recurrence occurrence and timing from EMRs.
Adaptable to other cancer sites. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker. Model weights available via Google Drive.
03
Breast Cancer Treatment Extraction
Hybrid UMLS parser + fine-tuned LLM for extracting longitudinal treatment timelines from free-text clinical notes.
Combines a UMLS-based parser with fine-tuned language models (GPT-2, BioGPT, LLaMA) to extract structured treatment timelines from unstructured clinical notes. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker.
04
PCO Extraction
Fine-tuning framework for LLMs to extract patient-centered outcomes from breast cancer clinical notes.
Extracts treatment-related side effects including fatigue, depression, anxiety, nausea, and lymphedema from breast cancer clinical notes. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker.
05
Recurrence Site Extraction (BioLinkBERT)
Fine-tuned BioLinkBERT model for extracting sites of distant recurrence from clinical, radiology, and pathology notes.
Fine-tuned on annotated clinical, radiology, and pathology notes to identify distant recurrence sites. Validated on Mayo, Stanford, Emory, and UC Davis. Released with an academic open-source license and packaged in Docker.
06
Mammogram Implant Identifier
ResNet18 CNN that identifies breast implants in mammograms without relying on DICOM tags.
Trained on 6,250 mammograms (5,000 train/validate, 1,250 test). Does not rely on DICOM metadata tags. Model weights available in repository.