Scientific Research
Research Overview
Two cancer use cases, one unified multimodal framework, and an open-source ecosystem designed for reproducibility and clinical impact.
Plain-Language Summary
We are studying why some patients with ER-positive breast cancer—the most common type— see their cancer come back years later while others do not. By combining mammograms, MRI scans, biopsy slides, and medical records from over 260,000 patients, we aim to build a free computer tool that can identify high-risk patients much earlier and without expensive genetic tests.
ER+ Breast Cancer Recurrence Prediction
Estrogen receptor-positive (ER+) breast cancer accounts for the majority of breast cancer cases. Despite generally favorable prognosis, a significant proportion of patients experience late recurrence — sometimes more than 5–10 years post-diagnosis — with limited ability to identify at-risk patients early using current tools.
MEFINDER develops NLP-based recurrence labeling from clinical notes, integrates causal feature analysis in histopathology, and applies compartment-aware prognostication approaches that jointly model tumor, stromal, and immune microenvironment compartments.
260,815+
Patients
~1M
Exams
EMBED v2
Dataset
Data Modalities
Key Methodologies
Plain-Language Summary
After treatment for prostate cancer (surgery or radiation), some patients see their PSA blood level rise again — a sign the cancer may be returning. Current tests to predict who is at risk can cost thousands of dollars. We are building a free AI tool that uses standard MRI scans and pathology slides — tests patients already receive — to make the same prediction at no extra cost.
Prostate Cancer Biochemical Recurrence
Biochemical recurrence (BCR) — PSA rise following radical prostatectomy or radiotherapy — affects up to 40% of patients within 10 years of definitive therapy. Identifying who will recur guides decisions on adjuvant therapy and surveillance intensity.
MEFINDER applies APIC (AI-based Pathology Image Classifier) to predict treatment benefit from standard H&E slides, develops a prostate biological age estimation model using MRI radiomics, and addresses MRI batch effect harmonization across heterogeneous multi-site cohorts.
Cost Impact
Molecular assays like Decipher cost approximately $3,400/patient and may not be covered. APIC achieves comparable performance using routine pathology slides at no incremental cost.
Data Modalities
Key Tools
Technical Architecture
Multimodal Fusion Framework
Four complementary fusion strategies, each addressing a different aspect of multimodal learning in clinical oncology.
Graph-Based Multimodal Fusion
Represents heterogeneous imaging and EHR features as nodes in a hypergraph, enabling message passing across modality boundaries.
Spatio-Temporal Fusion
Extends graph-based fusion with temporal edges, capturing longitudinal dynamics from serial imaging and repeated lab measurements.
Vision-Language Contrastive Training
Aligns imaging features with clinical text embeddings using CLIP-style contrastive objectives, with medical knowledge grounding via SNOMED and RadLex.
Co-Attention with Causality
Combines cross-modal attention (identifying which image regions correlate with EHR features) with structural causal models to remove confounding from demographic variables.
Model Interpretability
SHAP Values
Feature importance attribution across imaging, pathology, and EHR inputs
Attention Maps
Spatial visualization of which image regions drive model predictions
Transcriptome Analysis
Correlating model-derived phenotypes with gene expression data for biological validation
Infrastructure
Data Harmonization
Robust preprocessing and quality control pipelines ensure that data from five institutions is analysis-ready and comparable.
DICOM Preprocessing
Standardized pipeline for DICOM ingestion, windowing, and conversion
2D-to-3D ROI Mapping
Maps 2D pathology annotations to 3D volumetric imaging coordinates
Implant Detection
Automatically identifies breast implants in mammography for downstream handling
MQUAL
MRI quality scoring for SNR, motion artifacts, and protocol completeness
Beaks
Cross-modality quality assessment for radiology and pathology image sets
HistoQC
Whole-slide pathology QC with artifact masking and tissue segmentation
F-SYN
Fourier-domain stain normalization for digital pathology slides
Radiology Quality
- DICOM header validation
- SNR / CNR assessment
- Motion artifact scoring
- Protocol compliance check
Pathology Quality
- Tissue/background segmentation
- Artifact detection (blur, fold)
- Stain variability metrics
- Tile-level quality masks
EHR Quality
- HL7 FHIR conformance
- ICD-10 code validation
- Missing data imputation
- Recurrence label NLP verification