Scientific Research

Research Overview

Two cancer use cases, one unified multimodal framework, and an open-source ecosystem designed for reproducibility and clinical impact.

Plain-Language Summary

We are studying why some patients with ER-positive breast cancer—the most common type— see their cancer come back years later while others do not. By combining mammograms, MRI scans, biopsy slides, and medical records from over 260,000 patients, we aim to build a free computer tool that can identify high-risk patients much earlier and without expensive genetic tests.

Use Case 1

ER+ Breast Cancer Recurrence Prediction

Estrogen receptor-positive (ER+) breast cancer accounts for the majority of breast cancer cases. Despite generally favorable prognosis, a significant proportion of patients experience late recurrence — sometimes more than 5–10 years post-diagnosis — with limited ability to identify at-risk patients early using current tools.

MEFINDER develops NLP-based recurrence labeling from clinical notes, integrates causal feature analysis in histopathology, and applies compartment-aware prognostication approaches that jointly model tumor, stromal, and immune microenvironment compartments.

260,815+

Patients

~1M

Exams

EMBED v2

Dataset

View breast cancer publications

Data Modalities

Full-Field Digital Mammography (FFDM)
Digital Breast Tomosynthesis (DBT)
Breast MRI
Digital Pathology (H&E slides)
Clinical Text / NLP
Electronic Health Records (EHR)

Key Methodologies

NLP-based recurrence labeling from unstructured clinical notes
Causal feature integration in histopathology (tumor + stroma + immune)
Compartment-aware prognostication across imaging modalities
Federated contrastive learning (MamoCLIP) across sites

Plain-Language Summary

After treatment for prostate cancer (surgery or radiation), some patients see their PSA blood level rise again — a sign the cancer may be returning. Current tests to predict who is at risk can cost thousands of dollars. We are building a free AI tool that uses standard MRI scans and pathology slides — tests patients already receive — to make the same prediction at no extra cost.

Use Case 2

Prostate Cancer Biochemical Recurrence

Biochemical recurrence (BCR) — PSA rise following radical prostatectomy or radiotherapy — affects up to 40% of patients within 10 years of definitive therapy. Identifying who will recur guides decisions on adjuvant therapy and surveillance intensity.

MEFINDER applies APIC (AI-based Pathology Image Classifier) to predict treatment benefit from standard H&E slides, develops a prostate biological age estimation model using MRI radiomics, and addresses MRI batch effect harmonization across heterogeneous multi-site cohorts.

Cost Impact

Molecular assays like Decipher cost approximately $3,400/patient and may not be covered. APIC achieves comparable performance using routine pathology slides at no incremental cost.

View prostate cancer publications

Data Modalities

Biparametric MRI (bpMRI)
H&E Pathology Slides
Clinical Lab Values (PSA)
Treatment History (EHR)
Gleason Score / Grade Group
Social Determinants of Health

Key Tools

APICTreatment benefit prediction from H&E pathology slides
ProstateNetTZ/PZ segmentation for zone-specific radiomics
MQUALMRI quality assessment and protocol compliance
PyComBatchBatch effect harmonization across scanner vendors

Technical Architecture

Multimodal Fusion Framework

Four complementary fusion strategies, each addressing a different aspect of multimodal learning in clinical oncology.

Imaging + EHR

Graph-Based Multimodal Fusion

Represents heterogeneous imaging and EHR features as nodes in a hypergraph, enabling message passing across modality boundaries.

Imaging + EHR + Time

Spatio-Temporal Fusion

Extends graph-based fusion with temporal edges, capturing longitudinal dynamics from serial imaging and repeated lab measurements.

Image + Text

Vision-Language Contrastive Training

Aligns imaging features with clinical text embeddings using CLIP-style contrastive objectives, with medical knowledge grounding via SNOMED and RadLex.

Causal AI

Co-Attention with Causality

Combines cross-modal attention (identifying which image regions correlate with EHR features) with structural causal models to remove confounding from demographic variables.

Model Interpretability

SHAP Values

Feature importance attribution across imaging, pathology, and EHR inputs

Attention Maps

Spatial visualization of which image regions drive model predictions

Transcriptome Analysis

Correlating model-derived phenotypes with gene expression data for biological validation

Infrastructure

Data Harmonization

Robust preprocessing and quality control pipelines ensure that data from five institutions is analysis-ready and comparable.

DICOM Preprocessing

Standardized pipeline for DICOM ingestion, windowing, and conversion

2D-to-3D ROI Mapping

Maps 2D pathology annotations to 3D volumetric imaging coordinates

Implant Detection

Automatically identifies breast implants in mammography for downstream handling

MQUAL

MRI quality scoring for SNR, motion artifacts, and protocol completeness

Beaks

Cross-modality quality assessment for radiology and pathology image sets

HistoQC

Whole-slide pathology QC with artifact masking and tissue segmentation

F-SYN

Fourier-domain stain normalization for digital pathology slides

Radiology Quality

  • DICOM header validation
  • SNR / CNR assessment
  • Motion artifact scoring
  • Protocol compliance check

Pathology Quality

  • Tissue/background segmentation
  • Artifact detection (blur, fold)
  • Stain variability metrics
  • Tile-level quality masks

EHR Quality

  • HL7 FHIR conformance
  • ICD-10 code validation
  • Missing data imputation
  • Recurrence label NLP verification