Medical multimodal multitask foundation model for lung cancer screening

Multimodal multitask datasets
Figure 2a shows the general data curation pipeline, including medical tasks definition, task-specific multimodal data collection, multimodal data processing and alignment, and multimodal question-answering (MQA) dataset construction. We target 17 (sub-)tasks in the LCS process, including 5 tasks for lung nodule detection and characterization, 1 task for cardiovascular disease (CVD) diagnosis, 1 task for CVD mortality risk prediction, 1 task for lung cancer risk prediction over multiple years, 7 tasks for other chest abnormality exams, 1 task for COVID-19 detection, as well as 1 task for American College of Radiology (ACR) guidelines for Lung CT Screening Reporting and Data System (Lung-RADS) categorization. COVID-19 detection from CT is included since it remains a global threat40 and was reported in the LCS radiology reports collected from Massachusetts General Hospital (MGH) and Wake Forest University School of Medicine (WFUSM). The ground-truth labels come from different information sources, including radiology reports, disease history, pathology test results, follow-up data, death reports, and laboratory test results as described in Fig. 3a.

a General data construction workflow consists of four steps: medical task definition, task-specific multimodal data collection, multimodal data processing and alignment, and multimodal question-answering construction. b The data used in this study was collected from two data centers, National Lung Screening Trial (NLST) and Medical Imaging and Data Resource Center (MIDRC), and two medical institutes, Wake Forest University School of Medicine (WFUSM) and Massachusetts General Hospital (MGH), with the key characteristics summarized, based on which a large volumetric Computed Tomography (CT) pretraining dataset and a simulated clinical dataset were constructed. The detailed configuration can be found in Supplementary Table 3. The blue boxes indicate the OpenM3Chest dataset that is publicly available. c The patient sex and age distributions of the collected data from the involved data centers, where the age data represent mean age ± standard deviation. d Distributions of the training, validation, and test datasets over all tasks. e Distributions of independent evaluation datasets from MGH. f Distributions of independent evaluation, full dose (FD) CT, and fine-tuning datasets from WFUSM. CVD Cardiovascular Disease, Reticular/… /scar, reticular/reticulonodular opacities/honeycombing/fibrosis/scar where / means or, COVID-19 Coronavirus Disease 2019, Lung-RADS Lung CT Screening Reporting and Data System, CAC Coronary Artery Calcification. Source data are provided as a Source Data file.

a Alignment among text input, image input, example questions, and candidate answers. The black bounding boxes on lungs and heart illustrate input region sizes, including 2.5-dimensional left or right lung regions (2.5D L/R Lung), three-dimensional heart regions, three-dimensional left and right lung regions (3D L&R Lungs), three-dimensional heart regions, three-dimensional left or right lung regions (3D L/R Lungs). b Multimodal data elements involved in this work including three-dimensional (3D) computed tomography (CT). Patient information on race and ethnicity is self-reported. COVID-19 Coronavirus Disease 2019, Lung-RADS Lung CT Screening Reporting and Data System.
To curate the multimodal datasets, multiple data sources were aligned, including volumetric CT scans, demographics, smoking history, disease history, cancer history, family cancer history, and other task-specific clinical data. Race and ethnicity of NLST data are self-reported by participants using standardized questionnaires provided during the NLST enrollment process. In total, 49 different clinical data types were integrated into the multimodal datasets for LCS, as described in Fig. 3b. For each task, one training, one validation, and one or more testing datasets were constructed. Our multimodal multitask dataset is summarized in Fig. 2b. The data were collected from different data centers and institutes, including NLST, MIDRC, WFUSM, and MGH. In total, we curated 17 training, 17 validation, and 34 testing datasets for the 17 tasks, with detailed information in Fig. 2c–e. We also collected an out-of-distribution multimodal dataset from WFUSM for transfer learning. To inspect the modeling ability for textural clinical data, we simulated a dataset for clinical information retrieval, as illustrated in Fig. 1e. Since we unify multimodal multitask learning in an MQA framework, each dataset consists of task-specific multimodal inputs, questions, and answers. The details for all tasks are summarised in Fig. 3a.
As the first data source, we were granted access to all recorded data in NLST, which is a randomized trial for evaluating LCS with 3D LDCT versus 2D chest radiography, demonstrating that screening with LDCT lowered lung cancer mortality by 20%. The NLST data were collected from 33 medical institutions, which were randomly indexed without revealing their identifications publicly. The 26,722 participants in the LDCT screening arm were enrolled from August 2002 through April 2004. The participants underwent three screenings at 1-year intervals from August 2002 through September 2007. The follow-up data were collected until December 31, 2009. During the whole process, diverse data were recorded, including demographics, smoking history, disease history, multiple CT series with different reconstruction algorithms and associated imaging parameters, key abnormalities in fully structured reports, pathology test results for lung cancer, follow-up data, and vital status. Being consistent with the NLST clinical practice, we constructed 15 multimodal datasets for 15 tasks, including 5 datasets for predicting the presence of lung nodules and estimating the location, size, margin, and attenuation properties of lung nodules; 7 datasets for identifying chest abnormalities, including atelectasis, pleural thickening/effusion, non-calcified hilar/mediastinal adenopathy/mass, chest wall abnormality (bone destruction, metastasis, etc.), consolidation, emphysema, reticular/reticulonodular opacities/honeycombing/fibrosis/scar; 1 dataset for CVD diagnosis; 1 dataset for CVD mortality risk prediction following16, where the intervals between screening CT and CVD mortality are in the range from 11 days to 2619 days (within 8 years); and 1 dataset for lung cancer risk prediction within from 1 to 6 years as in14. For the CVD mortality risk prediction task, we further stratified the binary risk into 1-6 cut-off year risks following14. Each dataset was randomly split into training, validation, and test datasets. The patient information in the test dataset was not leaked to the training and validation datasets across all tasks. From NLST, we included 125,090 effective volumetric chest CT scans of the received 26,254 patient cases.
The second data source is the Medical Imaging and Data Resource Center (MIDRC)41, a collaboration of leading medical imaging organizations launched in August 2020 as part of NIBIB’s response to the COVID-19 pandemic. We were granted to access all CT series with the associated clinical data. The ground-truth labels for COVID-19 were determined by either the Reverse Transcription Polymerase Chain Reaction (RT-PCR) or the Rapid Antigen Test (RAT). From MIDRC, we retrieved 35,730 volumetric chest CT series of 7609 patients scanned from 2011 to 2021. The patient data were randomly split into the training, validation, and test datasets.
All CT scans from NLST and MIDRC excluding those in any test datasets were combined as a CT pretraining dataset, comprised of 128,693 CT scans in total. To inspect if the clinical data are effectively encoded, we constructed a clinical question-answering dataset to retrieve key information from the textual clinical data. The integration of all the above-curated datasets is called OpenM3Chest.
To test the generalizability of M3FM, we independently collected two multimodal multitask datasets from the third and fourth data sources, i.e., WFUSM and MGH, respectively. These multimodal LCS datasets include CT scans, radiology reports, demographics, smoking history, disease history, personal cancer history, family lung cancer history, and pathology test results for lung cancer. Race and ethnicity data of MGH and WFUSM datasets were collected from the MGH and WFUSM electronic health record systems and self-reported by participants. The radiology reports from WFUSM and MGH are in the structured reporting template with sub-headers, but the free text is used under each sub-header. We also collected a full-dose CT dataset with the associated radiology reports from WFUSM to evaluate the generalizability on full-dose CT scans. The MGH and WFUSM review boards approved the analysis of all these multimodal data and tasks. Based on the radiology reports and the pathology test results, we constructed 7 datasets from WFUSM and 6 datasets from MGH for independent evaluation, with the detailed information shown in Fig. 2b, c, e, f. Specifically, we collected 8053 patient data from 2015 to 2023, all with radiology reports, and 1800 of them (from September 7, 2021 to December 30, 2022) with LDCT and multimodal information at WFUSM. We collected 1000 patient data with full-dose CT scans and the associated radiology reports from September 22, 2022, to December 31, 2022, at WFUSM. We collected 904 patient data with multimodal data at MGH from 2016 to 2021. The Lung-RADS dataset from WFUSM was randomly split into training, validation, and test datasets to classify the text descriptions into the Lung-RADS category. All other datasets of WFUSM and MGH were used for testing.
To evaluate the adaptability of our M3FM, we collected an out-of-distribution multimodal dataset for non-small cell lung cancer (NSCLC) immunotherapy prognosis from WFUSM. This dataset consists of 90 patient data, including the target label indicating if the patient was diagnosed with immune checkpoint-inhibitor-induced pneumonitis after immunotherapy, the CT scans before immunotherapy, and the clinical variables including the total cycles of Immuno-Oncology (IO), smoking information of pack years, Body Mass Index (BMI) at diagnosis, age, and if the patient received radiation prior to immunotherapy. Among the 90 patients, 49 patients developed immune checkpoint-inhibitor-induced pneumonitis, and the other patients were used as the control group.
Further details on the multimodal data processing and alignment and the MQA dataset construction are described in the Methods section.
M3FM performance
Figure 4a summarizes the key results of M3FM against the previous SOTA models14,16,34,42,43,44,45 and the most powerful generalist AI model GPT-4o46 on the OpenM3Chest dataset. The competing models are summarized in Supplementary Table 1. We used the Area Under the receiver operating characteristic Curve (AUC) and the 95% two-sided Confidence Intervals (CI) of AUC as the evaluation metrics47.

a Comparison of the best M3FMs with previous state-of-the-art (SoTA) models in including Generative Pre-trained Transformer 4 Omni (GPT-4O) in terms of Area Under the Curve (AUC) relative improvement. The compared models have been summarized in Supplementary Table 1. The AUC values and 95% confidence intervals of all models can be found in Supplementary Table 2. b AUC results with 95% confidence intervals for M3FM models of three scales including Base, Large, and Huge. The AUC value and two-sided 95% confidence interval for each task were calculated from its entire test dataset. Error bars in b indicate the two-sided 95% confidence intervals. CVD Cardiovascular Disease, Reticular/… /scar reticular/reticulonodular opacities/honeycombing/fibrosis/scar, where / means or, COVID-19 Coronavirus Disease 2019, Lung-RADS Lung CT Screening Reporting and Data System. Source data are provided as a Source Data file.
With the detailed comparative results summarized in Supplementary Table 2, M3FM outperformed the previous SOTA models across all tasks, demonstrating significant improvements in most of them. Specifically, for a fair comparison, we retrained the Sybil model14, denoted as Sybil*, for lung cancer risk prediction without using costly bounding box annotations but predicting lung cancer risks by merging the separate results of left and right lungs. It is observed that Sybil* achieved inferior results for 1 ~ 2-year risk prediction but superior results for 3 ~ 6-year risk prediction in comparison with the results obtained using the original Sybil model. Without using any bounding box, our M3FM achieved an AUC of 0.9400 (95% Confidence Intervals = 0.9119–0.9698), 0.8881 (95% Confidence Intervals = 0.8567–0.9195), 0.8599 (95% Confidence Intervals = 0.8288–0.8910), 0.8604 (95% Confidence Intervals = 0.8310–0.8898), 0.8392 (95% Confidence Intervals = 0.8098–0.8685), 0.8232 (95% Confidence Intervals = 0.7936–0.8529) for lung cancer risk prediction over six years, outperforming both Sybil* and original Sybil models by the margins of 5% to 9% and 2% to 11%, respectively. For CVD diagnosis and CVD mortality prediction, we compared the results on both the original dataset16 and our OpenM3Chest dataset. M3FM achieved an AUC of 0.9284 (95% Confidence Intervals = 0.9136–0.9433) for CVD diagnosis and an AUC of 0.8904 (95% Confidence Intervals = 0.8427–0.9381) for CVD mortality prediction on the OpenChest dataset, outperforming the previous model (Tri2D-Net16) by 5% and 9% respectively, and achieved an AUC of 0.9304 (95% Confidence Intervals = 0.9150–0.9458) for CVD diagnosis and 0.8606 (95% Confidence Intervals = 0.8063–0.9150) for CVD mortality prediction on the datasets constructed in16, outperforming the previous model (Tri2D-Net) by ~ 5% and ~ 10% respectively. On average, M3FM enhanced the 1-6 year CVD mortality risk prediction performance by 14.22% in AUC compared to the previous best model (see Supplementary Table 2). For several tasks including nodule detection, nodule localization, nodule size prediction, and emphysema detection, M3FM improved the results by various degrees up to 3% of AUC. For all the other tasks, M3FM significantly improved the performance from ~ 5% to ~ 10%. To study the scalability of M3FM, we trained three versions of M3FM, consisting of 257M (M3FM-Base), 502M (M3FM-Large), and 865M (M3FM-Huge) trainable parameters respectively. The results obtained using these three models are summarized in Fig. 4b. Overall, with a larger model size, the performance became better, especially from M3FM-Base to M3FM-Large. This trend is consistent with the well-known scaling law48 in the field of foundation models.
M3FM encoding multimodal data and synergizing multiple clinical tasks
Table 1 compares the results of the single-modality single-task, multi-modality single-task, and multi-modality multitask M3FM-Large models. First, the single-modality single-task models were trained and evaluated on LDCT data only and denoted by M3FM-SM-ST, while the multi-modality single-task models were trained and evaluated on multimodal data and denoted by M3FM-MM-ST. Overall, the multimodal information improved the prediction results for multiple tasks. In particular, M3FM-SM-ST achieved an AUC of 0.8163 (95% Confidence Intervals = 0.7585–0.8741) for CVD mortality prediction while the M3FM-MM-ST model achieved an AUC of 0.8709 (95% Confidence Intervals = 0.8200–0.9219), which represents a 5.46% improvement. Similarly, for multi-year CVD mortality risk prediction, the multimodal model outperformed the single-modality model by 5% on average as shown in Supplementary Table 2. While M3FM-SM-ST achieved an AUC of 0.8924 (95% Confidence Intervals = 0.8745–0.9104) for CVD diagnosis, the M3FM-MM-ST model achieved an AUC of 0.9238 (95% Confidence Intervals = 0.9084–0.9392), i.e., a 3.14% improvement. Similarly, M3FM-SM-ST achieved an AUC of 0.6515 (95% Confidence Intervals = 0.5939–0.7092) for consolidation detection, and the M3FM-MM-ST model achieved an AUC of 0.6895 (95% Confidence Intervals = 0.6326–0.7464), a 3.80% improvement. Also, M3FM-SM-ST achieved an AUC of 0.7676 (95% Confidence Intervals = 0.7573–0.7779) for reticular/reticulonodular opacities/honeycombing/fibrosis/scar detection, and the M3FM-MM-ST model achieved an AUC of 0.7929 (95% Confidence Intervals = 0.7830–0.8027), a 2.53% improvement. It is further observed that M3FM-MM-ST models produce slightly improved or comparable results in comparison with M3FM-SM-ST for the other tasks. Then, we compared the multimodal multitask model (M3FM-MM-MT) and multimodal single-task models (M3FM-MM-ST). Impressively, training on multiple tasks, M3FM-MT-MM outperformed the M3FM-ST-MM for 17 out of 22 (sub)-tasks. In reference to the label distributions of the multiple tasks in Supplementary Table 3, the five tasks that were not benefited from multitask learning have the largest balance ratios of the number of minority class samples over the number of majority class samples. In other words, multitask learning is more beneficial for tasks with more imbalanced datasets or a much smaller number of positive/minority class labels.
M3FM identifying clinically informational elements
Since M3FM accommodates any combination of multimodal datasets in the training and inference stages, we investigated the application of M3FM to analyze the synergy between multimodal data elements and clinical tasks by observing the effects of different input combinations on the model outcomes. Table 2 presents the ablation results using different combinations of multimodal data for CVD diagnosis and mortality prediction. M3FM using all multimodal inputs improved the AUC by 3% ~ 4% relative to the results using LDCT only and by 12% and 5% over that using clinical data only for CVD diagnosis and mortality prediction respectively. Furthermore, the M3FM results show that the disease histories of heart disease or heart attack, hypertension, stroke, and diabetes consistently boosted the AUC results by gradually adding them into the input combination for CVD diagnosis and mortality prediction. Supplementary Table 4 shows the lung cancer risk prediction results using different inputs, showing that demographic information slightly improved the AUC results.
Then, we evaluated if M3FMs could effectively encode the physical size information. The ablation results in Fig. 5a show that the embedded physical size of LDCT improved the AUC results for multiple tasks. The physical size information boosted the AUC of 1 ~ 6-year lung cancer risk prediction by 5%, 4%, 4%, 7%, 8% and 12%, respectively. The physical size information also improved AUC results of the nodule size characterization, CVD diagnosis, and CVD mortality prediction, by 0.71%, 0.47%, and 1.11% respectively.

a Evaluation of voxel size embedding in computed tomography (CT) imaging. The Area Under the Curve (AUC) values and 95% confidence intervals for Medical Multimodal Multitask Foundation Model (M3FM) models are reported with and without embedding CT voxel sizes across various tasks. The AUC value and two-sided 95% confidence interval for each task were calculated from its entire test dataset. The error bars indicate the two-sided 95% confidence intervals. Source data are provided as a Source Data file. b The attention maps of the task encoder for two cardiovascular disease (CVD) diagnosis examples, where the two cases were reported with significant CVD abnormalities. c The attention maps of the task encoder for two lung cancer risk prediction examples, where the pathology test results confirmed the lung cancer within one year following their low-dose CT lung cancer screenings.
We quantitatively evaluated the relevance of different clinical elements with model outputs by visualizing the attention maps of the last task attention block in M3FM. Figure 5b visualizes the attention heat maps on selected CT slices and text tokens of individual patients with CVD or lung cancer risks. In CVD diagnosis, the coronary artery calcification areas are highlighted in the LDCT attention heat maps, and the patients’ disease histories of diabetes, heart disease or heart attack, hypertension, and stroke are highly relevant among text tokens, which is consistent with the quantitative results in Table 2. Furthermore, the ablation inference in Supplementary Fig. 1 explicitly shows how the information from multiple sources is composed to affect the model prediction. In a case of positive CVD diagnosis, M3FM failed predictions when taking LDCT only or LDCT plus uninformative clinical data as inputs. The same M3FM successfully diagnosed CVD when using LDCT plus the relevant clinical data including the diabetes/heart disease history. This is consistent with the ablation results summarized in Table 2, where multimodal inputs are statistically beneficial for the M3FM inferences. In predicting lung cancer risks, the lung nodules in LDCT images are localized in the heat maps, and the text tokens related to demographic and family lung cancer histories are more correlated to the model outputs as shown in Fig. 5c. Although the visualization of the attention maps provides a window to inspect the behavior of the Transformer model, it is not always reliable to reveal the correlation between model predictions and input tokens, e.g., less relevant tokens were highlighted in Fig. 5b, which is consistent to the prior findings49,50.
M3FM improving generalizability
We evaluated the generalizability of M3FMs on the multimodal datasets independently collected from MGH and WFUSM, with the comparative results shown in Fig. 6a, b, respectively. For the CVD diagnosis task, we constructed two datasets, which regard (1) moderate and severe CVD as positive and (2) severe CVD only as positive, respectively. On the two MGH CVD datasets, the multimodal multitask model (M3FM-MM-MT) improved the AUC by 10.60% and 6.57% relative to the previous model, improved the AUC by 4.85% and 2.36% relative to the single-modality single-task model (M3FM-SM-ST), and also achieved slight AUC improvements relative to the multi-modality single-task model (M3FM-MM-ST). Relative to M3FM-SM-ST, the M3FM-MM-ST model improved the AUC by 4.39% and 1.75% on the two CVD MGH datasets respectively. For the 1-year lung cancer risk prediction on the MGH dataset, the M3FM-MM-MT model improved the AUC by 20.80% against the previous model under the same experimental setting without using any bounding box annotations, improved the AUC by 4.85% over M3FM-SM-ST, and improved the AUC by 6.89% over M3FM-MM-ST. On the MGH emphysema, atelectasis, and reticular opacities/honeycombing/fibrosis/scar datasets, the M3FM improved the AUC by 5.23%, 14.34%, and 12.91% relative to the previous model, and also achieved AUC improvements by 0.24% ~ 4.96% over M3FM-SM-ST and M3FM-MM-ST. For the CVD tasks on the MGH datasets, M3FM-MM-MT improved the AUC by 12% and 6.29% against the previous model, improved the AUC by 6.46% and 3.77% relative to M3FM-SM-ST, and had the essentially same results as M3FM-MM-ST. For the 1-year lung cancer risk prediction on the MGH dataset, the M3FM-MM-MT model improved the AUC by 18.54% relative to the previous model under the same experimental setting without using any bounding box annotations, improved the AUC by 5.91% over M3FM-SM-ST, and improved the AUC by 2.57% over M3FM-MM-ST; and M3FM-MM-ST improved the AUC by 3.24% over M3FM-SM-ST. On the WFUSM emphysema, atelectasis, and reticular opacities/honeycombing/fibrosis/scar datasets, the M3FM-MM-MT model improved the AUC by 0.78%, 9.65%, and 14.24% against the previous model. We further evaluated the generalizability of M3FM on the full-dose CT scans in Fig. 6c. It is observed that M3FM (M3FM-SM-ST) models trained with LDCT scans performed similarly on the diagnosis tasks of scar, atelectasis, and emphysema abnormalities, but had an evident performance drop for CVD diagnosis on full-dose CT scans. M3FMs still outperformed the competing models by 1% ~ 10% on all the compared tasks.

Evaluation results of the M3FM variants including single-modality (SM), multimodality (MM), single task (ST), and multitask (MT), and competing models on the (a) MGH, (b) WFUSM datasets, and (c) WFUSM full-dose CT datasets in terms of Area Under the Curve (AUC) and 95% confidence intervals. The AUC value and two-sided 95% confidence interval for each task were calculated from its entire test dataset. The error bars indicate the two-sided 95% confidence intervals. CAC Coronary Artery Calcification, which is a type of cardiovascular disease. Source data are provided as a Source Data file.
M3FM enhancing out-of-distribution multimodal analysis
We further evaluated if M3FM, as a foundation model, facilitates out-of-distribution multimodal modeling as shown in Fig. 7. For this purpose, we fine-tuned M3FM to predict immunotherapy-induced pneumonitis from volumetric CT prior and the selected clinical data related to immunotherapy as described in the Results section. We used the method developed in WFUSM as the reference method51 and compared different fine-tuned variants of the M3FM in terms of the average AUC and its standard deviation in five-fold cross-validation. The reference model used in WFUSM had 0.894 ± 0.075 AUC by merging all radiomic and clinical features. Specifically, the reference model was based on a nomogram to predict immunotherapy outcomes using features extracted from radiomic algorithms, a pre-trained ViT-base model, and clinical records. After feature selection, 20 radiomic features, 20 deep features, and 17 clinical features were used for the nomogram. The best result of our fine-tuned M3FMs was 0.941% ± 0.026 of AUC, which achieved a 4.7% improvement over the competing model. The M3FM-CT model using CT data only had an AUC of 0.919 ± 0.026. The M3FM-Clinical model using clinical text only had a 0.911 ± 0.029 AUC. The M3FM-Scratch without pretraining achieved 0.925 ± 0.025 of AUC, in a favorable comparison with the competing model.

a The same M3FM architecture was fine-tuned to perform the out-of-distribution immunotherapy prognosis task with three-dimensional (3D) computed tomography (CT) and clinical inputs. b Results on immunotherapy-induced pneumonitis using different methods, including the reference method, M3FM-CT which only takes CT as inputs, M3FM-Clinical which only takes the clinical data as inputs, M3FM-Scratch that was trained from scratch without utilizing the pre-trained model, and M3FM that takes both CT and clinical data as inputs and was finetuned from the pre-trained model. The error bars represent mean AUC ± standard deviations from five-fold cross-validation (n = 5), with each fold for a distinct train/test split of the same dataset. CTViT Computed Tomography Vision Transformer. Source data are provided as a Source Data file.
link