Papers | Siqi Li

See the most current and complete list of works here.

2025

Leveraging AI and transfer learning to enhance out-of-hospital cardiac arrest outcome prediction in diverse setting

Siqi Li, Yohei Okada, Wenjun Gu, Michael Hao Chen, and 5 more authors

npj Digital Medicine, Nov 2025

Abs HTML Code

Access to trustworthy artificial intelligence (AI) for clinical applications is uneven, especially in low-resource settings with limited and inconsistent data. Models from high-resource settings often fail to generalize. Transfer learning (TL) can adapt established models to new settings. Using neurological outcome prediction for out-of-hospital cardiac arrest (OHCA) as a proof of concept, we adapted a model trained on a large cohort to Vietnam (243 patients) and Singapore (15,916 patients) using the Pan-Asian Resuscitation Outcomes Study registry. The external model performed poorly on the Vietnam cohort, with an area under the receiver operating characteristic curve (AUROC) of 0.467 (95% CI: 0.141–0.785), but TL markedly improved performance (AUROC = 0.807, 95% CI: 0.626–0.948). In Singapore, TL yielded modest gains (AUROC = 0.955 vs. 0.945). These findings highlights the potential of TL to improve prediction accuracy across diverse healthcare contexts and to support equitable and safe global AI adoption.
Developing federated time-to-event scores using heterogeneous real-world survival data

Siqi Li, Ziwen Wang, Yuqing Shang, Qiming Wu, and 6 more authors

Computers in Biology and Medicine, Sep 2025

Abs HTML Code

Objective Survival analysis serves as a fundamental component in numerous healthcare applications, where the determination of the time to specific events (such as the onset of a certain disease or death) for patients is crucial for clinical decision-making. Scoring systems are widely used for swift and efficient risk prediction. However, existing methods for constructing survival scores presume that data originates from a single source, posing privacy challenges in collaborations with multiple data owners. Materials and methods We propose a novel framework for building federated scoring systems for multi-site survival outcomes, ensuring both privacy and communication efficiency. We applied our approach to sites with heterogeneous survival data originating from emergency departments in Singapore and the United States. Additionally, we independently developed local scores at each site. Results In testing datasets from each participant site, our proposed federated scoring system consistently outperformed all local models, evidenced by higher integrated area under the receiver operating characteristic curve (iAUC) values, with a maximum improvement of 11.6%. Additionally, the federated score’s time-dependent AUC(t) values showed advantages over local scores, exhibiting narrower confidence intervals (CIs) across most time points. Discussion The model developed through our proposed method showed good local performance and is promising for future healthcare research. Sites participating in our proposed federated scoring model training can develop survival models with enhanced prediction accuracy and efficiency. Conclusion This study demonstrates the effectiveness of our privacy-preserving federated survival score generation framework and its applicability to real-world heterogeneous survival data.
Bridging Data Gaps in Healthcare: A Scoping Review of Transfer Learning in Structured Data Analysis

Siqi Li, Xin Li, Kunyu Yu, Qiming Wu, and 11 more authors

Health Data Science, Sep 2025

Abs HTML

Background: Clinical and biomedical research in low-resource settings often faces significant challenges due to the need for high-quality data with sufficient sample sizes to construct effective models. These constraints hinder robust model training and prompt researchers to seek methods for leveraging existing knowledge from related studies to support new research efforts. Transfer learning (TL), a machine learning technique, emerges as a powerful solution by utilizing knowledge from pre-trained models to enhance the performance of new models, offering promise across various healthcare domains. Despite its conceptual origins in the 1990s, the application of TL in medical research has remained limited, especially beyond image analysis. This review aims to analyze TL applications, highlight overlooked techniques, and suggest improvements for future healthcare research. Methods: Following the PRISMA-ScR guidelines, we conducted a search for published articles that employed TL with structured clinical or biomedical data by searching the SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL databases. Results: We screened 5,080 papers, with 86 meeting the inclusion criteria. Among these, only 2% (two out of 86) utilized external studies, and 5% (four out of 86) addressed scenarios involving multi-site collaborations with privacy constraints. Conclusions: To achieve actionable TL with structured medical data while addressing regional disparities, inequality, and privacy constraints in healthcare research, we advocate for the careful identification of appropriate source data and models, the selection of suitable TL frameworks, and the validation of TL models with proper baselines.
FairFML: fair federated machine learning with a case study on reducing gender disparities in cardiac arrest outcome prediction

Siqi Li, Qiming Wu, Doudou Zhou, Xin Li, and 10 more authors

npj Health Systems, Aug 2025

Abs HTML Code

Health equity is a critical concern in clinical research and practice, as biased predictive models can exacerbate disparities in clinical decision-making and patient outcomes. As healthcare systems increasingly rely on data-driven models, ensuring fairness in these systems is essential to prevent perpetuating existing disparities. While large-scale healthcare data exists across multiple institutions, cross-institutional collaborations often face privacy constraints, highlighting the need for privacy-preserving solutions that also promote fairness. We present Fair Federated Machine Learning (FairFML), a model-agnostic solution designed to reduce algorithmic bias in cross-institutional healthcare collaborations while preserving patient privacy. Validated through a real-world case study on reducing gender disparities in cardiac arrest outcome prediction, FairFML improved fairness metrics by up to 90% without compromising predictive performance. FairFML is flexible and compatible with various FL frameworks and models, from traditional statistical methods to deep learning, offering a robust and scalable solution for equitable model development in clinical settings.
FedIMPUTE: Privacy-preserving missing value imputation for multi-site heterogeneous electronic health records

Siqi Li, Mengying Yan, Ruizhi Yuan, Molei Liu, and 2 more authors

Journal of Biomedical Informatics, Mar 2025

Abs HTML Code

Objective: We propose FedIMPUTE, a communication-efficient federated learning (FL) based approach for missing value imputation (MVI). Our method enables multiple sites to collaboratively perform MVI in a privacy-preserving manner, addressing challenges of data-sharing constraints and population heterogeneity. Methods: We begin by conducting MVI locally at each participating site, followed by the application of various FL strategies, ranging from basic to advanced, to federate local MVI models without sharing site-specific data. The federated model is then broadcast and used by each site for MVI. We evaluate FedIMPUTE using both simulation studies and a real-world application on electronic health records (EHRs) to predict emergency department (ED) outcomes as a proof of concept. Results: Simulation studies show that FedIMPUTE outperforms all baseline MVI methods under comparison, improving downstream prediction performance and effectively handling data heterogeneity across sites. By using ED datasets from three hospitals within the Duke University Health System (DUHS), FedIMPUTE achieves the lowest mean squared error (MSE) among benchmark MVI methods, indicating superior imputation accuracy. Additionally, FedIMPUTE provides good downstream prediction performance, outperforming or matching other benchmark methods. Conclusion: FedIMPUTE enhances the performance of downstream risk prediction tasks, particularly for sites with high missing data rates and small sample sizes. It is easy to implement and communication-efficient, requiring sites to share only non-patient-level summary statistics.

2024

Federated Learning in Healthcare: A Benchmark Comparison of Engineering and Statistical Approaches for Structured Data Analysis

Siqi Li, Di Miao, Qiming Wu, Chuan Hong, and 10 more authors

Health Data Science, Dec 2024

Abs HTML Code

Background: Federated learning (FL) holds promise for safeguarding data privacy in healthcare collaborations. While the term “FL” was originally coined by the engineering community, the statistical field has also developed privacy-preserving algorithms, though these are less recognized. Our goal was to bridge this gap with the first comprehensive comparison of FL frameworks from both domains. Methods: We assessed seven FL frameworks, encompassing both engineering-based and statistical FL algorithms, and compared them against local and centralized modeling of logistic regression and least absolute shrinkage and selection operator (Lasso). Our evaluation utilized both simulated data and real-world emergency department data, focusing on comparing both estimated model coefficients and the performance of model predictions. Results: The findings reveal that statistical FL algorithms produce much less biased estimates of model coefficients. Conversely, engineering-based methods can yield models with slightly better prediction performance, occasionally outperforming both centralized and statistical FL models. Conclusion: This study underscores the relative strengths and weaknesses of both types of methods, providing recommendations for their selection based on distinct study characteristics. Furthermore, we emphasize the critical need to raise awareness of and integrate these methods into future applications of FL within the healthcare domain.
Variable importance analysis with interpretable machine learning for fair risk prediction

Yilin Ning, Siqi Li, Yih Yng Ng, Michael Yih Chong Chia, and 8 more authors

PLOS Digital Health, Jul 2024

Abs HTML Code

Machine learning (ML) methods are increasingly used to assess variable importance, but such black box models lack stability when limited in sample sizes, and do not formally indicate non-important factors. The Shapley variable importance cloud (ShapleyVIC) addresses these limitations by assessing variable importance from an ensemble of regression models, which enhances robustness while maintaining interpretability, and estimates uncertainty of overall importance to formally test its significance. In a clinical study, ShapleyVIC reasonably identified important variables when the random forest and XGBoost failed to, and generally reproduced the findings from smaller subsamples (n = 2500 and 500) when statistical power of the logistic regression became attenuated. Moreover, ShapleyVIC reasonably estimated non-significant importance of race to justify its exclusion from the final prediction model, as opposed to the race-dependent model from the conventional stepwise model building. Hence, ShapleyVIC is robust and interpretable for variable importance assessment, with potential contribution to fairer clinical risk prediction.
Federated machine learning in healthcare: A systematic review on clinical applications and technical architecture

Zhen Ling Teo, Liyuan Jin, Nan Liu, Siqi Li, and 10 more authors

Cell Reports Medicine, Feb 2024

Abs HTML

Summary Federated learning (FL) is a distributed machine learning framework that is gaining traction in view of increasing health data privacy protection needs. By conducting a systematic review of FL applications in healthcare, we identify relevant articles in scientific, engineering, and medical journals in English up to August 31st, 2023. Out of a total of 22,693 articles under review, 612 articles are included in the final analysis. The majority of articles are proof-of-concepts studies, and only 5.2% are studies with real-life application of FL. Radiology and internal medicine are the most common specialties involved in FL. FL is robust to a variety of machine learning models and data types, with neural networks and medical imaging being the most common, respectively. We highlight the need to address the barriers to clinical translation and to assess its real-world impact in this new digital data-driven healthcare scene.
CanVaxKB: a web-based cancer vaccine knowledgebase

Eliyas Asfaw^*, Asiyah Yu Lin^*, Anthony Huffman^*, Siqi Li^*, and 22 more authors

NAR Cancer, Jan 2024

Abs HTML

Cancer vaccines have been increasingly studied and developed to prevent or treat various types of cancers. To systematically survey and analyze different reported cancer vaccines, we developed CanVaxKB (https://violinet.org/canvaxkb), the first web-based cancer vaccine knowledgebase that compiles over 670 therapeutic or preventive cancer vaccines that have been experimentally verified to be effective at various stages. Vaccine construction and host response data are also included. These cancer vaccines are developed against various cancer types such as melanoma, hematological cancer, and prostate cancer. CanVaxKB has stored 263 genes or proteins that serve as cancer vaccine antigen genes, which we have collectively termed ‘canvaxgens’. Top three mostly used canvaxgens are PMEL, MLANA and CTAG1B, often targeting multiple cancer types. A total of 193 canvaxgens are also reported in cancer-related ONGene, Network of Cancer Genes and/or Sanger Cancer Gene Consensus databases. Enriched functional annotations and clusters of canvaxgens were identified and analyzed. User-friendly web interfaces are searchable for querying and comparing cancer vaccines. CanVaxKB cancer vaccines are also semantically represented by the community-based Vaccine Ontology to support data exchange. Overall, CanVaxKB is a timely and vital cancer vaccine source that facilitates efficient collection and analysis, further helping researchers and physicians to better understand cancer mechanisms.

2023

FedScore: A privacy-preserving framework for federated scoring system development

Siqi Li, Yilin Ning, Marcus Eng Hock Ong, Bibhas Chakraborty, and 7 more authors

Journal of Biomedical Informatics, Oct 2023

Abs HTML Code

Objective We propose FedScore, a privacy-preserving federated learning framework for scoring system generation across multiple sites to facilitate cross-institutional collaborations. Materials and methods The FedScore framework includes five modules: federated variable ranking, federated variable transformation, federated score derivation, federated model selection and federated model evaluation. To illustrate usage and assess FedScore’s performance, we built a hypothetical global scoring system for mortality prediction within 30 days after a visit to an emergency department using 10 simulated sites divided from a tertiary hospital in Singapore. We employed a pre-existing score generator to construct 10 local scoring systems independently at each site and we also developed a scoring system using centralized data for comparison. Results We compared the acquired FedScore model’s performance with that of other scoring models using the receiver operating characteristic (ROC) analysis. The FedScore model achieved an average area under the curve (AUC) value of 0.763 across all sites, with a standard deviation (SD) of 0.020. We also calculated the average AUC values and SDs for each local model, and the FedScore model showed promising accuracy and stability with a high average AUC value which was closest to the one of the pooled model and SD which was lower than that of most local models. Conclusion This study demonstrates that FedScore is a privacy-preserving scoring system generator with potentially good generalizability.
Federated and distributed learning applications for electronic health records and structured medical data: a scoping review

Siqi Li, Pinyan Liu, Gustavo G Nascimento, Xinru Wang, and 11 more authors

Journal of the American Medical Informatics Association, Aug 2023

Abs HTML

Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations, and discusses potential innovations.We searched 5 databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from 3 primary perspectives, including data quality, modeling strategies, and FL frameworks.Out of the 1193 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis.The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research.
Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques

Mingxuan Liu^*, Siqi Li^*, Han Yuan, Marcus Eng Hock Ong, and 7 more authors

Artificial Intelligence in Medicine, Aug 2023

Abs HTML

Objective The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. In response to the increasing diversity and complexity of data, many researchers have developed deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on the types of data, intending to assist healthcare researchers from various disciplines in dealing with missing data. Materials and methods We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to February 8, 2023 that described the use of DL-based models for imputation. We examined selected articles from four perspectives: data types, model backbones (i.e., main architectures), imputation strategies, and comparisons with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models. Results Out of 1822 articles, a total of 111 were included, of which tabular static data (29%, 32/111) and temporal data (40%, 44/111) were the most frequently investigated. Our findings revealed a discernible pattern in the choice of model backbones and data types, for example, the dominance of autoencoder and recurrent neural networks for tabular temporal data. The discrepancy in imputation strategy usage among data types was also observed. The “integrated” imputation strategy, which solves the imputation task simultaneously with downstream tasks, was most popular for tabular temporal data (52%, 23/44) and multi-modal data (56%, 5/9). Moreover, DL-based imputation methods yielded a higher level of imputation accuracy than non-DL methods in most studies. Conclusion The DL-based imputation models are a family of techniques, with diverse network structures. Their designation in healthcare is usually tailored to data types with different characteristics. Although DL-based imputation models may not be superior to conventional approaches across all datasets, it is highly possible for them to achieve satisfactory results for a particular data type or dataset. There are, however, still issues with regard to portability, interpretability, and fairness associated with current DL-based imputation models.
A universal AutoScore framework to develop interpretable scoring systems for predicting common types of clinical outcomes

Feng Xie, Yilin Ning, Mingxuan Liu, Siqi Li, and 9 more authors

STAR Protocols, Jun 2023

Abs HTML Code

Summary The AutoScore framework can automatically generate data-driven clinical scores in various clinical applications. Here, we present a protocol for developing clinical scoring systems for binary, survival, and ordinal outcomes using the open-source AutoScore package. We describe steps for package installation, detailed data processing and checking, and variable ranking. We then explain how to iterate through steps for variable selection, score generation, fine-tuning, and evaluation to generate understandable and explainable scoring systems using data-driven evidence and clinical knowledge. For complete details on the use and execution of this protocol, please refer to Xie et al. (2020),1 Xie et al. (2022)2, Saffari et al. (2022)3 and the online tutorial https://nliulab.github.io/AutoScore/.

2022

Benchmarking emergency department prediction models with machine learning and public electronic health records

Feng Xie, Jun Zhou, Jin Wee Lee, Mingrui Tan, and 9 more authors

Scientific Data, Oct 2022

Abs HTML Code

The demand for emergency department (ED) services is increasing across the globe, particularly during the current COVID-19 pandemic. Clinical triage and risk assessment have become increasingly challenging due to the shortage of medical resources and the strain on hospital infrastructure caused by the pandemic. As a result of the widespread use of electronic health records (EHRs), we now have access to a vast amount of clinical data, which allows us to develop prediction models and decision support systems to address these challenges. To date, there is no widely accepted clinical prediction benchmark related to the ED based on large-scale public EHRs. An open-source benchmark data platform would streamline research workflows by eliminating cumbersome data preprocessing, and facilitate comparisons among different studies and methodologies. Based on the Medical Information Mart for Intensive Care IV Emergency Department (MIMIC-IV-ED) database, we created a benchmark dataset and proposed three clinical prediction benchmarks. This study provides future researchers with insights, suggestions, and protocols for managing data and developing predictive tools for emergency care.
A novel interpretable machine learning system to generate clinical risk scores: An application for predicting early mortality or unplanned readmission in a retrospective cohort study

Yilin Ning, Siqi Li, Marcus Eng Hock Ong, Feng Xie, and 3 more authors

PLOS Digital Health, Jun 2022

Abs HTML Code

Risk scores are widely used for clinical decision making and commonly generated from logistic regression models. Machine-learning-based methods may work well for identifying important predictors to create parsimonious scores, but such ‘black box’ variable selection limits interpretability, and variable importance evaluated from a single model can be biased. We propose a robust and interpretable variable selection approach using the recently developed Shapley variable importance cloud (ShapleyVIC) that accounts for variability in variable importance across models. Our approach evaluates and visualizes overall variable contributions for in-depth inference and transparent variable selection, and filters out non-significant contributors to simplify model building steps. We derive an ensemble variable ranking from variable contributions across models, which is easily integrated with an automated and modularized risk score generator, AutoScore, for convenient implementation. In a study of early death or unplanned readmission after hospital discharge, ShapleyVIC selected 6 variables from 41 candidates to create a well-performing risk score, which had similar performance to a 16-variable model from machine-learning-based ranking. Our work contributes to the recent emphasis on interpretability of prediction models for high-stakes decision making, providing a disciplined solution to detailed assessment of variable importance and transparent development of parsimonious clinical risk scores.
Development and validation of an interpretable clinical score for early identification of acute kidney injury at the emergency department

Yukai Ang^*, Siqi Li^*, Marcus Eng Hock Ong, Feng Xie, and 6 more authors

Scientific Reports, May 2022

Abs HTML

Acute kidney injury (AKI) in hospitalised patients is a common syndrome associated with poorer patient outcomes. Clinical risk scores can be used for the early identification of patients at risk of AKI. We conducted a retrospective study using electronic health records of Singapore General Hospital emergency department patients who were admitted from 2008 to 2016. The primary outcome was inpatient AKI of any stage within 7 days of admission based on the Kidney Disease Improving Global Outcome (KDIGO) 2012 guidelines. A machine learning-based framework AutoScore was used to generate clinical scores from the study sample which was randomly divided into training, validation and testing cohorts. Model performance was evaluated using area under the curve (AUC). Among the 119,468 admissions, 10,693 (9.0%) developed AKI. 8491 were stage 1 (79.4%), 906 stage 2 (8.5%) and 1296 stage 3 (12.1%). The AKI Risk Score (AKI-RiSc) was a summation of the integer scores of 6 variables: serum creatinine, serum bicarbonate, pulse, systolic blood pressure, diastolic blood pressure, and age. AUC of AKI-RiSc was 0.730 (95% CI 0.714–0.747), outperforming an existing AKI Prediction Score model which achieved AUC of 0.665 (95% CI 0.646–0.679) on the testing cohort. At a cut-off of 4 points, AKI-RiSc had a sensitivity of 82.6% and specificity of 46.7%. AKI-RiSc is a simple clinical score that can be easily implemented on the ground for early identification of AKI and potentially be applied in international settings.