Featured Research

See what investigators can do within Information Commons


The De-identifying Clinical Notes at Scale

This provides an automated clinical text de-identification pipeline to protect patient information in clinical note text and reports. It is one of the only professionally certified de-identification software for unstructured data today, and is being adopted by many other medical institutions, including UC Irvine Health and UC Davis Health.   

This project also won the prestigious 2022 Larry L. Sautter Award for Innovation in Information Technology Golden Award.

Read More

Lakshmi Radhakrishnan
Gundolf Schenk
Kathleen Muenzen
Boris Oskotsky
Sharat Israni
Atul Butte


Similarities and differences in Alzheimer's dementia comorbidities in racialized populations identified from electronic medical records

(Published in Communications Medicine)  

Black- and Latine- identified individuals in the United States are more likely to have Alzheimer’s dementia (AD) relative to Asian- and White-identified individuals. Despite this, Black- and Latine- identified individuals are less likely to be included in studies that attempt to understand and treat AD.

Patients’ medical information, electronically recorded by healthcare providers, was used to explore whether patients with AD were more likely to have different conditions relative to patients who do not have AD. We did this analysis separately for Asian-, Non-Latine Black-, Latine- and Non-Latine White- identified individuals for a total of four analyses. While we found many conditions that were shared by all individuals, a few, such as lung-related diseases, may be more common in specific identified race and ethnicity categories.

Read More

Sarah R. Woldemariam
Alice S. Tang
Tomiko T. Oskotsky
Kristine Yaffe
Marina Sirota


Development and Internal Validation of an Interpretable Machine Learning Model to Predict Readmissions in a United States Healthcare System

We developed and internally validated a supervised ML model to predict 30-day readmissions in a US-based healthcare system. The final model selected was XGBoost, which had an area under the receiver operating characteristic curve of 0.783 and an area under the precision-recall curve of 0.434. This model has several advantages including state-of-the-art performance metrics, the use of clinical data, the use of features available within 24 h of discharge, and generalizability to multiple disease states.

This research was done in collaboration with data science masters students from the University of San Francisco. The work was done completely using the data assets, software resources, computational resources, and environments within IC app server and Wynton PHI app server.

Read More

Xinran “Leo” Liu
Shan Wang
Amanda Luo
Akshay Ravi
Simone Arvisais-Anhalt
Anoop Muniyappa


Data-driven longitudinal characterization of neonatal health and morbidity

(Published in STM) 

Although prematurity is the single largest cause of death in children under 5 years of age, the current definition of prematurity, based on gestational age, lacks the precision needed for guiding care decisions. Here, we propose a longitudinal risk assessment for adverse neonatal outcomes in newborns based on a deep learning model that uses electronic health records (EHRs) to predict a wide range of outcomes over a period starting shortly before conception and ending months after birth. By linking the EHRs of the Lucile Packard Children’s Hospital and the Stanford Healthcare Adult Hospital, we developed a cohort of 22,104 mother-newborn dyads delivered between 2014 and 2018. Maternal and newborn EHRs were extracted and used to train a multi-input multitask deep learning model, featuring a long short-term memory neural network, to predict 24 different neonatal outcomes. An additional cohort of 10,250 mother-newborn dyads delivered at the same Stanford Hospitals from 2019 to September 2020 was used to validate the model. Areas under the receiver operating characteristic curve at delivery exceeded 0.9 for 10 of the 24 neonatal outcomes considered and were between 0.8 and 0.9 for 7 additional outcomes. Moreover, comprehensive association analysis identified multiple known associations between various maternal and neonatal features and specific neonatal outcomes. This study used linked EHRs from more than 30,000 mother-newborn dyads and would serve as a resource for the investigation and prediction of neonatal outcomes. An interactive website is available for independent investigators to leverage this unique dataset: https://maternal-child-health-associations.shinyapps.io/shiny_app/.

Read More

De Francesco D
Reiss JD
Roger J
Tang AS
Chang AL
Becker M
Phongpreecha T
Espinosa C
Morin S
Berson E
Thuraiappah M
Ravindra NG
Payrovnaziri SN
Mataraso S
Kim Y
Xue L
Rosenstein MG
Oskotsky T
Marić I
Gaudilliere B
Carvalho B
Bateman BT
Angst MS
Prince LS
Blumenfeld YJ
Benitz WE
Fuerch JH
Shaw GM
Sylvester KG
Stevenson DK
Sirota M
Aghaeepour N

Deep phenotyping of Alzheimer's Disease Leveraging Electronic Medical Records Data Identifies Sex-Specific Clinical Associations

(Published in Nature Communications)

Alzheimer’s Disease (AD) is a neurodegenerative disorder that is still not fully understood. Sex modifies AD vulnerability, but the reasons for this are largely unknown. We utilize two independent electronic medical record (EMR) systems across 44,288 patients to perform deep clinical phenotyping and network analysis to gain insight into clinical characteristics and sex-specific clinical associations in AD. Embeddings and network representation of patient diagnoses demonstrate greater comorbidity interactions in AD in comparison to matched controls. Enrichment analysis identifies multiple known and new diagnostic, medication, and lab result associations across the whole cohort and in a sex-stratified analysis. With this data-driven method of phenotyping, we can represent AD complexity and generate hypotheses of clinical factors that can be followed-up for further diagnostic and predictive analyses, mechanistic understanding, or drug repurposing and therapeutic approaches.

Read More

Tang AS
Oskotsky T
Havaldar S
Mantyh W
Bicak M
Warly-Solsberg C
Woldemariam S
Zeng B
Hu Z
Oskotsky B
Dubal D
Allen IE
Glicksberg B
Sirota M 


Early detection of Parkinson’s disease through enriching the electronic health record using a biomedical knowledge graph

(Accepted for publication in the journal Frontiers in Medicine) 

Utilized the DEID OMOP EHR database.

Early diagnosis of Parkinson’s disease (PD) is important to iden8fy treatments to slow neurodegeneration. People who develop PD often have symptoms before the disease manifests and may be coded as diagnoses in the electronic health record (EHR).  To predict PD diagnosis, we embedded EHR data of patients onto a biomedical knowledge graph called Scalable Precision medicine Open Knowledge Engine (SPOKE) and created patient embedding vectors. We trained and validated a classifier using these vectors from 3,004 PD patients, restricting records to 1, 3, and 5 years before diagnosis, and 457,197 non-PD group.  The classifier predicted PD diagnosis with moderate accuracy (AUC = 0.77 ± 0.06, 0.74 ± 0.05, 0.72 ± 0.05 at 1, 3, and 5 years) and performed better than other benchmark methods. Nodes in the SPOKE graph, among cases, revealed novel associations, while SPOKE patient vectors revealed the basis for individual risk classification. The proposed method was able to explain the clinical predictions using the knowledge graph, thereby making the predictions clinically interpretable. Through enriching EHR data with biomedical associations, SPOKE may be a cost-efficient and personalized way to predict PD diagnosis years before its occurrence.

Read More - Paper  Knowledge Graph

First Author:
Karthik Soman

Middle Authors:
Charlotte A. Nelson
Gabriel Cerono
Samuel M. Goldman
Sergio E. Baranzini

Corresponding Author:
Ethan G. Brown


Leveraging electronic health records to identify risk factors for recurrent pregnancy loss across two medical centers: a case-control study

(Preprint, under review in Nature Communications)

Recurrent pregnancy loss (RPL), defined as 2 or more pregnancy losses, affects 5-6% of ever-pregnant individuals. Approximately half of these cases have no identifiable explanation. To generate hypotheses about RPL etiologies, we implemented a case-control study comparing the history of over 1,600 diagnoses between RPL and live-birth patients, leveraging the University of California San Francisco (UCSF) and Stanford University electronic health record databases. In total, our study included 8,496 RPL (UCSF: 3,840, Stanford: 4,656) and 53,278 Control (UCSF: 17,259, Stanford: 36,019) patients. Menstrual abnormalities and infertility-associated diagnoses were significantly positively associated with RPL in both medical centers. Age-stratified analysis revealed that the majority of RPL-associated diagnoses had higher odds ratios for patients <35 compared with 35+ patients. While Stanford results were sensitive to control for healthcare utilization, UCSF results were stable across analyses with and without utilization. Intersecting significant results between medical centers was an effective filter to identify associations that are robust across center-specific utilization patterns.

Read More

Jacquelyn Roger
Feng Xie
Jean Costello
Alice Tang
Jay Liu
Tomiko Oskotsky
Sarah Woldemariam
Idit Kosti
Brian Le
Michael Snyder
Dara Torgerson
Gary Shaw
David Stevenson
Aleksandar Rajkovic
M. Glymour
Nima Aghaeepour
Hakan Cakmak
Ruth Lathi
Marina Sirota 


Leveraging Electronic Medical Records and Knowledge Networks to Predict Disease Onset and Gain Biological Insight Into Alzheimer's Disease

(Preprint, revision in Nature Aging)

Early identification of Alzheimer’s Disease (AD) risk can aid in interventions before disease progression. We demonstrate that electronic health records (EHRs) combined with heterogeneous knowledge networks (e.g., SPOKE) allow for (1) prediction of AD onset and (2) generation of biological hypotheses linking phenotypes with AD. We trained random forest models that predict AD onset with mean AUROC of 0.72 (-7 years) to .81 (-1 day). Top identified conditions from matched cohort trained models include phenotypes with importance across time, early in time, or closer to AD onset. SPOKE networks highlight shared genes between top predictors and AD (e.g., APOE, IL6, TNF, and INS). Survival analysis of top predictors (hyperlipidemia and osteoporosis) in external EHRs validates an increased risk of AD. Genetic colocalization confirms hyperlipidemia and AD association at the APOE locus, and AD with osteoporosis colocalize at a locus close to MS4A6A with a stronger female association.

Read More

Alice Tang
Katherine P. Rankin
Gabriel Cerono
Silvia Miramontes
Hunter Mills
Jacquelyn Roger
Billy Zeng
Charlotte Nelson
Karthik Soman
Sarah Woldemariam
Yaqiao Li
Albert Lee
Riley Bove
Maria Glymour
Tomiko Oskotsky
Zachary Miller
Isabel Allen
Stephan J. Sanders
Sergio Baranzini
Marina Sirota


Applying a computational transcriptomics-based drug repositioning pipeline to identify therapeutic candidates for endometriosis

(Preprint, under review in Nature Women’s Health)

Endometriosis is a common, inflammatory pain disorder comprised of disease in the pelvis and abnormal uterine lining and ovarian function that affects ∼200 million women of reproductive age worldwide and up to 50% of those with pelvic pain and/or infertility. Existing medical treatments for endometriosis-related pain are often ineffective, with individuals experiencing minimal or transient pain relief or intolerable side effects limiting long-term use - thus underscoring the pressing need for new drug treatment strategies. In this study, we applied a computational drug repurposing pipeline to endometrial gene expression data in the setting of endometriosis and controls in an unstratified manner as well as stratified by disease stage and menstrual cycle phase in order to identify potential therapeutics from existing drugs, based on expression reversal. Out of the 3,131 unique genes differentially expressed by at least one of six endometriosis signatures, only 308, or 9.8%, were in common. Similarities were more pronounced when looking at therapeutic predictions: 221 out of 299 drugs identified across the six signatures, or 73.9%, were shared, and the majority of predicted compounds were concordant across disease stage-stratified and cycle phase-stratified signatures. Our pipeline returned many known treatments as well as novel candidates. We selected the NSAID fenoprofen, the top therapeutic candidate for the unstratified signature and among the top-ranked drugs for the stratified signatures, for further investigation. Our drug target network analysis shows that fenoprofen targets PPARG and PPARA which affect the growth of endometrial tissue, as well as PTGS2 (i.e., COX2), an enzyme induced by inflammation with significantly increased gene expression demonstrated in patients with endometriosis who experience severe dysmenorrhea. NSAIDs are widely prescribed for endometriosis-related dysmenorrhea and nonmenstrual pelvic pain. Our analysis of clinical records across University of California healthcare systems revealed that while NSAIDs have been commonly prescribed to the 61,306 patients identified with diagnoses of endometriosis, dysmenorrhea, or chronic pelvic pain (36,543, 59.61%), fenoprofen was infrequently prescribed to those with these conditions (5, 0.008%). We tested the effect of fenoprofen in an established rat model of endometriosis and determined that it successfully alleviated endometriosis-associated vaginal hyperalgesia, a surrogate marker for endometriosis-related pain. These findings validate fenoprofen as a potential endometriosis therapeutic and suggest the utility of future investigation into additional drug targets identified.

Read More

Tomiko T Oskotsky
Arohee Bhoja
Daniel Bunis
Brian L Le
Idit Kosti
Christine Li
Sahar Houshdaran
Sushmita Sen
Júlia Vallvé-Juanico
Wanxin Wang
Erin Arthurs
Lauren Mahoney
Lindsey Lang
Brice Gaudilliere
David K Stevenson
Juan C Irwin
Linda C Giudice
Stacy McAllister
Marina Sirota 


Automatic Hip Fracture Identification and Functional Classification with Deep Learning

This research aims to automatically identify hip fractures from hip radiographs using neural networks in order to prioritize radiologist’s reading queues and to provide early notification of important injuries to the emergency room and orthopedic physicians

Read More


Enhancing the Brain Tumor Center Database

The Brain Tumor Center Database is augmented by extracting key attributes (e.g. diagnosis Grade/Type, genetic markers and resection information) from the Electronic Medical Records system. These attributes exist only as elements in unstructured notes and currently are only accessible by painstaking manual review of patient records. The Information Commons provides a workflow to extract this data en masse using advanced Natural Language Processing techniques to ensure a high level accuracy. This will greatly reduce the time it takes to extract this data from the clinical record, and provide access to much more data than was previously accessible.

Read More