We are continually developing tools to enable wider and deeper clinical data exploration

We have implemented some best-of-breed data exploration and computational analysis tools, listed below. These are based on a UC Berkeley-developed Spark AI environment, employing a “compute to the data” model for Very Large data mining and pattern discovery, and on Facebook-originated Presto, a high-performance distributed SQL query engine.  Both are based on secure data on premise or in the cloud.  Additional tools are planned to be added in the future, as we launch support for imaging exploration (a.k.a. Imaging Commons), molecular 'omics (a.k.a. Omics Commons), and new clinical text analysis tools as they become available.

User-Friendly Data Science Tools (no technical training required)

  • PatientExploreR
    User-friendly tool for patient search and cohort selection from de-identified UCSF EHR data. Users can select and visualize patient cohorts and explore individual patient data with interactive dynamic reports accessible, via an easy-to-use web interface. Developed by Butte Lab and BCHSI.
  • CTAKES As-A-Service
    Natural language processing system for extraction of information from medical record clinical free-text.  Originally developed by Mayo Clinic, and now part of Apache open source. Our version includes the ability for UCSF users to extract medical concepts such as diagnoses, medications and procedures from their own sets of clinical documents.
  • EMERSE at UCSF (Electronic Medical Record Search Engine)
    Enables users to search UCSF machine-redacted clinical notes through a user-friendly interface. This software was developed at the University of Michigan, and implemented here in partnership with CTSI.
  • UCSFPhilter
    Open-source software for user-friendly ie-Identification of clinical text. Developed by ButteLab and BCHSI. 

Tools for Hands-On Data Scientists

Information Commons Shared Cluster is our cloud-based computational environment. It is an AWS EMR cluster with Spark, open-source distributed computing framework for big data processing. IC shared cluster features the following data science tools:

  • New! HUE
    A user-friendly web-based application for visually constructing and running SQL queries with any data hosted on the Information Commons. 
  • Spark, SparkMLPySparkSparkR, SparklyR, SparkSQL
    Spark-based versions of popular language and AI tools that support distributed computing.
  • JupyterHub
    A multi-user version of Jupyter Notebook, an open-source platform for developing and sharing documents with live code, data output, equations, visualizations and narrative text. IC JupyterHub serves a pre-configured data science environment to the shared cluster users.