pypmi: An API for the Parkinson’s Progression Markers Initiative (PPMI)

https://travis-ci.org/rmarkello/pypmi.svg?branch=master https://codecov.io/gh/rmarkello/pypmi/branch/master/graph/badge.svg https://readthedocs.org/projects/pypmi/badge/?version=latest https://img.shields.io/badge/License-BSD%203--Clause-blue.svg

The PPMI is an ongoing longitudinal study that begin in early 2010 with the primary goal of identifying biomarkers of Parkinson’s disease (PD) progression. To date, the PPMI has collected data from over 400 individuals with de novo PD and nearly 200 age-matched healthy participants, in addition to large cohorts of individuals genetically at-risk for PD. Data, made available on the PPMI website, include comphrensive clinical-behavioral assessments, biological assays, single-photon emission computed tomography (SPECT) images, and magnetic resonance imaging (MRI) scans.

While accessing this data is straightforward (researchers must simply sign a data usage agreement and provide information on the purpose of their research), the sheer amount of data made available can be quite overwhelming to work with. Thus, the primary goal of this package is to provide a Python interface to making working with the data provided by the PPMI easier.

While this project is still very much under development it is neverthless functional. However, please note that this project’s functionality is liable to change quite dramatically until an initial release is made—so be careful! Check out our reference API for some of the current capabilities of pypmi while our user guide is under construction.

Usage

Getting the data

First things first: you need to get the data! Once you have access to the PPMI database, log in to the database and follow these instructions:

  1. Select Download from the navigation bar at the top
  2. Select Study Data from the options that appear in the navigation bar
  3. Select ALL at the bottom of the left-hand navigation bar on the new page
  4. Click Select ALL tabular data (csv) format and then press Download>> in the top right hand corner of the page
  5. Unzip the downloaded directory and save it somewhere on your computer

Alternatively, you can use pypmi module to download the data programatically:

>>> import pypmi
>>> files = pypmi.fetch_studydata('all', user='username', password='password')  
Fetching authentication key for data download...
Requesting 113 datasets for download...
Downloading PPMI data: 17.3MB [00:33, 519kB/s]

By default, the data will be downloaded to your current directory making it easy to load them in the future, but you can optionally provide a path argument to pypmi.fetch_studydata() to specify where you would like the data to go. (Alternatively, you can set an environmental variable $PPMI_PATH to specify where they should be downloaded to; this takes precedence over the current directory.)

Loading and working with the data

Once you have the data downloaded you can use the functions to load various portions of it into tidy data frames.

For example, we can generate a number of clinical-behavioral measures:

>>> behavior = pypmi.load_behavior()
>>> behavior.columns
Index(['participant', 'visit', 'date', 'benton', 'epworth', 'gds',
       'hvlt_recall', 'hvlt_recognition', 'hvlt_retention', 'lns', 'moca',
       'pigd', 'quip', 'rbd', 'scopa_aut', 'se_adl', 'semantic_fluency',
       'stai_state', 'stai_trait', 'symbol_digit', 'systolic_bp_drop',
       'tremor', 'updrs_i', 'updrs_ii', 'updrs_iii', 'updrs_iii_a', 'updrs_iv',
       'upsit'],
      dtype='object')

The call to pypmi.load_behavior() may take a few seconds to run—there’s a lot of data to import and wrangle!

If we want to query the data with regards to, say, subject diagnosis it might be useful to load in some demographic information:

>>> demographics = pypmi.load_demographics()
>>> demographics.columns
Index(['participant', 'diagnosis', 'date_birth', 'date_diagnosis',
       'date_enroll', 'status', 'family_history', 'age', 'gender', 'race',
       'site', 'handedness', 'education'],
      dtype='object')

Now we can perform some interesting queries! As an example, let’s just ask how many individuals with Parkinson’s disease have a baseline UPDRS III score. We’ll have to use information from both data frames to answer the question:

>>> import pandas as pd
>>> updrs = (behavior.query('visit == "BL" & ~updrs_iii.isna()')
...                  .get(['participant', 'updrs_iii']))
>>> parkinsons = demographics.query('diagnosis == "pd"').get('participant')
>>> len(pd.merge(parkinsons, updrs, on='participant'))
423

And the same for healthy individuals:

>>> healthy = demographics.query('diagnosis == "hc"').get('participant')
>>> len(pd.merge(healthy, updrs))
195

There’s a lot of power gained in leveraging the pandas DataFrame objects, so take a look at the pandas documentation to see what more you can do!

Reference API

This is the primary reference of pypmi. Please refer to the user guide for more information on how to best implement these functions in your own workflows.

pypmi - Dataset fetchers and loaders

Functions for listing and downloading datasets from the PPMI database:

fetchable_studydata() Lists study data available to download from the PPMI
fetchable_genetics(projects) Lists genetics data available to download from the PPMI
fetch_studydata(*datasets, path, user, …) Downloads specified study data datasets from the PPMI database
fetch_genetics(*datasets, path, user, …) Downloads specified genetics data datasets from the PPMI database

Functions for loading data from PPMI database into tidy dataframes:

load_behavior(path, measures) Loads clinical-behavioral data into tidy dataframe
load_biospecimen(path, measures) Loads biospecimen data into tidy dataframe
load_datscan(path, measures) Loads DaT scan data into tidy dataframe
load_demographics(path, measures) Loads demographic data into tidy dataframe

Functions for listing measures available from relevant pypmi.load_X() commands:

available_behavior(path) Lists measures available in pypmi.load_behavior()
available_biospecimen(path) Lists measures available in pypmi.load_biospecimen()
available_datscan(path) Lists measures available in pypmi.load_datscan()
available_demographics(path) Lists measures available in pypmi.load_demographics()