Getting Started with Orca Datasets
This guide will help you quickly access and start working with datasets from the Orca platform.
Overview
Orca datasets are stored in G-Trac and processed through Dagster pipelines. Each dataset goes through standardized cleaning, transformation, and validation steps before being made available for analysis.
Available Datasets
GI-DAMPs
- Study Name: Investigation into the inflammatory mechanism of gut damage-associated molecular patterns in Inflammatory Bowel Disease
- Study Type: Cross-sectional IBD sampling study (with optional longitudinal sampling)
- Data Structure: Sampling visits with optional repeated measures
- Key Features: Comprehensive biomarker data, detailed medication tracking
- More Info: GI-DAMPs Overview
MUSIC
- Study Name: Mitochondrial DAMPs as mechanistic biomarkers of gut mucosal inflammation
- Study Type: Longitudinal adult IBD cohort study
- Data Structure: Fixed timepoints (timepoint_1 through timepoint_5)
- Key Features: Mucosal healing outcomes, longitudinal follow-up
- More Info: MUSIC Overview
Mini-MUSIC
- Study Name: Mitochondrial DAMPs as mechanistic biomarkers of gut mucosal inflammation in children
- Study Type: Paediatric IBD cohort study
- Data Structure: Similar to MUSIC, but with paediatric-specific assessments (PUCAI, PCDAI, Paris classification)
- Key Features: Age-appropriate disease activity scores, exclusive enteral nutrition (EEN) tracking
- More Info: Mini-MUSIC Overview
Accessing Datasets
Step 1: Understand the Data Structure
Before accessing data, familiarize yourself with:
- Unified Data Dictionary - Common variables across all datasets
- Study-Specific Columns - Variables unique to each study
- Dataset Governance - Access policies and requirements
Step 2: Load and Explore Data
Datasets are typically provided as CSV files. Here's a basic workflow:
Access Workflow
- Review the Dataset Governance policy to confirm you meet any prerequisite training and to identify the correct data steward for the study.
- Email the governance lead and the relevant steward (see the contacts listed in the policy) with a short summary of your project, the datasets you need, and the timeframe for access.
- The governance lead records the request, coordinates steward approval, and replies with confirmation plus the G-Trac folder path for the authorised datasets. Orca datasets are stored within the Orca workspace on G-Trac and each CSV filename includes the extraction date (for example,
music_main_2025-01-15.csv). - Sign in to G-Trac with your institutional credentials to download the most recent dated files. When pipelines publish a refresh, the governance lead shares release notes by email and keeps the folder-level README up to date, so monitor those messages for cadence updates.
Understanding Variable Names
All variables follow snake_case naming convention. Key patterns:
Demographics
study_id: Unique participant identifier (prefix: GID-, MID-, MINI-)study_group: Disease classification (cd, uc, ibdu, non_ibd, hc)age,sex,height,weight,bmi
Laboratory Values
- Prefix pattern: Variable name indicates the measurement
nhs_bloods_date: Date of blood samplehaemoglobin,crp,albumin: Test resultscalprotectin_date,calprotectin: Faecal calprotectin
Medications
sampling_*: Medications at time of sampling (1 = yes, 0 = no)- Examples:
sampling_asa,sampling_ifx,sampling_ada baseline_*: Historical medication exposure
Disease Activity Scores
hbi_total: Harvey-Bradshaw Index total scoresccai_total: Simple Clinical Colitis Activity Indexmayo_total: Mayo Scoresescd,uceis: Endoscopic scores
Common Workflows
Comparing Across Studies
When working with multiple datasets, focus on variables documented in the Unified Data Dictionary:
Handling Missing Data
- Missing values are typically represented as
NaNor empty strings - Some categorical variables use specific codes for missing/unknown (see data dictionary)
- Review Known Issues for dataset-specific considerations
Date Variables
Date columns follow YYYY-MM-DD format:
| Python | |
|---|---|
Important Considerations
Study-Specific Differences
Disease Activity Classifications: Vary between studies (see Known Issues)
Timepoints:
- GI-DAMPs: Sampling visits (no fixed schedule)
- MUSIC: Fixed timepoints (1-5)
- Mini-MUSIC: Fixed timepoints (1-3)
Scoring Systems:
- Adults: HBI, SCCAI, Mayo
- Pediatrics: PCDAI, PUCAI, Paris classification
Data Quality
- Always review Known Issues before analysis
- Check for data version/date stamps in filenames
- Verify variable meanings in the data dictionary
- Contact data stewards for clarifications
Next Steps
- Explore Dataset Overviews - Detailed information about each dataset
- Review Data Dictionary - Complete variable reference
- Check Known Issues - Important limitations and considerations
- Understand Pipelines - How data is transformed
Getting Help
- Data questions: Contact study data stewards (see Dataset Governance)
- Technical issues: shaun.chuah@glasgow.ac.uk
- Documentation updates: Submit issues or pull requests