Getting Started with Orca Datasets

This guide will help you quickly access and start working with datasets from the Orca platform.

Overview

Orca datasets are stored in G-Trac and processed through Dagster pipelines. Each dataset goes through standardized cleaning, transformation, and validation steps before being made available for analysis.

Available Datasets

GI-DAMPs

Study Name: Investigation into the inflammatory mechanism of gut damage-associated molecular patterns in Inflammatory Bowel Disease
Study Type: Cross-sectional IBD sampling study (with optional longitudinal sampling)
Data Structure: Sampling visits with optional repeated measures
Key Features: Comprehensive biomarker data, detailed medication tracking
More Info: GI-DAMPs Overview

MUSIC

Study Name: Mitochondrial DAMPs as mechanistic biomarkers of gut mucosal inflammation
Study Type: Longitudinal adult IBD cohort study
Data Structure: Fixed timepoints (timepoint_1 through timepoint_5)
Key Features: Mucosal healing outcomes, longitudinal follow-up
More Info: MUSIC Overview

Mini-MUSIC

Study Name: Mitochondrial DAMPs as mechanistic biomarkers of gut mucosal inflammation in children
Study Type: Paediatric IBD cohort study
Data Structure: Similar to MUSIC, but with paediatric-specific assessments (PUCAI, PCDAI, Paris classification)
Key Features: Age-appropriate disease activity scores, exclusive enteral nutrition (EEN) tracking
More Info: Mini-MUSIC Overview

Accessing Datasets

Step 1: Understand the Data Structure

Before accessing data, familiarize yourself with:

Unified Data Dictionary - Common variables across all datasets
Study-Specific Columns - Variables unique to each study
Dataset Governance - Access policies and requirements

Step 2: Load and Explore Data

Datasets are typically provided as CSV files. Here's a basic workflow:

Python
import pandas as pd

# Load a dataset
df = pd.read_csv('gidamps_sampling_2025-10-30.csv')

# Check basic structure
print(f"Shape: {df.shape}")
print(f"Columns: {df.columns.tolist()[:10]}...")  # First 10 columns

# Explore common variables
print(df[['study_id', 'study_group', 'age', 'sex', 'crp', 'calprotectin']].head())

Access Workflow

Review the Dataset Governance policy to confirm you meet any prerequisite training and to identify the correct data steward for the study.
Email the governance lead and the relevant steward (see the contacts listed in the policy) with a short summary of your project, the datasets you need, and the timeframe for access.
The governance lead records the request, coordinates steward approval, and replies with confirmation plus the G-Trac folder path for the authorised datasets. Orca datasets are stored within the Orca workspace on G-Trac and each CSV filename includes the extraction date (for example, music_main_2025-01-15.csv).
Sign in to G-Trac with your institutional credentials to download the most recent dated files. When pipelines publish a refresh, the governance lead shares release notes by email and keeps the folder-level README up to date, so monitor those messages for cadence updates.

Understanding Variable Names

All variables follow snake_case naming convention. Key patterns:

Demographics

study_id: Unique participant identifier (prefix: GID-, MID-, MINI-)
study_group: Disease classification (cd, uc, ibdu, non_ibd, hc)
age, sex, height, weight, bmi

Laboratory Values

Prefix pattern: Variable name indicates the measurement
nhs_bloods_date: Date of blood sample
haemoglobin, crp, albumin: Test results
calprotectin_date, calprotectin: Faecal calprotectin

Medications

sampling_*: Medications at time of sampling (1 = yes, 0 = no)
Examples: sampling_asa, sampling_ifx, sampling_ada
baseline_*: Historical medication exposure

Disease Activity Scores

hbi_total: Harvey-Bradshaw Index total score
sccai_total: Simple Clinical Colitis Activity Index
mayo_total: Mayo Score
sescd, uceis: Endoscopic scores

Common Workflows

Comparing Across Studies

When working with multiple datasets, focus on variables documented in the Unified Data Dictionary:

Python
# Example: Compare CRP levels across studies
common_vars = ['study_id', 'study_group', 'age', 'crp', 'calprotectin']

# Filter to common variables
gidamps_subset = gidamps_df[common_vars]
music_subset = music_df[common_vars]
mini_music_subset = mini_music_df[common_vars]

# Combine for analysis
combined = pd.concat([
    gidamps_subset.assign(source='GI-DAMPs'),
    music_subset.assign(source='MUSIC'),
    mini_music_subset.assign(source='Mini-MUSIC')
])

Handling Missing Data

Missing values are typically represented as NaN or empty strings
Some categorical variables use specific codes for missing/unknown (see data dictionary)
Review Known Issues for dataset-specific considerations

Date Variables

Date columns follow YYYY-MM-DD format:

Python
# Convert to datetime
df['nhs_bloods_date'] = pd.to_datetime(df['nhs_bloods_date'])
df['date_of_diagnosis'] = pd.to_datetime(df['date_of_diagnosis'])

Important Considerations

Study-Specific Differences

Disease Activity Classifications: Vary between studies (see Known Issues)

Timepoints:

GI-DAMPs: Sampling visits (no fixed schedule)
MUSIC: Fixed timepoints (1-5)
Mini-MUSIC: Fixed timepoints (1-3)

Scoring Systems:

Adults: HBI, SCCAI, Mayo
Pediatrics: PCDAI, PUCAI, Paris classification

Data Quality

Always review Known Issues before analysis
Check for data version/date stamps in filenames
Verify variable meanings in the data dictionary
Contact data stewards for clarifications

Next Steps

Explore Dataset Overviews - Detailed information about each dataset
Review Data Dictionary - Complete variable reference
Check Known Issues - Important limitations and considerations
Understand Pipelines - How data is transformed

Getting Help

Data questions: Contact study data stewards (see Dataset Governance)
Technical issues: shaun.chuah@glasgow.ac.uk
Documentation updates: Submit issues or pull requests