Welcome to Orca

Meet our friendly Orca, your companion in navigating the data waves. Orca is a collaborative data platform that unifies diverse scientific datasets. Just as orcas work together in pods to achieve their goals, our platform harnesses the power of teamwork to accelerate scientific discovery and innovation. This documentation guides you through effectively utilizing our integrated datasets.

Orca

What is Orca?

Orca is a Dagster-based data engineering platform that aggregates data from multiple REDCap study databases, processes them, and stores the final datasets in G-Trac. The platform currently manages data from three primary studies focused on Inflammatory Bowel Disease (IBD) research:

GI-DAMPs: Investigation into the inflammatory mechanism of gut damage-associated molecular patterns (DAMPs) in Inflammatory Bowel Disease
MUSIC: Mitochondrial DAMPs as mechanistic biomarkers of gut mucosal inflammation
Mini-MUSIC: Pediatric IBD cohort study

Quick Start

Getting Started - Learn how to access and work with Orca datasets
Dataset Overviews - Explore each dataset's features and statistics (with tabs for quick comparison)
Data Dictionary - Understand the standardized variables and study-specific fields
Pipelines - Review the data transformation code and lineage

Key Features

Unified Data Standards

All datasets follow standardized naming conventions (snake_case) and share common variables for demographics, laboratory values, medications, and clinical assessments. This enables seamless cross-study analysis while preserving study-specific details.

Comprehensive Research Data

Datasets include:

Demographics and baseline characteristics
Disease activity scores (HBI, SCCAI, Mayo, UCEIS, SES-CD)
Laboratory values (bloods, assay values, drug levels)
Medication history and sampling status
Endoscopic and radiological findings
Patient-reported outcomes

Data Quality Assurance

Automated validation checks on each pipeline execution
Comprehensive data dictionaries with clear variable definitions
Documentation of known data issues and limitations
Clear lineage tracking through Dagster assets

Documentation Structure

Section	Description
Dataset Overviews	Key statistics, features, and characteristics of each dataset (tabbed view)
Data Dictionary	Complete variable reference with types, values, and descriptions
Pipelines	Code documentation for data extraction and transformation (tabbed view)
Policies	Dataset governance, attribution policy, and access requirements
Known Issues	Important data quality considerations and limitations

Getting Help

For questions about:

Dataset access: Contact the data steward listed in Dataset Governance
Data interpretation: Review the Data Dictionary and Known Issues
Technical support: Contact the Orca engineering team via shaun.chuah@glasgow.ac.uk