Data Issues

Here is a list of data issues to be aware about:

Disease activity definitions vary between studies

The disease_activity and physician_global_assessment variables differ between GI-DAMPs, MUSIC, and Mini-MUSIC.

Current Variations in Disease Activity Classifications by Study

Study	Field Name	Categories
GI-DAMPs	ibd_status	- Biochemical remission (normal CRP AND FC< 250) - Remission - Active - Highly active (admission for IV steroids) - Not applicable
MUSIC	physician_global_assessment	- Remission - Mildly active - Moderately active - Severely active
	disease_activity	- Biochemical remission - Remission - Active - Biochemically Active - Not applicable
Mini-MUSIC	physician_global_assessment	- Biochemical remission - Clinical remission - Mildly active - Moderately active - Severely active
	disease_activity	- Remission - Mild - Moderate - Severe - Not applicable

Potential Standardized Definition

Disease activity could be standardised using these common variables:

has_active_symptoms (Boolean)
crp (mg/L, threshold > 5)
calprotectin (μg/g, threshold > 250)

Example Implementation Logic

Python
def calculate_disease_activity(row):
    """
    Returns: 'biochem_active' if has_active_symptoms and CRP > 5 and calprotectin > 250, 
            'active' if has_active_symptoms but CRP or calprotectin do not meet above threshold,
            'remission' if not has_active_symptoms but CRP > 5 or calprotectin > 250,
            'biochem_remission' if not has_active_symptoms and CRP < 5 and calprotectin < 250.
    """
    has_active_symptoms = row['has_active_symptoms']
    crp = row['crp']
    calprotectin = row['calprotectin']

    if has_active_symptoms:
        if crp > 5 and calprotectin > 250:
            return 'biochem_active'
        else:
            return 'active'
    else:
        if crp > 5 or calprotectin > 250:
            return 'remission'
        else:
            return 'biochem_remission'

Example Usage with DataFrame

Python
import pandas as pd

data = {
    'has_active_symptoms': [True, False, True, False],
    'crp': [6.2, 2.1, 7.8, 1.5],
    'calprotectin': [300, 150, 180, 100]
}
df = pd.DataFrame(data)

# Apply function to DataFrame
df['standardised_disease_activity'] = df.apply(calculate_disease_activity, axis=1)
print(df)

GI-DAMPs Study ID Format Evolution

Historical Format

Initial format: GID-xxx-P or GID-xxx-HC
xxx: integer with inconsistent leading zeros (e.g., GID-001-P vs GID-1-P)
P: Patient, HC: Healthy Control

Multi-Center Expansion

Center-specific formats introduced:
Edinburgh: GID-x
Glasgow: GID-136-x
Dundee: GID-138-x

Known Issues

Potential conflicts with Edinburgh legacy IDs GID-136-P and GID-138-P
Inconsistent formats at Glasgow/Dundee sites (e.g., GID-136-x-P)

Current Standard (December 2024)

Remove -P and -HC suffixes (use study_group column instead)
Use GID- prefix only
No leading zeros
Center-specific formats:
Edinburgh: GID-x
Glasgow: GID-136-x
Dundee: GID-138-x

⚠️ Important: When merging GI-DAMPs data, carefully check study_id column for legacy formats.