DIMP Pseudonymization

DIMP (Data Integration and Management Platform) provides de-identification and pseudonymization services for FHIR healthcare data using the FHIR Pseudonymizer, protecting patient privacy while preserving data utility for research.

Overview

What is DIMP?

DIMP integration in Aether enables you to:

De-identify sensitive patient information (names, addresses, birthdates, etc.)
Pseudonymize records with consistent, reversible identifiers
Maintain data utility for research purposes
Comply with healthcare privacy regulations (GDPR, HIPAA, etc.)
Generate audit trails of all modifications
Scale pseudonymization for large datasets automatically

Prerequisites

DIMP service running and accessible
FHIR data in NDJSON format
DIMP HTTP endpoint configured in Aether

Configuration

1. Configure DIMP Endpoint

Add the DIMP service URL to aether.yaml:

yaml

services:
  dimp:
    url: "http://localhost:32861/fhir"
    bundle_split_threshold_mb: 10

For production environments:

yaml

services:
  dimp:
    url: "https://dimp.healthcare.org/fhir"
    bundle_split_threshold_mb: 50

The bundle_split_threshold_mb setting controls automatic splitting of large FHIR Bundles to prevent HTTP 413 errors when sending to DIMP (range: 1-100 MB).

2. Enable DIMP in Pipeline

yaml

pipeline:
  enabled_steps:
    - local_import  # Import FHIR data (or torch or http_import)
    - dimp          # Apply pseudonymization

Important: One of the import step types (torch, local_import, or http_import) must always be first. DIMP runs after the import step.

3. Full Configuration Example

yaml

services:
  dimp:
    url: "http://localhost:32861/fhir"
    bundle_split_threshold_mb: 10

pipeline:
  enabled_steps:
    - local_import  # or torch or http_import
    - dimp

retry:
  max_attempts: 5
  initial_backoff_ms: 1000
  max_backoff_ms: 30000

jobs_dir: "./jobs"

How Pseudonymization Works

1. Data Processing Flow

Raw FHIR Data
    ↓
Import Step (torch/local_import/http_import)
    ↓
DIMP Pseudonymization
  - Extract identifiable elements
  - Generate pseudonyms
  - Create mapping
  - Apply transformations
    ↓
De-identified Data

2. What Gets Pseudonymized

DIMP typically de-identifies:

Patient names → Pseudonyms (PT_00001, PT_00002, etc.)
Birth dates → Year of birth or age ranges
Contact information → Removed
Street addresses → Postal codes or general location
Medical record numbers → New identifiers
Social security numbers → Removed
Phone numbers → Removed
Email addresses → Removed

3. What Is Preserved

DIMP preserves:

Patient demographics (age, gender for research)
Diagnosis codes (ICD-10, SNOMED CT)
Procedure codes
Medication information
Laboratory values
Clinical narratives (with terms removed)
Relationships between records for same patient

Using DIMP Pseudonymization

Basic Pseudonymization Workflow

Prepare your data:

bash

# Ensure FHIR NDJSON files are ready
ls /data/fhir/*.ndjson

Configure aether.yaml:

yaml

services:
  dimp:
    url: "http://localhost:32861/fhir"
    bundle_split_threshold_mb: 10

pipeline:
  enabled_steps:
    - import
    - dimp

jobs_dir: "./jobs"

Run the pipeline:

bash

aether pipeline start /data/fhir/

Monitor progress:

bash

# Check job status
aether job list

# Get details
aether pipeline status <job-id>

# View logs
aether job logs <job-id>

Access results:

De-identified data is stored in the jobs directory:

jobs/
└── <job-id>/
    ├── status.json           # Job metadata
    ├── import_results.ndjson # Imported data
    └── dimp_results.ndjson   # De-identified data

Understanding TORCH vs DIMP

TORCH (Data extraction service):

Extracts FHIR data from TORCH servers based on CRTDL queries
Applies TORCH minimization (extracts only needed fields)
Returns raw identifiable data

DIMP (De-identification service):

Applies DIMP pseudonymization (removes/replaces PII)
De-identifies already-extracted data
Returns pseudonymized, de-identified data

Combined workflow: Extract with TORCH → Pseudonymize with DIMP

TORCH + DIMP Workflow

Combine TORCH extraction with pseudonymization:

yaml

services:
  torch:
    base_url: "https://torch.hospital.org"
    username: "researcher"
    password: "secret"
    extraction_timeout_minutes: 30
    polling_interval_seconds: 5
    max_polling_interval_seconds: 30
  dimp:
    url: "http://localhost:32861/fhir"
    bundle_split_threshold_mb: 10

pipeline:
  enabled_steps:
    - import  # Import extracted TORCH data (minimized but identifiable)
    - dimp    # Apply DIMP pseudonymization (de-identify)

retry:
  max_attempts: 5
  initial_backoff_ms: 1000
  max_backoff_ms: 30000

jobs_dir: "./jobs"

Run:

bash

aether pipeline start my_cohort.crtdl

This automatically:

Extracts data from TORCH using CRTDL query
Applies TORCH minimization (extracts only specified fields)
Imports the minimized data
Applies DIMP pseudonymization (removes/replaces PII)
Outputs de-identified, pseudonymized data

This provides defense-in-depth privacy: TORCH minimization reduces initial exposure, DIMP pseudonymization provides additional privacy protection.

Best Practices

1. Always Test First

Test with a small sample before processing large datasets:

bash

# Use a sample of data
aether pipeline start /data/fhir/sample/

2. Preserve Mappings

Keep pseudonym mappings in secure storage:

bash

# The job directory contains mappings
# Back them up securely
cp -r jobs/<job-id>/ /secure/backup/

3. Version Control

Track your pseudonymization configurations:

bash

git add aether.yaml
git commit -m "Update DIMP configuration"

4. Audit Trail

Review logs for compliance auditing:

bash

# View full audit trail
aether job logs <job-id> | grep -i "audit\|processed\|error"

5. Data Retention

Plan data lifecycle:

yaml

# Example: Keep job data for 90 days
jobs:
  jobs_dir: "./jobs"
  retention_days: 90  # Clean up after retention period

Privacy Considerations

1. Secure Storage

Store pseudonym mappings in encrypted storage
Restrict access to sensitive files
Use file permissions: chmod 600 mappings.json

2. Secure Transmission

Use HTTPS for DIMP communication
Enable TLS/SSL verification
Use strong authentication

3. Regulatory Compliance

DIMP helps comply with:

GDPR: Right to be forgotten, data minimization
HIPAA: Safe harbor de-identification standards
FHIR: Security and privacy profiles
HIPAA Breach Notification Rule: Protect against re-identification

4. Secondary Use

Even with pseudonymization, secondary use requires:

Explicit research protocol approval
IRB/Ethics committee review
Data use agreements
Publication restrictions

Troubleshooting

"DIMP service unavailable"

Verify DIMP is running: curl http://localhost:8083/health
Check services.dimp_url in configuration
Check network connectivity and firewall rules

"Pseudonymization failed"

Check DIMP logs for errors
Verify FHIR data is valid: Use a FHIR validator first
Check DIMP has sufficient resources (disk space, memory)

"Performance is slow"

DIMP may need tuning for large datasets
Consider processing in batches
Check system resources (CPU, RAM, disk I/O)

"Inconsistent pseudonyms"

Ensure same mapping is used consistently
Use the same job for all related data
Do not re-process the same data with different settings

Next Steps

Configuration Guide - Full configuration reference
TORCH Integration - Combine TORCH + DIMP
Pipeline Steps - Understand the pipeline
CLI Commands - Available commands

DIMP Pseudonymization ​

Overview ​

Prerequisites ​

Configuration ​

1. Configure DIMP Endpoint ​

2. Enable DIMP in Pipeline ​

3. Full Configuration Example ​

How Pseudonymization Works ​

1. Data Processing Flow ​

2. What Gets Pseudonymized ​

3. What Is Preserved ​

Using DIMP Pseudonymization ​

Basic Pseudonymization Workflow ​

Understanding TORCH vs DIMP ​

TORCH + DIMP Workflow ​

Best Practices ​

1. Always Test First ​

2. Preserve Mappings ​

3. Version Control ​

4. Audit Trail ​

5. Data Retention ​

Privacy Considerations ​

1. Secure Storage ​

2. Secure Transmission ​

3. Regulatory Compliance ​

4. Secondary Use ​

Troubleshooting ​

"DIMP service unavailable" ​

"Pseudonymization failed" ​

"Performance is slow" ​

"Inconsistent pseudonyms" ​

Next Steps ​

DIMP Pseudonymization

Overview

Prerequisites

Configuration

1. Configure DIMP Endpoint

2. Enable DIMP in Pipeline

3. Full Configuration Example

How Pseudonymization Works

1. Data Processing Flow

2. What Gets Pseudonymized

3. What Is Preserved

Using DIMP Pseudonymization

Basic Pseudonymization Workflow

Understanding TORCH vs DIMP

TORCH + DIMP Workflow

Best Practices

1. Always Test First

2. Preserve Mappings

3. Version Control

4. Audit Trail

5. Data Retention

Privacy Considerations

1. Secure Storage

2. Secure Transmission

3. Regulatory Compliance

4. Secondary Use

Troubleshooting

"DIMP service unavailable"

"Pseudonymization failed"

"Performance is slow"

"Inconsistent pseudonyms"

Next Steps