๐ MDPM vs Azure Synapse
Complex Data Ingestion Showdown: 5 Industry Scenarios
๐ฏ THE CHALLENGE
Can MDPM (Microsoft Dynamics 365 Dataverse Multi-Environment Pro Manager) outperform Azure Synapse Analytics in complex, real-world data ingestion scenarios?
Spoiler: YES! And we’ll prove it across 5 industries with detailed comparisons.
๐ Comparison Criteria
| Metric |
What We’re Measuring |
Why It Matters |
| Time to Production |
Setup time + Development time + Testing time |
Faster deployment = Lower cost, faster ROI |
| Complexity (Lines of Code) |
Amount of code/configuration required |
Less code = Easier maintenance, fewer bugs |
| Skills Required |
Technical expertise needed |
Lower skill = More team members can operate |
| Error Handling |
Robustness of error recovery |
Better recovery = Less downtime |
| Total Cost (Annual) |
Infrastructure + Licensing + Personnel |
Direct impact on budget |
| Flexibility |
Adaptability to changing requirements |
Easier changes = Faster business response |
๐ฅ SCENARIO 1 ๐ฅ
๐ฅ HEALTHCARE: Multi-Source Patient Data Aggregation
The Business Problem
Hospital Network Challenge: 15 hospitals, each with different EMR systems (Epic, Cerner, Meditech), need to aggregate patient data into a central Dataverse for analytics.
Data Complexity:
- ๐ 50 million patient records across all facilities
- ๐ Complex hierarchies: Patients โ Encounters โ Procedures โ Medications โ Lab Results
- ๐ 15 different source APIs with varying schemas
- ๐ 200+ data fields per patient record
- โ ๏ธ HIPAA compliance required – must track all data lineage
- ๐ Daily incremental updates + historical backfill
- ๐ฅ 12 lookup tables: Providers, Facilities, Diagnoses (ICD-10), Procedures (CPT), Medications (NDC), etc.
Success Criteria:
- โ
Complete backfill in under 2 weeks
- โ
Daily incremental loads in under 4 hours
- โ
99.9% accuracy with data validation
- โ
Full audit trail for HIPAA compliance
- โ
Non-technical staff can monitor and troubleshoot
๐ท Azure Synapse Approach
Architecture:
Azure Synapse Pipeline Architecture:
1. Azure Data Factory Pipelines (15 pipelines – one per hospital)
2. Mapping Data Flows for transformation
3. Azure Data Lake Gen2 for staging
4. Dedicated SQL Pool for aggregation
5. Azure Logic Apps for orchestration
6. Azure Key Vault for credentials
7. Azure Monitor for logging
8. Power BI for monitoring dashboards
9. Custom C# scripts for complex transformations
10. Azure Functions for error handling
Implementation Steps:
- Week 1-2: Provision Azure resources (Synapse workspace, Data Lake, SQL pools, Key Vault)
- Week 3-4: Build 15 ADF pipelines for each hospital’s API
- Week 5-6: Create mapping data flows for schema transformation
- Week 7-8: Develop C# activities for complex hierarchical lookups
- Week 9-10: Implement error handling and retry logic
- Week 11-12: Build monitoring dashboards and alerts
- Week 13-14: Testing and validation
- Week 15-16: HIPAA compliance audit trail implementation
Code Complexity:
Estimated Lines of Code/Configuration:
ADF Pipeline JSON: ~2,500 lines ร 15 pipelines = 37,500 lines
Mapping Data Flows: ~1,200 lines ร 15 = 18,000 lines
C# Custom Activities: ~5,000 lines
SQL Scripts: ~3,000 lines
Logic Apps: ~2,000 lines
PowerShell deployment: ~1,500 lines
Python validation scripts: ~2,000 lines
TOTAL: ~69,000 lines of code/configuration
Skills Required:
- ๐ผ Azure Data Engineer (2 FTE)
- ๐ผ .NET Developer for custom activities (1 FTE)
- ๐ผ SQL Developer (1 FTE)
- ๐ผ DevOps Engineer for deployment (0.5 FTE)
- ๐ผ Data Analyst for validation (1 FTE)
Timeline:
| Phase | Duration |
| Design & Architecture | 2 weeks |
| Development | 10 weeks |
| Testing | 2 weeks |
| Deployment | 2 weeks |
| TOTAL TO PRODUCTION | 16 weeks (4 months) |
Annual Cost:
| Item | Annual Cost |
| Synapse Workspace | $50,000 |
| Data Lake Storage | $12,000 |
| Dedicated SQL Pool | $87,000 (DW500c, 24/7) |
| Azure Functions | $3,600 |
| Logic Apps | $2,400 |
| Key Vault | $600 |
| Monitor & Alerts | $1,800 |
| Personnel (5.5 FTE @ $120k avg) | $660,000 |
| TOTAL ANNUAL COST | $817,400 |
๐ข MDPM Approach
Tabs Used:
๐งน Data Cleaning
Bulk Data (Wizard Mode)
๐๐ Python Gen
๐๐ Metadata Staging
๐ค AI Assistant
Implementation Steps:
- Day 1: Connect all 15 hospital Dataverse environments
- Day 2-3: Use AI Assistant to generate API extraction scripts for each EMR
- Day 4-5: Configure Wizard Mode with lookup mappings (12 lookup tables)
- Day 6-7: Test with sample data from each hospital (1,000 records each)
- Day 8: Generate Python scripts for nightly incremental loads
- Day 9-10: Full backfill (automated, runs overnight)
Complete MDPM Workflow:
PHASE 1: ONE-TIME SETUP (Day 1-3)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 1: Environment Setup (30 minutes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: Environments
– Add 15 hospital environments
– Connect using MSAL device code
– Test: WhoAmI calls succeed โ
Step 2: Metadata Documentation (1 hour)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: ๐๐ Metadata Staging
– Extract table schemas from all 15 environments
– Compare to identify differences
– Stage metadata for documentation
– Result: Complete schema inventory โ
Step 3: Lookup Table Preparation (2 hours)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: Bulk Data โ Manual Ingestion
– Upload 12 reference tables (Providers, Facilities, etc.)
– Create lookup staging tables:
โข Staging_Providers
โข Staging_Facilities
โข Staging_ICD10_Codes
โข Staging_CPT_Codes
โข Staging_NDC_Medications
โข … etc.
– Result: All lookup data ready โ
PHASE 2: API INTEGRATION (Day 4-5)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 4: Generate API Extraction Scripts (4 hours)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: ๐๐ Python Gen
Action: Generate “Bulk Import from CSV” script
For Hospital 1 (Epic):
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Client ID: [epic-client-id] โ
โ Environment: Hospital1_Prod โ
โ Table: crxxx_patientrecord โ
โ Click: Generate Python Script โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Generated Script Features:
โ
MSAL authentication
โ
Batch processing (100 records/batch)
โ
Error handling with retry
โ
Checkpoint/resume capability
โ
Progress tracking
Repeat for all 15 hospitals (copy/modify script)
Save as: extract_hospital_01_epic.py through extract_hospital_15_meditech.py
PHASE 3: LOOKUP MAPPING (Day 6-7)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 5: Configure Wizard Mode (8 hours total)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: Bulk Data โ Bulk Ingestion with Lookups
Group 1: Hospital 1 – Patient Records
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Environment: Hospital1_Prod โ
โ Target Table: crxxx_patientrecord โ
โ Data Source: [Patient data from API] โ
โ โ
โ Lookup 1: Provider (Attending Physician) โ
โ Source Column: attending_provider_npi โ
โ Lookup Table: crxxx_provider โ
โ Match Field: crxxx_npi โ
โ Binding Field: crxxx_attendingproviderid โ
โ โ
โ Lookup 2: Facility โ
โ Source Column: facility_code โ
โ Lookup Table: crxxx_facility โ
โ Match Field: crxxx_facilitycode โ
โ Binding Field: crxxx_facilityid โ
โ โ
โ Lookup 3: Primary Diagnosis (ICD-10) โ
โ Source Column: primary_diagnosis_code โ
โ Lookup Table: crxxx_icd10code โ
โ Match Field: crxxx_icd10code โ
โ Binding Field: crxxx_primarydiagnosisid โ
โ โ
โ … (configure 9 more lookups) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Repeat for Groups 2-15 (one per hospital)
๐พ SAVE CONFIGURATION for reuse!
Click: “๐พ Save Current Configuration”
PHASE 4: DATA CLEANING (Day 8)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 6: Clean Sample Data (2 hours)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: ๐งน Data Cleaning
Upload: hospital_01_sample_1000_records.csv
Cleaning Operations (AI-Assisted):
โ
Remove duplicates by MRN (Medical Record Number)
โ
Standardize date formats (15 different formats detected!)
โ
Validate phone numbers (remove invalid)
โ
Clean SSN formatting (XXX-XX-XXXX)
โ
Standardize names (Title Case)
โ
Validate email addresses
โ
Handle null values (default strategies)
Auto-Column Matching:
AI automatically maps:
hospital_patient_id โ crxxx_mrn
patient_first_name โ crxxx_firstname
patient_last_name โ crxxx_lastname
date_of_birth โ crxxx_dateofbirth
… (50+ fields matched automatically)
Export: hospital_01_cleaned.csv
PHASE 5: TESTING (Day 9)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 7: Test Run (4 hours)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: Bulk Data โ Wizard Mode
Load Configuration: “Hospital Patient Ingestion v1.0”
Test with cleaned sample data (1,000 records ร 15 hospitals)
Click: “๐ Run All Lookup Ingestion Groups”
Results:
โ
Hospital 1: 1,000/1,000 records (100% success)
โ
Hospital 2: 998/1,000 records (99.8% success – 2 invalid NPIs)
โ
Hospital 3: 1,000/1,000 records (100% success)
… (all hospitals tested)
Total: 14,985/15,000 records (99.9% success rate)
Review errors in error log:
– 15 records with invalid provider NPIs (fix lookup table)
– Fix and re-run failed records โ
PHASE 6: FULL BACKFILL (Day 10)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 8: Generate Production Scripts (1 hour)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: ๐๐ Python Gen
Generate batch processing script for each hospital:
import pandas as pd
import msal
import requests
from datetime import datetime
# Configuration
CLIENT_ID = “prod-client-id”
TENANT_ID = “tenant-id”
ENV_URL = “https://hospital1.crm.dynamics.com”
TABLE_NAME = “crxxx_patientrecord”
# Read patient data in chunks (memory efficient)
chunk_size = 10000
checkpoint_file = “checkpoint_hospital01.json”
# Process 50 million records across all hospitals
# With checkpoint/resume capability
for chunk in pd.read_csv(‘all_patients_hospital01.csv’,
chunksize=chunk_size):
# Clean with pandas
chunk = clean_patient_data(chunk)
# Batch and send (100 per API call)
for batch in create_batches(chunk, 100):
send_to_dataverse(batch)
save_checkpoint(current_row)
print(f”โ
{current_row:,} / 3,333,333 (Hospital 1)”)
Step 9: Run Overnight Backfill (Automated)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Schedule all 15 Python scripts to run overnight:
Hospital 1: 3.3M records โ ~4 hours
Hospital 2: 4.1M records โ ~5 hours
Hospital 3: 2.8M records โ ~3.5 hours
… (all run in parallel)
TOTAL BACKFILL TIME: ~8 hours (overnight)
50 million records successfully ingested!
PHASE 7: DAILY INCREMENTAL (Automated)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Step 10: Setup Daily Sync (30 minutes)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Schedule Python scripts via Windows Task Scheduler or cron:
Daily at 2:00 AM:
– Run extract_hospital_01_incremental.py
– Run extract_hospital_02_incremental.py
– … (all 15 hospitals)
Each script:
1. Queries EMR API for records modified in last 24 hours
2. Cleans data with pandas
3. Uses Wizard Mode lookup mappings (saved config)
4. Ingests via Batch API
5. Logs results
Daily incremental: ~50,000 records/day across all hospitals
Processing time: ~45 minutes total
Code Complexity:
Total Code Written:
Python extraction scripts: ~200 lines ร 15 = 3,000 lines
Wizard Mode configuration: 0 lines (GUI-based, saved as JSON)
Data cleaning rules: 0 lines (AI-assisted, GUI-based)
TOTAL: ~3,000 lines of Python (95% generated by MDPM!)
Configuration files (auto-generated):
lookup_config.json: ~500 lines (saved from UI)
cleaning_rules.json: ~300 lines (saved from UI)
Skills Required:
- โ
Basic Python knowledge (can copy/modify generated scripts)
- โ
Understanding of healthcare data (business user level)
- โ
NO Azure expertise required
- โ
NO SQL expertise required
- โ
NO DevOps expertise required
Team: 1-2 people (vs 5.5 FTE for Synapse)
Timeline:
| Phase | Duration |
| Setup & Configuration | 3 days |
| API Integration | 2 days |
| Lookup Mapping | 2 days |
| Testing | 1 day |
| Backfill | 1 day (overnight) |
| Daily Automation Setup | 0.5 days |
| TOTAL TO PRODUCTION | 9.5 days (~2 weeks) |
Annual Cost:
| Item | Annual Cost |
| Dataverse Storage (additional) | $6,000 |
| Power Platform licensing | $12,000 |
| Personnel (1.5 FTE @ $100k avg) | $150,000 |
| MDPM App | $0 (one-time purchase) |
| TOTAL ANNUAL COST | $168,000 |
๐ SCENARIO 1 WINNER: MDPM
| Metric | Azure Synapse | MDPM | Winner |
| Time to Production |
16 weeks (4 months) |
9.5 days (~2 weeks) |
MDPM โ
(8x faster) |
| Lines of Code |
69,000 lines |
3,000 lines (95% generated) |
MDPM โ
(23x less code) |
| Team Size |
5.5 FTE |
1.5 FTE |
MDPM โ
(3.7x fewer people) |
| Annual Cost |
$817,400 |
$168,000 |
MDPM โ
($649k savings, 79% cheaper) |
| Skills Required |
Azure Engineer, .NET Dev, SQL Dev, DevOps |
Basic Python, Healthcare domain knowledge |
MDPM โ
(Much lower barrier) |
| Maintenance Effort |
High (complex pipelines, custom code) |
Low (GUI config, generated scripts) |
MDPM โ
|
MDPM wins on EVERY metric! ๐
๐ฅ SCENARIO 2 ๐ฅ
๐ฆ FINANCIAL SERVICES: Regulatory Compliance Data Warehouse
The Business Problem
Investment Bank Challenge: Aggregate trading data from 50 different systems for regulatory reporting (MiFID II, EMIR, Dodd-Frank).
Data Complexity:
- ๐ 2 billion transactions annually
- ๐ Hierarchical structures: Trades โ Orders โ Executions โ Allocations โ Settlements
- ๐ 50 source systems: Trading platforms, OMS, EMS, prime brokers, exchanges
- โฑ๏ธ Real-time requirements: T+15 minutes for reportable transactions
- ๐ 300+ fields per transaction
- ๐ 25 lookup/reference tables: Instruments, Counterparties, Legal Entities, Venues
- ๐ Multiple currencies, time zones, regulatory regimes
- ๐ 7-year retention requirement
Complete MDPM Solution
Tabs Used:
๐๐ Python Gen
Bulk Data (Direct @odata.bind)
๐งน Data Cleaning
๐๐ Metadata Staging
Data Analysis
COMPLETE FINANCIAL SERVICES WORKFLOW
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
PHASE 1: REFERENCE DATA SETUP (Day 1)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: Bulk Data โ Manual Ingestion
Upload 25 reference tables:
1. Instruments (ISIN master): 500,000 records
2. Counterparties (LEI registry): 1,000,000 records
3. Venues (MIC codes): 5,000 records
4. Legal Entities: 50,000 records
5. Currency codes: 150 records
… (20 more reference tables)
Processing time: ~2 hours for all reference data
PHASE 2: REAL-TIME INGESTION (Day 2-3)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: ๐๐ Python Gen
Generate: Real-time streaming ingestion scripts
Python Script (Generated for each of 50 sources):
import pandas as pd
import msal
import requests
from datetime import datetime, timedelta
import json
# Real-time configuration
SOURCE_API = “https://trading-platform-01.bank.com/api/trades”
POLLING_INTERVAL = 60 # seconds (poll every minute)
BATCH_SIZE = 100
while True:
# Get last 15 minutes of trades
cutoff_time = datetime.now() – timedelta(minutes=15)
trades = fetch_new_trades(SOURCE_API, cutoff_time)
if trades:
# Clean and enrich
df = pd.DataFrame(trades)
df = clean_financial_data(df)
df = enrich_with_lookups(df)
# Ingest with @odata.bind
for trade in df.to_dict(‘records’):
record = {
“crxxx_tradeid”: trade[‘trade_id’],
“crxxx_tradetime”: trade[‘trade_timestamp’],
“crxxx_quantity”: trade[‘quantity’],
“crxxx_price”: trade[‘price’],
“crxxx_notional”: trade[‘notional’],
# Lookups with @odata.bind
“crxxx_instrumentid@odata.bind”:
f”/crxxx_instruments({instrument_guid_map[trade[‘isin’]]})”,
“crxxx_counterpartyid@odata.bind”:
f”/crxxx_counterparties({counterparty_guid_map[trade[‘lei’]]})”,
“crxxx_venueid@odata.bind”:
f”/crxxx_venues({venue_guid_map[trade[‘mic’]]})”
}
ingest_to_dataverse(record)
log_regulatory_event(record)
time.sleep(POLLING_INTERVAL)
Deploy 50 scripts (one per source) on VMs
Result: T+5 minute ingestion (well under T+15 requirement!)
PHASE 3: HISTORICAL BACKFILL (Day 4-5)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: ๐๐ Python Gen
Generate: Batch historical loader
Load 7 years of historical data:
– 2 billion transactions
– Split by year: 285M transactions/year
– Process in parallel: 10 years ร 50 sources = 500 jobs
Python Script (with checkpoint/resume):
import pandas as pd
CHECKPOINT_FILE = “checkpoint_source01_year2024.json”
def load_historical_trades(source_id, year):
checkpoint = load_checkpoint()
start_row = checkpoint.get(‘last_row’, 0)
chunk_size = 50000
for chunk in pd.read_csv(f’trades_{source_id}_{year}.csv’,
chunksize=chunk_size,
skiprows=range(1, start_row)):
# Clean
chunk = clean_financial_data(chunk)
# Batch ingest (100 per API call)
for batch in create_batches(chunk, 100):
ingest_batch(batch)
save_checkpoint(current_row)
# Progress
print(f”Source {source_id} Year {year}: ” +
f”{current_row:,} / 285,000,000″)
Run all 500 jobs in parallel (overnight)
TOTAL TIME: ~72 hours (3 days) for 2 billion records
PHASE 4: DATA VALIDATION (Day 6)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Tab: Data Analysis
Query: Check data quality
– Count by source system
– Count by trade date
– Identify missing ISINs
– Check currency conversions
– Validate timestamps (T+15 compliance)
Results displayed in interactive charts:
โ
All 2B records loaded
โ
99.95% have valid ISIN lookups
โ
0.05% flagged for review (100,000 records)
โ
100% within T+15 minute requirement
PHASE 5: REGULATORY REPORTING (Day 7