Synthea (Synthetic Patient Generator)
Open-source software that generates realistic but entirely fake patient records — FHIR, C-CDA, CSV — so you can build and test health systems without touching real PHI.
In one line
Synthea makes lifelike patients who never existed — full medical histories you can export as FHIR or C-CDA — so engineers can develop and test against real-shaped data with zero privacy risk. Licence: Apache 2.0 (fully open).
The problem it solves
You can't build or test a health system without data — but you can't use real patient data freely, because it's PHI: regulated, risky, and slow to get approved. That's a chicken-and-egg trap for every new project. Synthea breaks it by generating patients who are statistically realistic but completely fictional — safe to share, commit to a repo, and seed into a demo at any scale.
How it generates patients
Synthea (from MITRE) simulates each person's whole life, birth to death:
- Disease-progression modules — clinically-informed state machines (the Generic Module Framework) model how conditions develop and are treated over time, so a synthetic diabetic accrues a believable history of encounters, HbA1c results, medications and complications — not random noise.
- Demographics & geography — populations are generated against real demographic and prevalence statistics (US-based by default; configurable).
- Standards-conformant output — exports as FHIR R4 Bundles, C-CDA, CSV, and bulk-FHIR (NDJSON), so the data drops straight into real tools.
Why synthetic data matters
Because no real person is involved, the data carries no HIPAA/DPDP constraints — it can be shared, versioned, and reproduced exactly. This is the cleanest answer to the de-identification problem: rather than risk re-identifying anonymised real data, use data that was never real. It is the principle this whole platform runs on — synthetic data only in the labs.
Where it shows up in digital health
- Populating a FHIR server — seeding HAPI or Medplum for development; a real deployment of the FHIR Sandbox or FHIR Mapper lab would load Synthea Bundles.
- Demos & sales — showing an app with a believable patient panel, no privacy review.
- Load & integration testing — generate thousands of patients to stress a pipeline.
- Research & ML prototyping — develop methods on synthetic data before applying for access to real cohorts (OMOP conversions exist).
Limits to keep in mind
- It's a model, not reality — distributions are US-centric by default and won't capture every local pattern; conclusions drawn only from Synthea don't transfer to real populations.
- Not for statistical inference — it's for building and testing software, not for estimating true disease epidemiology.
- Correlations are simplified — the modules are realistic but not exhaustive; rare multi-morbidity interactions may be missing.
Key takeaways
- Synthea generates realistic, fully synthetic patient histories — FHIR, C-CDA, CSV, bulk.
- It removes the PHI barrier from development, demos, and testing entirely.
- Built on clinically-informed state machines, so histories are coherent, not random.
- Use it to seed FHIR servers and labs — but remember it models, and doesn't measure, reality.
Check your recall
0 of 2 recalledActive recall beats re-reading — try to answer, then reveal.
What problem does Synthea solve?
Why is synthetic data the cleanest answer to the de-identification problem?