HealthAtoms
Interoperability & Standardsconcept · 6 min · updated Jun 30, 2026

Synthea (Synthetic Patient Generator)

By Rajendra Sharma, RN, CPC, CPBReviewed by Rajendra Sharma, RN, CPC, CPB · Jun 29, 2026

Open-source software that generates realistic but entirely fake patient records — FHIR, C-CDA, CSV — so you can build and test health systems without touching real PHI.

FHIRC-CDA

In one line

Synthea makes lifelike patients who never existed — full medical histories you can export as FHIR or C-CDA — so engineers can develop and test against real-shaped data with zero privacy risk. Licence: Apache 2.0 (fully open).

Syntheadisease modules FHIR bundle CSV / C-CDA synthetic,no real PHI
Synthea simulates realistic lifelong patient histories and exports them as FHIR — synthetic data, safe for labs and testing.

The problem it solves

You can't build or test a health system without data — but you can't use real patient data freely, because it's PHI: regulated, risky, and slow to get approved. That's a chicken-and-egg trap for every new project. Synthea breaks it by generating patients who are statistically realistic but completely fictional — safe to share, commit to a repo, and seed into a demo at any scale.

How it generates patients

Synthea (from MITRE) simulates each person's whole life, birth to death:

  • Disease-progression modules — clinically-informed state machines (the Generic Module Framework) model how conditions develop and are treated over time, so a synthetic diabetic accrues a believable history of encounters, HbA1c results, medications and complications — not random noise.
  • Demographics & geography — populations are generated against real demographic and prevalence statistics (US-based by default; configurable).
  • Standards-conformant output — exports as FHIR R4 Bundles, C-CDA, CSV, and bulk-FHIR (NDJSON), so the data drops straight into real tools.

Why synthetic data matters

Because no real person is involved, the data carries no HIPAA/DPDP constraints — it can be shared, versioned, and reproduced exactly. This is the cleanest answer to the de-identification problem: rather than risk re-identifying anonymised real data, use data that was never real. It is the principle this whole platform runs on — synthetic data only in the labs.

Where it shows up in digital health

  • Populating a FHIR server — seeding HAPI or Medplum for development; a real deployment of the FHIR Sandbox or FHIR Mapper lab would load Synthea Bundles.
  • Demos & sales — showing an app with a believable patient panel, no privacy review.
  • Load & integration testing — generate thousands of patients to stress a pipeline.
  • Research & ML prototyping — develop methods on synthetic data before applying for access to real cohorts (OMOP conversions exist).

Limits to keep in mind

  • It's a model, not reality — distributions are US-centric by default and won't capture every local pattern; conclusions drawn only from Synthea don't transfer to real populations.
  • Not for statistical inference — it's for building and testing software, not for estimating true disease epidemiology.
  • Correlations are simplified — the modules are realistic but not exhaustive; rare multi-morbidity interactions may be missing.

Key takeaways

  • Synthea generates realistic, fully synthetic patient histories — FHIR, C-CDA, CSV, bulk.
  • It removes the PHI barrier from development, demos, and testing entirely.
  • Built on clinically-informed state machines, so histories are coherent, not random.
  • Use it to seed FHIR servers and labs — but remember it models, and doesn't measure, reality.

Check your recall

0 of 2 recalled

Active recall beats re-reading — try to answer, then reveal.

  1. What problem does Synthea solve?

  2. Why is synthetic data the cleanest answer to the de-identification problem?

References

  1. Walonoski et al. — Synthea (JAMIA 2018)
  2. Synthea project (MITRE)

Related entries