Compliance, Privacy & Ethicsarticle · 5 min · updated Jun 30, 2026

De-identification of health data

By Rajendra Sharma, RN, CPC, CPBReviewed by Rajendra Sharma, RN, CPC, CPB · Jun 29, 2026

Turning patient data into data you can safely analyse or share — Safe Harbor vs Expert Determination, and why 'anonymous' is harder than deleting the name.

HIPAA

In one line

De-identification removes or obscures the details that tie health data to a person, so it can be used for research, analytics or sharing without exposing the patient. Done properly it is the difference between a lawful dataset and a breach.

HIPAA gives two routes to de-identified data: the prescriptive Safe Harbor, or a statistician's Expert Determination.

The two HIPAA methods

Safe Harbor — remove 18 specified identifiers (names, geography smaller than a state, all date elements finer than year, contact details, MRNs, device IDs, biometrics, full-face photos, etc.) and have no actual knowledge the remainder could re-identify. Prescriptive and checkable — the Audit & Compliance lab practises exactly this.
Expert Determination — a qualified statistician certifies the re-identification risk is very small, given the data and who will receive it. More flexible, retains more analytic value, needs documented methods.

Why "anonymous" is hard

Stripping the name isn't enough. Quasi-identifiers — ZIP + birth date + sex famously re-identify most people — and linkage attacks against other datasets can unmask "anonymous" records. Stronger tools push further: k-anonymity, and differential privacy, which adds calibrated noise so no individual can be inferred at all.

The cleanest option: never use real data

For teaching, demos and many tests, synthetic data (Synthea) sidesteps the problem entirely — realistic patients that were never real people. That's exactly why every HealthAtoms lab runs on synthetic data only.

Common pitfalls

Free-text & pixel leakage — names and IDs hide in narrative notes and (in imaging) burned-in pixels; structured-field scrubbing alone misses them.
Dates done wrong — a consistent date shift preserves intervals; deleting dates loses analytic value; leaving them re-identifies.
Small cells — "1 patient" in a rare-disease × small-region cross-tab is an identifier; suppression or differential privacy is needed.

Key takeaways

De-identification makes data safe to analyse/share by removing what ties it to a person.
HIPAA gives two routes: prescriptive Safe Harbor (18 identifiers) and Expert Determination (certified low risk).
"Anonymous" is hard — quasi-identifiers and linkage attacks re-identify; k-anonymity and DP push further.
The cleanest answer is synthetic data (Synthea) — which is why every lab here uses it.

Check your recall

0 of 2 recalled

Active recall beats re-reading — try to answer, then reveal.

What are HIPAA's two routes to de-identified data?
Why isn't removing the name enough to anonymise health data?

Review due cards across all entries

References

HHS — Methods for De-identification of PHI