AI-Native Systemsconcept · 6 min · updated Jun 30, 2026

MedCAT (Medical Concept Annotation Toolkit)

By Rajendra Sharma, RN, CPC, CPBReviewed by Rajendra Sharma, RN, CPC, CPB · Jun 29, 2026

A Python toolkit that extracts and links clinical concepts to UMLS/SNOMED using self-supervised learning — strong on messy, real EHR free text.

UMLSSNOMED CT

In one line

MedCAT extracts clinical concepts from real-world notes and links them to UMLS or SNOMED, learning the vocabulary self-supervised so it adapts to a hospital's own language and abbreviations. Licence: source-available (Elastic License 2.0) — free for research and most internal use; review the terms before offering it as a hosted service.

MedCAT spots clinical entities in free text and links each to a SNOMED/UMLS concept — turning notes into coded, queryable data.

The problem it solves

Real EHR notes are messy — local abbreviations, shorthand, inconsistent phrasing — and the same word can mean different concepts ("MS" = multiple sclerosis or mitral stenosis?). Dictionary-only tools struggle here. MedCAT's bet is that the best way to read one hospital's notes is to learn from that hospital's notes — without anyone hand-labelling them.

How it works — two stages

NER + Linking — a fast, dictionary-style pass recognises spans and links them to a concept database (UMLS/SNOMED), with disambiguation when a term is ambiguous.
Meta-annotation — a supervised layer captures context: negation, experiencer (patient vs family), temporality (current vs historical).

Its distinctive trick is self-supervised training: it learns concept disambiguation from unlabelled local text, tuning to the messy reality of one site's notes rather than assuming clean, textbook phrasing. It runs in Python and scales to millions of documents.

Licensing — read before you resell

The nuance that matters: MedCAT is source-available (Elastic License 2.0), not OSI "open source." That's fine for learning and internal use, but there are conditions on offering it as a hosted service. Check the terms for a commercial product.

Where it sits among the engines

MedCAT is the machine-learning, locally-adaptive option, alongside the precise-and-auditable cTAKES and the lightweight scispaCy. It's the third backend behind the Clinical NLP lab's extraction interface — choose it when local language and scale matter most.

Where it shows up in digital health

Widely used (notably across the UK NHS, via the CogStack project) to turn NHS-scale clinical narrative into structured, coded data for research and service evaluation — exactly the kind of secondary-use pipeline that feeds registries and OMOP.

Common pitfalls

Skipping the meta-annotation training — without it you get concepts but not negation/ experiencer, so "family history of cancer" looks like a cancer diagnosis.
Assuming OSI-open — it's source-available; respect the licence on resale.
Expecting zero tuning — the self-supervised step needs local text to shine.

Key takeaways

MedCAT extracts + links clinical concepts and adapts to local language self-supervised.
Two stages: NER/linking + meta-annotation (negation, experiencer, temporality).
Source-available licence — great for internal use; review terms for a hosted service.
The locally-adaptive ML option next to cTAKES (auditable) and scispaCy (light).

Check your recall

0 of 2 recalled

Active recall beats re-reading — try to answer, then reveal.

What is MedCAT's distinctive trick for messy real-world EHR text?
What are MedCAT's two stages?

Review due cards across all entries