Data Protection 9 min read

PII Sanitisation in Legal AI: What Solicitors Need to Know

Understanding how personally identifiable information is protected when using AI tools for medical record analysis. Learn about PII sanitisation techniques, GDPR compliance, encryption standards, and what to look for when evaluating AI platforms for clinical negligence work.

TL;DR

PII sanitisation removes personally identifiable information from medical records before AI processing, using a 4-layer detection pipeline including Microsoft Presidio, spaCy NER, custom regex patterns for UK-specific identifiers like NHS numbers, and contextual validation. MedCase AI protects all data with AES-256-GCM encryption, UK-only hosting, and GDPR-compliant retention policies — ensuring the AI model only ever sees clinical content, never patient identities.

Artificial intelligence is transforming how solicitors handle clinical negligence cases, but adopting AI tools for medical record analysis raises a critical question: how is patient data protected? When you upload thousands of pages of medical records to an AI platform, you need absolute confidence that personally identifiable information (PII) is handled with the same rigour you would apply to physical case files — if not more.

This guide explains what PII sanitisation means in practice, why it matters for legal professionals working with sensitive medical data, and what to look for when evaluating AI tools for your firm.

Why PII Protection Matters in Medical-Legal AI

Medical records are classified as special category data under UK GDPR, carrying the highest level of data protection requirements. A breach involving medical records can result in ICO fines of up to £17.5 million or 4% of annual global turnover, SRA disciplinary proceedings, and civil liability. PII sanitisation ensures that AI models only ever process clinical content — never patient or practitioner identities.

Medical records are among the most sensitive categories of personal data recognised under UK data protection law. They contain not only clinical details about a patient's health, treatment history, and diagnoses, but also a wealth of identifying information that links those details to a specific individual.

For solicitors, the stakes are particularly high. You have a professional duty of confidentiality to your clients, regulatory obligations under the SRA Standards and Regulations, and strict legal requirements under the UK General Data Protection Regulation (UK GDPR) and the Data Protection Act 2018. A data breach involving medical records could result in:

  • Regulatory action from the Information Commissioner's Office (ICO), including fines of up to £17.5 million or 4% of annual global turnover
  • Professional disciplinary proceedings via the SRA
  • Loss of client trust and reputational damage
  • Potential civil liability to affected data subjects

When you introduce AI into your workflow, you are effectively sharing data with a third-party processor. The question is not whether to use AI — the efficiency gains for medical record analysis are too significant to ignore — but whether the platform you choose treats data protection as a core architectural principle rather than an afterthought.

What PII Means in the Context of UK Medical Records

PII in medical records encompasses 6 categories of identifiers: patient identifiers (name, DOB, NHS number), contact details (address, postcode, phone), practitioner identifiers (GMC/NMC numbers), institutional references (Trust names, ward identifiers), relational identifiers (next-of-kin details), and indirect identifiers (age-postcode-condition combinations). These appear across unstructured text in inconsistent formats, making automated multi-method detection essential.

PII in medical records goes well beyond names and addresses. A comprehensive sanitisation system must identify and protect a wide range of identifiers that appear throughout NHS and private healthcare documentation. These include:

  • Patient identifiers: full name, date of birth, NHS number, hospital number, case reference numbers
  • Contact details: home address, postcode, telephone numbers, email addresses
  • Practitioner identifiers: clinician names, GMC numbers, NMC registration numbers, consultant codes
  • Institutional references: GP surgery names, hospital ward names, NHS Trust identifiers
  • Relational identifiers: next-of-kin names, family member details referenced in clinical notes
  • Indirect identifiers: combinations of age, postcode, and rare conditions that could identify an individual even without a name

The challenge is that these identifiers appear in unstructured text — handwritten notes that have been scanned, typed clinical letters, discharge summaries, and GP records — often with inconsistent formatting. An NHS number might appear as 123 456 7890, 1234567890, or even embedded within a sentence without clear labelling. A robust sanitisation system must catch all variations.

How PII Sanitisation Works

PII sanitisation detects and replaces personally identifiable information with labelled placeholders (e.g. [PATIENT_1], [NHS_NUMBER_1]) before any AI model processes the text. Three core technologies work together: Named Entity Recognition (NER) using medical-trained NLP models, pattern matching with regular expressions targeting structured identifiers like 10-digit NHS numbers, and enterprise PII engines such as Microsoft Presidio that combine multiple detection strategies with confidence scoring.

PII sanitisation (sometimes called de-identification or anonymisation) is the process of detecting and removing or replacing personally identifiable information before data is processed by an AI model. The goal is straightforward: the AI should only ever see the clinical content it needs to perform its analysis, never the identity of the patient or practitioners involved.

Named Entity Recognition (NER)

NER is a natural language processing technique that identifies entities within text — names, locations, organisations, dates — and classifies them by type. Modern NER models trained on medical text can distinguish between a person's name and a drug name, or between a place name and a medical condition, with high accuracy. However, NER alone is not sufficient for the level of protection medical-legal data requires.

Pattern Matching with Regular Expressions

Regex patterns target structured identifiers that follow known formats: NHS numbers (10-digit sequences with specific check-digit validation), postcodes (UK format matching), telephone numbers, email addresses, and dates of birth. These rule-based systems are highly reliable for catching formatted data that NER models might occasionally miss.

Enterprise PII Engines

Dedicated PII detection engines such as Microsoft Presidio combine multiple detection strategies into a single pipeline. These engines use pre-trained recognisers for dozens of PII categories, support custom recognisers for domain-specific patterns (such as GMC numbers), and provide confidence scoring to flag uncertain detections for review.

Why One Detection System Is Not Enough

No single PII detection method achieves the reliability required for medical-legal data. A 4-layer approach combines an enterprise PII engine (Microsoft Presidio), NLP-based NER (spaCy with medical models), custom regex for UK-specific identifiers (NHS numbers, GMC numbers, UK postcodes), and contextual validation against medical terminology databases. Each layer catches what others miss, reducing the risk of any identifier reaching the AI model.

No single PII detection method is perfect. NER models can miss unusual name spellings or confuse medical terminology with personal names. Regex patterns cannot catch identifiers that deviate from expected formats. Enterprise engines, while comprehensive, may not include recognisers for every UK-specific identifier type out of the box.

This is why a multi-layer approach is essential. By combining several detection systems in sequence, each layer catches what the others might miss:

Detection Layer Technology Primary Targets Strength Limitation
Layer 1 Enterprise PII Engine (Microsoft Presidio) Names, addresses, dates, common identifiers Broad coverage across 30+ PII categories May miss UK-specific formats
Layer 2 NLP-based NER (spaCy with medical models) Names vs clinical terms in complex sentences Contextual understanding of medical text Can miss unusual name spellings
Layer 3 Custom regex and rule-based patterns NHS numbers, GMC/NMC numbers, UK postcodes Highly reliable for formatted data Cannot handle unstructured identifiers
Layer 4 Contextual validation Ambiguous entities, false positive reduction Cross-references medical terminology databases Depends on quality of reference data

At MedCase AI, this multi-layer pipeline is a foundational part of the platform architecture. Every document passes through all detection layers before any AI analysis begins, ensuring that the large language model never has access to identifiable patient or practitioner data.

What the AI Actually Sees After Sanitisation

After sanitisation, the AI receives text with all 18+ identifier types replaced by labelled placeholders such as [PATIENT_1], [PRACTITIONER_1], [NHS_NUMBER_1], and [HOSPITAL_1]. All clinical terminology — symptoms, diagnoses, procedures, drug names, dosages — is preserved. Consistent placeholder labels maintain document coherence across the entire record, and original identifiers are mapped back only in the final report delivered to the solicitor.

After PII sanitisation, the AI model receives text where all identifying information has been replaced with labelled placeholders. A clinical letter that originally read:

Dear Dr James Thornton, I reviewed Mrs Sarah Mitchell (DOB: 14/03/1965, NHS: 432 891 0567) at Royal Hampshire County Hospital on 12 January 2024 regarding her ongoing complaints of lower back pain following the procedure performed on 03/11/2023.

would be sanitised to something like:

Dear [PRACTITIONER_1], I reviewed [PATIENT_1] (DOB: [DATE_1], NHS: [NHS_NUMBER_1]) at [HOSPITAL_1] on [DATE_2] regarding her ongoing complaints of lower back pain following the procedure performed on [DATE_3].

The critical point is that all clinical terminology is preserved. The AI can still analyse the medical content — symptoms, diagnoses, procedures, treatment timelines, drug names, dosages, and clinical decision-making — without ever knowing who the patient or clinician is. The labelled placeholders maintain document coherence so the AI understands that [PRACTITIONER_1] refers to the same person throughout the record.

When results are returned to the solicitor, the original identifiers can be mapped back in for the final report, meaning you receive a fully contextualised analysis without the AI ever having processed the raw personal data.

Encryption Standards: Protecting Data at Rest and in Transit

MedCase AI uses AES-256-GCM encryption (256-bit keys in Galois/Counter Mode) for all data at rest, TLS 1.3 for all data in transit, and per-record unique nonces that prevent pattern analysis across records. Encryption keys are stored separately from encrypted data, rotated regularly, and managed through hardware security modules (HSMs). This multi-level encryption protects uploaded documents, extracted text, analysis results, and audit logs.

PII sanitisation addresses what the AI model sees, but encryption protects the data everywhere else in the system. For medical-legal data, the encryption standard that meets regulatory expectations is AES-256-GCM (Advanced Encryption Standard with 256-bit keys in Galois/Counter Mode).

Why AES-256-GCM Specifically?

  • 256-bit key length: Provides a level of security that is computationally infeasible to break with current or foreseeable technology, including quantum computing threats in the near term.
  • Galois/Counter Mode (GCM): An authenticated encryption mode that simultaneously encrypts data and generates an authentication tag. This means any tampering with encrypted data is detected automatically — the system knows if even a single bit has been altered.
  • Per-record nonces: Each individual record is encrypted with a unique nonce (number used once), ensuring that identical plaintext produces different ciphertext. This prevents pattern analysis across records.

A well-architected system applies encryption at multiple levels:

  • In transit: TLS 1.3 encryption for all data moving between your browser and the platform, and between internal services.
  • At rest: AES-256-GCM encryption for all stored data, including uploaded documents, extracted text, analysis results, and audit logs.
  • Key management: Encryption keys stored separately from encrypted data, rotated regularly, and managed through hardware security modules (HSMs) or equivalent cloud key management services.

You can learn more about the specific security measures used in our platform on the MedCase AI privacy page.

GDPR Compliance Requirements for AI Processing

Processing medical records with AI engages UK GDPR Article 6 (lawful basis), Article 9 (special category conditions for health data), and data minimisation principles. For clinical negligence solicitors, the most common lawful bases are legitimate interests under Article 6(1)(f) and the legal claims condition under Article 9(2)(f). PII sanitisation directly supports the data minimisation principle by ensuring only clinical content reaches the AI model.

Using AI to process medical records engages several provisions of the UK GDPR. Solicitors should understand these requirements both for their own compliance obligations and to evaluate whether an AI provider meets them.

Lawful Basis for Processing

Processing medical data with AI requires a lawful basis under Article 6 of the UK GDPR and, because health data is a special category, a condition under Article 9. For solicitors handling clinical negligence claims, the most common bases are:

  • Legitimate interests (Article 6(1)(f)): Processing is necessary for the legitimate interests of the client in pursuing their legal claim, balanced against the data subject's rights.
  • Legal claims (Article 9(2)(f)): Processing is necessary for the establishment, exercise, or defence of legal claims.
  • Explicit consent (Article 9(2)(a)): The client has given explicit consent for their medical records to be processed using AI tools as part of case preparation.

Data Minimisation

The UK GDPR requires that personal data processed is adequate, relevant, and limited to what is necessary. PII sanitisation directly supports this principle by ensuring the AI model only receives the minimum data required to perform its function — the clinical content — while stripping away unnecessary personal identifiers.

Data Retention and Right to Erasure

Any AI platform processing medical data should have clear policies on:

  • Retention periods: How long uploaded documents and analysis results are stored, with automatic deletion after a defined period.
  • Right to erasure: The ability for data subjects (or solicitors acting on their behalf) to request deletion of all data associated with a specific case.
  • Deletion verification: Confirmation that data is permanently deleted from all systems, including backups, within a reasonable timeframe.

Consent Tracking and Audit Trails

Firms should maintain records of the lawful basis relied upon for each case, any client consent obtained for AI processing, and a clear audit trail of what data was uploaded, when, and by whom. A good AI platform provides this audit functionality as a built-in feature rather than leaving it to the firm to manage manually.

UK Data Hosting and ICO Registration

MedCase AI hosts all data exclusively within the United Kingdom, eliminating cross-border transfer concerns under UK GDPR. The platform is registered with the ICO as a data processor — a legal requirement for any organisation processing personal data. UK-only hosting, combined with ICO registration and a published Data Processing Agreement (DPA), provides solicitors with the regulatory clarity needed for confident adoption.

Where your data is physically stored and processed matters. Under the UK GDPR, transferring personal data outside the UK requires additional safeguards unless the destination country has received an adequacy decision from the UK government.

For solicitors handling sensitive medical records, the simplest and most protective approach is to use a platform that hosts all data within the United Kingdom. This eliminates cross-border transfer concerns entirely and provides clarity for clients and regulators about where their data resides.

You should also verify that any AI provider is registered with the ICO as a data processor. ICO registration is a legal requirement for organisations processing personal data, and its absence is a significant red flag. The registration number should be readily available — typically published on the provider's website and verifiable through the ICO's public register.

Questions Solicitors Should Ask When Evaluating AI Tools

Solicitors evaluating AI tools for medical record analysis should seek answers across 4 areas: PII and data protection (sanitisation methods, encryption standards, per-record nonces), GDPR compliance (ICO registration, UK hosting, retention policies, DPA availability), architecture and security (SOC 2 Type II certification, data training policies, access controls), and practical considerations (audit trails, case isolation, SAR support). Any reputable provider should answer these transparently.

Before adopting any AI platform for medical record analysis, solicitors should seek clear answers to the following questions. Use this as a practical checklist during your evaluation process:

PII and Data Protection

  • Does the platform sanitise PII before AI processing? What methods are used?
  • Is a multi-layer detection approach employed, or does the system rely on a single method?
  • Can you demonstrate what the AI model actually sees after sanitisation?
  • What encryption standard is used for data at rest and in transit?
  • Are per-record encryption nonces used to prevent pattern analysis?

GDPR and Regulatory Compliance

  • Is the platform registered with the ICO? What is the registration number?
  • Where is data hosted? Is all processing and storage within the UK?
  • What is the data retention policy? Can data be deleted on request?
  • Is there a Data Processing Agreement (DPA) available for review?
  • Has the provider completed a Data Protection Impact Assessment (DPIA)?

Architecture and Security

  • Is the AI model fine-tuned on, or does it retain, any client data?
  • Are uploaded documents used to train or improve the AI model?
  • What access controls and authentication mechanisms are in place?
  • Is there SOC 2 Type II certification or equivalent security accreditation?
  • What is the incident response process in the event of a data breach?

Practical Considerations

  • Does the platform provide audit trails for uploaded documents and analyses?
  • Can individual case data be isolated and deleted independently?
  • What support is available if you need to respond to a Subject Access Request (SAR)?
  • Is the platform designed specifically for UK legal and medical contexts?

Any reputable provider should be able to answer these questions transparently. Vague responses or an unwillingness to discuss data protection architecture in detail should give you pause.


Taking a Confident Step Forward

PII sanitisation is the foundational requirement for trustworthy legal AI, not an optional feature. The combination of 4-layer PII detection, AES-256-GCM encryption, UK-only data hosting, and GDPR-compliant data handling represents the minimum standard that medical-legal AI tools should meet. Anything less introduces unnecessary risk to clients, firms, and the individuals whose records are entrusted to solicitors' care.

PII sanitisation is not a peripheral feature — it is the foundation upon which trustworthy legal AI is built. For solicitors handling clinical negligence cases, understanding how a platform protects patient data is just as important as evaluating its analytical capabilities.

The combination of multi-layer PII detection, AES-256-GCM encryption, UK data hosting, and GDPR-compliant data handling practices represents the standard that medical-legal AI tools should meet. Anything less introduces unnecessary risk to your clients, your firm, and the individuals whose records you are trusted to protect.

If you would like to see how MedCase AI approaches PII sanitisation and data protection in practice, explore our features overview or review our privacy and security documentation. We are always happy to discuss our architecture in detail with firms evaluating AI tools for clinical negligence work — book a demo to learn more.

Ready to Transform Your Case Preparation?

See how MedCase AI analyses medical records against clinical protocols in minutes.