|Updated March 23, 2026

Split-Knowledge Architecture: Separating Identity from AI Processing

No single system component should hold both identity information and AI-derived insights. This is the core invariant of split-knowledge architecture — and it changes how you think about privacy, security, and compliance.

split-knowledgearchitectureprivacypseudonymisationsecurity

The Core Invariant

No single sandbox holds sufficient information to both identify an individual and reason about their behaviour, preferences, or attributes. Every design decision flows from this constraint.

Sandbox A knows WHO: real name, email, national ID, account numbers, billing details, physical address, biometric references. Sandbox B knows WHAT: behavioural patterns, AI model outputs, preference vectors, risk scores, recommendations, classification labels, interaction logs. They share no network path. The ID Bridge — a separate, isolated component — is the only system that can correlate the two, using opaque pseudonymous tokens.

This is not access control. Access control is a policy that can be misconfigured, bypassed, or socially engineered. Split-knowledge is infrastructure-level enforcement: Kubernetes NetworkPolicies and Docker network segmentation prevent Sandbox A and Sandbox B from communicating directly. The AI model in Sandbox B literally cannot query the identity database because no route exists.

Data Flow: Eight Steps from Request to Response

A user submits a request — "Show me my recommendations." Here is what happens:

  1. The request hits the Gateway (dsa-edge), the only externally facing service.
  2. The Gateway strips identity data and forwards an authentication request to Sandbox A (dsa-identity).
  3. Sandbox A authenticates the user and issues an opaque token (e.g., tok_a7f2x9k4) via the ID Bridge (dsa-bridge). The token has no derivable relationship to the real identity — it is generated via HMAC-SHA256(master_key, identity_id || purpose || rotation_epoch).
  4. The Gateway forwards the request to Sandbox B (dsa-ai) with only the token and a context payload. The payload contains behavioural attributes. Never identity.
  5. Sandbox B runs AI inference on the token and context. The model cannot determine that this is "Marc from Fuerth" — it sees tok_a7f2x9k4 and a set of behavioural features.
  6. Sandbox B returns the result tagged to the token.
  7. The Gateway uses the ID Bridge to map the token back to the user's session.
  8. The user sees a personalised result.

The Gateway is the only service that sees both the user's identity and the inference token — but it never forwards both to the same downstream service. This is enforced in middleware: PII stripping runs before any inference request leaves the Gateway.

Token Design: Five Properties That Matter

Tokens are the connective tissue of the architecture. They must satisfy five properties simultaneously:

Data Classification Matrix

Data CategorySandboxExamplesRetention
Direct identifiersA onlyName, email, national ID, account numberAs required by law or contract
Quasi-identifiersA onlyDate of birth, postcode, gender, job titleMust not leak to B
Behavioural raw dataB onlyClickstreams, transaction amounts (without account refs), sensor readingsPurpose-limited retention
AI model inputsB onlyFeature vectors, embeddings, prompt contextEphemeral or short-term
AI model outputsB onlyScores, predictions, generated text, classificationsLogged for audit, not linked to A
Pseudonymous tokensBridgeOpaque IDs mapping A to BRotatable, revocable
Audit logsBoth + BridgeAccess records, re-linkage eventsImmutable, long-term

Quasi-identifiers deserve attention. Date of birth, postcode, gender, and profession — individually harmless — can enable statistical re-identification when combined. Sandbox B enforces k-anonymity on ingestion: if a combination of quasi-identifiers describes fewer than five individuals (k<5), attributes are generalised (age becomes age bracket, postcode becomes region) or suppressed entirely.

Breach Scenario Analysis: What an Attacker Gets

Breach of Sandbox A: The attacker obtains identity data — names, emails, account numbers. Damaging, but they get zero behavioural data, zero AI-derived insights, zero risk scores. They know WHO your customers are but nothing about WHAT they do or how the AI characterises them. Standard identity breach response applies.

Breach of Sandbox B: The attacker obtains behavioural patterns, model outputs, risk scores — all tagged to opaque tokens. They know WHAT happened but not WHO it happened to. A transaction risk score of 0.87 attached to tok_a7f2x9k4 is useless without the Bridge mapping. Under GDPR Art. 33-34, the breach risk assessment changes materially: pseudonymised data with adequate safeguards may not require individual notification (Recital 26).

Breach of the ID Bridge: The worst case. The attacker obtains the mapping table and can correlate Sandbox A identities with Sandbox B tokens. Immediate response: rotate all tokens, invalidating old correlations. The Bridge's HSM-protected encryption, multi-party key management, and network isolation (no internet-facing access, no direct connections from A or B) make this the hardest component to reach.

Contrast with monolithic architecture: A single breach yields everything — names, emails, transaction histories, AI risk scores, behavioural profiles, recommendations — fully linked, fully identifiable. Every breach is worst-case.

A Growing Market for Infrastructure-Level Privacy

The demand for this approach is not theoretical. Gartner projects the AI governance platform market at $492M in 2026, surpassing $1B by 2030. In a 2025 survey of 360 organisations, 52% cited sensitive data exposure as their #1 AI risk. The confidential computing market — hardware-enforced data protection during processing — is projected at $59.4B by 2028.

The infrastructure-level approach is attracting serious investment. Companies building hardware-enforced and network-enforced data isolation are raising significant rounds, including a $90M Series C for a confidential computing leader with 150+ enterprise customers across banking, government, and healthcare. The market consensus from cloud security teams and academic literature: redaction and infrastructure isolation are complementary layers. Choosing only one leaves gaps.

Split-knowledge architecture combines both: application-level PII stripping at the Gateway, infrastructure-level network isolation between sandboxes, and cryptographic tokenisation through the Bridge. This layered approach is what industry research recommends — and what regulations are increasingly demanding as EU AI Act Article 10 enforcement begins August 2, 2026.

Monolithic vs. Split-Knowledge: The Structural Difference

In a traditional monolithic AI system, the user database, feature store, model serving layer, and application logic share a network. An application server that renders a recommendation can query both the user table (identity) and the model output table (AI insights) in the same request. This is convenient for developers. It is also the reason that every data breach in a monolithic system exposes fully linked profiles.

Split-knowledge eliminates this by construction. The recommendation model in Sandbox B returns a result for tok_a7f2x9k4. The Gateway maps that token to a session. The user sees their recommendation. At no point did a single service hold both the user's name and the model's output in memory simultaneously — except the Gateway, which enforces PII stripping before forwarding.

This matters for GDPR Art. 25 (Data Protection by Design and Default). A monolithic system can claim DPDD through policy. Split-knowledge architecture *is* the DPDD implementation. The architecture is the proof.

Re-Linkage: The Controlled Exception

Sometimes identity and AI output must be joined — a fraud investigator needs to know who tok_a7f2x9k4 is. The re-linkage protocol handles this:

  1. An authorised human explicitly requests re-linkage, stating a legal basis (fraud investigation case FR-2026-0412).
  2. The Bridge's policy engine verifies: the analyst holds the fraud_investigator role, the case exists in the case management system, the token is flagged with a risk score above threshold, and DPO approval exists.
  3. The Bridge returns identity information with a 30-minute expiry.
  4. An immutable audit log entry records who requested, why, which tokens, the approval chain, the timestamp, and the duration.

Re-linkage is not a backdoor. It is an audited, time-limited, purpose-bound, multi-party-approved process. It exists because legitimate use cases (regulatory reporting, DSAR responses, fraud investigation) require it. The architecture makes re-linkage the deliberate exception, not the default state.