Split-Knowledge Architecture: Separating Identity from AI Processing
No single system component should hold both identity information and AI-derived insights. This is the core invariant of split-knowledge architecture — and it changes how you think about privacy, security, and compliance.
The Core Invariant
No single sandbox holds sufficient information to both identify an individual and reason about their behaviour, preferences, or attributes. Every design decision flows from this constraint.
Sandbox A knows WHO: real name, email, national ID, account numbers, billing details, physical address, biometric references. Sandbox B knows WHAT: behavioural patterns, AI model outputs, preference vectors, risk scores, recommendations, classification labels, interaction logs. They share no network path. The ID Bridge — a separate, isolated component — is the only system that can correlate the two, using opaque pseudonymous tokens.
This is not access control. Access control is a policy that can be misconfigured, bypassed, or socially engineered. Split-knowledge is infrastructure-level enforcement: Kubernetes NetworkPolicies and Docker network segmentation prevent Sandbox A and Sandbox B from communicating directly. The AI model in Sandbox B literally cannot query the identity database because no route exists.
Data Flow: Eight Steps from Request to Response
A user submits a request — "Show me my recommendations." Here is what happens:
- The request hits the Gateway (dsa-edge), the only externally facing service.
- The Gateway strips identity data and forwards an authentication request to Sandbox A (dsa-identity).
- Sandbox A authenticates the user and issues an opaque token (e.g.,
tok_a7f2x9k4) via the ID Bridge (dsa-bridge). The token has no derivable relationship to the real identity — it is generated viaHMAC-SHA256(master_key, identity_id || purpose || rotation_epoch). - The Gateway forwards the request to Sandbox B (dsa-ai) with only the token and a context payload. The payload contains behavioural attributes. Never identity.
- Sandbox B runs AI inference on the token and context. The model cannot determine that this is "Marc from Fuerth" — it sees
tok_a7f2x9k4and a set of behavioural features. - Sandbox B returns the result tagged to the token.
- The Gateway uses the ID Bridge to map the token back to the user's session.
- The user sees a personalised result.
The Gateway is the only service that sees both the user's identity and the inference token — but it never forwards both to the same downstream service. This is enforced in middleware: PII stripping runs before any inference request leaves the Gateway.
Token Design: Five Properties That Matter
Tokens are the connective tissue of the architecture. They must satisfy five properties simultaneously:
- Opaque: A token carries zero information about the underlying identity. No encoded names, no sequential IDs, no timestamps. Given
tok_a7f2x9k4, you learn nothing about who it represents. - Non-reversible without the Bridge: Deriving the real identity from a token requires access to the Bridge's mapping table, which is encrypted at rest with HSM-protected keys and requires quorum access (2-of-3 key custodians) for administrative operations.
- Rotatable: Tokens rotate on a schedule (daily, weekly) or on demand. After rotation, old tokens are invalidated. This limits the window for correlation attacks — an attacker who captures a token today cannot use it to correlate data captured next week.
- Scoped: Different tokens for different purposes. The same person gets one token for recommendations (
tok_a7f2x9k4, purpose:reco) and a different token for fraud detection (tok_m2p8q1r7, purpose:fraud). No mathematical relationship exists between them. This prevents cross-purpose profiling within Sandbox B. - Revocable: When a user exercises their right to erasure (GDPR Art. 17), all tokens for that identity are revoked in the Bridge. Sandbox B data becomes orphaned — behavioural records tagged to tokens that resolve to nothing.
Data Classification Matrix
| Data Category | Sandbox | Examples | Retention |
|---|---|---|---|
| Direct identifiers | A only | Name, email, national ID, account number | As required by law or contract |
| Quasi-identifiers | A only | Date of birth, postcode, gender, job title | Must not leak to B |
| Behavioural raw data | B only | Clickstreams, transaction amounts (without account refs), sensor readings | Purpose-limited retention |
| AI model inputs | B only | Feature vectors, embeddings, prompt context | Ephemeral or short-term |
| AI model outputs | B only | Scores, predictions, generated text, classifications | Logged for audit, not linked to A |
| Pseudonymous tokens | Bridge | Opaque IDs mapping A to B | Rotatable, revocable |
| Audit logs | Both + Bridge | Access records, re-linkage events | Immutable, long-term |
Quasi-identifiers deserve attention. Date of birth, postcode, gender, and profession — individually harmless — can enable statistical re-identification when combined. Sandbox B enforces k-anonymity on ingestion: if a combination of quasi-identifiers describes fewer than five individuals (k<5), attributes are generalised (age becomes age bracket, postcode becomes region) or suppressed entirely.
Breach Scenario Analysis: What an Attacker Gets
Breach of Sandbox A: The attacker obtains identity data — names, emails, account numbers. Damaging, but they get zero behavioural data, zero AI-derived insights, zero risk scores. They know WHO your customers are but nothing about WHAT they do or how the AI characterises them. Standard identity breach response applies.
Breach of Sandbox B: The attacker obtains behavioural patterns, model outputs, risk scores — all tagged to opaque tokens. They know WHAT happened but not WHO it happened to. A transaction risk score of 0.87 attached to tok_a7f2x9k4 is useless without the Bridge mapping. Under GDPR Art. 33-34, the breach risk assessment changes materially: pseudonymised data with adequate safeguards may not require individual notification (Recital 26).
Breach of the ID Bridge: The worst case. The attacker obtains the mapping table and can correlate Sandbox A identities with Sandbox B tokens. Immediate response: rotate all tokens, invalidating old correlations. The Bridge's HSM-protected encryption, multi-party key management, and network isolation (no internet-facing access, no direct connections from A or B) make this the hardest component to reach.
Contrast with monolithic architecture: A single breach yields everything — names, emails, transaction histories, AI risk scores, behavioural profiles, recommendations — fully linked, fully identifiable. Every breach is worst-case.
A Growing Market for Infrastructure-Level Privacy
The demand for this approach is not theoretical. Gartner projects the AI governance platform market at $492M in 2026, surpassing $1B by 2030. In a 2025 survey of 360 organisations, 52% cited sensitive data exposure as their #1 AI risk. The confidential computing market — hardware-enforced data protection during processing — is projected at $59.4B by 2028.
The infrastructure-level approach is attracting serious investment. Companies building hardware-enforced and network-enforced data isolation are raising significant rounds, including a $90M Series C for a confidential computing leader with 150+ enterprise customers across banking, government, and healthcare. The market consensus from cloud security teams and academic literature: redaction and infrastructure isolation are complementary layers. Choosing only one leaves gaps.
Split-knowledge architecture combines both: application-level PII stripping at the Gateway, infrastructure-level network isolation between sandboxes, and cryptographic tokenisation through the Bridge. This layered approach is what industry research recommends — and what regulations are increasingly demanding as EU AI Act Article 10 enforcement begins August 2, 2026.
Monolithic vs. Split-Knowledge: The Structural Difference
In a traditional monolithic AI system, the user database, feature store, model serving layer, and application logic share a network. An application server that renders a recommendation can query both the user table (identity) and the model output table (AI insights) in the same request. This is convenient for developers. It is also the reason that every data breach in a monolithic system exposes fully linked profiles.
Split-knowledge eliminates this by construction. The recommendation model in Sandbox B returns a result for tok_a7f2x9k4. The Gateway maps that token to a session. The user sees their recommendation. At no point did a single service hold both the user's name and the model's output in memory simultaneously — except the Gateway, which enforces PII stripping before forwarding.
This matters for GDPR Art. 25 (Data Protection by Design and Default). A monolithic system can claim DPDD through policy. Split-knowledge architecture *is* the DPDD implementation. The architecture is the proof.
Re-Linkage: The Controlled Exception
Sometimes identity and AI output must be joined — a fraud investigator needs to know who tok_a7f2x9k4 is. The re-linkage protocol handles this:
- An authorised human explicitly requests re-linkage, stating a legal basis (fraud investigation case
FR-2026-0412). - The Bridge's policy engine verifies: the analyst holds the
fraud_investigatorrole, the case exists in the case management system, the token is flagged with a risk score above threshold, and DPO approval exists. - The Bridge returns identity information with a 30-minute expiry.
- An immutable audit log entry records who requested, why, which tokens, the approval chain, the timestamp, and the duration.
Re-linkage is not a backdoor. It is an audited, time-limited, purpose-bound, multi-party-approved process. It exists because legitimate use cases (regulatory reporting, DSAR responses, fraud investigation) require it. The architecture makes re-linkage the deliberate exception, not the default state.