Confidential Computing for AI Inference: Securing Model Inference in Untrusted Environments

There is a gap in how most enterprises think about AI inference security — and it sits at the bottom of the stack, in hardware you never see.

The standard security model for cloud AI inference is built on a series of assumptions: that the cloud provider's hypervisor is trustworthy, that co-tenant workloads on shared hardware cannot observe your data, that the firmware running on the servers hasn't been compromised, and that the infrastructure management plane is secure. These assumptions are reasonable for most use cases. They are not reasonable when you're running inference on proprietary models with significant commercial value, on healthcare data subject to HIPAA, on financial data regulated under GLBA, or on workloads where a nation-state adversary has the motivation to target your cloud infrastructure.

Confidential computing is the technology category that directly addresses these assumptions. It does this by changing the threat model itself: rather than requiring you to trust the cloud provider's software stack, it creates a hardware-isolated execution environment where the cloud provider's software — including the hypervisor, the host OS, and infrastructure management tools — is excluded from the trust boundary.

I wrote previously about hardware root-of-trust as the foundation for cloud AI security. Confidential computing is the primary instantiation of that principle in production cloud infrastructure today. This post goes deeper on the specific technology: how it works, what's actually protected, where it fits in a real deployment, and what it doesn't solve by itself.

The Problem with the Default Threat Model

To understand why confidential computing matters, you need to see what's wrong with the default.

In a conventional virtualized AI inference deployment, model weights are encrypted at rest and decrypted into GPU memory when the inference workload loads them. At that point — while the model is performing inference on your data — the weights and the input data are in cleartext in hardware memory that is controlled by the cloud provider's hypervisor. A compromised hypervisor, a malicious insider with hypervisor-level access, or a vulnerability in the virtualization layer could, in principle, read that memory.

This isn't a hypothetical vulnerability class. Hypervisor escape vulnerabilities have been demonstrated and patched repeatedly. Cloud provider infrastructure has been compromised in high-profile incidents. Insider threat is a documented risk in every major data center operator's threat model. And nation-state actors have demonstrated the willingness and capability to target cloud infrastructure at the hardware level — including documented cases of hardware-level implants in supply chains. Hardware trojans in AI accelerators represent a supply chain threat that confidential computing addresses by raising the bar for post-deployment exploitation, even if it cannot prevent trojan insertion at the fabrication stage.

The deeper issue is that the cloud inference stack is extraordinarily complex, and that complexity creates a large attack surface. The software components in a typical cloud AI inference deployment include: the cloud provider's virtualization platform, the GPU driver stack, the container runtime, the model serving framework, and the inference application itself. Each of these is maintained by a different team, updated on a different schedule, and has its own vulnerability history. A compromise at any layer in that stack can potentially access your model weights and inference inputs.

The traditional response to this complexity is to try to secure each layer independently — patch management, access controls, monitoring, least-privilege IAM. This is necessary and insufficient. Layered security is valuable, but when all layers share a common vulnerability — the fact that they all run on hardware managed by a third party — you don't have defense in depth. You have defense in breadth. The distinction matters when the attacker is a sophisticated actor who can compromise a critical shared component.

Confidential computing addresses this by shrinking the trust boundary to a single, hardware-anchored component. Instead of trusting the entire cloud stack, you trust only the CPU and its hardware-enforced isolation capabilities.

How Confidential Computing Works: The Core Mechanics

Confidential computing refers to a set of hardware technologies that create a Trusted Execution Environment (TEE) — an isolated region of CPU memory and, critically, of GPU memory that is encrypted and access-controlled at the hardware level.

The core idea is not new. The ARM TrustZone technology introduced the concept of hardware-isolated secure worlds in mobile processors over a decade ago. What has changed is the application to cloud-server and GPU workloads, and the maturation of the protocols for remote attestation — the mechanism by which a workload can prove to a remote party that it's running inside a genuine TEE.

There are two primary server-class implementations:

AMD SEV-SNP (Secure Encrypted Virtualization — Secure Nested Paging) extends AMD's existing memory encryption architecture (SEV) with strong integrity protection and a reverse mapping table structure that prevents hypervisor-level memory redirect attacks. In SEV-SNP, each virtual machine's memory is encrypted with a unique key that is generated by the CPU and never exposed to the hypervisor or host OS. The hypervisor can see that a VM is running and can manage its scheduling, but it cannot read or modify the VM's memory contents.

Intel TDX (Trust Domain Extensions) provides similar functionality for Intel Xeon processors. TDX introduces the concept of a Trust Domain (TD), which is an isolated VM that runs with encrypted memory and a protected execution environment. The host OS and hypervisor are outside the TD's trust boundary.

Both technologies include attestation capabilities: the CPU can produce a signed report — using an attestation key burned into the hardware at manufacture — that contains measurements of the software running inside the TEE. A remote party (the workload owner, a key management service, an auditable attestation service) can verify this report to confirm that the workload is running in a genuine, unmodified TEE on genuine hardware.

For AI inference, the critical evolution has been GPU memory encryption. SEV-SNP and TDX protect CPU memory, but AI inference primarily runs on GPUs. Until recently, this meant that while CPU-side data was protected, model weights were exposed once loaded into GPU VRAM. NVIDIA addressed this gap with the confidential computing capabilities in the H100 and newer GPUs: hardware memory encryption on the GPU, with keys that are managed inside the HSM-backed key hierarchy and accessible only to attested workloads running on attested GPU hardware. This is the combination that makes confidential computing practically relevant for AI inference — CPU TEE plus GPU memory encryption plus remote attestation — and it's available in production today at the major cloud providers.

What Is Actually Protected

Confidential computing protects against a specific and important threat, but it's important to be precise about what it does and doesn't cover.

What it protects:

Model weights in memory. During inference, model weights are loaded into GPU VRAM encrypted with keys that the hypervisor cannot access. A compromised hypervisor cannot extract the weights from memory because it cannot obtain the decryption keys.

Inference input and output data. User queries and the inference responses are similarly protected inside the TEE boundary. This matters for data residency requirements, regulatory compliance, and any case where the inference inputs contain sensitive information that the cloud provider's infrastructure should not be able to observe.

The inference software stack. The attestation measurement covers the entire software stack inside the TEE — firmware, OS, runtime, and application. A remote party can verify that the exact inference runtime version it intends to use is what's running, and can refuse to send data to any instance that doesn't attest to the expected configuration.

Integrity of inference results. The TEE can sign inference outputs with a key held inside the hardware boundary. A downstream service that receives an inference result can verify that it was produced inside an attested TEE, by an expected software stack, without modification in transit.

What it does not protect:

Side-channel attacks on the hardware itself. This is where the limitations become important. Confidential computing protects the software/hypervisor boundary, but it does not protect against physical side-channel attacks — power analysis, electromagnetic emissions, timing channels — that operate below the level of the software stack. I wrote in detail about side-channel attacks on ML accelerators and why physical-layer defenses are a separate requirement for high-security deployments. Confidential computing and side-channel defenses address different attack surfaces and both are needed in high-threat environments.

Training workloads with gradient leakage risk. Confidential computing protects the confidentiality of data and weights during inference. Training — where gradients are computed across large datasets and weight updates are computed across many steps — has its own set of threats that confidential computing addresses only partially. Federated learning with TEEs is an active research area, but the production story for confidential training is less mature than for inference.

Application-level vulnerabilities. A SQL injection in your inference API, a prompt injection in your LLM pipeline, a model robustness failure under adversarial inputs — these are outside the scope of confidential computing. The TEE protects the hardware boundary. It doesn't make your application code correct or secure.

Inference latency and throughput. Hardware memory encryption and TEE overhead have measurable performance costs. For some workloads, the performance impact is acceptable. For latency-critical real-time inference, it may not be. The cloud providers have published benchmark data showing 5-15% throughput overhead for some workload types; real-world impact depends heavily on workload characteristics.

The Attestation Chain: How You Actually Verify What's Running

Attestation is the mechanism that makes confidential computing practically useful, and it's worth understanding in some detail because it's the part most often glossed over in summaries.

At its core, attestation is a cryptographic proof that a piece of hardware is running specific, unmodified software in a specific, unmodified configuration. The proof works as follows:

The hardware (CPU or GPU) computes a cryptographic hash of the software it is about to execute — this happens during boot of the TEE. This measurement is stored in hardware registers that cannot be modified by software. When the TEE starts, it can generate an attestation report: a signed statement, using a hardware-backed key, that includes the measurement values, the hardware identity, and optionally additional claims about the TEE configuration.

This attestation report is sent to an attestation service — either the cloud provider's native service (AWS Nitro Enclave Attestation, Azure Confidential Computing Attestation, GCP Confidential Space) or a custom service you operate. The attestation service verifies the signature, checks the measurements against an expected value you provide, and issues an attestation token if they match. Your workload — or your key management infrastructure — only releases sensitive material (model weights, decryption keys, sensitive input data) if the attestation token is valid.

For AI inference, this means: your model weights can be stored encrypted in cloud object storage. The decryption key is held by an HSM or a key management service that requires a valid attestation token before releasing the key. When a confidential computing instance boots and your inference workload starts, it requests attestation. The attestation service verifies that the instance is running the expected inference runtime, in an expected TEE configuration, on expected hardware. Only then does it release the decryption key to load the model weights. A compromised hypervisor or a misconfigured instance never gets the key, because it can never produce a valid attestation.

This is the architecture that converts the security property of confidential computing from theoretical to practical: not merely "the hypervisor can't read my memory," but "my keys won't be released to an instance that isn't properly configured and attested."

Building this attestation chain correctly is non-trivial. The expected measurement values must be computed from your exact software build, stored securely, and updated every time you update the inference runtime. The attestation service must be operated or configured with appropriate security properties. The key management system must be integrated with the attestation workflow. These are engineering challenges that require expertise in both cloud infrastructure and hardware security — they're not insurmountable, but they're also not click-to-deploy.

Deployment Patterns: Where Confidential Computing Fits

There are several deployment patterns for confidential computing in AI inference, each with different security/cost/complexity tradeoffs.

Confidential inference with dedicated GPU instances. The highest-security pattern uses dedicated GPU instances that support confidential computing — for example, AWS EC2 instances with NVIDIA H100s in confidential computing mode, or Azure Confidential GPU VMs. The entire inference workload — CPU-side serving, GPU inference, data handling — runs inside a TEE. This pattern is appropriate for workloads where both model weight confidentiality and input data confidentiality are required, and where the workload characteristics fit within the dedicated instance's capacity. The cost premium over shared instances is significant.

Confidential computing for key management only. A more incremental pattern keeps the inference workload on standard infrastructure but uses confidential computing for the key management infrastructure that protects model weights. Weights are encrypted in storage, and the keys that protect them are held in a TEE-backed HSM. The inference instances themselves are not confidential — but the keys that protect the model are. This is a lower-cost approach that addresses the hypervisor-layer threat to keys without requiring dedicated GPU instances. For many organizations, this is the right starting point before committing to full confidential inference.

Hybrid patterns for multi-party inference. A more sophisticated use case: scenarios where two or more parties want to run inference on combined data without revealing their individual inputs to the other party. Confidential computing alone doesn't solve multi-party computation — you'd typically combine it with secure multi-party computation protocols or federated learning techniques. But TEE-based approaches are being explored as a way to reduce the overhead of these approaches, by providing a hardware-isolated environment where the computation can run without exposing individual parties' data to the other.

The Hardware Root-of-Trust Connection

Confidential computing is not a standalone technology — it's one component of a hardware-anchored security architecture. Its relationship to the broader hardware root-of-trust stack is important to understand.

The root of trust for confidential computing is the hardware itself: the CPU's trusted anchor, the GPU's secure enclave, the HSM's tamper-resistant boundary. These hardware components are trusted because they are designed with specific security properties — tamper resistance, key isolation, secure boot — and because their security claims are verified through Common Criteria certifications, FIPS 140-3 validations, and the cloud provider's own hardware attestation programs.

When you deploy confidential computing for AI inference, you're relying on this entire chain:

The CPU hardware has a root of trust anchored in the AMD Secure Processor or Intel Management Engine — dedicated security coprocessors that are physically isolated from the main CPU cores and run their own firmware with controlled update processes. These security coprocessors generate and hold the attestation keys and perform the cryptographic operations that underpin the TEE. Their security properties are the foundation of the entire system.

The GPU confidential computing capability adds another component to this chain: the H100's memory encryption engine, its secure boot path, and its attestation interface. The GPU's TEE and the CPU's TEE must be jointly attested — the workload must verify that both the CPU and GPU are in the expected state before releasing sensitive material to either.

The HSM or cloud KMS provides the key management infrastructure that ties everything together: the keys that encrypt model weights at rest, the keys that encrypt data in transit, and the key hierarchy that establishes trust from hardware anchors through to the application layer.

This is why I described hardware root-of-trust as the foundation — and why the hardware root-of-trust architecture for cloud AI infrastructure matters independently of confidential computing. Confidential computing is the runtime mechanism. The hardware root-of-trust is the principle that makes it trustworthy. Without the tamper resistance, immutable measurement, and secure key storage that hardware root-of-trust provides, the TEE's security guarantees would rest on software assumptions rather than hardware guarantees.

What Organizations Should Do Now

If you're running AI inference on sensitive models in cloud environments and you haven't evaluated confidential computing, the risk assessment question you should be asking is straightforward: what is the consequence if your model weights or inference inputs are exposed through a hypervisor-level compromise or a malicious insider with infrastructure access?

If the answer is "significant," "catastrophic," or "career-ending," confidential computing should be in your threat model. Here's a practical path:

Short term (0-3 months): Evaluate your current deployment's exposure. Do you have model weights in cleartext in cloud GPU memory? Do your inference inputs contain sensitive data that your cloud provider's infrastructure staff could theoretically access? If yes to either, you have a gap that confidential computing addresses. Start by moving to HSM-backed key management for model weights — this is the lower-complexity first step and it directly reduces the most obvious exposure.

Medium term (3-9 months): Evaluate the confidential computing offerings from your cloud provider. AWS Nitro Enclaves, Azure Confidential Computing, and GCP Confidential Space have matured substantially and have published reference architectures for AI inference. Test a deployment with your specific inference runtime, measure the performance impact, and evaluate the attestation workflow. The operational overhead is real and you should understand it before committing to production.

Long term (9-18 months): Build confidential computing into your standard deployment architecture for sensitive AI workloads. This means integrating attestation into your CI/CD pipeline, establishing the measurement infrastructure to track expected measurements for each build, and operating the attestation verification service. This is the work that converts a security feature into a durable security practice.

The specific timeline should be driven by your threat model, not by industry hype. For most organizations, moving model weight key management to HSM-backed systems is a reasonable immediate step that doesn't require rearchitecting the inference stack. Full confidential computing adoption is warranted when the asset value and threat profile justify the investment and operational complexity.

What should not continue is the current default: running high-value AI inference workloads on infrastructure where the trust model implicitly requires trusting the cloud provider's entire software stack — a stack that is large, complex, frequently updated, and not designed with your threat model in mind.

The hardware isolation is available. The attestation infrastructure is available. The cloud provider support is available. The question is whether the organizations deploying AI on sensitive workloads recognize that the default trust model has a gap — and whether they're willing to close it before an incident forces the issue.

The hardware security research I've been tracking continues to expand the understanding of where these gaps matter most. The ventures I've built in AI infrastructure have given me direct experience with how fast the threat landscape is moving. The integration of hardware security primitives into cloud AI infrastructure is one of the most practically important developments in enterprise AI security right now — and for most organizations, it's still not on the security roadmap.

It should be.

Confidential Computing for AI Inference: Securing Model Inference in Untrusted Environments

The Problem with the Default Threat Model

How Confidential Computing Works: The Core Mechanics

What Is Actually Protected

The Attestation Chain: How You Actually Verify What's Running

Deployment Patterns: Where Confidential Computing Fits

The Hardware Root-of-Trust Connection

What Organizations Should Do Now

Related Posts

Hardware Root-of-Trust in Cloud AI Infrastructure: Why Software Security Alone Isn't Enough

Side-Channel Attacks on ML Accelerators: The Hardware Security Threat AI Teams Are Ignoring

Differential Power Analysis on Neural Networks: Recovering Model Weights at the Physical Layer