5 architecture decisions behind Apple Intelligence on Nvidia GPUs (2026)

Apple Intelligence now runs on Nvidia Blackwell GPUs inside Google Cloud, governed by Private Cloud Compute. Five design lessons for CTOs.

Read time: 13 min
Word count: 2K
Sections: 12
FAQs: 8

By Manu Shukla

Founder & Director June 22, 2026

Confidential AI inference on GPU hardware.

On this page · 12 sections

What Apple actually shipped at WWDC 2026
Decision 1: Route by sensitivity, not by capability
Decision 2: Root trust in hardware, then attest before sending data
Decision 3: Make the privacy claim verifiable, not just stated
Decision 4: Keep inference stateless and non-targetable
Decision 5: Engineer for token economics from the start
The five decisions in one view
India-specific considerations
What this changes for buyers and builders
FAQ
How eCorpIT can help
References

Summary. On June 8, 2026, Apple confirmed that Apple Intelligence now runs on Nvidia Blackwell B200 GPUs inside Google Cloud, the first time Private Cloud Compute (PCC) has operated outside Apple's own data centers since its 2024 launch. The move pairs three new server models in Apple's third-generation Foundation Models with on-device models of 3 billion and 20 billion parameters, and it roots trust in Nvidia Confidential Computing, Intel CPUs with TDX, and Google's Titan chip. Apple was reported in 2025 to be buying 250 Nvidia NVL72 systems at roughly $4 million each, and Apple's earlier ReDrafter work already showed 2.7x faster token generation on Nvidia GPUs. For CTOs weighing private versus cloud inference, five architecture decisions stand out.

This is not a story about Apple buying GPUs. It is a working example of how to put a frontier model behind a privacy bar that an external researcher can check, while keeping latency and cost under control. Apple held the same five PCC requirements it set in 2024, then re-implemented them on hardware it does not own. That is the exact problem most enterprises face in 2026: you want a capable cloud model, your data cannot be exposed to the operator, and your auditors want evidence rather than assurances.

The sections below read the architecture as a set of decisions, each with a takeaway for your own stack. The facts come from Apple's security blog, Apple's Machine Learning Research, and Nvidia's own engineering posts, all dated and linked.

What Apple actually shipped at WWDC 2026

Apple's third generation of Foundation Models spans device and cloud. On the device, AFM 3 Core is a 3-billion-parameter dense model, and AFM 3 Core Advanced is a 20-billion-parameter model that uses a sparse design, activating only 1 to 4 billion parameters per request, according to Apple Machine Learning Research. In the cloud, AFM 3 Cloud is the server workhorse, a separate model handles image generation, and AFM 3 Cloud Pro runs on Nvidia GPUs hosted in Google Cloud, as 9to5Mac reported from the keynote.

The infrastructure change is the headline for infrastructure leads. Apple's security team wrote that the company is "collaborating with Google and NVIDIA to run new Apple Intelligence workloads on Google Cloud, extending our industry-leading PCC privacy commitments to third-party data centers for the first time." Server-side inference uses Nvidia Blackwell GPUs with Confidential Computing, per Nvidia's blog dated June 9, 2026. The reported chip is the B200, and the cloud is Google's, as AppleInsider noted ahead of the show.

Dimension	On-device (AFM 3 Core)	Private Cloud Compute
Where it runs	iPhone, iPad and Mac silicon	Apple silicon servers and Nvidia GPUs in Google Cloud
Model size	3B dense; 20B sparse, 1-4B active	Larger server models, including AFM 3 Cloud Pro
Best for	Low-latency, offline, routine tasks	Agentic tool-use and complex reasoning
Data handling	Stays on the device	Sent to attested nodes, not stored
Trust mechanism	Secure Enclave, on-device OS	Remote attestation and a transparency log
Cost owner	Bundled in the device	Server GPU time and engineering

Decision 1: Route by sensitivity, not by capability

Apple's first decision was to treat on-device and cloud as one pipeline with a routing boundary, not as two separate products. A component Apple calls the system orchestrator decides what stays on the device and what goes to PCC. Apple's senior vice president of software engineering, Craig Federighi, described that orchestrator as "key to the privacy architecture of our entire system" in the company's WWDC remarks reported by CNBC. Routine and latency-sensitive work runs locally on the 3-billion-parameter model. Only the most demanding tasks, including agentic tool-use and complex reasoning, are sent to the cloud.

The takeaway for your stack is to design the boundary first. Decide which classes of request are allowed to leave the device or the on-premise tier, and make that decision on data sensitivity rather than on which model is most capable. Most teams do the reverse: they send everything to the largest model and bolt on redaction later. Apple inverted that order, and it is the cheaper order, because on-device inference carries no per-request GPU bill.

Decision 2: Root trust in hardware, then attest before sending data

The second decision is where the real engineering sits. Apple did not extend trust to Google or Nvidia as companies. It extended trust to specific, measured hardware states. The new PCC implementation combines Nvidia Confidential Computing on the GPUs, Intel CPUs with TDX, and Google's Titan security chip, as Apple's security team set out in its Expanding Private Cloud Compute post.

Nvidia Confidential Computing provides hardware-rooted trust, encrypted communication paths, and remote attestation, so software can verify the platform state before any sensitive data is released to it. Blackwell is the first GPU generation Nvidia describes as TEE-I/O capable, with inline protection across NVLink, per the Blackwell architecture page. The practical result is that confidential mode runs close to unencrypted throughput, which removes the usual reason teams skip it.

For CTOs, the lesson is to require attestation as a precondition, not a feature. A request should only leave your trust boundary after the receiving node has proven, cryptographically, that it is running the exact firmware and software you approved. Confidential computing is now available on H100, H200, B200, and GB200 class hardware, so this pattern is buildable today on rented GPUs, not only inside Apple.

Decision 3: Make the privacy claim verifiable, not just stated

Apple's third decision is the one auditors care about. Its five core PCC requirements did not change with the move to Google Cloud: stateless computation, enforceable guarantees, no privileged runtime access, non-targetability, and verifiable transparency. The last one is the differentiator. Apple maintains a cryptographically verifiable, append-only ledger of every Google Cloud machine in the PCC fleet, and it publishes the binaries running in production for public inspection through its Security Bounty program.

Apple also went past a standard confidential-computing deployment. For components that could exfiltrate user data if compromised, its software attestation is rooted in at least two separate roots of trust from independent vendors. Apple devices will only trust PCC software that Apple has cryptographically approved, regardless of whose data center hosts it.

The takeaway is to design for external verification from day one. If your only answer to "prove my data was not retained" is a contract clause, you have a promise, not a guarantee. A tamper-evident log of what ran, plus published measurements an outside party can match, is what turns a privacy statement into something a regulator or customer can test.

Decision 4: Keep inference stateless and non-targetable

The fourth decision shapes the runtime itself. PCC on Google Cloud reuses the architectural patterns Apple built on Apple silicon. Initial network parsing for each request happens in a dedicated process inside its own namespace. Shared inference software is recycled on a short time-to-live, so no session accumulates state. Attested keys live in a separate, dedicated confidential virtual machine that is isolated from external inputs.

Two properties fall out of this. Stateless computation means user data is used only to serve the request and is never stored. Non-targetability means an attacker cannot steer a specific user's request to a compromised machine, because the fleet and its routing are themselves attested. Apple noted that PCC on Google Cloud would ramp toward the full protection set across the summer 2026 preview period, which is a reminder that even Apple stages this kind of rollout rather than flipping it on at once.

For your own systems, the design rule is to make a node forget. Short-lived processes, per-request isolation, and keys held away from the inference path limit the blast radius if a single machine is compromised. It is harder to build than a long-running service, and it is the difference between a breach of one request and a breach of a database.

Decision 5: Engineer for token economics from the start

The fifth decision is about money and latency, which is why it matters to a CTO signing the cloud bill. Server inference on GPUs is metered, so tokens per second and GPU utilisation are cost levers, not just speed levers. Apple's ReDrafter technique, built with Nvidia and folded into TensorRT-LLM, uses a recurrent draft model with beam search and dynamic tree attention to generate up to 3.5 tokens per step. Apple's own ReDrafter research reported 2.7x more tokens per second than standard auto-regression, with lower GPU use and power draw, and MacRumors covered the collaboration when it launched in December 2024.

The unit cost depends on the hardware tier you rent. As of June 2026, a B200 ranged from about $2.12 per GPU-hour on spot capacity to roughly $14.24 on a fully bundled hyperscaler instance, while H100 capacity ran from about $2.69 to $9.98 per GPU-hour, according to a 2026 GPU cloud pricing survey. Those ranges decide whether a feature is viable at your request volume.

GPU	Confidential computing support	Indicative 2026 cloud price
H100	CC mode with remote attestation	$2.69 to $9.98 per GPU-hour
H200	CC mode with remote attestation	mid-range, provider dependent
B200 (Blackwell)	TEE-I/O, encrypted NVLink	$2.12 spot to $14.24 on-demand
GB200 (Blackwell)	CC mode, multi-GPU NVLink	enterprise bundled pricing

The takeaway is to treat speculative decoding, batching, and model routing as first-class budget items. The fastest way to cut an inference bill is usually to send fewer requests to the largest model, which loops straight back to Decision 1.

The five decisions in one view

Decision	What Apple did	What it means for your stack
1. Route by sensitivity	On-device first, cloud only for hard tasks	Design a routing boundary, not one model
2. Root trust in hardware	Nvidia CC, Intel TDX, Google Titan	Require attestation before data leaves
3. Make privacy verifiable	Append-only ledger, published binaries	Demand audit evidence, not promises
4. Keep inference stateless	No privileged access, short-TTL recycling	Architect for no retention and isolation
5. Engineer token economics	ReDrafter speculative decoding on GPUs	Budget GPU-hours, optimise tokens per second

India-specific considerations

For Indian teams, the same five decisions intersect with the Digital Personal Data Protection Act 2023 (DPDP). The routing decision in particular maps onto data-residency planning: classifying which requests may go to a cloud region and which must stay on-device or in-country is a DPDP design question before it is a performance one. The attestation and transparency patterns also help with the Act's accountability expectations, because they produce evidence a Data Protection Officer can show rather than a vendor claim to repeat.

On cost, GPU-hour pricing is quoted in dollars, but the budget lands in rupees, and a feature that runs at $2.12 per GPU-hour on spot capacity behaves very differently from one pinned to a roughly $14.24 on-demand instance once you convert and multiply by request volume. We design applications aligned with DPDP requirements and build the routing layer so that sensitive Indian user data can be kept on-device or in an approved region by default. For a broader view of building these systems, see our guide to generative AI enterprise strategy.

What this changes for buyers and builders

Two years ago, Apple said Apple Intelligence would run only on Apple silicon. In 2026 it runs frontier workloads on someone else's GPUs in someone else's data center, and it kept its privacy bar by moving the trust anchor into attested hardware rather than into a company name. That is the reusable idea. You do not need to own the metal to make a strong privacy claim, but you do need hardware-rooted attestation, a verifiable record of what ran, and a runtime that forgets.

The harder truth for builders is that the privacy architecture, not the model, is now the expensive part. Renting a Blackwell GPU is a purchase order. Designing stateless, attested, non-targetable inference that an external researcher can verify is an engineering programme. Apple staged its own rollout across a summer preview for that reason.

FAQ

How eCorpIT can help

eCorpIT (eCorp Information Technologies Private Limited) is a Gurugram-based, CMMI Level 5 technology organisation with senior engineering teams that design private and hybrid AI inference architectures. We help CTOs set the on-device versus cloud routing boundary, add hardware attestation and audit logging, and build applications aligned with DPDP requirements, working across cloud platforms including AWS, Microsoft, and Google. Learn more about us, or contact our team to review your private-AI design.

References

Apple Security Research, Expanding Private Cloud Compute, June 8, 2026.

Nvidia Blog, NVIDIA Confidential Computing to Help Expand Apple's Private Cloud Compute, June 9, 2026.

Apple Machine Learning Research, Introducing the Third Generation of Apple's Foundation Models, 2026.

Apple Security Research, Private Cloud Compute: A new frontier for AI privacy in the cloud, 2024.

Apple Machine Learning Research, Accelerating LLM Inference on NVIDIA GPUs with ReDrafter, 2024.

Nvidia Technical Blog, TensorRT-LLM Now Supports Recurrent Drafting, 2024.

MacRumors, Apple Teams Up With NVIDIA to Speed Up AI Language Models, December 20, 2024.

9to5Mac, Apple's third-generation Foundation Models explained, June 11, 2026.

AppleInsider, Revamped Siri will tap Nvidia chips for fast, private cloud computing, June 4, 2026.

Nvidia, Blackwell architecture, 2026.

CNBC, Apple partnering with Google and Nvidia for most advanced AI model, June 8, 2026.

Spheron, GPU Cloud Pricing 2026, 2026.

Spheron, Confidential GPU Computing on Cloud: NVIDIA TEE and Encrypted VRAM, 2026.

_Last updated: June 22, 2026._

Frequently asked

Quick answers.

01 What did Apple announce about Nvidia GPUs in 2026?

On June 8, 2026, Apple said it is running new Apple Intelligence workloads on Nvidia Blackwell GPUs hosted in Google Cloud, under Private Cloud Compute. It is the first time PCC has run outside Apple's own data centers, and server-side inference uses Nvidia Confidential Computing for hardware-level privacy.

02 What is Apple Private Cloud Compute?

Private Cloud Compute is Apple's system for running cloud AI requests under device-grade privacy. Introduced in 2024, it enforces five requirements: stateless computation, enforceable guarantees, no privileged runtime access, non-targetability, and verifiable transparency. User data is used only for the request, is never stored, and the guarantees are externally checkable.

03 Which models run on-device versus in the cloud?

Apple's on-device models are AFM 3 Core, a 3-billion-parameter dense model, and AFM 3 Core Advanced, a 20-billion-parameter sparse model that activates 1 to 4 billion parameters per request. Cloud models include AFM 3 Cloud and AFM 3 Cloud Pro, which runs on Nvidia GPUs in Google Cloud.

04 How does confidential computing protect data on shared GPUs?

Nvidia Confidential Computing isolates a workload in a trusted execution environment, encrypts data in use, and supports remote attestation. Software verifies the hardware and firmware state before releasing sensitive data. Blackwell adds TEE-I/O and encrypted NVLink, so confidential mode runs near unencrypted throughput, removing the usual performance penalty.

05 What is ReDrafter and why does it matter for cost?

ReDrafter is Apple's speculative decoding method, built with Nvidia and integrated into TensorRT-LLM. It uses a recurrent draft model with beam search and dynamic tree attention to generate up to 3.5 tokens per step, and Apple reported 2.7x more tokens per second than standard decoding, which lowers GPU use, power, and the per-request cost of server inference.

06 Can enterprises copy this architecture today?

Yes, in pattern if not in scale. Confidential computing is available on H100, H200, B200, and GB200 class GPUs, so attestation-before-data and stateless inference are buildable on rented hardware. The harder work is the verifiable-transparency layer: an append-only record of what ran and published measurements an external party can match.

07 How does this affect Indian data-protection planning?

Under the Digital Personal Data Protection Act 2023, the routing decision becomes a residency decision. Classifying which requests may reach a cloud region and which must stay on-device or in-country is a compliance design choice. Attestation and transparency logs also produce accountability evidence a Data Protection Officer can present rather than restate.

08 What should a CTO do first?

Define the routing boundary before choosing a model. Decide which request classes may leave your trust boundary based on data sensitivity, require hardware attestation before any data is sent, and budget GPU-hours as a cost lever. Sending fewer requests to the largest model is usually the fastest saving.

About the author

Manu Shukla

Founder & Director

Founder of eCorpIT. Hands-on engineer leading senior-only delivery for AI apps, custom software, and cloud systems for global clients.

One engineering note a week. No fluff, no spam.

Senior-architect playbooks on AI agents, mobile apps, cloud, security, data, and marketing — delivered every Wednesday.

Past the reading

Read enough. Let's build something.

A senior architect responds in 24 working hours with scope, indicative cost, and a timeline. NDA before any technical conversation.

Talk to an architect Browse the 10 practices