On this page · 12 sections
- What Apple actually shipped at WWDC 2026
- Decision 1: Route by sensitivity, not by capability
- Decision 2: Root trust in hardware, then attest before sending data
- Decision 3: Make the privacy claim verifiable, not just stated
- Decision 4: Keep inference stateless and non-targetable
- Decision 5: Engineer for token economics from the start
- The five decisions in one view
- India-specific considerations
- What this changes for buyers and builders
- FAQ
- How eCorpIT can help
- References
Summary. On June 8, 2026, Apple confirmed that Apple Intelligence now runs on Nvidia Blackwell B200 GPUs inside Google Cloud, the first time Private Cloud Compute (PCC) has operated outside Apple's own data centers since its 2024 launch. The move pairs three new server models in Apple's third-generation Foundation Models with on-device models of 3 billion and 20 billion parameters, and it roots trust in Nvidia Confidential Computing, Intel CPUs with TDX, and Google's Titan chip. Apple was reported in 2025 to be buying 250 Nvidia NVL72 systems at roughly $4 million each, and Apple's earlier ReDrafter work already showed 2.7x faster token generation on Nvidia GPUs. For CTOs weighing private versus cloud inference, five architecture decisions stand out.
This is not a story about Apple buying GPUs. It is a working example of how to put a frontier model behind a privacy bar that an external researcher can check, while keeping latency and cost under control. Apple held the same five PCC requirements it set in 2024, then re-implemented them on hardware it does not own. That is the exact problem most enterprises face in 2026: you want a capable cloud model, your data cannot be exposed to the operator, and your auditors want evidence rather than assurances.
The sections below read the architecture as a set of decisions, each with a takeaway for your own stack. The facts come from Apple's security blog, Apple's Machine Learning Research, and Nvidia's own engineering posts, all dated and linked.
What Apple actually shipped at WWDC 2026
Apple's third generation of Foundation Models spans device and cloud. On the device, AFM 3 Core is a 3-billion-parameter dense model, and AFM 3 Core Advanced is a 20-billion-parameter model that uses a sparse design, activating only 1 to 4 billion parameters per request, according to Apple Machine Learning Research. In the cloud, AFM 3 Cloud is the server workhorse, a separate model handles image generation, and AFM 3 Cloud Pro runs on Nvidia GPUs hosted in Google Cloud, as 9to5Mac reported from the keynote.
The infrastructure change is the headline for infrastructure leads. Apple's security team wrote that the company is "collaborating with Google and NVIDIA to run new Apple Intelligence workloads on Google Cloud, extending our industry-leading PCC privacy commitments to third-party data centers for the first time." Server-side inference uses Nvidia Blackwell GPUs with Confidential Computing, per Nvidia's blog dated June 9, 2026. The reported chip is the B200, and the cloud is Google's, as AppleInsider noted ahead of the show.
| Dimension | On-device (AFM 3 Core) | Private Cloud Compute |
|---|---|---|
| Where it runs | iPhone, iPad and Mac silicon | Apple silicon servers and Nvidia GPUs in Google Cloud |
| Model size | 3B dense; 20B sparse, 1-4B active | Larger server models, including AFM 3 Cloud Pro |
| Best for | Low-latency, offline, routine tasks | Agentic tool-use and complex reasoning |
| Data handling | Stays on the device | Sent to attested nodes, not stored |
| Trust mechanism | Secure Enclave, on-device OS | Remote attestation and a transparency log |
| Cost owner | Bundled in the device | Server GPU time and engineering |
Decision 1: Route by sensitivity, not by capability
Apple's first decision was to treat on-device and cloud as one pipeline with a routing boundary, not as two separate products. A component Apple calls the system orchestrator decides what stays on the device and what goes to PCC. Apple's senior vice president of software engineering, Craig Federighi, described that orchestrator as "key to the privacy architecture of our entire system" in the company's WWDC remarks reported by CNBC. Routine and latency-sensitive work runs locally on the 3-billion-parameter model. Only the most demanding tasks, including agentic tool-use and complex reasoning, are sent to the cloud.
The takeaway for your stack is to design the boundary first. Decide which classes of request are allowed to leave the device or the on-premise tier, and make that decision on data sensitivity rather than on which model is most capable. Most teams do the reverse: they send everything to the largest model and bolt on redaction later. Apple inverted that order, and it is the cheaper order, because on-device inference carries no per-request GPU bill.
Decision 2: Root trust in hardware, then attest before sending data
The second decision is where the real engineering sits. Apple did not extend trust to Google or Nvidia as companies. It extended trust to specific, measured hardware states. The new PCC implementation combines Nvidia Confidential Computing on the GPUs, Intel CPUs with TDX, and Google's Titan security chip, as Apple's security team set out in its Expanding Private Cloud Compute post.
Nvidia Confidential Computing provides hardware-rooted trust, encrypted communication paths, and remote attestation, so software can verify the platform state before any sensitive data is released to it. Blackwell is the first GPU generation Nvidia describes as TEE-I/O capable, with inline protection across NVLink, per the Blackwell architecture page. The practical result is that confidential mode runs close to unencrypted throughput, which removes the usual reason teams skip it.
For CTOs, the lesson is to require attestation as a precondition, not a feature. A request should only leave your trust boundary after the receiving node has proven, cryptographically, that it is running the exact firmware and software you approved. Confidential computing is now available on H100, H200, B200, and GB200 class hardware, so this pattern is buildable today on rented GPUs, not only inside Apple.
Decision 3: Make the privacy claim verifiable, not just stated
Apple's third decision is the one auditors care about. Its five core PCC requirements did not change with the move to Google Cloud: stateless computation, enforceable guarantees, no privileged runtime access, non-targetability, and verifiable transparency. The last one is the differentiator. Apple maintains a cryptographically verifiable, append-only ledger of every Google Cloud machine in the PCC fleet, and it publishes the binaries running in production for public inspection through its Security Bounty program.
Apple also went past a standard confidential-computing deployment. For components that could exfiltrate user data if compromised, its software attestation is rooted in at least two separate roots of trust from independent vendors. Apple devices will only trust PCC software that Apple has cryptographically approved, regardless of whose data center hosts it.
The takeaway is to design for external verification from day one. If your only answer to "prove my data was not retained" is a contract clause, you have a promise, not a guarantee. A tamper-evident log of what ran, plus published measurements an outside party can match, is what turns a privacy statement into something a regulator or customer can test.
Decision 4: Keep inference stateless and non-targetable
The fourth decision shapes the runtime itself. PCC on Google Cloud reuses the architectural patterns Apple built on Apple silicon. Initial network parsing for each request happens in a dedicated process inside its own namespace. Shared inference software is recycled on a short time-to-live, so no session accumulates state. Attested keys live in a separate, dedicated confidential virtual machine that is isolated from external inputs.
Two properties fall out of this. Stateless computation means user data is used only to serve the request and is never stored. Non-targetability means an attacker cannot steer a specific user's request to a compromised machine, because the fleet and its routing are themselves attested. Apple noted that PCC on Google Cloud would ramp toward the full protection set across the summer 2026 preview period, which is a reminder that even Apple stages this kind of rollout rather than flipping it on at once.
For your own systems, the design rule is to make a node forget. Short-lived processes, per-request isolation, and keys held away from the inference path limit the blast radius if a single machine is compromised. It is harder to build than a long-running service, and it is the difference between a breach of one request and a breach of a database.
Decision 5: Engineer for token economics from the start
The fifth decision is about money and latency, which is why it matters to a CTO signing the cloud bill. Server inference on GPUs is metered, so tokens per second and GPU utilisation are cost levers, not just speed levers. Apple's ReDrafter technique, built with Nvidia and folded into TensorRT-LLM, uses a recurrent draft model with beam search and dynamic tree attention to generate up to 3.5 tokens per step. Apple's own ReDrafter research reported 2.7x more tokens per second than standard auto-regression, with lower GPU use and power draw, and MacRumors covered the collaboration when it launched in December 2024.
The unit cost depends on the hardware tier you rent. As of June 2026, a B200 ranged from about $2.12 per GPU-hour on spot capacity to roughly $14.24 on a fully bundled hyperscaler instance, while H100 capacity ran from about $2.69 to $9.98 per GPU-hour, according to a 2026 GPU cloud pricing survey. Those ranges decide whether a feature is viable at your request volume.
| GPU | Confidential computing support | Indicative 2026 cloud price |
|---|---|---|
| H100 | CC mode with remote attestation | $2.69 to $9.98 per GPU-hour |
| H200 | CC mode with remote attestation | mid-range, provider dependent |
| B200 (Blackwell) | TEE-I/O, encrypted NVLink | $2.12 spot to $14.24 on-demand |
| GB200 (Blackwell) | CC mode, multi-GPU NVLink | enterprise bundled pricing |
The takeaway is to treat speculative decoding, batching, and model routing as first-class budget items. The fastest way to cut an inference bill is usually to send fewer requests to the largest model, which loops straight back to Decision 1.
The five decisions in one view
| Decision | What Apple did | What it means for your stack |
|---|---|---|
| 1. Route by sensitivity | On-device first, cloud only for hard tasks | Design a routing boundary, not one model |
| 2. Root trust in hardware | Nvidia CC, Intel TDX, Google Titan | Require attestation before data leaves |
| 3. Make privacy verifiable | Append-only ledger, published binaries | Demand audit evidence, not promises |
| 4. Keep inference stateless | No privileged access, short-TTL recycling | Architect for no retention and isolation |
| 5. Engineer token economics | ReDrafter speculative decoding on GPUs | Budget GPU-hours, optimise tokens per second |
India-specific considerations
For Indian teams, the same five decisions intersect with the Digital Personal Data Protection Act 2023 (DPDP). The routing decision in particular maps onto data-residency planning: classifying which requests may go to a cloud region and which must stay on-device or in-country is a DPDP design question before it is a performance one. The attestation and transparency patterns also help with the Act's accountability expectations, because they produce evidence a Data Protection Officer can show rather than a vendor claim to repeat.
On cost, GPU-hour pricing is quoted in dollars, but the budget lands in rupees, and a feature that runs at $2.12 per GPU-hour on spot capacity behaves very differently from one pinned to a roughly $14.24 on-demand instance once you convert and multiply by request volume. We design applications aligned with DPDP requirements and build the routing layer so that sensitive Indian user data can be kept on-device or in an approved region by default. For a broader view of building these systems, see our guide to generative AI enterprise strategy.
What this changes for buyers and builders
Two years ago, Apple said Apple Intelligence would run only on Apple silicon. In 2026 it runs frontier workloads on someone else's GPUs in someone else's data center, and it kept its privacy bar by moving the trust anchor into attested hardware rather than into a company name. That is the reusable idea. You do not need to own the metal to make a strong privacy claim, but you do need hardware-rooted attestation, a verifiable record of what ran, and a runtime that forgets.
The harder truth for builders is that the privacy architecture, not the model, is now the expensive part. Renting a Blackwell GPU is a purchase order. Designing stateless, attested, non-targetable inference that an external researcher can verify is an engineering programme. Apple staged its own rollout across a summer preview for that reason.
FAQ
How eCorpIT can help
eCorpIT (eCorp Information Technologies Private Limited) is a Gurugram-based, CMMI Level 5 technology organisation with senior engineering teams that design private and hybrid AI inference architectures. We help CTOs set the on-device versus cloud routing boundary, add hardware attestation and audit logging, and build applications aligned with DPDP requirements, working across cloud platforms including AWS, Microsoft, and Google. Learn more about us, or contact our team to review your private-AI design.
References
- Apple Security Research, Expanding Private Cloud Compute, June 8, 2026.
- Nvidia Blog, NVIDIA Confidential Computing to Help Expand Apple's Private Cloud Compute, June 9, 2026.
- Apple Machine Learning Research, Introducing the Third Generation of Apple's Foundation Models, 2026.
- Apple Security Research, Private Cloud Compute: A new frontier for AI privacy in the cloud, 2024.
- Apple Machine Learning Research, Accelerating LLM Inference on NVIDIA GPUs with ReDrafter, 2024.
- Nvidia Technical Blog, TensorRT-LLM Now Supports Recurrent Drafting, 2024.
- MacRumors, Apple Teams Up With NVIDIA to Speed Up AI Language Models, December 20, 2024.
- 9to5Mac, Apple's third-generation Foundation Models explained, June 11, 2026.
- AppleInsider, Revamped Siri will tap Nvidia chips for fast, private cloud computing, June 4, 2026.
- Nvidia, Blackwell architecture, 2026.
- CNBC, Apple partnering with Google and Nvidia for most advanced AI model, June 8, 2026.
- Spheron, GPU Cloud Pricing 2026, 2026.
_Last updated: June 22, 2026._