Apple's system orchestrator: how 3-tier AI routing works in 2026

Apple's system orchestrator routes each request across on-device, Private Cloud Compute, and third-party cloud models. How 2026 hybrid AI routing works.

Read time
13 min
Word count
2K
Sections
12
FAQs
8
Share
Glowing routing hub linking a handheld device and cloud servers with light paths in the dark
Routing each request between device and cloud.
On this page · 12 sections
  1. What the system orchestrator actually is
  2. The three routing destinations
  3. How the routing decision is made
  4. Why the on-device tier can now carry more
  5. The developer interface: one Swift API
  6. Whose model is it, and why that matters for routing
  7. What engineers can borrow
  8. India-specific considerations
  9. What this changes for builders
  10. FAQ
  11. How eCorpIT can help
  12. References

Summary. Apple Intelligence in 2026 runs behind a system orchestrator, an on-device component that decides whether each request stays local, goes to Private Cloud Compute, or reaches a third-party model. Apple's senior vice president of software engineering, Craig Federighi, called the orchestrator "key to the privacy architecture of our entire system" in Apple's WWDC remarks. The local tier now spans two models: AFM 3 Core at 3 billion parameters, and AFM 3 Core Advanced, a 20-billion-parameter sparse model that activates only 1 to 4 billion parameters per prompt. The cloud tier adds three more, including AFM 3 Cloud Pro, which runs on Nvidia GPUs in Google Cloud at roughly $2.12 to $14.24 per GPU-hour as of June 2026. For developers, all of it sits behind one Swift API, free on-device with no per-token cost. This is how the routing works, and what to copy.

The interesting part for an engineer is not that Apple has a big cloud model. It is that Apple shipped a routing layer that an app developer never has to see, and a device tier strong enough to keep most requests off the network. In blind human tests, reviewers preferred the new on-device text model 45.6% of the time against 23.3% for Apple's 2025 model, per figures Apple published and ofox.ai reproduced from the research post. A stronger local tier changes the routing maths: the more a 3-billion-parameter model can answer well, the less traffic, cost, and exposure you push to the cloud.

This piece reads the orchestrator as a routing system and pulls out the design choices that transfer to any hybrid-inference stack. It builds on the infrastructure side covered in our look at Apple Intelligence on Nvidia GPUs.

What the system orchestrator actually is

The system orchestrator is the part of Apple Intelligence that coordinates a request from start to finish. It reads what the user is asking, gathers the context the request needs, decides which model tier should answer, and routes the result to whatever app or tool acts on it. Apple describes it as coordinating across four capabilities: personal context drawn from the on-device Spotlight semantic index, broad world knowledge that can reach the web through Private Cloud Compute, App Actions that call into installed apps through an app toolbox built on App Intents, and on-screen awareness of what the user is currently looking at, as summarised in The Neuron's WWDC recap.

Two of those capabilities are local by nature. The Spotlight index and on-screen content never need to leave the device. The other two can require more compute or fresh data, which is when the orchestrator considers a cloud tier. The design intent is plain: keep the request on the device unless it genuinely needs more, and make that the default rather than the exception.

The three routing destinations

The orchestrator chooses among three destinations. The first is the on-device tier, the 3-billion-parameter AFM 3 Core for lightweight text and routing, and the 20-billion-parameter AFM 3 Core Advanced for harder local work such as the new Siri, dictation, and image understanding. The second is Private Cloud Compute, where AFM 3 Cloud handles the main cloud workload and AFM 3 Cloud Pro handles complex reasoning and agentic tool use. The third is a third-party model, such as the optional ChatGPT integration, which the user is asked to approve.

Tier Where it runs Typical job
AFM 3 Core On device Lightweight text, fast language understanding, routing
AFM 3 Core Advanced On device (20B sparse, 1-4B active) Siri, dictation, text to speech, image understanding
AFM 3 Cloud Private Cloud Compute Main cloud text and image-understanding workload
AFM 3 Cloud Pro Nvidia GPUs in Google Cloud (PCC extension) Complex reasoning and agentic tool use
Third-party model External cloud, with user approval General-knowledge chat and open-ended queries

Apple has not published parameter counts for any of the three cloud models, a point ofox.ai makes plainly. Only the two on-device sizes are disclosed. For a routing design, the sizes matter less than the boundaries between tiers, which is where the decision logic sits.

How the routing decision is made

The orchestrator's decision is a short sequence of checks rather than a single model call. It asks whether the on-device model can answer with acceptable quality, whether the task needs reasoning or context beyond the local model's reach, and what the current system load and connectivity allow, a pattern described in this breakdown of on-device versus cloud processing. Sensitivity is the implicit fourth check: anything tied to the Spotlight index or on-screen content is handled so that personal data stays local by default.

Signal What it checks Routing effect
Device capability Can the local model answer well enough Keep on device when yes
Task complexity Does it need reasoning beyond the local tier Escalate to Private Cloud Compute
Data sensitivity Does the request touch personal local context Prefer on-device handling
System load Memory, thermal, and battery headroom Defer or escalate under pressure
Connectivity Is a reliable network available Stay local when offline

The takeaway for your own systems is to make routing a cheap, explicit step, not an afterthought. A small, fast classifier in front of an expensive model is almost always cheaper than sending everything to the largest model and trimming later. Apple put that classifier in the operating system so every app inherits it.

Why the on-device tier can now carry more

The reason this routing model works in 2026 is that the local tier got much stronger without a bigger memory footprint. AFM 3 Core Advanced is a 20-billion-parameter model that never activates more than about 4 billion parameters for a given prompt. It does this with a technique Apple Research calls Instruction-Following Pruning, published in a January 2025 paper: a small predictor reads the prompt and chooses which rows and columns of the network to switch on for that request.

The paper's headline result is that a 3-billion activated model beat the 3-billion dense baseline by 5 to 8 absolute points on math and coding, and matched the quality of a 9-billion dense model. In the shipping product, Apple stores the full model in flash, keeps a small set of always-active shared parameters in memory, and pages the selected parts into memory only when the predictor picks them. Routing happens per prompt rather than per token, because moving weights from flash to memory for every token would be too slow.

That detail is the quiet engineering story. A phone now runs a model that behaves like a much larger one on the queries that need it, while spending the compute of a small one on the queries that do not. It raises the bar for what the on-device destination can absorb before the orchestrator reaches for the cloud.

The developer interface: one Swift API

App developers do not call the orchestrator directly. They use the Foundation Models framework, introduced in 2025 as a Swift API that gives any third-party app access to the on-device model with no API key, no network, and no per-token cost. The 2026 update adds image input, so an app can pass a photo to the local model for tasks such as captioning, receipt extraction, or classifying an on-screen element without a cloud round trip. Apple's own sessions, What's new in the Foundation Models framework and Build with the new Apple Foundation Model on Private Cloud Compute, cover the API surface.

The framework is strong at structured output, tool calling, and privacy-sensitive embedded work that must run offline. It is not built to be a general chatbot back end, to answer fresh world-knowledge questions, or to run frontier-tier reasoning over long context. That split is the routing decision restated at the API level.

Capability On-device (Foundation Models) Cloud fallback
Structured, typed output Yes, native Swift values Yes
Tool and function calling Yes Yes
Offline reliability Yes, no network needed No
Per-request cost None on device Metered by tokens or GPU time
Fresh world knowledge No Yes
Frontier reasoning, long context No Yes

Apple also lowered the cost barrier for the cloud tier. Developers in the App Store Small Business Program with fewer than 2 million first-time downloads can use Apple Foundation Models on Private Cloud Compute with no cloud API cost, as byteiota reported. For small teams, that removes the usual reason to avoid a cloud fallback.

Whose model is it, and why that matters for routing

A routing layer is also a trust boundary, so model provenance belongs in the design conversation. Apple is explicit that its cloud models are its own, refined with help from Google's frontier work rather than served by it. Apple AI vice president Amar Subramanya said the models are "all custom builds for Apple Silicon, trained using proprietary data, and refined using outputs from Gemini frontier models," per CNBC. Federighi was blunter about the runtime, telling 9to5Mac that "the amount of the Google Assistant we use is none."

The engineering reading is that Gemini is a teacher signal used in training, not the model answering at runtime. For a routing design, the lesson is that each destination carries its own provenance, cost, and privacy profile, and the orchestrator is where those differences are reconciled. When you build your own version, document what each tier is, who trained it, and what data it may see.

What engineers can borrow

Four choices from Apple's design transfer cleanly. Route by capability and sensitivity before capacity, so the question becomes what cannot run locally rather than how large a model to call. Keep the call site identical across tiers, the way the Foundation Models framework lets a prompt move from the on-device model to a cloud provider by changing a dependency, so swapping a provider does not mean rewriting session logic. Make the local tier as capable as the hardware allows, because every request it absorbs is one you do not pay to send. And treat multi-provider sourcing as the default, since even Apple now ships an on-device model refined from Gemini and runs cloud work on Nvidia hardware in Google's data centers.

India-specific considerations

For Indian teams, the routing boundary is also a compliance boundary under the Digital Personal Data Protection Act 2023 (DPDP). Classifying which request types may reach a cloud region, and which must stay on the device or in an approved location, is a residency decision before it is a performance one, and an on-device-first default makes that easier to defend. Apple's own rollout names English (India) and Hindi among the supported locales arriving over time, so India is in scope for the consumer features even as the EU and mainland China are restricted at launch.

On cost, the cloud tier is billed in dollars but budgeted in rupees, and a workload that runs on a GPU at $2.12 per hour on spot capacity behaves very differently from one pinned near $14.24 on-demand once converted and multiplied by volume. We build applications aligned with DPDP requirements and design the routing layer so sensitive Indian user data can stay on-device or in an approved region by default. For the wider build, see our guide to generative AI enterprise strategy.

What this changes for builders

The build decision in 2026 moves up the stack. With a 20-billion-parameter sparse model on the phone, image input, and a free on-device API, a real slice of in-app AI can stop paying for cloud calls. Frontier work still belongs in the cloud, but the question shifts from how big a model you need to what genuinely cannot run on-device. The orchestrator is Apple's answer to that question, baked into the operating system. Your version can be smaller, but the shape is the same: a cheap router, a strong local default, and a metered cloud tier you reach for only when the request earns it.

FAQ

How eCorpIT can help

eCorpIT (eCorp Information Technologies Private Limited) is a Gurugram-based, CMMI Level 5 technology organisation whose senior engineering teams design hybrid and on-device inference systems. We help teams build the routing layer that decides what runs locally and what escalates to the cloud, keep the call site swappable across providers, and design applications aligned with DPDP requirements across cloud platforms including AWS, Microsoft, and Google. Read more about us, or contact our team to design your routing strategy.

References

  1. Apple Machine Learning Research, Introducing the Third Generation of Apple's Foundation Models, 2026.
  1. ofox.ai, Apple's Third-Generation Foundation Models: A Developer's Read on WWDC 2026, June 9, 2026.
  1. Apple Developer, What's new in the Foundation Models framework (WWDC26), 2026.
  1. Apple Developer, Build with the new Apple Foundation Model on Private Cloud Compute (WWDC26), 2026.
  1. CNBC, Apple partnering with Google and Nvidia for most advanced AI model, June 8, 2026.
  1. 9to5Mac, Craig Federighi details Apple's collaboration with Google for Siri AI, June 8, 2026.
  1. arXiv, Instruction-Following Pruning for Large Language Models, January 2025.
  1. The Neuron, Everything AI Apple announced at WWDC 2026, June 2026.
  1. byteiota, Apple Foundation Models: Free Private Cloud Compute, 2026.
  1. Basil AI, Apple Intelligence and Siri requests: on-device vs cloud, March 8, 2026.
  1. Apple Security Research, Expanding Private Cloud Compute, June 8, 2026.
  1. Spheron, GPU Cloud Pricing 2026, 2026.

_Last updated: June 22, 2026._

Frequently asked

Quick answers.

01 What is Apple's system orchestrator?
It is the on-device part of Apple Intelligence that coordinates each request. It reads the ask, gathers context from sources such as the Spotlight index and on-screen content, decides whether to answer on-device, in Private Cloud Compute, or with a third-party model, and routes the result to the app or tool that acts on it.
02 How does the orchestrator decide on-device versus cloud?
It runs a short set of checks: whether the local model can answer well enough, whether the task needs reasoning beyond the local tier, how much memory and thermal headroom the system has, and whether a network is available. Personal, local context is handled on-device by default, so sensitivity also shapes the route.
03 What are the five Apple Foundation Models in 2026?
Two run on-device: AFM 3 Core at 3 billion parameters and AFM 3 Core Advanced at 20 billion sparse parameters. Three run in Private Cloud Compute: AFM 3 Cloud, an image-generation model, and AFM 3 Cloud Pro for complex reasoning. Apple has not disclosed parameter counts for the cloud models.
04 How does a 20-billion-parameter model run on a phone?
AFM 3 Core Advanced uses Instruction-Following Pruning. A small predictor reads each prompt and activates only 1 to 4 billion parameters for that request. Apple stores the full model in flash and pages the selected parts into memory per prompt, so it behaves like a larger model only when a query needs it.
05 Do developers call the orchestrator directly?
No. Apps use the Foundation Models framework, a Swift API that reaches the on-device model with no API key, no network, and no per-token cost, and added image input in 2026. The same call site can fall back to a cloud model, so swapping providers does not require rewriting the app's session logic.
06 Is Apple running Google Gemini for Apple Intelligence?
No. Apple AI vice president Amar Subramanya said the models are custom builds refined using outputs from Gemini frontier models, and Craig Federighi said the amount of Google Assistant used is none. Gemini acts as a teacher signal during training, not as the model that answers requests at runtime.
07 Is the cloud tier free for developers?
For some. Developers in the App Store Small Business Program with fewer than 2 million first-time App Store downloads can use Apple Foundation Models on Private Cloud Compute with no cloud API cost. Larger apps fall outside that allowance, so they should budget cloud inference as a metered cost.
08 What can engineers copy from this routing design?
Route by capability and sensitivity first, keep the call site identical across tiers so providers are swappable, make the local model as capable as the hardware allows to cut cloud traffic, and treat multi-provider sourcing as the default. A cheap router in front of a strong local default is usually the lowest-cost design.

About the author

Manu Shukla

Founder & Director

Founder of eCorpIT. Hands-on engineer leading senior-only delivery for AI apps, custom software, and cloud systems for global clients.

Subscribe

One engineering note a week. No fluff, no spam.

Senior-architect playbooks on AI agents, mobile apps, cloud, security, data, and marketing — delivered every Wednesday.

Past the reading

Read enough. Let's build something.

A senior architect responds in 24 working hours with scope, indicative cost, and a timeline. NDA before any technical conversation.