Strategic Briefing: Google TPU Landscape (History, Status, and Future)
Google has aggressively accelerated its custom silicon roadmap to challenge NVIDIA's dominance. The last 12 months marked a pivot from purely internal optimization to aggressive commercial scaling, highlighted by the release of Trillium (TPU v6) and the announcement of Ironwood (TPU v7). For developers, the most critical shift is the unification of the software stack, breaking down barriers between JAX and PyTorch via the vLLM TPU backend. By 2026, Google aims to transition from a cloud provider to a global AI utility, with projected deployments exceeding 5 million units by 2027.
- Origins (2015): Google introduced TPU v1 specifically for internal inference workloads (Search, Translate), realizing standard CPUs/GPUs could not sustain the required throughput [cite: 1, 2].
- Training Era (v2 - v3):
- TPU v2 (2017): Introduced training capabilities and High Bandwidth Memory (HBM), enabling the training of large models like BERT [cite: 2, 3].
- TPU v3 (2018): Doubled performance per chip and introduced liquid cooling to pods [cite: 3, 4].
- Scale and Efficiency (v4 - v5):
- TPU v4 (2021): Moved to 7nm process; introduced Optical Circuit Switching (OCS) for dynamic topology reconfiguration [cite: 4, 5].
- TPU v5 (2023): Bifurcated the line into v5e (efficiency/inference) and v5p (performance/training), optimizing cost-to-performance ratios for different workloads [cite: 4, 5].
Note: This section covers developments from late 2024 through late 2025.
- Trillium (TPU v6) Deployment:
- Performance: Delivers 4.7x peak compute performance and 67% better energy efficiency compared to TPU v5e [cite: 6, 7].
- Specs: Doubled HBM capacity and Interchip Interconnect (ICI) bandwidth. Features 3rd-gen SparseCore accelerators for embedding-heavy workloads (ranking/recommendation) [cite: 6, 8].
- Scale: Scales to 256 chips per pod, with multislice technology connecting tens of thousands of chips [cite: 6].
- Ironwood (TPU v7) Announcement:
- Purpose: Designed specifically for the "Age of Inference" and agentic workflows, optimizing for low latency at massive scale [cite: 5, 9].
- Specs: 4.6 PFLOPS (FP8) per chip, rivaling NVIDIA's Blackwell B200. Features 192 GB HBM3e memory per chip (6x increase over Trillium) [cite: 10, 11].
- Cluster Size: Supports massive pods of 9,216 chips, significantly larger than standard GPU clusters, enabling massive model residency [cite: 11, 12].
- vLLM TPU Backend:
- Google released a unified backend for vLLM (the popular open-source inference engine) that supports both JAX and PyTorch [cite: 13, 14].
- Impact: Developers can run PyTorch models on TPUs with zero code changes via
Torchax, removing the historical friction of migrating from CUDA to TPU [cite: 14, 15].
- Project EAT:
- A company-wide initiative to unify chip design, infrastructure, and developer tools. The goal is to create a coherent platform that reduces TCO (Total Cost of Ownership) and latency for both internal teams and cloud customers [cite: 16].
| Feature | Google (TPU) | NVIDIA (GPU) | AMD (Instinct) |
|---|---|---|---|
| 2025 Flagship | Ironwood (TPU v7) | Blackwell Ultra (B300) | Instinct MI350 |
| 2026 Flagship | TPU v8 (Projected) | Rubin (R100) | Instinct MI400 |
| Architecture | ASIC (Matrix-centric, Systolic Arrays) | General Purpose GPU (CUDA Cores) | CDNA (Compute-focused GPU) |
| Memory (2026) | HBM3e / HBM4 (High Capacity focus) | HBM4 (High Bandwidth focus) | HBM4 (Capacity leadership) |
| Interconnect | ICI + Optical Circuit Switching (OCS) | NVLink + InfiniBand/Spectrum-X | Infinity Fabric |
| Primary Software | JAX, PyTorch/XLA, vLLM | CUDA, TensorRT | ROCm (Open Source) |
| Key Advantage | Cost/Performance, Massive Pod Scale | Ecosystem Maturity, Raw Power | Memory Capacity, Open Ecosystem |
- NVIDIA:
- Blackwell Ultra (2025): A refresh of the Blackwell architecture [cite: 17].
- Rubin (2026): The next-gen architecture featuring HBM4 memory. Mass production is expected in late 2025/early 2026 [cite: 17, 18].
- Strategy: Moving to an annual release cadence to maintain performance leadership [cite: 17].
- AMD:
- Instinct MI350 (2025): Based on CDNA 4, targeting inference with 35x performance gains over MI300 [cite: 19, 20].
- Instinct MI400 (2026): Will utilize CDNA Next architecture and HBM4, aiming to compete directly with NVIDIA Rubin [cite: 21, 22].
- Strategy: Focusing on memory capacity advantages (up to 432GB on MI400) to hold larger models per GPU [cite: 22, 23].
- Massive Scale Deployment:
- Google projects to have 5 million TPUs deployed by 2027. This shift aims to position compute as a "global utility" rather than just a cloud service [cite: 24].
- Anthropic Partnership: Anthropic has committed to using over 1 million TPUs (approx. 1 GW capacity) for training Claude models, validating the TPU's capability for frontier models [cite: 24, 25].
- Hardware Evolution (TPU v8):
- Process Node: Expected to utilize TSMC 3nm process technology [cite: 26, 27].
- Partnerships: Reports suggest Google may partner with MediaTek for future TPU production (potentially v8 or edge variants) to diversify supply chains beyond Broadcom [cite: 26, 28].
- "Age of Inference" & Agentic AI:
- Future architectures will prioritize inference efficiency over raw training throughput.
- Focus on Agentic AI: Systems that plan and execute multi-step tasks. This requires hardware optimized for long-context windows and complex logic, driving the design of chips like Ironwood and its successors [cite: 9, 29].
- Infrastructure Unification:
- Optical Interconnects: Continued heavy investment in OCS (Optical Circuit Switching) to allow dynamic reconfiguration of clusters at runtime, reducing power and latency [cite: 30, 31].
- Power Management: New data center designs (Project EAT) will integrate liquid cooling and power management more tightly with the chip architecture to handle densities exceeding 100kW per rack [cite: 16, 30].
- Ecosystem Lock-in Reducing: With the maturation of PyTorch/XLA and vLLM, developers are less "locked" into NVIDIA's CUDA.
- Cost Efficiency: Google is positioning TPUs as the "price-performance" leader (approx. 20-50% lower TCO than NVIDIA), making them the preferred choice for inference-heavy applications [cite: 24].
- JAX Dominance: While PyTorch support is improving, JAX remains the "native" language of TPUs, offering the highest performance ceiling for research and training [cite: 32, 33].
- [cite: 1, 2] History and origins of TPU v1.
- [cite: 3, 4] Evolution of TPU v2, v3, v4.
- [cite: 6, 7] Trillium (v6) specifications and performance claims.
- [cite: 10, 11] Ironwood (v7) specifications and comparison to Blackwell.
- [cite: 13, 14] vLLM TPU backend and software unification.
- [cite: 17, 18] NVIDIA Rubin and Blackwell Ultra timelines.
- [cite: 21, 22] AMD MI350 and MI400 roadmaps.
- [cite: 16, 24] Future scale (5M units), Project EAT, and Anthropic deal.
- [cite: 26, 28] TSMC 3nm plans and MediaTek partnership rumors.
Sources:
- google.com
- orhanergun.net
- medium.com
- wikipedia.org
- uplatz.com
- google.com
- google.com
- google.com
- google.com
- theregister.com
- medium.com
- blog.google
- google.com
- joshuaberkowitz.us
- vllm.ai
- completeaitraining.com
- wccftech.com
- tomshardware.com
- amd.com
- storagereview.com
- instant-gaming.com
- wccftech.com
- techpowerup.com
- youtube.com
- introl.com
- trendforce.com
- smyg.hk
- siliconangle.com
- dev.to
- investing.com
- digitimes.com
- google.com
- googleblog.com