🔮E14: The Real AI Bottleneck: High Bandwidth Memory (HBM)

Without HBM there are no LLMs

Sep 08, 2023

Every week, I do a Briefing Note on one important technology, from Vector Databases, LLMs, and Decentralised AI to Optical, Neuromorphic, and Analog Computing. If you’re new, join the World’s best investors and subscribe. I don’t have a podcast or TV show like Azeem. Yet. 👀

You live in the eternal now. See memory is an illusion. It’s all gone. Everything you know about, that makes an impression. But that’s no longer there. Nothing is in your mind. Because even when you are remembering, it’s happening in this present. The memory is in the eternal now.

So this week you get computer memories. Do Androids Dream of Electric Sheep? We have short-term memory that is fast, but needs power. Volatile memory. Mainly RAM. And long-term memory that is slower, but doesn’t need power. Non-volatile memory. Mainly flash and SSD.

Today, I want to cover advances in volatile memory, in particular, dynamic RAM (DRAM). And specifically a recent innovation in DRAM which is to stack DRAMs on top of each other.

It’s important because it’s the memory used in every Nvidia H100. Without it we don’t have LLMs basically. So we must make loads of it right? Wrong. It’s really hard to make. So hard in fact that only one company in the world can make it. And even then they get pretty bad yields.

I’m covering it today as it comes out of a larger thesis, I’m developing with Elad on AI austerity. Not to spill the beans too early, but the chip bottleneck isn’t where you think it is. We can make plenty of GPUs.

The problems are making enough memory, the HBM I will cover today. And gluing the GPU and HBM together in an advanced package with a very specific process called Chip-on-silicon-on-wafer (CoSoW) process with TSMC. Stay tuned for more on this.

💾 State of the Future: High Bandwidth Memory TLDR 💾

Summary: High Bandwidth Memory (HBM) is a type of DRAM designed to provide higher bandwidth at low power by stacking modules on top of each other compared to traditional planar process used to make DDR.
Viability [5/5]: Commercially deployed and industry standard since 2013. Now on third generation (HBM3). R&D focused on bandwidth increases, latency reduction, and process cost reduction
Novelty [4/5]: Critical enabler of datacenter AI accelerators and LLMs with unique high bandwidth, low latency and low power consumption characteristics
Drivers [5/5]: Powerful LLM tailwinds and innovations in 3D stacking with through-silicon vias (TSVs), microbumps, and wafer thinning combined with Samsung and Micron joining SK Hynix in HBM3 volume production in 2024
Restraints [3/5]: Cost is a major restraint 3-5x more expensive than GDDR6 or DDR5 due to more complex manufacturing processes, lower volumes, and currently supply bottlenecks due to explosive demand and single vendor in SK Hynix
Impact [4/5]: High impact from 2024-2030 as critical enabler of LLMs but limited to HPC/AI as won’t replace planar-based DDR5/6 for mainstream applications
Timing [2020-2025]: Investment ready today growing 50% for next 5 years supporting AI datacentre chip build out
Rating [Underrated]: Forecasts suggest $10 billion by 2030, this is not consistent with forecast AI datacentre chip growth.
2030 Prediction: Massive growth to $25 billion from $2 billion today
Opportunities:
- Reduce manufacturing costs: Remove costly silicon interposer from 2D/3D stacking process and/or intergrating with SerDes or UCIe
- Increase bandwidth / Solve Memory Wall: Photonic interconnects for CXL HBM pooling node-to-node for LLM clusters (Ayer, Celestial)
- Thermal management: Advanced thermal solutions and heatsinks optimized for HBM stack cooling potentially opening up high-power, mobile market

Thank you for reading State of the Future. This post is public so feel free to share it.

💾 State of the Future: High Bandwidth Memory Memo 💾

Summary

High Bandwidth Memory (HBM) is a type of dynamic random access memory (DRAM), a type of volatile semiconductor memory that is used for primary system memory in computers. HBM is designed to provide significantly higher bandwidth compared to traditional DRAM technologies. HBM achieves this by using a novel 2.5D/3D stacking technique where multiple DRAM dies are vertically stacked on top of each other. This is in contrast to other DRAM technologies like GDDR or DDR which use a planar layout with a single DRAM die.

The key benefit of stacking DRAM dies vertically is that it enables much wider memory interfaces. HBM supports memory interfaces up to 1024-bits wide compared to just 32-bits or 64-bits for standard DRAM. This wider interface is what gives HBM its very high bandwidth. HBM is integrated directly onto silicon interposers or application-specific integrated circuits (ASICs) rather than as a standalone chip.

HBM requires less power and PCB area than other high bandwidth memories like GDDR while providing similar or higher data transfer rates. This makes HBM well suited for high performance computing, AI, networking and other applications that need very high memory bandwidth. The demand for HBM has increased significantly due to large language models and is expected to continue rising despite supply constraints.

Viability [Score: 5/5]

HBM reached technical maturity and entered volume production in 2013 by SK Hynix. The first devices to use HBM were the AMD Fiji GPUs in 2015. HBM2 entered mass production in 2016 with SK Hynix and Samsung. HBM2E entered mass production in 2020 with HBM3 in 2022 for use with Nvidia's H100 GPU. HBM3E is slated for mass production in 2024 by SK Hynix, Samsung and Micron. Each generation has increased data processing speeds, bandwidth and lower latencies. As it stands in September 2023, SK Hynix are the sole provider of HBM3. By 2024, we can expect all three of the major memory players to have competitive HBM products in market.

The remaining challenges are economic rather than technical. The cost and complexity needs to come down for HBM to become economical outside of the datacenter and the HPC/AI market segment. With a more competitive 2024 than 2023 and the history of aggressive pricing to win market share in the memory industry, it’s highly likely prices will drop especially as Samsung claims it wants to displace SK Hynix as the market leader and are “cutting favorable deals with some of the accelerator firms to try to capture more share.”

Interesting products such as Photonic Fabric from Celestial AI and TeraPHY from Ayar Labs, are using optical interconnects to connect HBM modules chip-to-chip, package-to-package, and node-to-node. Potentially enabling pooled HBM across serves using Compute Express Link (CXL).

Novelty [Score: 5/5]

HBM offers a huge amoutn of bandwidth at low latencies with good power efficiency. No other technology is able to match HBM for applications that need very high performance within a reasonable power budget. But this level of performance comes at a cost, limiting the market. Hybrid Memory Cube (HMC) technology was a competing technology from 2013 to 2018 before the last vendor, Micron, moved away from HMC to HBM and GDDR (Graphics DDR SDRAM). GDDR is the only real alternative but measures of bandwidth, latency, and power consumption are much lower, with DDR5 even lower still.

HBM offers a compelling balance between high bandwidth, small footprint, low latency all at good power efficiency. These benefits come from the fact the chip is stacked vertically. This design also brings the downsides: higher costs from additional components including expensive silicon interposers and through-silicon vias (TSVs) that are needed to stack the dies on top of each other. The stacking process also requires specific tooling which adds additional costs. With relatively low production volumes, HBM2E is estimated to cost around $25-35 per GB, 3-5x higher than the competing GDDR6 technology. High-bandwidth memory (HBM) is becoming the memory of choice for hyperscalers, but there are still questions about its ultimate fate in the mainstream marketplace. While it’s well-established in data centers, with usage growing due to the demands of AI/ML, wider adoption is inhibited by drawbacks inherent in its basic design. GDDR will continue dominating mainstream graphics and is the technology of choice for cost-sensitive applications.

In the long term, 2030+, new computing architectures pose the biggest threat to HBM. For datacenter chips, of all the exotic computing architectures, optical and quantum are likely to offer the greatest competition potentially utilising new memory technologies like Ferroelectric RAM (FeRAM, FRAM), Resistive RAM (RRAM, ReRAM, Memristors), Phase-Change Memory (PCM), or Magnetic RAM (MRAM). Although it’s unlikely HBM would be replaced, more likely is these exotic technologies will be used to complement DRAM with a different performance/cost trade-off within the memory hierarchy.

On balance, HBM is a critical enabler to scale LLMs in the datacenter for the next 5 years+. If we want bigger models, we need HBM. LLMs may shrink in size with advances in quantization, pruning and even efficient transformers that do not suffer from quadratic dependence on the input. However as it stands, models are likely to get bigger, and therefore HBM is fundamental.

Drivers [Score: 5/5]

The main driver of HBM is for extreme high-bandwidth memory to handle LLMs in the datacentre. The speed of the demand increase is close to unparalleled, whilst HBM served a small niche for high-end graphic and HPC before 2022, it is the demand for datacenter AI accelerators driving the market today. HBM is only able to meet the demand with advances in through-silicon vias (TSVs), microbumps, and wafer thinning.

LLMs demand HBM due to their substantial bandwidth requirements, with HBM providing 2-5 times higher bandwidth compared to GDDR or DDR memory, enabling efficient parallel data processing essential for LLMs. The low latency and fast data access of HBM are pivotal, particularly when LLMs constantly move large datasets; it also aligns with the requirements of tasks like training models with 175 billion parameters, necessitating frequent data transfers between memory and processors. In terms of power efficiency, HBM excels by minimizing energy consumption during memory access, contributing to reduced overall power requirements, which is crucial for energy-conscious data centers and AI accelerators hosting LLMs.

Demand can only be met through recent innovations in 3D stacking processes, namely through-silicon vias (TSVs), microbumps, and wafer thinning and handling techniques, play a pivotal role in facilitating HBM adoption. High-density vertically stacked TSVs have become instrumental in efficiently transmitting signals between stacked semiconductor dies, thanks to cutting-edge electroplating techniques that enable the creation of smaller 5um TSV pitches. Additionally, high-precision thermo-compression bonding techniques can generate over 10,000 high-density microbumps per square millimeter between dies, establishing numerous I/O interconnections. Micro bump technology is a crucial packaging technique that enables high-density, precise, and reliable interconnections between semiconductor components. It is essential for advanced packaging methods, allowing for the creation of compact and high-performance electronic devices, especially in applications where data transfer speed and efficiency are critical.

To further enhance the feasibility of fine-pitch stacking without die warpage, memory wafers are thinned down to a mere 50um, and advanced handling materials have significantly reduced wafer breakage rates during the backgrinding process to less than 5%.

Diffusion [Score: 2/5]

HBM is expensive and will be unable to serve mainstream use cases as long as costs stay high. Other factors like lower density, with only 8-16GB per stack, it cannot match GDDR's cost-per-bit for applications that need hundreds of gigabytes of memory. And the fact the dense 3D stacking results in thermal hotspots that are difficult to cool. This requires careful thermal management and liquid cooling solutions that add cost and complexity. But as long as HBM remains 3-5x more expensive per bit than GDDR it will not break out of the HPC/AI segment.

Potential cost reductions include removing or tweaking the silicon interposer, as this is a major component cost. TSMC are experimenting with different types of interposers and there are prototypes of 3D stacking removing the interposer entirely. Other ways to reduce costs would be through standardization. One potential route would be to allow finer pin pitched on standard package technology like SerDes or UCIe. Despite potential cost reduction, HBM will always have higher manufacturing cost because of the high cost of putting TSVs on the wafer. First, wafer thinning and handling remain challenging processes with high defect rates, limiting manufacturing scale-up. Handling thinned wafers requires high precision equipment. Creating TSV interconnects adds significant fabrication complexity and cost versus standard DRAM processes. TSV creation requires multiple deposition, lithography and etch steps. And HBM testing and validation requires sophisticated 3D inspection tools to probe dense microbumps between stacked dies. Lower testing throughput drives up costs.

Overall, lowish density and heat management can be managed as costs come down. Costs will inevitably come down as AI accelerator volumes increase. But the complicated and therefore expensive HBM production puts a ceiling on adoption and mainstream use.

Impact [Score: 4/5]

AI Bust: In a low impact, low probability scenario, HBM remains confined to high-end accelerators and HPC with minimal mainstream adoption. The current AI boom driven by LLMs does not materialise into sustaining valuable applications, and from 2025 onwards demand dries up for high-bandwidth memory and prices remain low because of overcapacity. This is the DRAM Wars of the 90s redux. HBM prices collapse due to oversupply and a lack of demand. HBM market stays less than $8 billion by 2030.

AI Boom + Low Prices: A medium impact, high probability scenario, sees the AI boom sustaining albeit at a less furious pace. Valuable applications are discovered serving billions of users. A price war breaks out between Samsung, Micron and SK Hynix to win market share in 2024. Lower prices brings down the overall cost of AI accelerators growing the market faster as well as making HBM more cost competitive with GDDR for graphics and other high performance but not extreme high performance applications. HBM will be a roughly $25bn business by 2030.

Replacing DDR5/6: A high impact, medium probability scenario, sees the AI accelerator market grow to become one of the largest semiconductor segments (smartphones are at 25% today, AI accelerators about 5%). At these volumes, production costs decline dramatically. This combined with a price war drives down prices to such an extent than HBM becomes cost competitive for PCs and smartphones replacing GDDR first and then DDR5/6. The DRAM market is worth $110 billion in 2021, if this scenario plays out, the market grows faster than the predicted 8-10% annually, and HBM can be worth $150bn by 2030.

On balance, the medium impact scenario is more plausible. There is always going to be a cost advantage for the planar process regardless of volumes and price wars. And cost will always be a core consideration for any application at some margin.

Timing [2025-2030]

The market for HBM was still nascent in the range of $1-2 billion in 2021. The best proxy for growth is probably the same 50% YoY growth suggested for AI datacenter chips, all of which will use HBM. All future AI accelerator chips will use HBM3 with SK Hynix, Samsung and Micron all competing for market share from 2024. The market for datacenter chips will drive growth in HBM. I'm going to assume 50% growth for the next 5 years due to market frenzy and AI deployments, and then a further 3 years of 15% growth, resulting in a $25 billion market in 2030, averaged at 23% over the 8 years.

Rating [Underrated]

HBM is forecast as a small $6 billion market by 2027, so at best $10 billion by 2030. This isn’t consistent with announcements by TSMC, AMD and Nvidia of 50% AI datacentre chip growth for the next five years. My forecasts suggest $25 billion market, double the forecasts.

Open Questions

Will TSV interconnect technology continue advancing fast enough to keep up with HBM roadmap density and bandwidth scaling?
Can HBM maintain its advantage over competing technologies like GDDR6 and future GDDR iterations? Much may depend on continued density improvements through advanced packaging techniques.
How will HBM standardization and intellectual property issues between memory vendors play out? Collaboration on standards will likely be needed for further innovation.
What is the floor for HBM costs and how much much margin will Samsung, SK Hynix and Micron be willing to eat for market share? Unlikely to be a price collapse like DRAM in the 1990s, but a large decline should reduce cost of AI accelerators.

2030 Prediction

HBM is a $25 billion market

Opportunities

Removing silicon interposer: Current HBM stacks DRAM dies on top of a silicon interposer, increasing costs. Removing the interposer could lower costs but requires directly stacking DRAM on top of the host processor. This is challenging due to yield and thermal issues. Advanced bonding, stacking, and cooling methods could make this viable.
Photonic interconnects: Optical CXL links between pooled HBM could remove bandwidth bottlenecks between nodes for exascale computing. Startups like Ayar Labs are developing laser-etched optical I/O dies to replace electrical links. This could enable terabyte/sec bandwidth for memory pooling and disaggregation. Significant engineering is needed.
Advanced HBM cooling: As HBM stacks more dies, thermal density increases. This limits max bandwidth per stack. Optimized microchannel heatsinks, liquid cooling, and materials like graphene could enable higher stacked configurations without overheating. But this requires expensive, precision manufacturing methods. Overall system cooling requirements also increase with more powerful HBM stacks.

Sources

AI Expansion - Supply Chain Analysis For CoWoS And HBM,
SemiAnalysis
AI Expansion - Supply Chain Analysis For CoWoS And HBM
AI is booming. Everyone wants more AI accelerators, and the primary limiting factor is the CoWoS advanced packaging to put the 5nm ASICs and HBM together. This lack of capacity is creating GPU shortages that will last through Q2 next year. In our prior report…
Read more
2 years ago · 59 likes · 25 comments · Dylan Patel, Myron Xie, and Gerald Wong
What’s Next For High Bandwidth Memory, https://semiengineering.com/whats-next-for-high-bandwidth-memory/
Can Any Emerging Memory Technology Topple DRAM and NAND Flash?, https://www.eejournal.com/article/can-any-emerging-memory-technology-topple-dram-and-nand-flash/
HBM’s Future: Necessary But Expensive, https://semiengineering.com/hbms-future-necessary-but-expensive/
EMERGING MEMORIES BRANCH OUT, https://objective-analysis.com/wp-content/uploads/2023/08/2023-Objective-Analysis-Coughlin-Emerging-Memory-Report-Flyer.pdf
What is High Bandwidth Memory 3 (HBM3)?, https://www.synopsys.com/glossary/what-is-high-bandwitdth-memory-3.html

State of the Future

🔮E14: The Real AI Bottleneck: High Bandwidth Memory (HBM)

Without HBM there are no LLMs

💾 State of the Future: High Bandwidth Memory TLDR 💾

💾 State of the Future: High Bandwidth Memory Memo 💾

Discussion about this post