Groq's LPU Chip Technology: On-Chip SRAM

What is Groq’s “On-Chip” SRAM?

Groq’s technology is defined by a “Software-First” philosophy that treats hardware as a predictable assembly line rather than a pool of resources.¹ At the heart of this is the Language Processing Unit (LPU) and its unique memory architecture.²

1. What is Groq’s “On-Chip” SRAM?

On a Groq LPU, SRAM (Static Random-Access Memory) is the primary and only memory on the chip [1.2, 3.3].³ Unlike GPUs that use external High Bandwidth Memory (HBM), Groq places 230 MB of SRAM directly onto the silicon die [2.2, 5.2].⁴

Speed: Because it is on-chip, it achieves a bandwidth of roughly 80 TB/s, compared to only ~3.3 TB/s on a top-tier Nvidia H100 GPU [3.3, 4.3].⁵
Latency: Accessing on-chip SRAM takes only a few clock cycles, whereas going to external memory (HBM) takes hundreds of nanoseconds [3.3].⁶
Capacity Trade-off: SRAM is physically large (6 transistors per bit).⁷ A single Groq chip can only hold 230 MB, meaning thousands of chips must be linked to run a large model like Llama 3 [2.2, 4.1].⁸

2. Is it Conventional “Embedded” SRAM?

Technically, it is a form of embedded SRAM (eSRAM), but its implementation is unconventional [2.1, 3.3].

As a Primary Store: Conventional eSRAM is used as a cache (a small, temporary waiting room like L1/L2/L3 cache) that automatically fetches data from a larger DRAM. Groq uses it as a Scratchpad—it is the main storage where the actual AI model weights live [1.1, 2.1].
No Cache Logic: Conventional eSRAM uses complex hardware “managers” (cache controllers) to guess what data the CPU needs.⁹ Groq’s SRAM has no controllers; the compiler manually places data in specific memory cells before the chip even turns on [1.3, 4.2].

3. Is it useful for Matrix Multiplication?

Yes, it is designed specifically for it. Matrix multiplication (GEMM) is the core of AI, and Groq’s architecture optimizes this through its Tiled Design [3.1, 3.2].

Functional Slices: The chip is divided into vertical “slices” specialized for memory (MEM) or matrix math (MXM) [3.1, 3.5].
The MXM Unit: Each Matrix Execution Module can perform hundreds of thousands of operations per cycle. Because the SRAM is right next to the math units, the chip can feed the matrix multipliers at full speed without ever waiting for data [1.1, 3.1].¹⁰

4. Does SRAM work as the “Conveyor Belt”?

Not exactly—the data streams act as the conveyor belt, while the SRAM acts as the loading docks [3.1, 4.1].

The Streams: Data moves horizontally across the chip in “streams” (320 bytes per lane). These streams physically move one “step” across the chip on every clock cycle, exactly like a conveyor belt [3.1, 4.1].
The SRAM Interaction: As the “conveyor belt” (stream) passes a Memory tile, the SRAM can drop new data onto it or pick data up. If the belt passes a Matrix tile, the math is performed on the data while it’s moving [3.1, 4.2].
Global Synchronization: Because every chip in a rack is perfectly synchronized (Plesiosynchronous), these conveyor belts effectively extend across hundreds of chips, creating a single massive, multi-chip assembly line [1.3, 4.2].

References

[1.1] Groq Official (2025), “LPU Architecture: Single Core & On-Chip SRAM.”¹¹
[1.2] Groq (2025), “Why Groq is Built Different for Inference.”
[1.3] HackerNoon (2025), “Groq’s Deterministic Architecture: Rewriting the Physics of AI.”¹²
[2.1] HPCwire (2022), “Groq Designs Chip that Hands Over Controls to Software.”¹³
[2.2] Reddit r/LocalLLaMA (2025), “How Groq achieves high-speed inference with SRAM.”
[3.1] Groq Whitepaper (2023), “Groq Rocks Neural Networks: The Tensor Streaming Processor.”
[3.3] Medium (2025), “Anatomy of the LPU: SRAM vs HBM.”
[3.5] arXiv (2024), “LPU: A Latency-Optimized and Highly Scalable Processor.”
[4.1] The Register (2025), “Nvidia’s $20B Groq Deal and the Assembly Line Architecture.”¹⁴
[4.2] ALCF x Groq Workshop (2024), “Day 2: LPU and Systems as Programmable Assembly Lines.”
[5.2] Hacker News (2023), “Discussion on GroqChip tech specs and SRAM capacity.”

Our Score

Click to rate this post!

[Total: 0 Average: 0]

Visited 37 times, 1 visit(s) today

Pages: 1 2 3

Groq’s LPU Chip Technology: On-Chip SRAM

What is Groq’s “On-Chip” SRAM?

1. What is Groq’s “On-Chip” SRAM?

2. Is it Conventional “Embedded” SRAM?

3. Is it useful for Matrix Multiplication?

4. Does SRAM work as the “Conveyor Belt”?

References

Leave a Comment Cancel reply

Visitor

Post

About Me

Contact