{"id":3943,"date":"2026-01-04T16:27:34","date_gmt":"2026-01-04T22:27:34","guid":{"rendered":"https:\/\/ykim.synology.me\/wordpress\/?p=3943"},"modified":"2026-01-04T21:39:46","modified_gmt":"2026-01-05T03:39:46","slug":"groqs-on-chip-sram","status":"publish","type":"post","link":"https:\/\/ykim.synology.me\/wordpress\/groqs-on-chip-sram-3943\/","title":{"rendered":"Groq&#8217;s LPU Chip Technology: On-Chip SRAM"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" src=\"https:\/\/cdn.sanity.io\/images\/chol0sk5\/production\/df308eef891cc2f8811be4e05b10561b81247747-1280x720.gif\" alt=\"\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Groq chip technology, specifically the <strong>Language Processing Unit (LPU)<\/strong>, is an architecture designed specifically to solve the &#8220;memory bottleneck&#8221; that slows down AI models on traditional chips like GPUs [1.2, 2.1]. Founded by former architects of Google\u2019s TPU, Groq has shifted from a startup to a major industry player, famously resulting in a $20 billion acquisition\/licensing deal with Nvidia in late 2025 [3.2].<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The core of Groq&#8217;s success lies in three pillars: <strong>Deterministic Architecture<\/strong>, <strong>On-chip SRAM<\/strong>, and a <strong>Software-First<\/strong> design [2.1, 4.2].<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">1. Deterministic &#8220;Clockwork&#8221; Execution<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Standard GPUs use dynamic scheduling, where the chip decides in real-time which task to prioritize. This creates &#8220;jitter&#8221;\u2014random delays that make AI responses feel uneven [6.1].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Zero Variance:<\/strong> Groq removed all hardware-side &#8220;decision makers&#8221; (like branch predictors or cache controllers) [1.2, 4.2].<\/li>\n\n\n\n<li><strong>Predictability:<\/strong> The hardware does exactly what the software tells it to do, down to the exact clock cycle. If a calculation is scheduled to take 400 cycles, it takes exactly 400 cycles every single time [1.2, 6.1].<\/li>\n\n\n\n<li><strong>Synchronous Networking:<\/strong> Using a &#8220;Plesiosynchronous&#8221; system, hundreds of Groq chips can be synchronized to act as one single, massive processor [1.2, 4.3].<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2. The SRAM Memory Advantage<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Instead of using external High Bandwidth Memory (HBM) like Nvidia, which requires data to &#8220;travel&#8221; back and forth, Groq uses <strong>SRAM (Static Random-Access Memory)<\/strong> built directly onto the chip [7.1, 7.3].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Speed:<\/strong> Groq\u2019s on-chip SRAM provides a bandwidth of roughly <strong>80 TB\/s<\/strong>, compared to only <strong>3.35 TB\/s<\/strong> on an Nvidia H100 [2.1, 7.3].<\/li>\n\n\n\n<li><strong>Capacity Trade-off:<\/strong> SRAM is fast but physically large. One Groq chip only holds <strong>230 MB<\/strong> of data [2.3, 7.3].<\/li>\n\n\n\n<li><strong>Scaling:<\/strong> To run a model like Llama 3 70B (which needs ~140 GB), Groq links together a rack of approximately <strong>576 chips<\/strong> [1.2, 7.2].<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3. Tiled Architecture &amp; Data Streams<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The chip is organized into a &#8220;Tiled&#8221; or &#8220;Functional Slice&#8221; layout [4.1].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Vertical Slices:<\/strong> The chip is divided into specialized columns: Memory (MEM), Vector math (VXM), and Matrix math (MXM) [4.1].<\/li>\n\n\n\n<li><strong>Horizontal Streams:<\/strong> Data flows across these slices on &#8220;conveyor belts&#8221; [2.1].<\/li>\n\n\n\n<li><strong>Matrix Multiplication:<\/strong> The Matrix Execution Modules (MXM) can perform a 320-element fused dot-product in just 20 cycles, making it incredibly efficient for the linear algebra required by AI [4.1, 2.3].<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4. Software-First (The Compiler is Captain)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In Groq\u2019s world, the <strong>compiler<\/strong> does all the hard work that hardware usually handles [2.1, 4.2].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Spatial Orchestration:<\/strong> The compiler maps the AI model&#8217;s data flow across the physical geometry of the chip before the program even runs [1.2, 4.2].<\/li>\n\n\n\n<li><strong>Resource Management:<\/strong> It manages all memory and data movement, ensuring that when a math unit is ready for a number, that number arrives at that exact nanosecond [4.1, 6.1].<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison of Performance (2025\/2026 Benchmarks)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><td><strong>Metric<\/strong><\/td><td><strong>Nvidia H100 (GPU)<\/strong><\/td><td><strong>Groq LPU<\/strong><\/td><\/tr><\/thead><tbody><tr><td><strong>Throughput (Llama 3 8B)<\/strong><\/td><td>~100\u2013150 tokens\/sec<\/td><td><strong>877 tokens\/sec<\/strong> [3.2]<\/td><\/tr><tr><td><strong>Throughput (Llama 3 70B)<\/strong><\/td><td>~30\u201350 tokens\/sec<\/td><td><strong>240\u2013300 tokens\/sec<\/strong> [5.1, 5.2]<\/td><\/tr><tr><td><strong>Memory Bandwidth<\/strong><\/td><td>3.35 TB\/s<\/td><td><strong>~80 TB\/s<\/strong> [7.3]<\/td><\/tr><tr><td><strong>Best For<\/strong><\/td><td>Training &amp; Batch Inference<\/td><td>Real-time, Low-latency Inference [3.2, 6.1]<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>[1.2] HackerNoon (2025), &#8220;Groq&#8217;s Deterministic Architecture is Rewriting the Physics of AI Inference.&#8221;<\/li>\n\n\n\n<li>[2.1] Groq Official (2025), &#8220;What is a Language Processing Unit?&#8221;<\/li>\n\n\n\n<li>[2.3] TechPowerUp (2024), &#8220;Groq LPU AI Inference Chip is Rivaling Major Players.&#8221;<\/li>\n\n\n\n<li>[3.2] IntuitionLabs (2025), &#8220;Nvidia&#8217;s $20B Groq Deal: Strategy, LPU Tech &amp; Antitrust.&#8221;<\/li>\n\n\n\n<li>[4.1] Groq Whitepaper (2020), &#8220;Groq Rocks Neural Networks: Tensor Streaming Processor.&#8221;<\/li>\n\n\n\n<li>[4.2] Medium (2025), &#8220;The Compiler is the Captain: Software-Defined Hardware.&#8221;<\/li>\n\n\n\n<li>[5.1] ArtificialAnalysis.ai (2024), &#8220;LLM Benchmark: Groq LPU Performance Results.&#8221;<\/li>\n\n\n\n<li>[6.1] 601MEDIA (2025), &#8220;How Groq LPU Works: A Comparison with GPU and TPU.&#8221;<\/li>\n\n\n\n<li>[7.3] Bojie Li (2024), &#8220;Groq Inference Chips: A Trick of Trading Space for Time.&#8221;<\/li>\n<\/ul>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">What is Groq\u2019s &#8220;On-Chip&#8221; SRAM?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Groq\u2019s technology is defined by a &#8220;Software-First&#8221; philosophy that treats hardware as a predictable assembly line rather than a pool of resources.<sup>1<\/sup> At the heart of this is the Language Processing Unit (LPU) and its unique memory architecture.<sup>2<\/sup><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">1. What is Groq\u2019s &#8220;On-Chip&#8221; SRAM?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">On a Groq LPU, <strong>SRAM (Static Random-Access Memory)<\/strong> is the primary and only memory on the chip [1.2, 3.3].<sup>3<\/sup> Unlike GPUs that use external High Bandwidth Memory (HBM), Groq places 230 MB of SRAM directly onto the silicon die [2.2, 5.2].<sup>4<\/sup><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Speed:<\/strong> Because it is on-chip, it achieves a bandwidth of roughly <strong>80 TB\/s<\/strong>, compared to only ~3.3 TB\/s on a top-tier Nvidia H100 GPU [3.3, 4.3].<sup>5<\/sup><\/li>\n\n\n\n<li><strong>Latency:<\/strong> Accessing on-chip SRAM takes only a few clock cycles, whereas going to external memory (HBM) takes hundreds of nanoseconds [3.3].<sup>6<\/sup><\/li>\n\n\n\n<li><strong>Capacity Trade-off:<\/strong> SRAM is physically large (6 transistors per bit).<sup>7<\/sup> A single Groq chip can only hold <strong>230 MB<\/strong>, meaning thousands of chips must be linked to run a large model like Llama 3 [2.2, 4.1].<sup>8<\/sup><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2. Is it Conventional &#8220;Embedded&#8221; SRAM?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Technically, it <strong>is<\/strong> a form of embedded SRAM (eSRAM), but its implementation is unconventional [2.1, 3.3].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>As a Primary Store:<\/strong> Conventional eSRAM is used as a <strong>cache<\/strong> (a small, temporary waiting room like L1\/L2\/L3 cache) that automatically fetches data from a larger DRAM. Groq uses it as a <strong>Scratchpad<\/strong>\u2014it is the main storage where the actual AI model weights live [1.1, 2.1].<\/li>\n\n\n\n<li><strong>No Cache Logic:<\/strong> Conventional eSRAM uses complex hardware &#8220;managers&#8221; (cache controllers) to guess what data the CPU needs.<sup>9<\/sup> Groq\u2019s SRAM has no controllers; the <strong>compiler<\/strong> manually places data in specific memory cells before the chip even turns on [1.3, 4.2].<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3. Is it useful for Matrix Multiplication?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Yes, it is designed specifically for it.<\/strong> Matrix multiplication (GEMM) is the core of AI, and Groq\u2019s architecture optimizes this through its <strong>Tiled Design<\/strong> [3.1, 3.2].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Functional Slices:<\/strong> The chip is divided into vertical &#8220;slices&#8221; specialized for memory (MEM) or matrix math (MXM) [3.1, 3.5].<\/li>\n\n\n\n<li><strong>The MXM Unit:<\/strong> Each Matrix Execution Module can perform hundreds of thousands of operations per cycle. Because the SRAM is right next to the math units, the chip can feed the matrix multipliers at full speed without ever waiting for data [1.1, 3.1].<sup>10<\/sup><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4. Does SRAM work as the &#8220;Conveyor Belt&#8221;?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Not exactly\u2014the <strong>data streams<\/strong> act as the conveyor belt, while the <strong>SRAM<\/strong> acts as the loading docks [3.1, 4.1].<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>The Streams:<\/strong> Data moves horizontally across the chip in &#8220;streams&#8221; (320 bytes per lane). These streams physically move one &#8220;step&#8221; across the chip on every clock cycle, exactly like a conveyor belt [3.1, 4.1].<\/li>\n\n\n\n<li><strong>The SRAM Interaction:<\/strong> As the &#8220;conveyor belt&#8221; (stream) passes a Memory tile, the SRAM can drop new data onto it or pick data up. If the belt passes a Matrix tile, the math is performed on the data while it\u2019s moving [3.1, 4.2].<\/li>\n\n\n\n<li><strong>Global Synchronization:<\/strong> Because every chip in a rack is perfectly synchronized (Plesiosynchronous), these conveyor belts effectively extend across hundreds of chips, creating a single massive, multi-chip assembly line [1.3, 4.2].<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\">References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>[1.1] Groq Official (2025), &#8220;LPU Architecture: Single Core &amp; On-Chip SRAM.&#8221;<sup>11<\/sup><\/li>\n\n\n\n<li>[1.2] Groq (2025), &#8220;Why Groq is Built Different for Inference.&#8221;<\/li>\n\n\n\n<li>[1.3] HackerNoon (2025), &#8220;Groq&#8217;s Deterministic Architecture: Rewriting the Physics of AI.&#8221;<sup>12<\/sup><\/li>\n\n\n\n<li>[2.1] HPCwire (2022), &#8220;Groq Designs Chip that Hands Over Controls to Software.&#8221;<sup>13<\/sup><\/li>\n\n\n\n<li>[2.2] Reddit r\/LocalLLaMA (2025), &#8220;How Groq achieves high-speed inference with SRAM.&#8221;<\/li>\n\n\n\n<li>[3.1] Groq Whitepaper (2023), &#8220;Groq Rocks Neural Networks: The Tensor Streaming Processor.&#8221;<\/li>\n\n\n\n<li>[3.3] Medium (2025), &#8220;Anatomy of the LPU: SRAM vs HBM.&#8221;<\/li>\n\n\n\n<li>[3.5] arXiv (2024), &#8220;LPU: A Latency-Optimized and Highly Scalable Processor.&#8221;<\/li>\n\n\n\n<li>[4.1] The Register (2025), &#8220;Nvidia&#8217;s $20B Groq Deal and the Assembly Line Architecture.&#8221;<sup>14<\/sup><\/li>\n\n\n\n<li>[4.2] ALCF x Groq Workshop (2024), &#8220;Day 2: LPU and Systems as Programmable Assembly Lines.&#8221;<\/li>\n\n\n\n<li>[5.2] Hacker News (2023), &#8220;Discussion on GroqChip tech specs and SRAM capacity.&#8221;<\/li>\n<\/ul>\n\n\n\n<!--nextpage-->\n\n\n\n<h2 class=\"wp-block-heading\">Appendix<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/groq.com\/blog\/inside-the-lpu-deconstructing-groq-speed\" target=\"_blank\" rel=\"noopener\">https:\/\/groq.com\/blog\/inside-the-lpu-deconstructing-groq-speed<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/groq.com\/blog\/the-groq-lpu-explained\" target=\"_blank\" rel=\"noopener\">https:\/\/groq.com\/blog\/the-groq-lpu-explained<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/blog.codingconfessions.com\/p\/groq-lpu-design\" target=\"_blank\" rel=\"noopener\">https:\/\/blog.codingconfessions.com\/p\/groq-lpu-design<\/a><\/p>\n\n\n\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-the-next-platform wp-block-embed-the-next-platform\"><div class=\"wp-block-embed__wrapper\">\n<blockquote class=\"wp-embedded-content\" data-secret=\"O3S4mHmAPW\"><a href=\"https:\/\/www.nextplatform.com\/2020\/09\/29\/groq-shares-recipe-for-tsp-nodes-systems\/\" target=\"_blank\" rel=\"noopener\">Groq Shares Recipe for TSP Nodes, Systems<\/a><\/blockquote><iframe loading=\"lazy\" class=\"wp-embedded-content\" sandbox=\"allow-scripts\" security=\"restricted\" style=\"position: absolute; visibility: hidden;\" title=\"&#8220;Groq Shares Recipe for TSP Nodes, Systems&#8221; &#8212; The Next Platform\" src=\"https:\/\/www.nextplatform.com\/2020\/09\/29\/groq-shares-recipe-for-tsp-nodes-systems\/embed\/#?secret=pV3RibnY92#?secret=O3S4mHmAPW\" data-secret=\"O3S4mHmAPW\" width=\"600\" height=\"338\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\"><\/iframe>\n<\/div><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n<div style='text-align:center' class='yasr-auto-insert-overall'><\/div><div style='text-align:center' class='yasr-auto-insert-visitor'><\/div>","protected":false},"excerpt":{"rendered":"<p>Groq chip technology, specifically the Language Processing Unit (LPU), is an architecture designed specifically to solve the &#8220;memory bottleneck&#8221; that slows down AI models on traditional chips like GPUs [1.2, 2.1]. Founded by former architects of Google\u2019s TPU, Groq has shifted from a startup to a major industry player, famously resulting in a $20 billion&#8230;<\/p>\n","protected":false},"author":4,"featured_media":3950,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_bbp_topic_count":0,"_bbp_reply_count":0,"_bbp_total_topic_count":0,"_bbp_total_reply_count":0,"_bbp_voice_count":0,"_bbp_anonymous_reply_count":0,"_bbp_topic_count_hidden":0,"_bbp_reply_count_hidden":0,"_bbp_forum_subforum_count":0,"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","yasr_overall_rating":0,"yasr_post_is_review":"","yasr_auto_insert_disabled":"","yasr_review_type":"","fifu_image_url":"","fifu_image_alt":"","iawp_total_views":1,"footnotes":""},"categories":[6,319,4],"tags":[],"class_list":["post-3943","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-sram-slug","category-language-processing-unit-slug","category-semiconductor-slug"],"yasr_visitor_votes":{"stars_attributes":{"read_only":false,"span_bottom":false},"number_of_votes":0,"sum_votes":0},"jetpack_featured_media_url":"https:\/\/ykim.synology.me\/wordpress\/wp-content\/uploads\/2026\/01\/20260104-GroqChip-1-LPU.png","_links":{"self":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/3943","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/comments?post=3943"}],"version-history":[{"count":4,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/3943\/revisions"}],"predecessor-version":[{"id":3954,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/posts\/3943\/revisions\/3954"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media\/3950"}],"wp:attachment":[{"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/media?parent=3943"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/categories?post=3943"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/ykim.synology.me\/wordpress\/wp-json\/wp\/v2\/tags?post=3943"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}