on-chip SRAM AI ASIC

本帖于 2025-12-14 10:18:34 时间, 由普通用户 胡雪盐8 编辑

An on-chip SRAM AI ASIC is an accelerator where most of the working set (activations, partial sums, sometimes weights) stays inside SRAM physically on the compute die instead of being fetched from off-chip DRAM/HBM.

 

1. Latency dominance (especially LLM inference)

  • SRAM access: ~0.3–1 ns

  • HBM access (effective): ~50–100 ns

  • DDR access: 100+ ns

For token-by-token inference, this difference dominates user-visible latency.

2. Energy efficiency

Approximate energy per access:

  • SRAM: ~0.1–1 pJ/bit

  • HBM: ~3–5 pJ/bit

  • DDR: 10+ pJ/bit

LLMs are often memory-energy limited, not compute-limited.

3. Deterministic performance

  • No DRAM scheduling, refresh, or bank conflicts

  • Enables cycle-accurate pipelines (important for real-time systems)

Chip class On-chip SRAM
Mobile NPU 4–32 MB
Edge inference ASIC 32–128 MB
Datacenter inference ASIC 100–300 MB
Wafer-scale (Cerebras) 10s of GB

 

Famous examples (and what they optimized for)

Groq

  • All on-chip SRAM

  • Static schedule, no caches

  • Unmatched token latency

  • Limited flexibility and capacity

Google TPU v1–v3

  • Large SRAM buffers

  • Matrix-centric workloads

  • Training + inference hybrid

Cerebras

  • Wafer-scale SRAM + compute

  • Avoids off-chip memory entirely

  • Extreme cost, extreme performance for certain models

 

When on-chip SRAM AI ASICs are the right answer

Ultra-low latency LLM inference
Real-time systems (finance, robotics, telecom)
Edge or power-constrained environments
Predictable workloads with known model shapes

 

 

 

 

 

 

 


更多我的博客文章>>>

 

 

所有跟帖: 

请您先登陆,再发跟帖!