A mixed-precision memristor and SRAM compute-in-memory AI processor
成果类型:
Article
署名作者:
Khwa, Win-San; Wen, Tai-Hao; Hsu, Hung-Hsi; Huang, Wei-Hsing; Chang, Yu-Chen; Chiu, Ting-Chien; Ke, Zhao-En; Chin, Yu-Hsiang; Wen, Hua-Jin; Hsu, Wei-Ting; Lo, Chung-Chuan; Liu, Ren-Shuo; Hsieh, Chih-Cheng; Tang, Kea-Tiong; Ho, Mon-Shu; Lele, Ashwin Sanjay; Teng, Shih-Hsin; Chou, Chung-Cheng; Chih, Yu-Der; Chang, Tsung-Yung Jonathan; Chang, Meng-Fan
署名单位:
Taiwan Semiconductor Manufacturing Company; Taiwan Semiconductor Manufacturing Company
刊物名称:
Nature
ISSN/ISSBN:
0028-2856
DOI:
10.1038/s41586-025-08639-2
发表日期:
2025-03-20
关键词:
cmos
macro
摘要:
Artificial intelligence (AI) edge devices(1-12) demand high-precision energy-efficient computations, large on-chip model storage, rapid wakeup-to-response time and cost-effective foundry-ready solutions. Floating point (FP) computation provides precision exceeding that of integer (INT) formats at the cost of higher power and storage overhead. Multi-level-cell (MLC) memristor compute-in-memory (CIM)(13-15) provides compact non-volatile storage and energy-efficient computation but is prone to accuracy loss owing to process variation. Digital static random-access memory (SRAM)-CIM16-22 enables lossless computation; however, storage is low as a result of large bit-cell area and model loading is required during inference. Thus, conventional approaches using homogeneous CIM architectures and computation formats impose a trade-off between efficiency, storage, wakeup latency and inference accuracy. Here we present a mixed-precision heterogeneous CIM AI edge processor, which supports the layer-granular/kernel-granular partitioning of network layers among on-chip CIM architectures (that is, memristor-CIM, SRAM-CIM and tiny-digital units) and computation number formats (INT and FP) based on sensitivity to error. This layergranular/kernel-granular flexibility allows simultaneous optimization within the two-dimensional design space at the hardware level. The proposed hardware achieved high energy efficiency (40.91 TFLOPS W(-1)for ResNet-20 with CIFAR-100 and 28.63 TFLOPS W-1 for MobileNet-v2 with ImageNet), low accuracy degradation (<0.45% for ResNet-20 with CIFAR-100 and for MobilNet-v2 with ImageNet) and rapid wakeup-to-response time (373.52 mu s).