The prosperity of the Internet-of-Things (IoT) imposes increasing demand on endpoint microcontroller-based devices' performance and energy efficiency. The MCUs are demanded to process the raw data acquired from the sensors with the integer-based workload, such as digital signal p
...
The prosperity of the Internet-of-Things (IoT) imposes increasing demand on endpoint microcontroller-based devices' performance and energy efficiency. The MCUs are demanded to process the raw data acquired from the sensors with the integer-based workload, such as digital signal processing (DSP) algorithms and quantized neural network (QNN) inference. Snitch is a tiny RV32I control core based on RISC-V open-source instruction set architecture. Currently, the Snitch system built around the Snitch core aims to achieve high performance in floating-point applications. Novel hardware extensions have been implemented in its floating-point subsystem to achieve high floating-point unit (FPU) utilization, such as stream semantic registers (SSRs) and floating-point repetition (FREP) hardware loop. However, it only has RV32IM instruction set support for integer computation, which does not satisfy the increasing demand from the integer workload we mentioned. In this work, we present a unified Snitch architecture with integer extensions targeting integer workload acceleration. Some existing custom extensions to address performance bottlenecks in DSP and QNN applications were proposed, which are Xpulpimg ISA and sub-byte single-instruction-multiple-data (SIMD) ISA, respectively. Both extensions are built on the outdated version of Snitch in another many-core system Mempool. In our work, we first integrated the DSP-oriented ISA extension Xpulpimg and the sub-byte SIMD ISA extension into the mainline Snitch. Then we extended the existing floating-point SSR to have integer support. To evaluate the proposed extensions, we benchmarked the Snitch core complex (CC) with integer matrix multiplication algorithms and compared the performance between the baseline RV32IM and our extensions. A speedup of 5.9$\times$, 22.6$\times$, and 77.4$\times$ in terms of MACs/cycle with respect to the baseline was measured for 32-bit, 8-bit and 4-bit data sizes, respectively. Post-synthesis figures have been obtained from GlobalFoundries 22 nm technology for area and timing evaluations. Our integer extensions only introduced 12\% area overhead compared with the original FP-capable Snitch CC, and they led to no measurable impact in terms of the maximum effective frequency with FP extensions enabled.