### General-Purpose Code Acceleration with Limited-Precision Analog Computation

Renée St. AmantAmir YazdanbakhshJongse ParkBradley ThwaitesHadi EsmaeilzadehArjang HassibiLuis CezeDoug Burger

Georgia Institute of Technology Alternative Computing Technologies (ACT) Lab

Georgia Institute of Technology University of Washington The University of Texas at Austin Microsoft Research

ISCA 2014



Input and Output Display Communication Sensing

**Analog Domain** 

#### Processing Storage

**Digital Domain** 

How to use analog circuits for accelerating programs written in conventional languages?

- 1) Neural transformation [Esmaeilzadeh et. al., MICRO 2012]
- 2) Analog neurons

### Challenges

- Analog circuits are mainly single function
- Instruction control cannot be analog
- Storing intermediate results in analog domain is not effective
- Analog circuits have limited operational range
- **1)** Neural transformation
- 2) Analog neurons

### Challenges

- Analog circuits are mainly single function
- Instruction control cannot be analog
- Storing intermediate results in analog is not effective
- Analog circuits have limited operational range
- 1) Neural transformation
- 2) Analog neurons

### Challenges

- Analog circuits are mainly single function
- Instruction control cannot be analog
- Storing intermediate results in analog domain is not effective
- Analog circuits have limited operational range
- 1) Neural transformation
- 2) Analog neurons

### 1<sup>st</sup> Design Principle

## **Neural Transformation**

### **Neural Transformation**



Esmaeilzadeh, Sampson, Ceze, Burger, "Neural Acceleration for General-Purpose Approximate Programs," MICRO 2012.

### **A-NPU** acceleration



### 2<sup>nd</sup> Design Principle

# **Analog Neurons**

### Analog Neurons for Accelerated Computation



### Mixed-signal A-NPU





### Limitations of Analog Neuron

Limited range of operation (e.g. 600mV)

Margins for noise resiliency (2-3 mV)

Limited Bit-width Topology Restriction Circuit Non-idealities (e.g., Sigmoid)

### **3<sup>rd</sup> Design Principle**

# Compiler-Circuit Co-design

### **Digital** Compilation Workflow



### **Analog** Compilation Workflow



### (1) Training with Limited Bit-width



Continuous-Discrete Learning Method (CDLM), E. Fiesler, 1990

# (2) Training with topology restrictions and non-idealities



 Robust to the topology restrictions
Tolerate a more shallow sigmoid activation steepness over all applications

Resilient Back Propagation (RPROP), M. Riedmiller, 1993

### Measurements

Signal Processing, Robotics, 3D Gaming, Financial Analysis, Compression, Machine Learning, Image Processing

#### Analog A-NPU with 8 Analog Neurons

- Transistor-Level HSPICE Simulation
- Predictive Technology Models (PTM), 45nm
- Vdd: 1.2 V, f: 1.1 GHz

#### **Digital Components**

• Power Models: McPAT, CACTI, and Verilog

#### **Processor Simulator**

- Marssx86 Cycle-Accurate Simulation
- Intel Nehalem-like 4-wide/5-issue OoO processor
- Technology: 45 nm, Vdd: 0.9 V, f: 3.4 GHz



Ranges from 0.8× to 24.5× with Analog NPU

1.2× increase in application speedup with Analog over Digital NPU



Energy saving with Analog NPU is very close to ideal case (6.5x)

### **Application quality loss**



Quality loss is below 10% in all cases but one Based on application-specific quality metric

### What is left?





#### We can not reduce the energy of the computation much more.

# **3.7x** × **6.3x**

Speedup

Energy Reduction  $\approx 23x$ 

**Energy-Delay Product** 

#### Quality Degradation: Avg. 8.2%, Max. 19.7%



- It is still the beginning...
  - 1) **Broad applicability** of the analog computation
  - 2) Prototyping and integrating A-NPU within **noisy** high performance processors
  - 3) Reasoning about the acceptable level of error at the programming level



# **Backup Slides**

### Area Breakdown

| Sub-circuit                       | Area                                    |  |  |  |
|-----------------------------------|-----------------------------------------|--|--|--|
| A-NPU                             |                                         |  |  |  |
| 8x8-bit DAC                       | 3,096 T                                 |  |  |  |
| 8xResistor Ladder (8-bit weights) | 4,096 T + 1 K $\Omega$ ( $pprox$ 450 T) |  |  |  |
| 8xDifferential Pair               | 48 T                                    |  |  |  |
| I-to-V Resistors                  | 20 К $\Omega$ ( $pprox$ 30 Т)           |  |  |  |
| Differential Amplifier            | 244 T                                   |  |  |  |
| 8-bit ADC                         | 2,550 T + 1K $_{\Omega}$ ( $pprox$ 450) |  |  |  |
| Total                             | pprox10,964 T                           |  |  |  |
| D-NPU                             |                                         |  |  |  |
| 8x8-bit multiply-adds             | pprox 56,000 T                          |  |  |  |
| 8-bit Sigmoid lookup table        | 16,456 T                                |  |  |  |
| Total                             | pprox72,456                             |  |  |  |

#### 6.6x fewer transistors in the analog neuron implementation

### Power Breakdown

| Sub-circuit                       | Percentage of total power |  |  |  |
|-----------------------------------|---------------------------|--|--|--|
| A-NPU                             |                           |  |  |  |
| SRAM-accesses                     | 13%                       |  |  |  |
| DAC-Resistor Ladder-Diff Pair-Sum | 54%                       |  |  |  |
| Sigmoid-ADC                       | 33%                       |  |  |  |

**Power numbers vary with applications** 

| Applicatio                                                                    | ns                     | <b>Financial</b><br><b>blackscholes</b><br>309 x86 instructions<br>97.2% dynamic instructions | 6→8→8→1<br>Error: 10.2%            |
|-------------------------------------------------------------------------------|------------------------|-----------------------------------------------------------------------------------------------|------------------------------------|
| Signal Processing<br>fft<br>34 x86 instructions<br>67.4% dynamic instructions | 1→4→4→2<br>Error: 4.1% | <b>Compression</b><br><b>jpeg</b><br>1,257 x86 instructions<br>56.3% dynamic instructions     | 64 → 16 → 8<br>→ 64<br>Error: 8.4% |

| Robotics<br>inversek2j                             | 2→8→2       | Machine Learning<br>kmeans                        | 6→8→4→1     |
|----------------------------------------------------|-------------|---------------------------------------------------|-------------|
| 100 x86 instructions<br>95.9% dynamic instructions | Error: 9.4% | 26 x86 instructions<br>29.7% dynamic instructions | Error: 7.3% |

| 3D Gaming                  | 18→32→8             | Image Processing           |             |
|----------------------------|---------------------|----------------------------|-------------|
| jmeint                     | → 2                 | sobel                      | 9→8→1       |
| 1,079 x86 instructions     | Error: 19.7%        | 88 x86 instructions        | Error: 5.2% |
| 95.1% dynamic instructions | <u>EITOL.</u> 19.7% | 57.1% dynamic instructions |             |

### Speedup with A-NPU over 8-bit D-NPU



3.3× geometric mean speedup

Ranges from 1.8× to 15.2×

### Energy savings with A-NPU over 8-bit D-NPU



#### 12.1× geometric mean speedup Ranges from 3.7× to 82.2×

### **Dynamic Instruction Reduction**



### Speedup with A-NPU acceleration



### 3.7× geometric mean speedup

#### Ranges from 0.8× to 24.5×

### Energy savings with A-NPU acceleration



#### 6.3× geometric mean energy reduction All benchmarks benefit