# **Deploying DNNs in the Embedded Space: Challenges and Opportunities**

**Christos-Savvas Bouganis** 



**About myself** 



# Intelligent Digital Systems Lab

#### The team



Aditya Rajagopal Machine Learning



**Alexander Montgomerie** Hardware Acceleration for Machine Learning



Zhewen Yu Machine Learning



Machine Learning,

Machine Learning

**Petros Toupas** 

Machine Learning

Robotics

**Diederik Vink** Machine Learning



Mudhar Bin Rabieah



**Giorog Zampokas** Computer Vision,



Machine Learning







```
Welcome to the Intelligent Digital
Systems Lab at Imperial College
```

TOP LINES. Our research Dr. Ehristen Bouganis tein eur lab CNN-to-FPGA Banchmark Solta fogation/tent



The IOSL lab is part of the Electrical and Electronic Engineering Department of Imperial College London.

1.46111





**Our** vision

# To research and develop intelligent autonomous systems



"see"

# "understand" "process"

#### **Imperial College** Íntelligent Digital Systems Lab London Some of our work **Autonomous Navigation Hunan Pose Estimation** SMMC-10 SEQUENCE 20 Multi-CNN fpgaConvNet Deployment **Localisation and Mapping Traffic Detection** Time-Data-Driven constrained CNN LSTM Inference Inference Imperial College London intelligent Digital Systems Lab

his demo is based on the work of Engel et al. (LSD-SLAM), for which (DSL has developed a curitors FPGA-based hardware architectur

#### A bit of history: Artificial Intelligence - Machine Learning – Deep Neural Networks

| Time   |                                                             | Artificial Intelligence                                                |
|--------|-------------------------------------------------------------|------------------------------------------------------------------------|
| <1950s | Statistical Model                                           | Machine Learning                                                       |
| 1950s  | The term "Machine Learning" was used                        | Deep Neural<br>Networks                                                |
| 1990s  | Shift from a knowledge-driven to data-driven approach       | CNNs                                                                   |
| 2000s  | Supervised ML methods (SVM, Kernel Methods)                 |                                                                        |
| 2009   | Power of many and real-world examples - ImageNet is created | ImageNet Challenge                                                     |
| 2010s  | Deep Neural Networks – Performance improvement with data    | IMAGENET                                                               |
|        |                                                             | <ul> <li>1,000 object classes</li> <li>1,000 object classes</li> </ul> |

(categories).
 Images:

1.2 M train
 100k test.

#### **Convolutional Neural Networks**



a.

#### Imperial College London Models – Where we are today



- Number of models trading-off complexity vs accuracy
- Top-1 accuracy 82% (increase of 30 pp)
- 20x higher computational complexity



**Íntelligent** Digital Systems Lab

## **DNNs in the Embedded Space – Variability in Performance Requirements**







M. Horowitz, "1.1 Computing's energy problem (and what we can do about it)," 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, 2014, pp. 10-14,

## **Efficiency comes from customisation**

Intelligent Digital Systems Lab



# Íntelligent Digital Systems Lab

## Putting things in perspective – What customization buys you

#### Impact on LSTM-based Image Captioning – Computations tailored to the architecture

Input Image



- A. Kouris, S. Venieris, M. Rizakis and C.S. Bouganis, "Approximate LSTMs for Time-Constrained Inference: Enabling Fast Reaction in Self-Driving Cars",
- B. in IEEE Consumer Electronics Magazine, 2019

#### Imperial College London Algorithm-Hardware Co-design

**Íntelligent** Digital Systems Lab



## **CNN acceleration through an FPGA**

Íntelligent Digital Systems Lab





#### Characteristics

- Custom datapath
- Custom memory subsystem
- Programmable interconnections

- Reconfigurability
- Heterogeneous
- Difficult to program

## The Challenge of the Mapping Problem







| Parameters      | Value |
|-----------------|-------|
| LC              | 2M    |
| BRAMS (36kbits) | 1,880 |
| DSPs            | 3,360 |

#### Specifications

- Latency
- Throughput
- Power consumption

## Challenges:

- Diversity of operations in modern NN
- Diversity and resources of modern FPGAs
- Competition (or need for performance)
- Large number of parameters in the target architecture





## **Challenge #1: Automated CNN-to-FPGA Toolflow**





- ConvNet Inference
  - Tailored to images and data with spatial patterns
  - Built as a sequence of layers (Convolutional, Nonlinearity and Pooling Layer)



## fpgaConvNet – Streaming Architecture for CNNs



## fpgaConvNet – Streaming Architecture for CNNs

**CNN Hardware SDF Graph** Sliding Window Sliding Nonlin Pool Unit Fork Unit Window Sliding Nonlin Sliding Conv Pool Unit Fork Unit Unit Window Window Sliding Sliding Nonlin Sliding Pool Unit Fork Window Window Fork Uni Window Sliding Sliding Conv Nonlin Pool Unit Fork Window Unit Unit Window

Complex Model  $\rightarrow$  Bottlenecks:

- Limited compute resources
- Limited on-chip memory capacity for model parameters
- Limited off-chip memory bandwidth



Define a set of **graph transformations** to traverse the design space in **fast** and **principled** way



## **Transformation 3: Graph Partitioning with Reconfiguration**



## **Transformation 4: Weights Reloading**



- Synchronous Dataflow Modelling
  - Capture hardware mappings as matrices
  - Transformations as *algebraic operations*
  - Analytical *performance model*
  - Cast design space exploration as a mathematical optimisation problem



$$t_{total}(B, N_P, \mathbf{\Gamma}) = \sum_{i=1}^{N_P} t_i(B, \mathbf{\Gamma}_i) + (N_P - 1) \cdot t_{reconfig.}$$

Intelligent Digital Systems Lab





Íntelligent Digital Systems Lab

## **Challenge #2: Multi-CNN Systems – Autonomous Drones**



#### Imperial College London Challenge #2: Multi-DNN System

Intelligent Digital Systems Lab





## **Key characteristics**

- Latency is relevant: Reconfiguration is not an option
- One hardware engine per CNN highly customisable
- Hardware scheduler to control memory access schedule

## **Multi-CNN Hardware Architecture**

**Íntelligent Digital Systems** Lab

**Key characteristics** 

- One hardware engine per CNN highly customisable
- Hardware scheduler to control memory access schedule



## **Proposed Design Space Exploration Method**



- Memory contention
  - Problem 1: Performance model != Actual performance (scheduler)
  - Problem 2: Not full utilization of the memory bandwidth
- CNN inference over a stream of inputs
  - Cast to a cyclic scheduling problem
  - Search for a periodic solution
- Optimal ILP scheduler has very high runtimes for large-sized problems
- Develop a heuristic Resource Constrained List Scheduler (RCLS).
- Key points:
  - Scheduler exposed in the engine design optimization process
  - Introduce slow-down => fine control over bandwidth

#### Imperial College London The effect of slow-downs

Scheduler Scheduler + slow downs Available Memory Bandwidth: 2 GB/s Bandwidth Requirement: 1.2 GB/s Bandwidth Requirement: 1.5 GB/s Slowdown1\_1: 0.8x CONV7 CONV7 ReLU MAX POOL MAX POOL ReLU ┢ ┢ x7 x7 Exec Time: 0.05 ms CNN1 - Subgraph 1 Exec Time: 0.062 ms CNN1 - Subgraph 1 Slowdown2\_1: 0.8x Bandwidth Requirement: 0.2 GB/s Bandwidth Requirement: 0.25 GB/s CONV5 CONV5 MAX POOL ReLU MAX POOL ⊢► ReLU x5 x5 CNN2 - Subgraph 1 Exec Time: 0.031 ms CNN2 - Subgraph 1 Exec Time: 0.025 ms Bandwidth Requirement: 0.56 GB/s Bandwidth Requirement: 0.75 GB/s Slowdown3 1: 0.75x CONV CONV ReLU ReLU 5x5 5x5 Exec Time: 0.02 ms CNN3 - Subgraph 1 Exec Time: 0.026 ms CNN3 - Subgraph 1 2 GB/s 2 GB/s 3 1 2 3 2 0.07 ms 0.0625 ms time time

## **Comparison with Embedded GPUs**



- Latency-driven scenario  $\rightarrow$  batch size of 1
- Up to 19.09× speedup with an average of 6.85× (geo. mean)



#### Performance-per-Watt: f-CNN<sup>x</sup> vs. TX1

- Latency-driven scenario  $\rightarrow$  batch size of 1
- Up to 9.61× speedup with an average of 2.76× (geo. mean)

# Íntelligent Digital Systems Lab

- Customisation is key, but also a challenge in the design of DNN systems
- We need toolflows to support deployment of DNN on the embedded space
  - Many choices, high-dimensional space
- Exposing the hardware capabilities to the algorithm can lead to performance gains
  - Challenging task
  - Rethink current approaches to fully utilise the underlying hardware



customisation

#### Imperial College London What we are looking into...

# Íntelligent Digital Systems Lab





Íntelligent Digital Systems Lab

Co-optimise topology and hardware architecture



![](_page_34_Figure_4.jpeg)

*HW architecture (latency, throughput, resources)* 

![](_page_35_Picture_0.jpeg)

Íntelligent Digital Systems Lab

Adversarial attacks to DNNs and how to prevent them

![](_page_35_Picture_3.jpeg)

# Tesla "sees" 85

McAfee Advanced Threat Research (ATR). Feb 2020

#### Imperial College London Opportunities at Imperial

# Íntelligent Digital Systems Lab

- MSc Programmes
  - Analogue and Digital Integrated Circuit Design
  - Applied Machine Learning
  - Communications and Signal Processing
  - Control and Optimisations
  - Future Power Networks
- PhD Programme
  - Scholarships available for top students

![](_page_36_Picture_10.jpeg)

![](_page_36_Picture_11.jpeg)

![](_page_36_Picture_12.jpeg)

### Questions

# Íntelligent Digital Systems Lab

![](_page_37_Picture_3.jpeg)

Research

![](_page_37_Picture_4.jpeg)

Christon-Servas Bouganity Alexandrics Kovern Stylianis I. Vesseria Dept. of Electrical and Electronic Eng. Thept. of Electrical and Electronic Eng. Dept. of Electrical and Electronic Eng. Imperial College London Imperial College London Imported College London a briefs baby or ok are lisses remission in the part of shainton arreas hongoninglic as als

#### ABSTRACT

1 INTRODUCTION

"Bur wish process Canada" NN at animated tealling that pulse the quantitation limits of any given CNN model, to perform high throughput informs for exploiting the computation time security trade-off. Without the south a networking, a two-stage architecture tailored for any pivos HYGA device is processed, comming of a low- and a high-precision unit. A confidence real-axism and is employed between them to identify minimated cases at ton there and forward there to the high precision and or technicale computation. Experiments doministrate that Consult/XW atheness is preformation based of up to TFE for VEEL ID and XPE for SizeNet over the baseline design for the same resource budget and accesses.

While Convolutional Neural Networks are becoming the specific and the Cartel and Cartel

![](_page_37_Picture_8.jpeg)

fidence realization inclusion and generate the cocouled loss and

high processor processing with-

![](_page_37_Picture_9.jpeg)

St. Washington

![](_page_37_Picture_10.jpeg)