MemA

Fast Inference of Multiple Deep Models

Conference paper (2021)

Authors

Jeroen Galjaard Student

B.A. Cox Data-Intensive Systems -

S. Ghiassi Data-Intensive Systems -

Y. Chen Data-Intensive Systems -

Lydia Y. Chen Data-Intensive Systems -

Robert Birke ABB Research Switzerland

Research Group

Data-Intensive Systems () (TU Delft)

Scheduling Multi-inference Deep neural networks Edge computing Constrained memory Memory aware

To reference this document use:

http://resolver.tudelft.nl/uuid:7bde3164-6814-40fe-be83-05402a5bb85d

More Info

expand_more

Published Date

2021

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Department

Software Technology

Research Group

Data-Intensive Systems

Abstract

The execution of deep neural network (DNN) inference jobs on edge devices has become increasingly popular. Multiple of such inference models can concurrently analyse the on-device data, e.g. images, to extract valuable insights. Prior art focuses on low-power accelerators, compressed neural network architectures, and specialized frameworks to reduce execution time of single inference jobs on edge devices which are resource constrained. However, it is little known how different scheduling policies can further improve the runtime performance of multi-inference jobs without additional edge resources. To enable the exploration of scheduling policies, we first develop an execution framework, EdgeCaffe, which splits the DNN inference jobs by loading and execution of each network layer. We empirically characterize the impact of loading and scheduling policies on the execution time of multi-inference jobs and point out their dependency on the available memory space. We propose a novel memory-aware scheduling policy, MemA, which opportunistically interleaves the executions of different types of DNN layers based on their estimated run-time memory demands. Our evaluation on exhaustive combinations of five networks, data inputs, and memory configurations show that MemA can alleviate the degradation of execution times of multi-inference (up to 5*) under severely constrained memory compared to standard scheduling policies without affecting accuracy.

Files

MemA_Fast_Inference_of_Multipl... (pdf)

(pdf | 0.395 Mb)

Unknown license

Download not available