Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
start [2018/05/12 00:33]
Andreas Moshovos
start [2019/02/27 22:55] (current)
Andreas Moshovos [Deep Learning Acceleration]
Line 4: Line 4:
 I will also be serving as the Director of the newly formed National Sciences and Engineering Research Council Strategic Partnership Network on Machine Learning Hardware Acceleration (NSERC COHESA), a partnership of 19 Researchers across 7 Universities involving 8 Industrial Partners. I will also be serving as the Director of the newly formed National Sciences and Engineering Research Council Strategic Partnership Network on Machine Learning Hardware Acceleration (NSERC COHESA), a partnership of 19 Researchers across 7 Universities involving 8 Industrial Partners.
  
-For the work I have done with my students and collaborators,​ I have been awarded the [[https://​www.sigarch.org/​awards/​acm-sigarch-maurice-wilkes-award/​|ACM SIGARCH Maurice Wilkes mid-career award]], a National Research Foundation CAREER Award, two IBM Faculty Partnership awards, a Semiconductor Research Innovation award, an IEEE Top Picks in Computer Architecture Research, and a MICRO conference Hall of Fame award. I have served as the Program Chair for the ACM/IEEE International Symposium on Microarchitecture and the ACM/IEEE International Symposium on Performance Analysis of Systems and Software. ​ I am also Fellow of the ACM.+For the work I have done with my students and collaborators,​ I have been awarded the [[https://​www.sigarch.org/​awards/​acm-sigarch-maurice-wilkes-award/​|ACM SIGARCH Maurice Wilkes mid-career award]], a National Research Foundation CAREER Award, two IBM Faculty Partnership awards, a Semiconductor Research Innovation award, an IEEE Top Picks in Computer Architecture Research, and a MICRO conference Hall of Fame award. I have served as the Program Chair for the ACM/IEEE International Symposium on Microarchitecture and the ACM/IEEE International Symposium on Performance Analysis of Systems and Software. ​ I am also Fellow of the ACM and a Faculty Affiliate of the [[https://​vectorinstitute.ai|Vector Institute]]
  
  
Line 11: Line 11:
 ===== Deep Learning Acceleration ===== ===== Deep Learning Acceleration =====
  
-We are developing, designing and demonstrating a novel class of hardware accelerators for Deep Learning networks whose key feature is that they are **value-based**. [[DLA_valuebased|What does this mean?]] Conventional accelerators rely mostly on the structure of computation,​ that is, which calculations are performed and how they communicate. Value-based accelerators further boost performance by taking advantage of expected properties in the runtime calculated value stream, such as, dynamically redundant or ineffectual computations,​ or the distribution of values, or even their bit content. In short, our accelerator designs, reduce the amount of work that needs to be performed for existing neural models and do so transparently to the model designer. Why are we pursuing these designs? Because Deep Learning is transforming our world by leaps and bounds. One of the three drivers behind Deep Learning success is the computing hardware that enabled its first practical applications. While algorithmic improvements will allow Deep Learning to evolve, much hinges on hardware’s ability to keep delivering ever higher performance and data processing storage and processing capability. As Dennard scaling has seized, the only viable way to do so is by architecture specialization. 
  
-The figure below highlights ​the potential ​and hence motivation for some of the methods we have developed:+**Value-Based Acceleration:​ ** We are developing methods that reduce ​the work, storage ​and communication needed when executing Deep Learning models. We target optimizations at the middleware software and at the hardware levels so that they benefit out-of-the-box models and do not require intervention from the Machine Learning expertdeveloping models is hard enough already. Our methods rely on value properties exhibited by typical models such as value- and bit-sparsity and data type need variability. Our methods, however, do reward model optimizations. For example, our methods reward quantization to smaller data widths where possible but will still provide benefits for non-quantized models. Similarly for sparsity.
  
-{{ :wiki:mlpotential.gif?​1024 ​|}} +See overview articles here[[https://​ieeexplore.ieee.org/​document/​8364645|Exploiting Typical Values ​to Accelerate Deep Learning]]
-**A:** remove zero Activations,​ **W:** remove zero weights, **Ap:** use dynamic per group precision for activations,​ **Ae:** skip ineffectual terms after Booth Encoding the Activations. We also have designs that exploit Weight precision (see LOOM below) and yet to be released designs that exploit further properties :)+
  
-This [[https://​ieeexplore.ieee.org/​document/​8259428/​|IEEE MICRO]] and IEEE Computer ​ articles present our rationale and summarize some of our designs. The most recently publicly disclosed design is [[https://​arxiv.org/​abs/​1803.03688|Bit-Tactical]] that targets but does not require sparse networks. 
- 
- 
-The tables below summarize some key characteristics of some of our designs: 
-{{ :​wiki:​summary2.gif?​1024 |}} 
- 
-{{ :​wiki:​summary1.gif?​1024 |}} 
- 
-===== Laconic ===== 
- 
-Our most recent design targets the effectual "​bit"​ content of both activations and weights. The potential reduction in work is consistently two orders of magnitude over processing full 16b values. As all our other designs, Laconic does not sacrifice accuracy. The graph below shows work reduction and equivalently performance improvement potential. 
-{{ :​laconic_potential.png?​600 |}} 
- 
-And here are the performance improvements achieved by different configurations of Laconic vs. a DaDianNao configuration with 1 tile, processing 8 filters and 16 products per filter. The subscript is the number of wires used  in the Weight Memory interface. The baseline uses filters x weights x bits = 8x16x16 = 2K wires. 
- 
-{{ :​laconic_actual_performance.png?​400 |}}