University of Toronto, Electrical and Computer Engineering Dept.
I am a postdoctoral fellow in The Edward S. Rogers Sr. Department of Electrical and Computer Engineering at University of Toronto. My supervisor is Andreas Moshovos. I got my PhD in 2019 from University of Toronto where I designed machine learning accelerators for both training and inference targeting data centers and edge devices. I received my MSc (2013) and BSc with Honours (2009) degrees in Computer Engineering from the Faculty of Engineering, Cairo University. I was an R&D Senior Software Engineer at Fingerprint Consultancy Egypt from August, 2009 to July, 2013.
My research interests include computer architectures, data center memory systems, computational imaging and machine learning acceleration.
During my PhD, I pursued a number of projects under supervision of prof. Moshovos covering memory systems for cloud workloads, GPU-CPU heterogeneous architectures, computational imaging and machine learning acceleration. More on each of these projects can be found in my "Research" page. My research is supported by a combination of NSERC Engage Grant, the ECE department fellowship, Connaught Scholarship for top international PhD students, and RBC Post-Doctoral Entrepreneur Fellowship.
Under supervision of prof. Amr Wassal, my MSc thesis proposed a new cache directory scheme, hybrid limited-pointer linked-list (HLPLL) directory, for many-core Chip Multi-Processors (CMPs). HLPLL combines the scalability and low coherence traffic of the linked-list directory scheme with the parallel operation of the limited-pointer directory.
My hobbies include playing soccer, swimming, traveling and watching documentaries especially historical ones. I like reading novels and political articles.
University of Toronto, Electrical and Computer Engineering Dept.
University of Toronto, Electrical and Computer Engineering Dept.
Cairo University, Computer Engineering Dept.
Cairo University, Computer Engineering Dept.
Fingerprint Consultancy, R&D Team
Clausthal University of Technology, Car Ring II project
Ph.D. in Computer Engineering
University of Toronto
Master of Science in Computer Engineering
Bachelor of Science in Computer Engineering
- TensorDash: a deep learning accelerator that exploits sparsity in activations, weights and gradients during deep learning training to achieve a 2x speedup and 1.5x energy efficiency vs. Tensorcores with just 5% area overhead.
- Bit-Tactical: a deep neural network hardware accelerator that exploits weight sparsity, per layer precision variability, dynamic fine-grain precision tuning for activations and optionally the sparse effectual bit content naturally occurring in binary representation. Bit-Tactical improves performance by 5.05x and is 2.98x more energy efficient than an equivalent data-parallel accelerator.
- Laconic: a deep neural network accelerator that decomposes convolutions down to the bit level. The design reduces the amount of work needed for inference by two orders of magnitude. With a 1-K weight memory interface, Laconic is 15.4x faster and 1.95x more energy efficient than a conventional accelerator with a 2K-wire weight memory interface.
Accelerating computational imaging applications such as image denoising, sharpening, deblurring, demosaicking, super-resolution ... etc.
- Diffy: a hardware DNN accelerator that exploits unique characteristics of Computational Imaging Deep Neural Networks (CI-DNNs) to reduce the computation, communication and storage needed. CI-DNNs are fully convolutional DNN models which perform per-pixel prediction on high resolution HD and 4K input frames. Diffy boosts the performance of CI-DNNs by 7.1x over a state-of-the-art value-agnostic data-parallel accelerator. Diffy needs 2.8x less on-chip storage and is 1.83x more energy efficient if only on-chip power is considered. Overall energy efficiency is much higher as Diffy reduces the expensive off-chip traffic by 4.57x. Diffy is robust and not limited to CI-DNNs as it improves performance even for traditional image classification models.
- IDEAL: a hardware ASIC accelerator for BM3D incorporating Matches Reuse, a novel software/hardware co-design optimization, to reduce the needed computation and off-chip bandwidth. IDEAL can be practically integrated into cameras and other mobile imaging systems. IDEAL boosts performance by 11,352x on average over Intel Xeon E5-2650 v2 CPU and by 597x over GTX980 GPU. IDEAL is orders of magnitude more energy efficient than CPU and GPU.
GPU acceleration of a software 3D scanner algorithm that constructs a 3D model of an object based on a stack of 2D images taken at different focal lengths. Achieved 18.4x speedup vs. CPU (35.4x excluding CPU-GPU memory transfer time). The work was introduced as a graduate course research-based project. The final technical report and presentation are available online.
Researching the suitability of state-of-the-art memory controller designs for the emerging cloud workloads including NoSQL data stores, MapReduce, streaming, web services, and search engines. The project outlines design recommendations for the memory controller to better match the unique characteristics of the cloud workloads.
Investigating state-of-the-art memory scheduling algorithms in heterogeneous systems where CPUs and GPU share the DRAM, specifically in mobile systems. The project contributes the following:
- Enhancing memory controller model in Gem5-GPU to enable plugging and testing different memory scheduling algorithms
- Using simulation results, deficiencies are identified and better scheduling algorithms are suggested
- Comparing the performance of the suggested algorithms to the state-of-the-art
Implemented a source-to-source compiler that analyzes sequential loops and data dependencies to auto-convert c++ loops into CUDA kernels. The required memory transfers before and after the kernel call are auto-generated. The compiler is built using ROSE, the open source compiler framework. The project was part of a research-based graduate course. The final technical report and presentation are available online.
GPU acceleration of in-Memory database compression/decompression. We achieve an average speedup of 3x over Intel Xeon 3.2 GHz (including CPU-GPU memory transfer time) while a state-of-the-art VPU SSE implementation is only 1.58x faster. Excluding CPU-GPU memory transfers, our speedup range is 41.5x to 87.7x depending on compression ratio. Results are shown here.
TensorDash is a hardware-based technique that enables data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost sparse input operand interconnect with an area-efficient hardware scheduler. The scheduler can effectively extract sparsity in the activations, the weights, and the gradients. Over a wide set of state-of-the-art models covering various applications, TensorDash accelerates the training process by 1.95× while being 1.5× more energy efficient when incorporated on top of a Tensorcore-based accelerator at less than 5% area overhead. TensorDash is datatype agnostic and we demonstrate it with IEEE standard mixed-precision floating-point units and a popular optimized for machine learning floating-point format (BFloat16).
We present FPRaker, a processing element for composing training accelerators. FPRaker processes several floating-point multiply-accumulation operations concurrently and accumulates their result into a higher precision accumulator. FPRaker boosts performance and energy efficiency during training by taking advantage of the values that naturally appear during training. Specifically, it processes the significand of the operands of each multiply-accumulate as a series of signed powers of two. The conversion to this form is done on-the-fly. This exposes ineffectual work that can be skipped: values when encoded have few terms and some of them can be discarded as they would fall outside the range of the accumulator given the limited precision of floating-point. We demonstrate that FPRaker can be used to compose an accelerator for training and that it can improve performance and energy efficiency compared to using conventional floating-point units under ISO-compute area constraints. We also demonstrate that FPRaker delivers additional benefits when training incorporates pruning and quantization. Finally, we show that FPRaker naturally amplifies performance with training methods that use a different precision per layer.
Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present Boveda, a lossless on-chip memory compression technique for neural networks operating on fixed-point values. Boveda reduces the datawidth used per block of values to be only as long as necessary: since most values are of small magnitude Boveda drastically reduces their footprint. Boveda can be used to increase the effective on-chip capacity, to reduce off-chip traffic, or to reduce the on-chip memory capacity needed to achieve a performance/energy target. Boveda reduces total model footprint to 53%.
We review two inference accelerators that exploit value properties in deep neural networks: 1) Diffy that targets spatially correlated activations in computational imaging DNNs, and 2) Tactical that targets sparse neural networks using a low-overhead hardware/software weight-skipping front-end. Then we combine both into Di-Tactical to boost benefits for both scene understanding workloads and computational imaging tasks.
We show that selecting a data width for all values in Deep Neural Networks, quantized or not and even if that width is different per layer, amounts to worst-case design. Much shorter data widths can be used if we target the common case by adjusting the data type width at a much finer granularity. We propose ShapeShifter, where we group weights and activations and encode them using a width specific to each group and where typical group sizes vary from 16 to 256 values. The per group widths are selected statically for the weights and dynamically by hardware for the activations. We present two applications of ShapeShifter. In the first, that is applicable to any system, ShapeShifter reduces off- and on-chip storage and communication. This ShapeShifter-based memory compression is simple and low cost yet reduces off-chip traffic to 33% and 36% for 8-bit and 16-bit models respectively. This makes it possible to sustain higher performance for a given off-chip memory interface while also boosting energy efficiency. In the second application, we show how ShapeShifter can be implemented as a surgical extension over designs that exploit variable precision in time.
We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work performed during inference for image classification models can be consistently reduced by two orders of magnitude. In the best case studied of a sparse variant of AlexNet, this approach can ideally reduce computation work by more than 500x. We present Laconic a hardware accelerator that implements this approach to improve execution time, and energy efficiency for inference with Deep Learning Networks. Laconic judiciously gives up some of the work reduction potential to yield a low-cost, simple, and energy efficient design that outperforms other state-of-the-art accelerators. For example, a Laconic configuration that uses a weight memory interface with just 128 wires outperforms a conventional accelerator with a 2K-wire weight memory interface by 2.3x on average while being 2.13x more energy efficient on average. A Laconic configuration that uses a 1K-wire weight memory interface, outperforms the 2K-wire conventional accelerator by 15.4x and is 1.95x more energy efficient. Laconic does not require but rewards advances in model design such as a reduction in precision, the use of alternate numeric representations that reduce the number of bits that are "1", or an increase in weight or activation sparsity.
We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream properties. We demonstrate a practical application with Bit-Tactical (TCL), a hardware accelerator which exploits weight sparsity, per layer precision variability and dynamic fine-grain precision reduction for activations, and optionally the naturally occurring sparse effectual bit content of activations to improve performance and energy efficiency. TCL benefits both sparse and dense CNNs, natively supports both convolutional and fully-connected layers, and exploits properties of all activations to reduce storage, communication, and computation demands. While TCL does not require changes to the CNN to deliver benefits, it does reward any technique that would amplify any of the aforementioned weight and activation value properties. Compared to an equivalent data-parallel accelerator for dense CNNs, TCLp, a variant of TCL improves performance by 5.05x and is 2.98x more energy efficient while requiring 22% more area.
Recent hardware accelerators for inference on the ImageNet dataset with Deep Convolutional Neural Networks eliminate ineffectual work and traffic resulting from value properties, such as precision variability and sparsity at the value or bit-level. This work analyzes to what extent these properties persist in a broader set of neural network models and data sets. We analyzed: a) image classification networks, b) other image application models (non-classification), and c) Long-Short-Term Memory (LSTM) models for non-image applications. We show that these properties persist, albeit to a different degree, and identify opportunities for future accelerator design efforts.
We show that Deep Convolutional Neural Network (CNN) implementations of computational imaging tasks exhibit spatially correlated values. We exploit this correlation to reduce the amount of computation, communication, and storage needed to execute such CNNs by introducing Diffy, a hardware accelerator that performs Differential Convolution. Diffy stores,communicates, and processes the bulk of the activation values as deltas. Experiments show that, over five state-of-the-art CNN models and for HD resolution inputs,Diffy boosts the average performance by 7.1x over a baseline value-agnostic accelerator  and by 1.41x over a state-of-the-art accelerator that processes only the effectual content of the raw activation values . Further, Diffy is respectively 1.83x and1.36x more energy efficient when considering only the on-chip energy. However,Diffy requires 55% less on-chip storage and 2.5x less off-chip bandwidth compared to storing the raw values using profiled per-layer precisions . Compared to using dynamic per group precisions , Diffy requires 32% less storage and1.43x less off-chip memory bandwidth. More importantly, Diffy provides the performance necessary to achieve real-time processing of HD resolution images with practical configurations. Finally, Diffy is robust and can serve as a general CNN accelerator as it improves performance even for image classification models.
The rapid pace and successful application of machine learning research and development has seen widespread deployment of deep convolutional neural networks (CNNs). Alongside these algorithmic efforts, the compute- and memory-intensive nature of CNNs has stimulated a large amount of work in the field of hardware acceleration for these networks. In this paper, we profile the memory requirements of CNNs in terms of both on-chip memory size and off-chip memory bandwidth, in order to understand the impact of the memory system on accelerator design. We show that there are fundamental trade-offs between performance, bandwidth, and on-chip memory. Further, this paper explores how the wide variety of CNNs for different application domains each have fundamentally different characteristics. We show that bandwidth and memory requirements for different networks, and occasionally for different layers within a network, can each vary by multiple orders of magnitude. This makes designing fast and efficient hardware for all CNN applications difficult. To remedy this, we outline heuristic design points that attempt to optimize for select dataflow scenarios.
Recent hardware accelerators for inference on the ImageNet dataset with Deep Convolutional Neural Networks eliminate ineffectual work and traffic resulting from value properties, such as precision variability and sparsity at the value- or bit-level. This work analyzes to what extent these properties persist in a broader set of neural network models and data sets. We analyzed: a) image classification networks, b) other image application models (non-classification), and c) Long-Short-Term-Memory (LSTM) models for non-image applications. We show that these properties persist, albeit to a different degree, and identify opportunities for future accelerator design efforts.
Computational imaging pipelines (CIPs) convert the raw output of imaging sensors into the high-quality images that are used for further processing. This work studies how Block-Matching and 3D filtering (BM3D), a state-of-the-art denoising algorithm can be implemented to meet the demands of user-interactive (UI) applications. Denoising is the most computational demanding stage of a CIP taking more than 95% of time on a highly-optimized software implementation . We analyze the performance and energy consumption of optimized software implementations on three commodity platforms and find that their performance falls far short of that needed. To enable BM3D to be used for UI applications, we consider two alternatives: a dedicated accelerator, and running recently proposed Neural Network (NN) based approximations of BM3D [2, 3] on an NN accelerator. We develop Image DEnoising AcceLerator(IDEAL), a hardware BM3D accelerator which incorporates the following techniques: 1) a novel software-hardware optimization, Matches Reuse (MR), that exploits typical image content to reduce the computations needed by BM3D, 2) prefetching and judicious use of on-chip buffering to minimize execution stalls and off-chip bandwidth consumption, 3) a careful arrangement of specialized computing blocks, and 4) data type precision tuning. Over a dataset of images with resolution ranging from 8 megapixel (MP) up to 42MP, IDEAL is 11, 352× and 591× faster than high-end general-purpose (CPU) and graphics processor (GPU) software implementations with orders of magnitude better energy efficiency. Even when the NN approximations of BM3D are run over DaDianNao , a state-of-the-art high-end hardware NN accelerator, IDEAL is 5.4× faster and 3.95× more energy efficient.
This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address mapping scheme used. Experimental measurements demonstrate: 1) Several recently proposed memory scheduling policies are not a good match for these scale-out workloads. 2) The relatively simple First-Ready-First-Come- First-Served (FR-FCFS) policy performs consistently better, and 3) for most of the studied workloads, the even simpler First-Come-First-Served scheduling policy is within 1% of FRFCFS. 4) Increasing the number of memory channels offers negligible performance benefits, e.g., performance improves by 1.7% on average for 4-channels vs. 1-channel. 5) 77%- 90% of DRAM rows activations are accessed only once before closure. These observation can guide future development and optimization of memory controllers for scale-out workloads.
The rise of Chip-Multiprocessors (CMPs) as a promising trend for the state of the art high-performance processors design raised the need for a scalable cache directory organization along with a simple cache coherence protocol as a hot research area. While thousands of cores are expected to fit on a single chip soon, the previously proposed cache directory schemes still lacks the scalability to accommodate more than tens of cores. The inefficiencies of these directory schemes come in the form of unaffordable memory overhead, excessive coherence traffic leading to performance degradation due to inexact representation of sharers and very complex coherence protocols. In this paper we introduce a new cache directory scheme for many core CMPs. The proposed scheme acquires, and actually improves, the scalability and low coherence traffic of cache-based linked list directory schemes while avoiding its completely sequential operation by exploiting the parallel operation of limited pointer directory schemes. We compare the proposed organization with these two previously proposed ones on different CMP configurations starting with a 4-core CMP and ending with a 32-core CMP. We show that the proposed scheme can avoid one third of the excessive broadcasted invalidation messages and two thirds of the extraneous acks in case of directory pointer overflows in limited pointer schemes. On the other hand, the proposed scheme achieves around 10% better performance than that of the completely sequential cache-based linked list directory while reducing the number of invalidation messages per invalidation event by 24%.
Moore's law still offers more transistors to fit per die unit area and this leads to the expectation of having thousands of cores fit on a single chip soon. Thus, Network-on-Chip proved to be a successful approach to accommodate this increasing number of cores on chip. However, the previously proposed 2D architectures still lack the scalability to more than few tens of cores where the inefficiencies of those architectures come in the form of long interconnect delays leading to performance degradation and high power consumption due to long wires. Fortunately, the rapid advances in 3D die-stacking technology as a promising trend for the state of the art high-performance processor designs raised the possibilities of having new approaches towards a scalable interconnection network. Thus, in this paper, we propose a novel 3D crossbar-based architecture that separates cores from cache modules in different 3D stacked dies.We introduce area model of the adopted crossbar and analyze the scalability of the proposed architecture up to 1024 communicating entities; cores and L2 cache banks.
My teaching experience extends back to September, 2009 when I worked as a teaching assistant (TA) at Cairo University for the following courses: Data Structures, Computer Architectures, Parallel Programming, Microprocessors, Operating Systems, Digital Design, Embedded Systems, Image Processing, Software Engineering and Advanced Database Systems.
I led my parallel programming class students to advanced ranks (7th to 12th) in Intel Accelerate Your Code (AYC) contest versions 2012 and 2013.
During my PhD at University of Toronto, I worked as a TA for the following courses:
ECE243 Computer Organization
ECE353 Systems Software
CSC190 Computer Algorithms and Data Structures
ECE241 Digital Systems
ECE352 Computer Organization
Basic computer structure. Design of central processing unit. Hardwired control. Input-output and the use of interrupts. Assembly language programming. Main memory organization and caches. Peripherals and interfacing. System design considerations.
Digital logic circuit design. Optimizations of combinational logic. Transistor-level design of logic gates; propagation delay and timing of gates and circuits. Verilog. Memory in digital circuits, including latches, clocked flip-flops, and Static Random Access Memory. Set-up and hold times of sequential logic. Finite state machines. Hardware addition and multiplication.
Introducion to C++ programming and computational thinking. Algorithm design techniques. Data structures; linked lists, stacks, queues, trees, heaps, hashing, pointers (including function pointers) and arrays, data types and bit operations. Dynamic memory management.
Digital and Computer Systems. Synchronous and asynchronous sequential circuits, pipelining, integer and floating-point arithmetic, RISC processors.
Operating system structure, processes, threads, synchronization, CPU scheduling, memory management, file systems, input/output, multiple processor systems, virtualization, protection, and security. The laboratory exercises will require implementation of part of an operating system.
The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
10 King's College Rd.
University of Toronto
Toronto, ON M5S 3G4 Canada
Office: EA306 Engineering Annex Building, room 306