Positions

  • Present 2019

    Postdoctoral Fellow

    University of Toronto, Electrical and Computer Engineering Dept.

  • 2019 2013

    Teaching and Research Assistant

    University of Toronto, Electrical and Computer Engineering Dept.

  • Present 2013

    Assistant Lecturer (in leave)

    Cairo University, Computer Engineering Dept.

  • 2013 2009

    Teaching and Research Assistant

    Cairo University, Computer Engineering Dept.

  • 2013 2009

    Senior Software Engineer

    Fingerprint Consultancy, R&D Team

  • 2008 2008

    Summer Research Assistant

    Clausthal University of Technology, Car Ring II project

Education

  • Ph.D. 2019

    Ph.D. in Computer Engineering

    University of Toronto

  • M.Sc.2013

    Master of Science in Computer Engineering

    Cairo University

  • B.Sc.2009

    Bachelor of Science in Computer Engineering

    Cairo University

Honors, Awards and Grants

  • RBC 2019
    RBC Post-Doctoral Fellowship
    image
    Honored to receive the 2019 RBC Post-Doctoral Fellowship award to support my entrepreneurial research to accelerate deep learning model training and inference at the edge.
  • NSERC 2016
    NSERC Engage Grant
    image
    My research on Computational Imaging will be supported by an NSERC Engage Grant.
  • Connaught 2013
    Connaught International Scholarships for Doctoral Students
    University of Toronto
    image
    The Connaught International Scholarships assist graduate units in recruiting and supporting top international graduate students. Only the top 20 PhD students are awarded the scholarship annually.
  • 2013
    JEC-ECC 2013 Best Paper Award
    My paper "Hybrid limited-pointer linked-list cache directory and cache coherence protocol" was selected for the best paper award.
  • 2010
    Graduation Ceremony Prize for Excellence
    image
    1st Rank student award, Computer Engineering class 0f 2009, B.Sc. with Honor.

Research Projects

  • image

    Machine Learning Acceleration

    Desiging efficient ASIC accelerators for training and inference with deep learning models for both data centers and edge devices

    - TensorDash: a deep learning accelerator that exploits sparsity in activations, weights and gradients during deep learning training to achieve a 2x speedup and 1.5x energy efficiency vs. Tensorcores with just 5% area overhead.
    - Bit-Tactical: a deep neural network hardware accelerator that exploits weight sparsity, per layer precision variability, dynamic fine-grain precision tuning for activations and optionally the sparse effectual bit content naturally occurring in binary representation. Bit-Tactical improves performance by 5.05x and is 2.98x more energy efficient than an equivalent data-parallel accelerator.
    - Laconic: a deep neural network accelerator that decomposes convolutions down to the bit level. The design reduces the amount of work needed for inference by two orders of magnitude. With a 1-K weight memory interface, Laconic is 15.4x faster and 1.95x more energy efficient than a conventional accelerator with a 2K-wire weight memory interface.

  • image

    Computational Photography Acceleration

    Custom ASIC accelerator for Computational Imaging and Photography applications

    Accelerating computational imaging applications such as image denoising, sharpening, deblurring, demosaicking, super-resolution ... etc.
    - Diffy: a hardware DNN accelerator that exploits unique characteristics of Computational Imaging Deep Neural Networks (CI-DNNs) to reduce the computation, communication and storage needed. CI-DNNs are fully convolutional DNN models which perform per-pixel prediction on high resolution HD and 4K input frames. Diffy boosts the performance of CI-DNNs by 7.1x over a state-of-the-art value-agnostic data-parallel accelerator. Diffy needs 2.8x less on-chip storage and is 1.83x more energy efficient if only on-chip power is considered. Overall energy efficiency is much higher as Diffy reduces the expensive off-chip traffic by 4.57x. Diffy is robust and not limited to CI-DNNs as it improves performance even for traditional image classification models.
    - IDEAL: a hardware ASIC accelerator for BM3D incorporating Matches Reuse, a novel software/hardware co-design optimization, to reduce the needed computation and off-chip bandwidth. IDEAL can be practically integrated into cameras and other mobile imaging systems. IDEAL boosts performance by 11,352x on average over Intel Xeon E5-2650 v2 CPU and by 597x over GTX980 GPU. IDEAL is orders of magnitude more energy efficient than CPU and GPU.

  • image

    Computational Imaging GPU Acceleration

    Accelerating a software 3D scanner on GPU

    GPU acceleration of a software 3D scanner algorithm that constructs a 3D model of an object based on a stack of 2D images taken at different focal lengths. Achieved 18.4x speedup vs. CPU (35.4x excluding CPU-GPU memory transfer time). The work was introduced as a graduate course research-based project. The final technical report and presentation are available online.

  • image

    Memory Systems for The Cloud

    Rethinking the memory controller design for cloud workloads

    Researching the suitability of state-of-the-art memory controller designs for the emerging cloud workloads including NoSQL data stores, MapReduce, streaming, web services, and search engines. The project outlines design recommendations for the memory controller to better match the unique characteristics of the cloud workloads.

  • image

    Memory Scheduling in Heterogeneous Mobile Systems

    Enhancing memory subsystem for CPU-GPU heterogeneous mobile platforms

    Investigating state-of-the-art memory scheduling algorithms in heterogeneous systems where CPUs and GPU share the DRAM, specifically in mobile systems. The project contributes the following:
    - Enhancing memory controller model in Gem5-GPU to enable plugging and testing different memory scheduling algorithms
    - Using simulation results, deficiencies are identified and better scheduling algorithms are suggested
    - Comparing the performance of the suggested algorithms to the state-of-the-art

  • image

    C-to-CUDA compiler

    Auto-generation of CUDA kernels replacing loops

    Implemented a source-to-source compiler that analyzes sequential loops and data dependencies to auto-convert c++ loops into CUDA kernels. The required memory transfers before and after the kernel call are auto-generated. The compiler is built using ROSE, the open source compiler framework. The project was part of a research-based graduate course. The final technical report and presentation are available online.

  • image

    In-Memory Database GPU Acceleration

    GPU acceleration of compression/decompression of in-memory DBs

    GPU acceleration of in-Memory database compression/decompression. We achieve an average speedup of 3x over Intel Xeon 3.2 GHz (including CPU-GPU memory transfer time) while a state-of-the-art VPU SSE implementation is only 1.58x faster. Excluding CPU-GPU memory transfers, our speedup range is 41.5x to 87.7x depending on compression ratio. Results are shown here.

TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training

Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Omar Mohamed Awad, Gennady Pekhimenko, Jorge Albericio, Andreas Moshovos
Conference Paper(MICRO 2020) The 53rd Annual IEEE/ACM International Symposium on Microarchitecture.

Abstract

TensorDash is a hardware-based technique that enables data-parallel MAC units to take advantage of sparsity in their input operand streams. When used to compose a hardware accelerator for deep learning, TensorDash can speedup the training process while also increasing energy efficiency. TensorDash combines a low-cost sparse input operand interconnect with an area-efficient hardware scheduler. The scheduler can effectively extract sparsity in the activations, the weights, and the gradients. Over a wide set of state-of-the-art models covering various applications, TensorDash accelerates the training process by 1.95× while being 1.5× more energy efficient when incorporated on top of a Tensorcore-based accelerator at less than 5% area overhead. TensorDash is datatype agnostic and we demonstrate it with IEEE standard mixed-precision floating-point units and a popular optimized for machine learning floating-point format (BFloat16).

FPRaker: A Processing Element For Accelerating Neural Network Training

Omar Mohamed Awad, Mostafa Mahmoud, Isak Edo, Ali Hadi Zadeh, Ciaran Bannon, Anand Jayarajan, Gennady Pekhimenko, Andreas Moshovos
arXiv Paper(arXiv 2020) arXiv:2010.08065.

Abstract

We present FPRaker, a processing element for composing training accelerators. FPRaker processes several floating-point multiply-accumulation operations concurrently and accumulates their result into a higher precision accumulator. FPRaker boosts performance and energy efficiency during training by taking advantage of the values that naturally appear during training. Specifically, it processes the significand of the operands of each multiply-accumulate as a series of signed powers of two. The conversion to this form is done on-the-fly. This exposes ineffectual work that can be skipped: values when encoded have few terms and some of them can be discarded as they would fall outside the range of the accumulator given the limited precision of floating-point. We demonstrate that FPRaker can be used to compose an accelerator for training and that it can improve performance and energy efficiency compared to using conventional floating-point units under ISO-compute area constraints. We also demonstrate that FPRaker delivers additional benefits when training incorporates pruning and quantization. Finally, we show that FPRaker naturally amplifies performance with training methods that use a different precision per layer.

Building an on-chip deep learning memory hierarchy brick by brick: late breaking results

Isak Edo Vivancos, Sayeh Sharify, Milos Nikolic, Ciaran Bannon, Mostafa Mahmoud, Alberto Delmas Lascorz, Andreas Moshovos
Conference Paper(DAC 2020) The 57th ACM/EDAC/IEEE Design Automation Conference, July 2020.

Abstract

Data accesses between on- and off-chip memories account for a large fraction of overall energy consumption during inference with deep learning networks. We present Boveda, a lossless on-chip memory compression technique for neural networks operating on fixed-point values. Boveda reduces the datawidth used per block of values to be only as long as necessary: since most values are of small magnitude Boveda drastically reduces their footprint. Boveda can be used to increase the effective on-chip capacity, to reduce off-chip traffic, or to reduce the on-chip memory capacity needed to achieve a performance/energy target. Boveda reduces total model footprint to 53%.

Accelerating Image Sensor Based Deep Learning Applications

Mostafa Mahmoud, Dylan Malone Stuart, Zissis Poulos, Alberto Delmas Lascorz, Patrick Judd, Sayeh Sharify, Milos Nikolic, Kevin Siu, Isak Edo Vivancos, Jorge Albericio, Andreas Moshovos
Journal Paper(IEEE Micro 2019) IEEE Micro, Volume: 39, Issue: 5, Sept.-Oct. 1 2019.

Abstract

We review two inference accelerators that exploit value properties in deep neural networks: 1) Diffy that targets spatially correlated activations in computational imaging DNNs, and 2) Tactical that targets sparse neural networks using a low-overhead hardware/software weight-skipping front-end. Then we combine both into Di-Tactical to boost benefits for both scene understanding workloads and computational imaging tasks.

ShapeShifter: Enabling Fine-Grain Data Width Adaptation in Deep Learning

Alberto Delmas Lascorz, Sayeh Sharify, Isak Edo, Dylan Malone Stuart, Omar Mohamed Awad, Patrick Judd, Mostafa Mahmoud, Milos Nikolic, Kevin Siu, Zissis Poulos, Andreas Moshovos
Conference Paper(MICRO 2019) The 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, Ohio.

Abstract

We show that selecting a data width for all values in Deep Neural Networks, quantized or not and even if that width is different per layer, amounts to worst-case design. Much shorter data widths can be used if we target the common case by adjusting the data type width at a much finer granularity. We propose ShapeShifter, where we group weights and activations and encode them using a width specific to each group and where typical group sizes vary from 16 to 256 values. The per group widths are selected statically for the weights and dynamically by hardware for the activations. We present two applications of ShapeShifter. In the first, that is applicable to any system, ShapeShifter reduces off- and on-chip storage and communication. This ShapeShifter-based memory compression is simple and low cost yet reduces off-chip traffic to 33% and 36% for 8-bit and 16-bit models respectively. This makes it possible to sustain higher performance for a given off-chip memory interface while also boosting energy efficiency. In the second application, we show how ShapeShifter can be implemented as a surgical extension over designs that exploit variable precision in time.

Laconic Deep Learning Computing

Sayeh Sharify, Alberto Delmas Lascorz, Mostafa Mahmoud, Milos Nikolic, Kevin Siu, Dylan Malone Stuart, Zissis Poulos, Andreas Moshovos
Conference Paper(ISCA 2019) The 46th International Symposium on Computer Architecture, Phoenix, Arizona.

Abstract

We motivate a method for transparently identifying ineffectual computations in unmodified Deep Learning models and without affecting accuracy. Specifically, we show that if we decompose multiplications down to the bit level the amount of work performed during inference for image classification models can be consistently reduced by two orders of magnitude. In the best case studied of a sparse variant of AlexNet, this approach can ideally reduce computation work by more than 500x. We present Laconic a hardware accelerator that implements this approach to improve execution time, and energy efficiency for inference with Deep Learning Networks. Laconic judiciously gives up some of the work reduction potential to yield a low-cost, simple, and energy efficient design that outperforms other state-of-the-art accelerators. For example, a Laconic configuration that uses a weight memory interface with just 128 wires outperforms a conventional accelerator with a 2K-wire weight memory interface by 2.3x on average while being 2.13x more energy efficient on average. A Laconic configuration that uses a 1K-wire weight memory interface, outperforms the 2K-wire conventional accelerator by 15.4x and is 1.95x more energy efficient. Laconic does not require but rewards advances in model design such as a reduction in precision, the use of alternate numeric representations that reduce the number of bits that are "1", or an increase in weight or activation sparsity.

Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How

Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Andreas Moshovos
Conference Paper(ASPLOS 2019) The 24th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Providence, RI.

Abstract

We show that, during inference with Convolutional Neural Networks (CNNs), more than 2x to $8x ineffectual work can be exposed if instead of targeting those weights and activations that are zero, we target different combinations of value stream properties. We demonstrate a practical application with Bit-Tactical (TCL), a hardware accelerator which exploits weight sparsity, per layer precision variability and dynamic fine-grain precision reduction for activations, and optionally the naturally occurring sparse effectual bit content of activations to improve performance and energy efficiency. TCL benefits both sparse and dense CNNs, natively supports both convolutional and fully-connected layers, and exploits properties of all activations to reduce storage, communication, and computation demands. While TCL does not require changes to the CNN to deliver benefits, it does reward any technique that would amplify any of the aforementioned weight and activation value properties. Compared to an equivalent data-parallel accelerator for dense CNNs, TCLp, a variant of TCL improves performance by 5.05x and is 2.98x more energy efficient while requiring 22% more area.

Characterizing Sources of Ineffectual Computations in Deep Learning Networks

Milos Nikolic, Mostafa Mahmoud, Yiren Zhao, Robert Mullins, Andreas Moshovos
Conference Paper(ISPASS 2019) 2019 IEEE International Symposium on Performance Analysis of Systems and Software, Madison, Wisconsin.

Abstract

Recent hardware accelerators for inference on the ImageNet dataset with Deep Convolutional Neural Networks eliminate ineffectual work and traffic resulting from value properties, such as precision variability and sparsity at the value or bit-level. This work analyzes to what extent these properties persist in a broader set of neural network models and data sets. We analyzed: a) image classification networks, b) other image application models (non-classification), and c) Long-Short-Term Memory (LSTM) models for non-image applications. We show that these properties persist, albeit to a different degree, and identify opportunities for future accelerator design efforts.

Diffy: a Déjà vu-Free Differential Deep Neural Network Accelerator

Mostafa Mahmoud, Kevin Siu, Andreas Moshovos
Conference Paper(MICRO 2018) The 51th Annual IEEE/ACM International Symposium on Microarchitecture, Fukuoka, Japan, 2018.

Abstract

We show that Deep Convolutional Neural Network (CNN) implementations of computational imaging tasks exhibit spatially correlated values. We exploit this correlation to reduce the amount of computation, communication, and storage needed to execute such CNNs by introducing Diffy, a hardware accelerator that performs Differential Convolution. Diffy stores,communicates, and processes the bulk of the activation values as deltas. Experiments show that, over five state-of-the-art CNN models and for HD resolution inputs,Diffy boosts the average performance by 7.1x over a baseline value-agnostic accelerator [1] and by 1.41x over a state-of-the-art accelerator that processes only the effectual content of the raw activation values [2]. Further, Diffy is respectively 1.83x and1.36x more energy efficient when considering only the on-chip energy. However,Diffy requires 55% less on-chip storage and 2.5x less off-chip bandwidth compared to storing the raw values using profiled per-layer precisions [3]. Compared to using dynamic per group precisions [4], Diffy requires 32% less storage and1.43x less off-chip memory bandwidth. More importantly, Diffy provides the performance necessary to achieve real-time processing of HD resolution images with practical configurations. Finally, Diffy is robust and can serve as a general CNN accelerator as it improves performance even for image classification models.

Memory Requirements for Convolutional Neural Network Hardware Accelerators

Kevin Siu, Dylan Malone Stuart, Mostafa Mahmoud, Andreas Moshovos
Conference Paper(IISWC 2018) The IEEE International Symposium on Workload Characterization, Raleigh, NC, 2018.

Abstract

The rapid pace and successful application of machine learning research and development has seen widespread deployment of deep convolutional neural networks (CNNs). Alongside these algorithmic efforts, the compute- and memory-intensive nature of CNNs has stimulated a large amount of work in the field of hardware acceleration for these networks. In this paper, we profile the memory requirements of CNNs in terms of both on-chip memory size and off-chip memory bandwidth, in order to understand the impact of the memory system on accelerator design. We show that there are fundamental trade-offs between performance, bandwidth, and on-chip memory. Further, this paper explores how the wide variety of CNNs for different application domains each have fundamentally different characteristics. We show that bandwidth and memory requirements for different networks, and occasionally for different layers within a network, can each vary by multiple orders of magnitude. This makes designing fast and efficient hardware for all CNN applications difficult. To remedy this, we outline heuristic design points that attempt to optimize for select dataflow scenarios.

Characterizing Sources of Ineffectual Computations in Deep Learning Networks

Miloš Nikolić, Mostafa Mahmoud, Andreas Moshovos
Poster Paper(IISWC 2018) The IEEE International Symposium on Workload Characterization, Raleigh, NC, 2018.

Abstract

Recent hardware accelerators for inference on the ImageNet dataset with Deep Convolutional Neural Networks eliminate ineffectual work and traffic resulting from value properties, such as precision variability and sparsity at the value- or bit-level. This work analyzes to what extent these properties persist in a broader set of neural network models and data sets. We analyzed: a) image classification networks, b) other image application models (non-classification), and c) Long-Short-Term-Memory (LSTM) models for non-image applications. We show that these properties persist, albeit to a different degree, and identify opportunities for future accelerator design efforts.

IDEAL: Image DEnoising AcceLerator

Mostafa Mahmoud, Bojian Zheng, Alberto Delmás Lascorz, Felix Heide, Jonathan Assouline, Paul Boucher, Emmanuel Onzon, Andreas Moshovos
Conference Paper(MICRO 2017) The 50th Annual IEEE/ACM International Symposium on Microarchitecture, Boston, MA, 2017.

Abstract

Computational imaging pipelines (CIPs) convert the raw output of imaging sensors into the high-quality images that are used for further processing. This work studies how Block-Matching and 3D filtering (BM3D), a state-of-the-art denoising algorithm can be implemented to meet the demands of user-interactive (UI) applications. Denoising is the most computational demanding stage of a CIP taking more than 95% of time on a highly-optimized software implementation [1]. We analyze the performance and energy consumption of optimized software implementations on three commodity platforms and find that their performance falls far short of that needed. To enable BM3D to be used for UI applications, we consider two alternatives: a dedicated accelerator, and running recently proposed Neural Network (NN) based approximations of BM3D [2, 3] on an NN accelerator. We develop Image DEnoising AcceLerator(IDEAL), a hardware BM3D accelerator which incorporates the following techniques: 1) a novel software-hardware optimization, Matches Reuse (MR), that exploits typical image content to reduce the computations needed by BM3D, 2) prefetching and judicious use of on-chip buffering to minimize execution stalls and off-chip bandwidth consumption, 3) a careful arrangement of specialized computing blocks, and 4) data type precision tuning. Over a dataset of images with resolution ranging from 8 megapixel (MP) up to 42MP, IDEAL is 11, 352× and 591× faster than high-end general-purpose (CPU) and graphics processor (GPU) software implementations with orders of magnitude better energy efficiency. Even when the NN approximations of BM3D are run over DaDianNao [4], a state-of-the-art high-end hardware NN accelerator, IDEAL is 5.4× faster and 3.95× more energy efficient.

Memory Controller Design Under Cloud Workloads

Mostafa Mahmoud, Andreas Moshovos
Conference Paper(IISWC 2016) The IEEE International Symposium on Workload Characterization, Providence, RI, 2016, pp. 1-11.

Abstract

This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address mapping scheme used. Experimental measurements demonstrate: 1) Several recently proposed memory scheduling policies are not a good match for these scale-out workloads. 2) The relatively simple First-Ready-First-Come- First-Served (FR-FCFS) policy performs consistently better, and 3) for most of the studied workloads, the even simpler First-Come-First-Served scheduling policy is within 1% of FRFCFS. 4) Increasing the number of memory channels offers negligible performance benefits, e.g., performance improves by 1.7% on average for 4-channels vs. 1-channel. 5) 77%- 90% of DRAM rows activations are accessed only once before closure. These observation can guide future development and optimization of memory controllers for scale-out workloads.

Hybrid Limited-Pointer Linked-List Cache Directory and Cache Coherence Protocol

Mostafa Mahmoud, Amr Wassal
Conference Paper(JEC-ECC 2013) The Second International Japan-Egypt Conference on Electronics, Communications and Computers, 6th of October City, 2013, pp. 77-82.
Best Paper Award

Abstract

The rise of Chip-Multiprocessors (CMPs) as a promising trend for the state of the art high-performance processors design raised the need for a scalable cache directory organization along with a simple cache coherence protocol as a hot research area. While thousands of cores are expected to fit on a single chip soon, the previously proposed cache directory schemes still lacks the scalability to accommodate more than tens of cores. The inefficiencies of these directory schemes come in the form of unaffordable memory overhead, excessive coherence traffic leading to performance degradation due to inexact representation of sharers and very complex coherence protocols. In this paper we introduce a new cache directory scheme for many core CMPs. The proposed scheme acquires, and actually improves, the scalability and low coherence traffic of cache-based linked list directory schemes while avoiding its completely sequential operation by exploiting the parallel operation of limited pointer directory schemes. We compare the proposed organization with these two previously proposed ones on different CMP configurations starting with a 4-core CMP and ending with a 32-core CMP. We show that the proposed scheme can avoid one third of the excessive broadcasted invalidation messages and two thirds of the extraneous acks in case of directory pointer overflows in limited pointer schemes. On the other hand, the proposed scheme achieves around 10% better performance than that of the completely sequential cache-based linked list directory while reducing the number of invalidation messages per invalidation event by 24%.

A Novel 3D Crossbar-Based Chip Multiprocessor Architecture

Mostafa Mahmoud, Amr Wassal
Conference Paper(JEC-ECC 2013) The Second International Japan-Egypt Conference on Electronics, Communications and Computers, 6th of October City, 2013, pp. 83-87.

Abstract

Moore's law still offers more transistors to fit per die unit area and this leads to the expectation of having thousands of cores fit on a single chip soon. Thus, Network-on-Chip proved to be a successful approach to accommodate this increasing number of cores on chip. However, the previously proposed 2D architectures still lack the scalability to more than few tens of cores where the inefficiencies of those architectures come in the form of long interconnect delays leading to performance degradation and high power consumption due to long wires. Fortunately, the rapid advances in 3D die-stacking technology as a promising trend for the state of the art high-performance processor designs raised the possibilities of having new approaches towards a scalable interconnection network. Thus, in this paper, we propose a novel 3D crossbar-based architecture that separates cores from cache modules in different 3D stacked dies.We introduce area model of the adopted crossbar and analyze the scalability of the proposed architecture up to 1024 communicating entities; cores and L2 cache banks.

Hybrid Limited-Pointer Linked-List (HLPLL) Cache Directory and Cache Coherence Protocol

Mostafa Mahmoud
MSc. Thesis M.Sc. Thesis. Cairo University, 2013.

Abstract

Abstract

Teaching Experience

  • 2019 2014

    ECE243 Computer Organization

    Basic computer structure. Design of central processing unit. Hardwired control. Input-output and the use of interrupts. Assembly language programming. Main memory organization and caches. Peripherals and interfacing. System design considerations.

  • 2019 2014

    ECE241 Digital Systems

    Digital logic circuit design. Optimizations of combinational logic. Transistor-level design of logic gates; propagation delay and timing of gates and circuits. Verilog. Memory in digital circuits, including latches, clocked flip-flops, and Static Random Access Memory. Set-up and hold times of sequential logic. Finite state machines. Hardware addition and multiplication.

  • 2018 2017

    CSC190 Computer Algorithms and Data Structures

    Introducion to C++ programming and computational thinking. Algorithm design techniques. Data structures; linked lists, stacks, queues, trees, heaps, hashing, pointers (including function pointers) and arrays, data types and bit operations. Dynamic memory management.

  • 2020 2016

    ECE352 Computer Organization

    Digital and Computer Systems. Synchronous and asynchronous sequential circuits, pipelining, integer and floating-point arithmetic, RISC processors.

  • 2017 2015

    ECE353 Operating Systems

    Operating system structure, processes, threads, synchronization, CPU scheduling, memory management, file systems, input/output, multiple processor systems, virtualization, protection, and security. The laboratory exercises will require implementation of part of an operating system.

My Office

The Edward S. Rogers Sr. Department of Electrical and Computer Engineering
10 King's College Rd.
University of Toronto
Toronto, ON M5S 3G4 Canada
Office: EA306 Engineering Annex Building, room 306