Site Maker Title

Recent work proposes trading-off accuracy for greater energy efficiency. The key insight of this work is that some compelling algorithms and applications such as machine learning, image processing, video processing, gaming and computer vision are tolerant to modest errors. These applications are inherently tolerant to some error in their output whether due to noisy input data, multiple correct answers or not requiring perfect execution. Leveraging this error tolerance can lead to significant energy savings and performance improvements. As the speed gap between memory and processors continues to widen, removing memory accesses from the critical path of applications can reap significant performance advantages. Furthermore, foregoing memory accesses to power hungry structures leads to lower power designs. Our research leverages approximation and value similarity to avoid costly accesses to memory; value similarity observes that two values may be close enough that replacing one with the other will still yield an acceptable result. In load value approximation, memory accesses are replaced with predicted values. Predicting values allows the processor to consume them faster and avoids power-hungry access to large caches and off-chip memories; tolerance to errors allows this design to forego the complexities associated with checkpointing and rollback in out of order processors. With Doppelganger Caches, we develop hardware techniques to identify values that are similar and avoid storing them redundantly in on-chip caches. This increases effective cache capacity leading to performance improvements through lower miss rates. Our follow-on work, Bunker Caches simplifies the Doppelganger architecture by exploiting the insight that approximately similar data follows regular spatial patterns such as pixels in adjacent rows of an image. This observation led to a practical cache design that reduces storage pressure leading to energy savings through power gating or performance improvement through increased effective capacity. In each of these projects, we synergistically consider the context of the data being processed and the low-level hardware structures that can be optimized to better handle data that exhibits temporal, spatial and value locality.

Machine Learning Acceleration

Deep Neural Networks (DNNs) are becoming ubiquitous thanks to their exceptional capacity to extract meaningful features from complex pieces of information such as text, images, or voice. For example, DNNs and in particular, Convolutional Neural Networks (CNNs) currently offer the best recognition quality versus alternative object recognition algorithms. While current DNNs enjoy several practical applications, it is likely that future DNNs will be larger, deeper, process larger inputs, and used to perform more intricate classification tasks at faster speeds, if not in real-time. Accordingly, there is a need to boost hardware compute capability while reducing energy per operation and to do so for smaller form factor devices. To improve energy efficiency, we are exploring hardware acceleration techniques for CNNs. First, we propose reduced and variable precision for different computations in CNNs. Some layers in the neural network can effectively compute their results with only a few bits of precision while others require close to 16 bits. Forcing the network to use a one-size-fits-all precision wastes energy and resources when accessing memory. We propose hardware support to store in memory and transfer reduced precision values and convert them to full-precision for computation within the accelerator. This results in bandwidth and energy savings while achieving accuracy within 1% of the original design. Second, we eliminate unnecessary computation within the hardware accelerator. At the heart of CNN computation are multiplications; however, considering a wide range of neural networks and inputs, we found that a large fraction of the multiplications have zero or a value very close to zero has one of their inputs. We propose hardware support to identify zero-value neurons and eliminate their computation on-the-fly without sacrificing the throughput of the accelerator. These two approaches both consider the values being computed as a means to improve energy efficiency. As future work, these types of approaches can be applied to other machine learning algorithms. Our work on machine learning acceleration also pairs well with approximate computing which we plan to explore in greater detail in the future.

Simulation Methodologies and Acceleration

A significant challenge facing computer architects is the long latency associated with determining the performance of proposed system modifications. Simulation is the most common method for evaluating new architectures; among simulation methodologies, full-system simulation is widely favoured. Full-system simulation runs the application, the operating system and the architecture in great detail. This level of detail and architectural fidelity comes at a significant speed penalty; existing simulators can simulate 200 KIPS (kilo instructions per second) or four orders of magnitude slower than native execution. This problem is exacerbated when one wants to simulate a large number of cores. Simulation is inherently difficult to parallelize; these simulations can take several days or longer to complete. We are currently pursuing several avenues of research to address the challenges associated with computer architecture simulation: high-fidelity synthetic traffic models and analytical models for synchronization behaviour.
We are re-evaluating the early-stage design process and coming up with new patterns that are richer in detail and closer to current and emerging applications running on existing and novel cache coherence protocols. These parameterized traffic generators are able to model dependence relationships between messages and mimic different coherence protocol options prior to finalization of protocol-level details; they allow us to better examine the suitability of the network to certain protocols. We use machine learning to identify application phases and then extract a Markov model to reproduce key traffic characteristics. Network simulation can exit once the Markov chain has reached its steady state which can lead to substantial simulation time reductions. This work enables a very fast turnaround for performance analysis of NoCs allowing for large and accurate design-space explorations. The results obtained via our synthetic coherence models are a better predictor of the system performance obtained at later design stages than traditional synthetic traffic patterns are. As part of this work, we have identified synchronization overhead as a key challenge to developing accurate models and scaling them to larger systems; we are currently developing analytical models to estimate synchronization overhead for a wide variety of architectures and core counts. We have also extended these models to consider more complex coherence protocols such as those that would be present in heterogeneous systems composed of CPUs and GPUs; this work also develops techniques to extrapolate our traffic models to larger systems. These new models also incorporate a realistic memory traffic generator that captures key memory behaviour such as row buffer hits and bank conflicts. To further improve performance, we also apply sampling methodologies to NoC simulation.

Routing and Flow Control Optimizations for
Cache-Coherent Many-Core Architectures

Shared memory models continue to dominate many-core architecture proposals. Providing an on-chip network that efficiently handles cache coherence traffic is imperative for these types of systems. In recent work, we have proposed several optimizations for cache coherence traffic including routing algorithms to handle multicast and reduction traffic [ISCA 2008][HPCA 2012]. Due to abundant wiring on chip, cache coherence traffic contains of a large fraction of short packets (single flit). We propose a novel flow control technique called Whole Packet Forwarding [HPCA 2012][TPDS 2013] specifically designed to exploit and optimize short packets in the on-chip network. These optimizations offer performance improvements and energy reductions for cache-coherence traffic. Alternatively, they can be used to reduce the required resources of the network while maintaining performance.

Application-Aware On-Chip Networks

As architectures scale to many cores, it becomes increasingly difficult to scale individual programs to fully utilize the available cores. As a result, multiple workloads are being consolidated on a single chip to maximize utilization. Existing routing algorithms, both the deterministic and adaptive largely overlook the issues associated with workload consolidation. Ideally, the performance of each application should be the same whether it is running in isolation or is co-scheduled with other applications. Significant research has focused on maintaining isolation and effectively sharing on-chip resources such as caches and memory controllers. Recently, we have proposed DBAR, a destination-based adaptive routing scheme [ISCA 2011][TC 2012]. DBAR dynamically filters network congestion information to prevent the traffic patterns and congestion of one workload from impacting the routing decisions of a separate workload.

To adapt to changing communication demands both within and across applications, we are exploring adaptive and configurable networks. Our most recent result explores the benefits of implementing bi-direcitonal channels whose direction can be configured at a fine-granularity [NOCS 2012]. This router microarchitecture can switch the directionality of channels with low-overhead to provide additional bandwidth in the event of heavy traffic flows between particular source-destination pairs. By enabling intelligent network resources, we can potentially reduce the footprint of the network without loss in performance.

Die-Stacked Architectures

Interconnection Networks in Die-Stacked Systems. Fast, high-bandwidth access to memory is critical in modern and future systems. In early work, I explored the impact of memory traffic on the network-on-chip (NoC). The long latency to access memory remains a significant challenge in architecture; NoCs contribute to the bandwidth and latency associated with memory accesses and must be a key consideration when optimizing memory performance. Memory traffic results in hotspot patterns as there are many fewer memory controllers than cores. Looking beyond the impact on a single chip, we consider multi-chip systems designed on silicon interposers. Silicon interposer technology (“2.5D” stacking) enables the integration of multiple memory stacks with a processor chip, thereby greatly increasing in-package memory capacity while largely avoiding the thermal challenges of 3D stacking DRAM on to the processor. Current systems employing interposers for memory integration use the interposer to provide pointto- point interconnects between chips. However, these interconnects only utilize a fraction of the interposer’s overall routing capacity. We propose to extend the NoC architecture to the interposer to exploit otherwise unused routing resources. Once the choice has been made to integrate memory through an interposer, the cost of the interposer is factored in to the system; our work looks at effectively using that interposer for increased performance. Follow-on work considers how the interposer can be leveraged to improve yield for large many-core systems. We exploit the interposer to “disintegrate” a multi-core CPU chip into smaller chips that individually and collectively cost less to manufacture than a single large chip. However, this fragments the overall NoC, which decreases performance as core-to-core messages between chips must now route through the interposer. We study the performance-cost tradeoffs of interposer-based, multi-chip, multi-core systems and propose new interposer NoC organizations to mitigate the performance challenges while preserving the cost benefits. Composing a system of multiple, heterogeneous chips on an interposer introduces new routing challenges. It is desirable to maintain flexibility in the types of chips that can be integrated but the final, fully-integrated system must function correctly. Individual chips with deadlock-free interconnection networks can be composed into a system that suffers from deadlock. We propose a routing strategy to enable deadlock-free integration of multiple chips while limiting overheads and maintaining high performance [CS3]. Many open challenges and interesting opportunities remain in this relatively new area of silicon interposer-based systems.

Interconnection Network Optimizations

Mechanisms to facilitate efficient on-die communication are critical to future architectures. My research focuses on innovations in on-chip interconnection networks to provide scalable, low-latency, cost-effective communication. I address energy-efficient network design through innovations in dynamic voltage and frequency scaling (DVFS), network prioritization and novel topologies. Communication can consume a significant fraction of the on-die power budget. The goal of our work is to dynamically match network resources to demand. Providing more resources than required wastes power; providing fewer resources than needed leads to performance loss. Effectively scaling the voltage and frequency to match communication demands can mitigate some of the power costs of the network. To be most effective, a NoC DVFS technique should accurately predict communication phases. We leverage coherence protocol behaviour to identify traffic phases and proactively adjust the voltage and frequency of the NoC to match the traffic demands. Another approach we take is to divide the network into two subnetworks with different characteristics. One network provides fast communication for critical data while the other network provides more energy-efficient communication for data that can be slowed down without negatively effecting performance. We also explore the use of a speculative network to deliver performance critical packets in a fast, best effort manner. By allowing packets to be dropped, we can vastly simplify the network design leading to an aggressive clock cycle and lower power by omitting buffers. Speculative packets are guaranteed to be delivered on the secondary network. Novel topologies can reduce network cost while maintaining the performance required by current applications. We develop low-radix topologies that require less area and power than current popular topologies such as meshes. Randomized long links keep the network diameter low. The randomized connections in our network result in more uniform access to memory controllers within the on-die fabric regardless of placement. In this manner, the network provides greater equality of service; all threads receive similar latency and bandwidth from the network when accessing shared resources. We are further exploring novel topologies for heterogeneous systems where the size and demands of processing elements in the system may vary. This work includes floorplanning and innovations in routing algorithms to support the proposed topologies.

In addition to these projects, I am currently recruiting outstanding Masters and PhD students to work on new projects. These projects explore various aspects of the memory system, on-chip network design and optimization.

If you have applied to graduate school at the University of Toronto and feel your interests align with mine, please email me. You are more likely to receive a response if you can demonstrate that you have read at least one of my papers. E-mails that address me as "Dear Sir:" will be ignored. Bonus points if you figure out that my full last name is "Enright Jerger". Undergraduates at the University of Toronto looking for summer research positions are also welcome to contact me. Unfortunately, I am unable to accept summer undergraduate students from international universities at this time.

We are grateful for the funding and in-kind contributions for these projects provided by the following: Natural Science and Engineering Research Council (NSERC), Ontario Centres of Excellence (OCE), Sloan Foundation, Connaught Foundation, University of Toronto, Percy Edward Hart Endowed Chair, Canadian Foundation for Innovation (CFI), Ontario Ministry of Research and Innovation, Intel, Qualcomm, AMD and Fujitsu.

Last updated: May 2017

Research Projects

Approximate Computing

Machine Learning Acceleration

Simulation Methodologies and Acceleration

Routing and Flow Control Optimizations for
Cache-Coherent Many-Core Architectures

Application-Aware On-Chip Networks

Die-Stacked Architectures

Interconnection Network Optimizations

Additional Projects

Funding Support

Research Projects

Approximate Computing

Machine Learning Acceleration

Simulation Methodologies and Acceleration

Routing and Flow Control Optimizations for Cache-Coherent Many-Core Architectures

Application-Aware On-Chip Networks

Die-Stacked Architectures

Interconnection Network Optimizations

Additional Projects

Funding Support

Routing and Flow Control Optimizations for
Cache-Coherent Many-Core Architectures