My research interests fall into the general area of parallel systems, i.e., systems with multiple processing elements. The last few years have witnessed a shift towards parallel systems with heterogeneous processing elements, exemplified by systems that integrate multicores with Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). Such heterogenous parallel systems appear across a spectrum of computing platforms, ranging from mobile phones and tablets that augment their processor cores with GPUs for faster and more power-efficient computing, to high-performance computing platforms that utilize GPUs and FPGAs for higher levels of performance.
My research tackles an important challenge in using heterogeneous parallel systems: their programmability. To realize the performance potential of GPUs, application developers must apply a host of program optimizations to their code. Similarly, the use of FPGAs demands knowledge of hardware design and have lengthy development cycles. These hurdles make it difficult for software application developers to effectively and productively use parallel heterogenous systems. My research aims to design, implement and evaluate innovative compiler, run-time and architectural support for these systems. I have several projects that relate to GPU programmability and performance, to the design and implementation of high-performance FPGA overlays and to architectural support for the programming of multicores systems.
Graduate Students: T.D. Han, J.D. Garvey, W. Feng, A. Lee, A. Chiu, M.C. Delorme
Sample Publications: ICPP15, CF15, ADAPT15, GTC14, ICPP13, GPGPU13, GPGPU11, TPDS11, GPGPU09
Automatic performance tuning of GPU programs
GPUs require software developers to restructure, or optimize, their applications code to exploit the underling GPU architecture. For example developers may optimize their code to exploit memory coalescing, reduce thread divergence, use local and texture memories, and select kernel launch configurations that balance parallelism with resource usage. The impact of each optimization is often difficult to assess when combined with other optimizations. Programmers are left to explore a large space of optimization configurations, i.e., combinations of optimizations and their parameters—a space that can take months of compute time to explore. In this project, my students and I develop compiler and run-time support that aid programmers in exploring this large space and in deciding what optimization configurations to use. We explore the use of machine learning models to predict what optimization configurations to use. We also explore online tuning approaches that use heuristics to explore the space of optimizations.
Directive-based programming for GPUs
This project develops high-level programming models for GPUs. More specifically, it involves the design and implementation of hiCUDA, a directive-based language for GPUs. The language, facilitates programming GPUs through simple directives added to the sequential code while maintaining the well-adopted CUDA/OpenCL programming models, and does so with no penalty to performance. We developed a prototype hiCUDA compiler, which is released to the public domain at www.hicuda.org". We currently extend this infrastructure to support directive-based optimizations for OpenCL and CUDA kernels.
Compiler Optimizations for GPUs
This project developed new compiler optimizations for GPU programs. Much of the existing work in this area focused on optimizing memory performance. However, it ignored thread divergence due to branches and loops, which also impacts performance due to the GPU’s SIMT execution model. We explored three optimizations for reducing the impact of thread divergence purely in software and without the need for hardware support.
Graduate Students: D. Capalija, L. Barron.
Publications: FPL14, FPL13, FCCM11.
FPGAs offer massively parallel resources that, if exploited by application developers, can deliver high levels of performance. However, the widespread use of FPGAs to accelerate applications is hindered by: (1) their low level programming abstraction that requires expertise in hardware design—expertise that application developers often lack; and (2) the long development cycles associated with FPGA design tools to which software developers are not accustomed to. We advocate a solution to the these problems that is based on overlays: FPGA circuits that are in themselves programmable. Overlays can be designed to have software-friendly programming models and, once synthesized, can be quickly programmed without the need for FPGA design tools. They can lower the entry threshold to using FPGAs, allow rapid prototyping and offer portability across different FPGAs, albeit at high resource usage.
This work designed and prototyped an overlay architecture that projects the software model of pipelined dataflow graphs (DFGs). The overlay consists of a set of cells that are connected in a 4 nearest-neighbor topology. Each cell contains a function unit, capable of implementing the functionality of a machine instruction and a set of programmable elastic pipeline stages that can be configured to connect the outputs of function units to the inputs of others. The overlay is heterogeneous in that each cell has a different function unit that is specialized for a specific machine instruction operation. Instances of our overlay architecture can deliver performance in the gigaflops range that scales with FPGA resources and are fast to program. Current work focuses on (1) automatic extraction of DFGs of applications, (1) determining the best functional units for an overlay for a given application, (3) extending the design of the overlay to multiple FPGA devices, and (4) exploring a Just-in-Time compilation framework for dynamically and transparently translating binary code into overlay circuits.
Graduate Students: C. Segulja.
Publications: ISCA15, PACT14, HPCA12.
In this set of projects, we explore new architectural mechanisms that ease the programming of parallel systems. This work stems from the view that it is necessary for the hardware and the software to collaborate to address the problems of programming parallel systems.
Efficient Race Detection for Cleaner Parallel Execution Semantics
This work develops CLEAN, a system that precisely detects WAW and RAW races and deterministically orders synchronization operations. It proposes relatively inexpensive architectural mechanisms for supporting these two operations, resulting in cleaner semantics for racy programs. A software-only implementation of CLEAN runs all Pthread benchmarks from the SPLASH-2 and PARSEC suites with an average 7.8x slowdown. The overhead of precise WAW and RAW detection (5.8x) constitutes the majority of this slowdown. Simple hardware extensions reduce the slowdown of CLEAN’s race detection to on average 10.4% and never more than 46.7%.
The Cost of Determinism
This work evaluates the extent to which the significant overheads (3X-10X) reported in the literature for systems that enforce deterministic execution stem from fundamentally imposing a deterministic order on the threads. To the best of our knowledge, this is the first “limits” study to characterize this overhead in an area that is of growing importance to both academia and industry. It is also the first to characterize this overhead under changing execution conditions. The works reports that the overhead of deterministic orders can be as low as 4% on average.
Synchronization-Free Deterministic Parallel Programming
This work developed a novel approach to synchronization-free and deterministic parallel programming of multicores. The approach uses hardware extensions to dynamically (i.e., at run time) establish a deterministic order of accesses to shared data. The approach (called versioning) requires no modifications to a multicore’s processors and can be efficiently implemented in an area that is equivalent a small fraction of the processor’s cache. It also requires minimal compiler support. The viability of this approach is demonstrated using an FPGA-based prototype of a multicore running real applications. This approach will allow future multicores to dedicate a fraction of the silicon resources to make them programmed with ease.
Over the years, my research has been supported by the Government of Canada, by the Province of Ontario and by industry, in Canada and in the U.S. I gratefully acknowledge the support of: the Natural Sciences and Engineering Research Council of Canada (NSERC), the Ontario Centres of Excellence (OCE), Qualcomm, Intel, NVIDIA, Advanced Micro Devices (AMD), STMicroelectronics and IBM Canada.