International Conference on 
Parallel Architecture and Compilation Techniques
Toronto, Canada
October 25-29, 2008

The tutorials and workshops for PACT 2008 are listed below. Registration for tutorials and workshops is handled through the PACT 2008 registration site. Both half day and full day tutorials and workshops include breaks but not lunch.


The tutorials and workshops will be held on Saturday October 25th and Sunday October 26th in the conference hotel.


The daily schedule is as follows:

  • 08:00-08:30: Breakfast (included)
  • 08:30-10:00: Morning Session #1
  • 10:00-10:20: Morning Break
  • 10:20-12:00: Morning Session #2
  • 12:00-13:30: Lunch on your own
  • 13:30-15:00: Afternoon Session #1
  • 15:00-15:20: Afternoon Break
  • 15:20-17:00: Afternoon Session #2





SimFlex and ProtoFlex: Fast, Accurate, and Flexible Simulation of Multicore Systems

Eric Chung (, Mike Ferdman, Nikos Hardavellas


Carnegie Mellon University




Introducing Microthreading and its Programming Model

Thomas Bernard (,

Mike Lankamp (,

Chris Jesshope (

Universiteit van Amsterdam




Workshop on Parallel Architectures and Bioinspired Algorithms

J. Ignacio Hidalgo (


Universidad Complutense de Madrid




Programming Models and Compiler Optimizations for GPUs and Multi-Core Processors
 J. (Ram) Ramanujam (,
P. (Saday) Sadayappan

Louisiana State University



Productive Parallel Programming in PGAS
George Almasi, Christopher Barton (, Calin Cascaval, Ettore Tiotto









MEDEA: Workshop on  MEmory performance: DEaling with Applications, systems and architecture

Sandro Bartolini (, Pierfrancesco Foglia ( and
Cosimo Antonio Prete


Università degli Studi di Siena

Università di Pisa


Transactional Memory

Presenters: Yang Ni,
Tatiana Shpeisman (, and Adam Welc
(Intel Corporation)




WoSPS: Workshop on Soft Processor Systems
J. Gregory Steffan (


University of Toronto



Descriptions and Links:

Tutorial #1: SimFlex and ProtoFlex: Fast, Accurate, and Flexible Simulation of Multi-Core Systems


Computer architects have long relied on software simulation to evaluate the functionality and performance of architectural innovations. Unfortunately, modern cycleaccurate simulators are several orders of magnitude slower than real hardware and the growing levels of hardware integration increase simulation complexity even further. In addition, conventional simulators are optimized for speed at the expense of code flexibility and maintainability. In this tutorial, we present the SimFlex and ProtoFlex family of simulation tools for fast, accurate, and flexible simulation of uniprocessor, multi-core and distributed shared-memory systems. SimFlex achieves fast simulation turnaround while ensuring representative results by leveraging the SMARTS simulation sampling framework. At the same time, its component-based design allows for easy composition of complex multi-core and multiprocessor systems. ProtoFlex is an FPGA-accelerated simulation technology that complements SimFlex by enabling full-system functional simulation of multiprocessor and multi-core systems at speeds of one to two orders of magnitude faster than software tools.In this tutorial, first we introduce attendees to the SMARTS simulation sampling approach. We present relevant background from statistics and compare and contrast statistical sampling with other sampling proposals. Second, we present the design, implementation and use of the Flexus simulator suite. Flexus is a family of component based C++ architecture simulators that implement timing-accurate models of multi-core and multiprocessor systems. We give attendees hands-on experience with Sim- Flex/TraceFlex, a Flexus model for fast functional and memory system simulation, SimFlex/OoO, a Flexus model for cycle-accurate simulation, and Flexus’ statistical managers and sampling tools. Finally, we present a hands-on technology preview of ProtoFlex. We give attendees the opportunity to compile, execute and profile multithreaded applications on a real operating system running on a BEE2 FPGA platform.

Tutorial #2: Introducing Microthreading and its Programming Model

At the University of Amsterdam, our research group developed a disruptive parallel programming model called the SANE Virtual Processor, or SVP model [1] [2] [3] which captures all the parallelism in the program in order to enable ideal performance speedup. The model is also deadlock free, deterministic and includes providing dynamic plus adaptive concurrency management.

This half-day tutorial will introduce the audience to the SVP model. Following this, we will present an instantiation of the model by implementing its actions as instructions in a microgrid of DRISC processors [4]. Then as the μTC language and its compiler, which is based on GCC. We will also show how we aim to parallelize C by translating C to μTC. Finally we present results we have achieved so far using our compiler and emulator, which show scalable speedup with number of processors used across many orders of magnitude.

[1] K. Bousias, L. Guang, C.R. Jesshope, M. Lankamp (2008), Implementation and Evaluation of a Microthread Architecture, Submitted to: Journal of Systems Architecture
[2] C. R. Jesshope (2007), A model for the design and programming of multicores, to be published in: Advances in Parallel Computing, IOS Press, Amsterdam|
[3] C. R. Jesshope (2007), SVP and μTC - A dynamic model of concurrency and its implementation as a compiler target, Report (unpublished)
[4] T. Bernard, K. Bousias, L. Guang, C. R. Jesshope, M. Lankamp, M. W. van Tol and L. Zhang (2008), A general model of concurrency and its implementation as many-core dynamic RISC processors, SAMOS 08

Tutorial #3: Programming Models and Compiler Optimizations for GPUs and Multi-Core Processors

On-chip parallelism with multiple cores is now ubiquitous. Because of  power and cooling constraints, recent performance improvements in both
general-purpose and special-purpose processors have come primarily  from increased on-chip parallelism rather than increased clock  ates. Parallelism is therefore of considerable interest to a much broader group than developers of parallel applications for high-end  supercomputers. Several programming environments have recently emerged in response to the need to develop applications for GPUs, the Cell  rocessor, and multi-core processors from AMD, IBM, Intel etc.  As commodity computing platforms all go parallel, programming these platforms in order to attain high erformance has become an extremely important issue. There has been considerable recent interest in two complementary approaches:

·         developing programming models that explicitly expose the programmer to parallelism; and,

·         compiler optimization frameworks to automatically transform  sequential programs for parallel execution.

This tutorial will provide an introductory survey covering both these aspects. In contrast to conventional multicore architectures, GPUs and the Cell processor have to exploit parallelism while managing the physical memory on the processor (since there are no caches) by explicitly orchestrating the movement of data between large off-chip memories and the limited on-chip memory. This tutorial will address the issue of explicit memory management in detail.

Tutorial #4: Productive Parallel Programming in PGAS


Partitioned Global Address Space (PGAS) programming languages offer an attractive, high-productivity programming model for parallel programming. PGAS languages, such as Unified Parallel C (UPC), combine the simplicity of shared-memory programming with the efficiency of the message-passing paradigm. The efficiency is obtained through a combination of factors: programmers declare how the data is partitioned and distributed between threads and use the SPMD programming model to define work; compilers can use the data annotations to optimize accesses and communication. We have demonstrated that UPC applications can outperform MPI applications on large-scale machines, such as BlueGene/L.


In this tutorial we shall present our work on the IBM's XLUPC Compiler. We will discuss language issues, compiler optimizations for PGAS languages, runtime trade-offs for scalability and performance results obtained on a number of benchmarks and applications. Attendants should not only gain a better understanding of parallel programming, but also learn about compiler and system limitations. The expected outcome is that programmers will be able to code their applications such that performance optimization opportunities are exposed and exploited.

Tutorial #5: Transactional Memory

Transactions have recently emerged as a promising alternative to lock-based synchronization. The tutorial will cover a range of topics related to transactional memory spanning from the description of high-level language constructs and their semantics to the low-level details of specific algorithms used to support efficient execution of these constructs. We will take a programming systems view of transactional memory and walk the audience through each layer of the system starting from the top-level programmer's view of transactional memory and working down to the implementation level. We show how transactional memory can avoid the problems of lock-based synchronization such as deadlock and poor scalability when lock-based software modules are composed. We discuss how transactional constructs can be added to languages, such as C/C++ or Java, as an alternative to current synchronization constructs. We present software strategies for implementing transactional memory and show how to leverage compiler optimizations to reduce its overheads. We also describe our experience writing transactional applications and present the experimental results comparing their performance with that of the lock-based applications. Finally, we discuss the advanced topics related to the semantics of transactional language constructs including isolation levels and integration with the language memory models.


Yang Ni is a Research Scientist in Intel's Programming Systems Lab. He has been working on programming languages for platforms from mobile devices to chip multi processors. His current research focuses on transactional memory. He is a major contributor to the Intel C/C++ TM compiler. Yang received his Ph.D. in Computer Science from Rutgers University.


Adam Welc is a Research Scientist in Intel's Programming Systems Lab. His work is in the area of programming language design and implementation, with specific interests in concurrency control, compiler and run-time system optimizations, transactional processing as well as architectural support for programming languages and applications. Adam received the Master of Science in Computer Science from Poznan University of Technology, Poland, in July 1999. He continued his graduate studies at Purdue University, receiving the Master of Science in Computer Science in May 2003, and the Ph.D. in Computer Science in March 2006.


Tatiana Shpeisman is a Research Scientist in Intels Programming Systems Lab. Her general research interest lies in finding ways to simplify software development while improving program efficiency. Her current research focuses on the semantics of transactional memory. In the past, she worked on dynamic compilation for managed runtime environments, IPF code generation and compiler support for sparse matrix computations. She holds Ph.D. in Computer Science from University of Maryland, College Park and B.S. in Applied Math from Leningrad Electrical Engineering Institute, Russia.

Workshop #1: Workshop on Parallel Architectures and Biosinpired Algorithms

Parallel Computer Architecture and Bioinspired Algorithms have been coming together during the last years. On one hand, the application of Bioinspired Algorithm to solve difficult problems has shown that they need high computation power and communications technology. Parallel architectures and Distributed systems have offered an interesting alternative to sequential counterparts. On the other hand, and perhaps which is more interesting for the Computer Architecture community, Bioinspired algorithms comprises a series of heuristics that can help to optimize a wide range of tasks required for Parallel and Distributed architectures to work efficiently. Genetic Algorithms (GAs), Genetic Programming (GP), Ant Colonies Algorithms (ACAs) or Simulated Annealing (SA) are nowadays helping computer designers on the advance of Computer Architecture, while improvement on parallel architectures are allowing to run computing intensive Bioinspired algorithms for solving other difficult problems. We can find in the literature several evolutionary solutions for design problems such as partitioning, place and route, etc.. which allows technology improvements. Researchers have also used this kind of meta-heuristics for the optimization of computer architectures, balancing computer load, instructions code, and other related problems. Nevertheless, any effort for increasing the relationship between them would be very welcome by the community. This workshop will gather scientists, engineers, and practitioners to share and exchange their experiences, discuss challenges, and report state-of-the-art and in-progress research on all aspects of the answer to two questions: What can Bioinspired Algorithms do for Parallel Computer Architectures? And What can Parallel Computer Architectures do for Bioinspired Algorithms?

Workshop #2: WoSPS: Workshop on Soft Processor Systems


Processors implemented in programmable logic, called soft processors, are becoming increasingly important in both industry and academia. FPGA-based processors provide an easy way for software programmers to target FPGAs without having to write hardware-description language---hence designers of FPGA-based embedded systems are increasingly including soft processors in their designs. Soft processors will also likely play an important role in FPGA-based co-processors for high-performance computing. Furthermore, academics are embracing FPGA-based processors as the foundation of systems for faster architectural simulation. In all cases, we need to develop a deeper understanding of processor and multiprocessor architecture for this new medium.

This workshop will serve as a forum for academia and industry to discuss and present challenges, ideas, and recent developments in soft processors, soft multiprocessors, application-specific soft processors, and soft-processor-based accelerators and architectural simulation platforms.


Workshop #3: MEDEA: Workshop on MEmory performance: DEaling with Applications, Systems and Architecture


MEDEA aims to continue the high level of interest of the previous editions held with PACT Conference since 2000.


Due to the ever-increasing gap between CPU and memory speed, there is always great interest in evaluating and proposing processor, multiprocessor, CMP, multi-core and system architectures dealing with the "memory wall" and wire-delay problems. At the same time, a modular high-level design is becoming more and more attracting in order to reduce design costs.


In this scenario, design solutions and their corresponding performance are shaped by the combined pressure of a) technological opportunities and limitations, b) features and organization of system architecture and c) critical requirements of specific application domains. Evaluating and controlling the effects on the memory subsystem (e.g. caches, interconnection, bus, memory, coherence) of any architectural proposal is extremely important both from the performance (e.g. bandwidth, latency, predictability) and power (e.g. static, dynamic, manageability) points of view.


In particular, the emerging trend of single-chip multi-core solutions, will push towards new design principles for memory hierarchy and interconnection networks, especially when the design is aimed to build systems with a high number of cores (many-core instead of multi-core systems), which aim to scale performance and power efficiency in a variety of application domains.


From a slightly different point of view, the mutual interaction between the application behavior and the system on which it executes, is responsible of the figures of merit of the memory subsystem and, therefore, pushes towards specific solutions. In addition, it can suggest specific compile/link time tunings for adapting the application to the features of the target architecture.


In the overall picture, power consumption requirements are increasingly important cross-cutting issues and raise specific challenges.


Typical architectural choices of interest include, but are not limited to, single processors, chip and board multiprocessors, SoC, traditional and tiled/clustered architectures, multithreaded or VLIW architectures with emphasis on single-chip design, massive parallelism designs, heterogeneous architectures, architectures equipped with application-domain accelerators as well as endowed with reconfigurable modules. Application domains encompass embedded (e.g. multimedia, mobile, automotive, automation, medical), commercial (e.g. Web, DB, multimedia), scientific and networking applications, security, etc. The emerging network on chip infrastructure and transactional memory may suggest new solutions and issues.


MEDEA Workshop wants to continue to be a forum for academic and industrial people to meet, discuss and exchange their ideas, experience and solutions in the design and evaluation of architectures for embedded, commercial and general/special purpose systems taking into account memory issues, both directly and indirectly.


Proceedings of the Workshop will be published under ACM ISBN, and appear also in the ACM Digital Library. As in the previous years, a selection of papers will be considered for publication on transactions on HIPEAC (


The format of the workshop includes the presentation of selected papers and discussion after each presentation.