Next: Current Directions Up: DSP Processor Design at Previous: Project Overview

Accomplishments

Our first work was a study that showed the feasibility of using deeper pipelining on the DSP56000 to achieve better performance [1,2]. The results indicated that a straightforward architectural technique could be used to improve performance and that there was much more room for improvement.

The next step was to develop a compiler and simulation environment that could be used to study compiler optimizations and their effect on performance. The first compiler was based on gcc [3,4] and targeted a VLIW-style machine where the primary goal was to make the work easy for the compiler and not introduce any hardware constraints that could affect performance. A DSP post-optimizer phase was developed that handled low-overhead looping constructs, memory partitioning, operation compaction, and modulo addressing [5]. A simulator executes this code and provides statistics. More recent work has developed a SUIF front end, which allows us to capture much more information about the source code, as well as making it easier to experiment with new optimizations.

We developed a small set of benchmarks coded in C that consisted of the typical DSP kernels used by commercial vendors to measure performance and a set of application benchmarks. Some of our benchmarks are coded in different styles so that we could see how our compiler behaved [6]. More information on the current benchmark suite is available on the web.

Early results from our system showed that kernels and applications have very different characteristics and that using kernels as the sole benchmarks for evaluating performance can be misleading [7]. When evaluating new architectures, it is necessary to use complete applications. We have also shown that with our VLIW model and compiler system, it is possible to generate efficient code [5].

To explore the challenges and gain an understanding of generating code that performs as well as handwritten assembly code, we developed an optimizing C compiler for the TI TMS320C25. We were able to generate code comparable, and in some cases superior, in performance to the version of the TI compiler we had [8].

One of the interesting challenges in compilation is to handle the dual memory banks found in many DSPs. Most existing compilers handle this by using additions to the language or by adding pragmas in the program. We are able to do partitioning without either of these techniques [9,10]. As part of this work, we also developed another way to improve performance with this type of memory system, which we call partial data duplication [10]. By duplicating appropriate memory locations, some applications can have significant performance improvements. This is a technique that we have not seen done in assembly language programming but our compiler and analysis tools can find instances where it is beneficial and handle the necessary consistency issues.

Software pipelining is a common technique used by DSP programmers and we have new results for handling conditionals in software-pipelined loops [11]. Some experts have said that these results solve the last remaining issue for applying this technique.

With our current infrastructure, we were recently able to demonstrate quantitatively the necessity to move beyond the traditional DSP architectures [12] if significant performance benefits are desired. We show that instruction set design has an important effect on the performance of programmable DSPs. A simple VLIW model can achieve at least a factor of two to three times in performance with minimal additional cost measured in silicon area. We believe the performance benefit is actually much more as our results were based on a much simpler compilation model for the traditional processors than really exists.

As part of our exploration of architectures we are also examining the feasibility of using vector processor technology for multimedia applications as a way of achieving a high degree of processor parallelism, without the complexity of modern processors. Current results from this work can be seen at http://www.eecg.toronto.edu/~corinna/vector/index.html.

Next: Current Directions Up: DSP Processor Design at Previous: Project Overview

Paul Chow 2005-01-02