Jongsok Choi
University of Toronto
January, 2012
We describe new multi-ported cache designs suitable for use in FPGA-based processor/parallel-accelerator systems, and evaluate their impact on application performance and area. The baseline system comprises a MIPS soft processor and high-level synthesis-generated custom hardware accelerators with a shared memory architecture: on-FPGA L1 cache backed by off-chip DDR2 SDRAM. Within this general system model, we evaluate traditional cache design parameters (cache size, line size, associativity). In the parallel accelerator context, we examine the impact of the cache design and its interface. Specifically, we look at how the number of cache ports affects performance when multiple accelerators operate (and access memory) in parallel, and evaluate two different hardware implementations of multi-porting: 1) multi-pumping, and 2) a recently-published approach based on the concept of a live-value table. Results show that application performance depends strongly on the cache interface and architecture: for a system with 6 parallel accelerators, depending on the cache design, speed-up swings from 0.73X to 6.1X, on average, relative to a baseline sequential system (with a single accelerator and a direct-mapped, 2KB cache with 32B lines).