Impact of Cache Architecture and Interface on Performance and Area of FPGA-Based Processor/Parallel-Accelerator Systems


Jongsok Choi

University of Toronto

January, 2012

We describe new multi-ported cache designs suitable for use in
FPGA-based processor/parallel-accelerator systems, and evaluate their
impact on application performance and area.  The baseline system
comprises a MIPS soft processor and high-level synthesis-generated
custom hardware accelerators with a shared memory architecture:
on-FPGA L1 cache backed by off-chip DDR2 SDRAM.  Within this general
system model, we evaluate traditional cache design parameters (cache
size, line size, associativity).  In the parallel accelerator context,
we examine the impact of the cache design and its interface.
Specifically, we look at how the number of cache ports affects
performance when multiple accelerators operate (and access memory) in
parallel, and evaluate two different hardware implementations of
multi-porting: 1) multi-pumping, and 2) a recently-published approach
based on the concept of a live-value table.  Results show that
application performance depends strongly on the cache interface and
architecture: for a system with 6 parallel accelerators, depending on
the cache design, speed-up swings from 0.73X to 6.1X, on average,
relative to a baseline sequential system (with a single accelerator
and a direct-mapped, 2KB cache with 32B lines).