## Titan: Large and Complex Benchmarks in Academic CAD

Kevin E. Murray, Scott Whitty, Suya Liu, Jason Luu, Vaughn Betz



## Outline

- Motivation
- Hybrid CAD Flow & Benchmarks
- VPR and Quartus II Comparison
- Conclusion and Future Work



# Motivation



## Evaluating FPGA Architectures and CAD



Good benchmarks:

- Exploit device characteristics (i.e. hard blocks)
- Comparable to modern device sizes



## Evaluating FPGA Architectures and CAD

#### Must quantitatively compare:

- FPGA Architectures
- FPGA CAD Algorithms

Benchmarks often neglected



#### Good benchmarks:

- Exploit device characteristics (i.e. hard blocks)
- Comparable to modern device sizes



## Evaluating FPGA Architectures and CAD

#### Must quantitatively compare:

- FPGA Architectures
- FPGA CAD Algorithms

Benchmarks often neglected



#### Good benchmarks:

- Exploit device characteristics (i.e. hard blocks)
- Comparable to modern device sizes



#### State of FPGA Benchmarks

| 28nm |  |
|------|--|
|      |  |
|      |  |
|      |  |
|      |  |
|      |  |
|      |  |















## State of FPGA Benchmarks

#### MCNC20 (1991)

- < 1% of Stratix V
- No Hard Blocks

#### VTR (2012)

- < 5% of Stratix V
- Few Hard Blocks

Even smaller on future devices





## Why Don't We Have Better Benchmarks?

Academic tools cannot handle real designs

- Limited HDL support
- No IP Cores (Vendor, 3<sup>rd</sup> party)



Vendor tools are too restrictive

- Limited to Vendor's Architectures
- Cannot modify CAD algorithms





#### **Options?**

#### Upgrade academic tools

- Add support for wide range of HDLs
- Create an IP library
- A huge investment!

- Exploit vendor tool strengths?
  - Hybrid CAD flow





# Hybrid CAD Flow & Benchmarks



















## Titan Flow Capabilities & Limitations

| <b>Experiment Modification</b>            | VTR | Titan | <b>Titan Flow Method</b>                          |
|-------------------------------------------|-----|-------|---------------------------------------------------|
| Device Floorplan                          | Yes | Yes   | Architecture file                                 |
| Inter-cluster Routing                     | Yes | Yes   | Architecture file                                 |
| Clustered Block Size /<br>Configuration   | Yes | Yes   | Architecture file                                 |
| Intra-cluster Routing                     | Yes | Yes   | Architecture file                                 |
| LUT size / Combinational<br>Logic Element | Yes | Yes   | ABC re-synthesis                                  |
| New RAM Block                             | Yes | Yes   | Architecture file (up to 16K depth*)              |
| New DSP Block                             | Yes | Yes   | Architecture file (up to 36 bit width*)           |
| New Primitive Type                        | Yes | No    | No method to pass black box through<br>Quartus II |



\* Maximum for Stratix IV

## Titan 23 Benchmarks

- 23 Benchmarks
- Wide range of application domains
- All make use of hard blocks (DSPs, RAMs)
- 90K to 1.9M netlist primitives

Neural Network Control Systems Signal Processing Communications SHA Hashing Computer Vision Sorting Microprocessor DSP Radar Processing Communication Switch On Chip Network Matrix Decomposition Multi-core Image Processing





## Titan 23 Benchmarks

- 23 Benchmarks
- Wide range of application domains
- All make use of hard blocks (DSPs, RAMs)
- 90K to 1.9M netlist primitives

Neural Network Control Systems Signal Processing Communications SHA Hashing Computer Vision Sorting Microprocessor DSP Radar Processing Communication Switch On Chip Network Matrix Decomposition Multi-core Image Processing





## Titan 23 Benchmarks

- 23 Benchmarks
- Wide range of application domains
- All make use of hard blocks (DSPs, RAMs)
- 90K to 1.9M netlist primitives

Neural Network Control Systems Signal Processing Communications SHA Hashing Computer Vision Sorting Microprocessor DSP AMs) Radar Processing Communication Switch On Chip Network Matrix Decomposition Multi-core Image Processing





















# VPR and Quartus II Comparison







#### **VPR and Quartus II Flows**





### Titan Compatible Architecture

- Architecture must use same primitives as logic synthesis
- Can be grouped into arbitrary blocks

| Primitive                | Description           |
|--------------------------|-----------------------|
| lcell_comb               | LUT and Full<br>Adder |
| dffeas                   | Register              |
| mlab_cell                | LUT RAM               |
| mac_mult                 | Multiplier            |
| mac_out                  | Accumulator           |
| ram_block                | RAM Slice             |
| io_{i,o}buf              | I/O Buffer            |
| <pre>ddio_{in,out}</pre> | DDR I/O               |
| pll                      | Phase Locked<br>Loop  |



#### Stratix IV Architecture Capture



Floorplan:

Based on EP4SE820

**Fully Modeled Blocks:** 

**Routing Network:** 

Mixture of long and short wires 



# Architecture Details

#### LAB

- Detailed internal connectivity
- Full instead of partial crossbars
- Extra carry chain connectivity

#### M9K & M144K RAM Blocks

- All modes and sizes
- Approximated mixed-width modes



#### DSP Blocks

- All Stratix IV multiplier/accumulator modes
- Extra routing flexibility for packing

#### **ALM Internal Connectivity**

#### **Benchmark Completion**

| Tool       | Benchmarks<br>Completed |
|------------|-------------------------|
| Quartus II | 21/23                   |
| VPR        | 14/23                   |





#### Tool Performance vs. Benchmark Size





#### Tool Performance vs. Benchmark Size





#### Tool Memory vs. Benchmark Size

































































#### Impact of Clustering





#### Impact of Clustering





### Stratix IV & Academic LUT/FF Flexibility



**Traditional Academic BLE** 

- Additional flexibility in Stratix IV architecture allows for denser packing
- Can be detrimental to Wirelength



**Stratix IV like Half-ALM** 



# Stratix IV & Academic LUT/FF Flexibility



**Traditional Academic BLE** 

- Additional flexibility in Stratix IV architecture allows for denser packing
- Can be detrimental to Wirelength



**Tight Packing, Higher Wirelength** 



**Stratix IV like Half-ALM** 



# Stratix IV & Academic LUT/FF Flexibility



**Traditional Academic BLE** 

- Additional flexibility in Stratix IV architecture allows for denser packing
- Can be detrimental to Wirelength



**Tight Packing, Higher Wirelength** 



Loose Packing, Lower Wirelength



**Stratix IV like Half-ALM** 



# **Conclusion and Future Work**





- Titan Flow
  - Hybrid CAD Flow
  - Enables academic tools to use large benchmarks



- Titan Flow
  - Hybrid CAD Flow
  - Enables academic tools to use large benchmarks
- Titan23 Benchmark Suite
  - Significantly improves open-source FPGA benchmarks



- Titan Flow
  - Hybrid CAD Flow
  - Enables academic tools to use large benchmarks
- Titan23 Benchmark Suite
  - Significantly improves open-source FPGA benchmarks
- Comparison of VPR and Quartus II
  - Stratix IV architecture capture
  - VPR: 2.7x slower, 5.1x more memory, 2.6x more wire
  - Identified packing density as an important factor in wirelength



#### Future Work

• Timing Driven Comparison



# Thanks!

# Questions?

**Email:** kmurray@eecg.utoronto.ca

# Titan Flow & Titan 23 Benchmarks: http://uoft.me/titan

#### **Demo Night:**

From Quartus to VPR: Converting HDL to BLIF with the Titan Flow

