Assignment #5

Please read the following publication: Self-Optimizing Memory Controllers: A Reinforcement Learning Approach, Ipek et al., ISCA 2008.

Then answer the following questions:

A DRAM chip is contains several independent banks. At a high-level what is the sequence of operations that need to be issued to a bank to read and write data? Briefly explain what a row is and what purpose it serves.
Briefly explain what the FR-FCFS policy is?
What is a Markov Decision Process? Please formally define it.
Why is a discounted cumulative reward function more appropriate for infinite horizon problems?
Explain what are Q-values.
What is epsilon-greedy action selection? Why is it needed?
What is CMAC and why is it used here? What goal does it serve?
In FIg. 6(a) why are there two parallel vertical pipes?

You are asked to think about the on and off-chip memory system for our DaDianNao like accelerator. Let's restrict attention to CNNs, where the layers are convolutions, pooling and fully-connected. Let us assume you have an external memory interface which can provide X bytes/cycle. For example, a DDR3-1600 memory system can, at peak provide, 1600 x 1M x 8B/ sec = 13,107,200 bytes/sec. Assuming a 1GHz operating frequency for our accelerator that would translate into just 12.5 bytes/cycle.

You are asked to think about strategies of how to allocate your on-chip memory to reduce as much as possible off-chip traffic to sustain as much as possible the execution cores. We will provide you with the architecture of a few CNNs shortly. You will have to calculate how much traffic will be needed to execute each layer assuming a 1GHz operating frequency for your accelerator. This is an open ended problem and I am looking for suggestions on how to start to think about solutions. To further simplify things let's only consider the convolutional layers.

Here's some starting pointers. Option number #1 is to assume that everything can fit on chip. Your traffic then is just to read the input activations for the first layer and then to write the output activations from that last layer. What if you cannot fit all of the filters and all the activations on chip? What are the options? One approach then would be to approach weights and activations separately. Let's assume that you can fit indeed all activations on chip. What happens when you cannot fit all the weights for all layers on chip? What other options exist? Try to identify cases and report what you find.

Then switch the problem around. What if you can fit the weights but you cannot fit the activations. What are the choices there?

FInally, if you can fit neither the weights nor the activations can you suggest good ways to compute the layers?

For all options calculate the bandwidth that would be needed from off-chip. That would be the number of bytes that are needed to read the activatios and weights, plus the number of bytes that are needed to write the output weights, and the sum of these divided by the number of cycles the accelerator takes to do the computation. Can you compare and contract diffferent on-chip memory allocation policies between weights and activations with regards to the overall off-chip bandwidth?

Find the input data and further information here. Thank you to Kevin Siu for preparing these.

Due Thursday, March 8, before class. Submission link will be provided in due time. Do not e-mail.

Description

Due Thursday, February 15, before class. Submission link will be provided in due time. Do not e-mail.

Repeat Assignment #1 but use Intel's PIN tool.

Due Thursday, February 1, before class. Submission link will be provided in due time. Do not e-mail.

Getting to know the Simplescalar Simulator Read the lab0 and lab4 handouts from ECE552. These will introduce you to the Simplescalar simulator. Our goal here is to modify the cache simulation module to implement a different replacement policy. The cache module is implemented in cache.c. The simplest simulator that uses it is sim-cache.c.

Part A: Modify cache.c to add a “not MRU” replacement policy. You will have to modify the cache_access() function and potentially others. For example, check whether you need to change cache_probe() too. Not MRU replaces one cache block a random except for the MRU. Sim-cache for the go and gcc traces first with LRU replacement and then with your notMRU.

Part B: Read the following paper: Adaptive Insertion Policies for High-Performance Caching, M. K. Qureshi et al, IEEE/ACM Intl’ Symposium on Computer Architecture, .

Modify cache.c to implement DIP. No need to implement set dueling.

Assignment #5

Assignment #4

Assignment #3

Assignment #2

Assignment #1

Andreas Moshovos