Advanced Computer Architecture
HW #2  -  Due Friday, October 21, 2005, by 11:59pm under the door of EA310

Instructor: Andreas Moshovos

Note: In questions 1 & 2 you will have to use a few benchmarks which will be made available through the course’s website.

 

0. Understanding MIPS assembly and the impact the compiler can have: Grab the file test.c from the website and compile it down to assembly. To do so you will need the ss-gcc (simplescalar PISA version) compiler. First compile using “ss-gcc –v –S test.c”. This uses the default level of optimizations and will produce the file “test.S”. Take a look at test.S and comment on each line (what does it do?).  The compile using “ss-gcc –O3 –o test.O3.S –S test.c”. Look at test.O3.s and comment on what each line does. What are the key differences from the test.S file produced using the default optimization flags? Finally, compile using “ss-gcc –O3 –o test.O3 test.c” and then use “ss-objdump –d test.O3”. The latter commend will print out a disassembly of the executable. Notice that there is additional code in there besides your boo() and main(). This includes some startup/initialization and libaries.

Since you ran ss-gcc with the “-v” argument it will report exactly which steps it takes to compile your program. Notice that gcc is just a front-end that launches the various compilation phases. In turn these are the preprocessor (cpp), the C compiler (cc1), the assembler (as) and if needed the linker (ld). Notice the various default flags passed to each tool. 

What to hand-in: The annotated test.S and test.O3.S. Comments are needed only for the boo() function.

 1.        Limits of Conventional Superscalar and Out-of-Order Execution: You’ll have to perform a study on the “limits” of instruction level parallelism. We’ll compare in-order superscalar execution with generalized out-of-order execution. Read all of the assignment before attempting a solution. The last section provides suggestions on how to proceed about developing the necessary infrastructure. The simulator’s output should report the average parallelism, i.e., the average number of instructions executed in parallel.

(A) In-order superscalar execution: Assume an ideal -way machine. This machine can execute any number of instructions in parallel. However, since it uses in-order superscalar execution, it can execute instructions in parallel only if they are independent and adjacent. Furthermore make the following assumptions: It can do register renaming and has as infinite physical registers, so, execution order is restricted only by true (RAW) dependences. It executes loads and stores in order since it does not support memory renaming. It resolves control-flow auto-magically (never waits for branches to execute before fetching and executing the target instructions). All instructions take a single cycle to complete.

Modify Simplescalar’s sim-safe to measure the average parallelism for the aforementioned model. A  register availability table returning the cycle at which each register becomes available should be sufficient for determining when the next instruction can execute.  An instruction can execute immediately after its source operands become available subject to the additional constraints of adjacency and memory ordering.

 (B) Generalized out-of-order execution: Give our “ideal” -way machine a face-lift. Now, it can do out-of-order execution even when instructions are not adjacent.  Memory operations still execute in-order and control-flow is resolved auto-magically as before. However, an instruction cannot execute earlier than an instruction that precedes it by N instructions (N is our instruction window). Measure average parallelism for N=64, 128 and 256.

Here are a couple of suggestions on how to proceed: Let’s assume a 265-wide instruction window. Build a 256-entry table (WIN) in sim-safe that is going to operate as a FIFO (have tail and head pointers). As instructions are simulated, fill in the table in order. Also, keep a table (Register Availability Table or RAT) with an entry per register. This entry is supposed to indicate the earliest time the register value is available. Upon inserting an instruction into WIN, determine its input register dependences. The instruction can execute only after the corresponding register values are available (given by RAT). The output becomes available the next “cycle”, make sure to update RAT accordingly.  Instructions are retired from the WIN in order, that is, oldest first. New instructions enter WIN only when space becomes available (as the result of retiring “executed” instructions).

2.       Realistic Out-Of-Order Execution:  Study the relative importance of various organizational parameters under  a realistic out-of-order execution model. For this assignment we will use the sim-outorder simulator from the Simplescalar suite. Dump the default configuration into a file by running “sim-outorder –dumpconfig  filename”.  You will have to make changes in this file and use it (“-config filename”). Change the memory latency to 60 cycles (“-mem:lat 60 4).

2.1.     Scheduler (window) size: The default configuration uses a small window of up to 16 instructions out of which up to 8 can be loads or stores. The relevant parameters are “-ruu:size” and “-lsq:size” respectively. Measure the average IPC for windows of 32, 64 and 128 instructions (in all cases set “-lsq:size” to half of the “-ruu:size”).

2.2.     Branch Prediction: The default configuration uses a bi-modal predictor. Assuming a window of 128 instructions, try the 2-level and combined predictors. To do so, set the “-bpred” parameter accordingly. For all predictors, use the default configuration. If possible also try the perfect predictor. Report the average IPC for all cases.

2.3.     Issue-Width: Issue-width is the maximum number of instructions that can be issued (i.e., commence execution) during the same cycle. The default configuration uses an issue width of 4. Assume a 128 window (as in part 2.1) and a combined predictor (as in part 2.2) and simulate configurations with issue widths of 4, 8 and 16 instructions. The relevant parameters are “-issue:width”,  “-decode:width”, “-commit:width” and “-fetch:ifqsize”. The last parameter is the size of the fetch queue (a buffer between the fetch unit and the decode stage.  You will also have to change the number of functional units (“-res:xxx” options). Read through section 4.4 of the Simplescalar technical report for additional information.

2.4.     Data Cache Latency: Finally, the default data cache latency is 1 cycle which is pretty fast with today’s standards. Increase the latency to 2 and then to 3 cycles (parameter “-cache:dl1lat”). For this experiment use the configuration from part 2.3 with 4 issue-width.

2.5.     Briefly discuss your findings commenting on the relative importance of each parameter.

To obtain reasonable simulation times, skip through the first 20M instructions (-fastfwd 20000000) and then simulate the next 40M instructions (-max:inst 40000000). This is a very short run. It is generally not a good idea to draw conclusions from such short runs, however, due to time and resource limitations this is necessary for this assignment. As you will see, timing simulation (sim-outorder) is about two orders of magnitude slower than functional simulation (sim-safe).