Advanced Computer Architecture
HW #3  -  Due Friday, Nov. 4th 2005 at 23:59 under the door of EA310
Andreas Moshovos

1.       Consider a MIPS-like architecture that has delayed loads That is, an instruction immediately following a load is guaranteed not to see the loaded value. In this architecture the following two code fragments are equivalent:

                load r1, 10(r2)                     add r3, r1, r4

      add r3, r1, r4                      load r1, 10(r2)

      sub  r5, r1, 1                      noop

                                          sub r5, r1, 1


In either case, the “add” sees the value r1 had before the load and the “sub” sees the value loaded from memory. The “noop” (no operation) is necessary. Delayed loads “simplify” the pipeline control as it is not necessary for the hardware to check for load-use hazards. Enforcement of load-use hazards is left to the compiler. Can you modify conventional out-of-order execution to correctly execute programs for this architecture? Assume a register renaming table such as that we discussed in class and describe the necessary extensions. Assume that no control flow (a.k.a. branch) prediction is used. Of course, stalling execution after a load is one solution. However, it may not perform very well. Come up with a solution that will allow out-of-order execution for all instructions following a load. Assume that it is illegal for an instruction in a delay slot to have the same destination register as the corresponding load. Briefly explain what changes need to be made to the RAT and the scheduler if any. Be concise and describe your changes at a high level.

2.       Consider adding indirect addressing to the register file. That is consider instructions of the following form:

op Rdest, (Rs1), Rs2

The semantics are:

                                                        A = read register Rs1 from register file

                                                        B = read register A from register file

                                                        C = read register Rs2 from register file

                                                        Rdest = B op C

Argue (one reason is good enough) why this is a good idea. Then argue why this might no be such a good idea by commenting on how these instructions impact register renaming. NOTE: Similar issues exist in stack-based ISAs.

3.       Is it possible that a dynamically scheduled, superscalar processor with imperfect branch prediction to perform better than the same processor but with perfect branch prediction? Explain why or why not. Needed be provide a pseudo-code/block-level example.

4.       Write a code sequence of two MIPS instructions for which register renaming will increase the opportunities for out-of-order execution. Also , write a sequence of two MIPS instructions for which register renaming will not increase the opportunities for out-of-order execution.

5.       Current Instruction Set Architectures use an indirect specification of inter-instruction communication. In this specification communication happens through registers. The specification is indirect as producing instructions do not directly name their consumers or vice versa. Of course, this is not the only possible way of specifying inter-instruction communication. An alternative is to have the consuming instructions name the producer. For example, an instruction may be of the following format:

  add (10), (2)

This is interpreted as take the result of the instruction at distance 10 (distance is measured in instructions during execution) before this one, the result of the instruction at distance 2 and add them up. Notice that no target register is specified.  Why such an ISA might be a good idea. What are the implications for register renaming and code size? Will we be able to do everything that is possible with registers using such an ISA? Can you think of scenario where this scheme creates complications?

6.       In dynamically-scheduled, superscalar processors producers broadcast their result tag to consumer so that the consumers can determine when all their source operands are available. In modern processors the result tag broadcast happens the cycle the result is actually produced. This way, the result can be bypassed the next cycle directly to the consumers. For operations of fixed latency the aforementioned process is straightforward: if we know an add takes two cycle to produce its result we have it broadcast its result tag at the beginning of the second cycle of its execution. Loads combined with caches present a challenge. With caches a load requires a variable number of cycles to produce its result depending on where it finds the data. Accordingly, we cannot safely broadcast the result tag at the beginning of the last cycle since we do not know which the last cycle is.  Do a quick online survey to determine what solutions are currently used to overcome this limitation. Summarize your findings using less than 500 words (with 100 being more preferable).