Problem

Suppose we wish to write a procedure that computes the inner product of two vectors u an...

Suppose we wish to write a procedure that computes the inner product of two vectors u and v. An abstract version of the function has a CPE of 14–18 with x86- 64 and 26–29 with IA32 for integer, single-precision, and double-precision data. By doing the same sort of transformations we did to transform the abstract program combine1 into the more efficient combine4, we get the following code:

Our measurements show that this function has a CPE of 1.50 for integer data and 3.00 for floating-point data. For data type float, the x86-64 assembly code for the inner loop is as follows:

Assume that the functional units have the characteristics listed in Figure 5.12.

A. Diagram how this instruction sequence would be decoded into operations and show how the data dependencies between them would create a critical path of operations, in the style of Figures 5.13 and 5.14.

B. For data type double, what lower bound on the CPE is determined by the critical path?

C. Assuming similar instruction sequences for the integer code as well, what lower bound on the CPE is determined by the critical path for integer data?

D. Explain how the two floating-point versions can have CPEs of 3.00, even though the multiplication operation requires either 5 clock cycles.