Problem

In Problem 5.12, we were able to reduce the CPE for the prefix-sum computation to 3.00...

In Problem 5.12, we were able to reduce the CPE for the prefix-sum computation to 3.00, limited by the latency of floating-point addition on this machine. Simple loop unrolling does not improve things. Using a combination of loop unrolling and reassociation, write code for a prefix sum that achieves a CPE less than the latency of floating-point addition on your machine Doing this requires actually increasing the number of additions performed. For example, our version with two-way unrolling requires three ad-ditions per iteration, while our version with four-way unrolling requires five. Our best implementation achieves a CPE of 1.67 on our reference machine. Determine how the throughput and latency limits of your machine limit the minimum CPE you can achieve for the prefix-sum operation.

Step-by-Step Solution

Request Professional Solution

Request Solution!

We need at least 10 more requests to produce the solution.

0 / 10 have requested this problem solution

The more requests, the faster the answer.

Request! (Login Required)


All students who have requested the solution will be notified once they are available.
Add your Solution
Textbook Solutions and Answers Search
Solutions For Problems in Chapter 5