Question

The basic pipeline for DLX has five stages: IF, ID, EX, MEM, and WB. Assuming all...

The basic pipeline for DLX has five stages: IF, ID, EX, MEM, and WB. Assuming all memory accesses take 1 clock cycle.

a) What are the three types of pipeline hazards?

b) What is the control (branch) hazard of an instruction pipeline? Provide three branch prediction alternatives to reduce branch hazards.

c) What is the data forwarding scheme used to reduce the data hazard?

d) Give all the forwarding paths of the five-stage DLX pipeline, including sources, destinations, and information transferred.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining.

There are three classes of hazards:

Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.
Data Hazards. They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards.They arise from the pipelining of branches and other instructions that change the PC.

Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall on different events:

A cache miss. A cache miss stalls all the instructions on pipeline both before and after the instruction causing the miss.
A hazard in pipeline. Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed. When the instruction is stalled, all the instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never clear.

In case of structural hazards:

Clock cycle number

Instr

1

2

3

4

5

6

7

8

9

10

Instr i

IF

ID

EX

MEM

WB

Instr i+1

IF

ID

EX

MEM

WB

Instr i+2

IF

ID

EX

MEM

WB

Stall

bubble

bubble

bubble

bubble

bubble

Instr i+3

IF

ID

EX

MEM

WB

Instr i+4

IF

ID

EX

MEM

WB

A stall causes the pipeline performance to degrade the ideal performance.

                                                          Average instruction time unpipelined

Speedup from pipelining   =        ----------------------------------------

                                                        Average instruction time pipelined

                                                     CPI unpipelined * Clock Cycle Time unpipelined

                                                 = -------------------------------------

                                                      CPI pipelined * Clock Cycle Time pipelined

The ideal CPI on a pipelined machine is almost always 1. Hence, the pipelined CPI is

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instruction

= 1 + Pipeline stall clock cycles per instruction

If we ignore the cycle time overhead of pipelining and assume the stages are all perfectly balanced, then the cycle time of the two machines are equal and

                    CPI unpipelined

Speedup = ----------------------------

                    1+ Pipeline stall cycles per instruction

If all instructions take the same number of cycles, which must also equal the number of pipeline stages ( the depth of the pipeline) then unpipelined CPI is equal to the depth of the pipeline, leading to

                         Pipeline depth

Speedup = --------------------------

                       1 + Pipeline stall cycles per instruction

If there are no pipeline stalls, this leads to the intuitive result that pipelining can improve performance by the depth of pipeline.

Structural Hazards

When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all posible combinations of instructions in the pipeline.
If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.

Common instances of structural hazards arise when

Some functional unit is not fully pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the rate of one per clock cycle
Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute.

Example1:
a machine may have only one register-file write port, but in some cases the pipeline might want to perform two writes in a clock cycle.

Example2:
a machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference(load), it will conflict with the instruction reference for a later instruction (instr 3):

Clock cycle number

Instr

1

2

3

4

5

6

7

8

Load

IF

ID

EX

MEM

WB

Instr 1

IF

ID

EX

MEM

WB

Instr 2

IF

ID

EX

MEM

WB

Instr 3

IF

ID

EX

MEM

WB


To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented.

Clock cycle number

Instr

1

2

3

4

5

6

7

8

9

Load

IF

ID

EX

MEM

WB

Instr 1

IF

ID

EX

MEM

WB

Instr 2

IF

ID

EX

MEM

WB

Stall

bubble

bubble

bubble

bubble

bubble

Instr 3

IF

ID

EX

MEM

WB

Instruction 1 assumed not to be data-memory reference (load or store), otherwise Instruction 3 cannot start execution for the same reason as above.

To simplify the picture it is also commonly shown like this:

Clock cycle number

Instr

1

2

3

4

5

6

7

8

9

Load

IF

ID

EX

MEM

WB

Instr 1

IF

ID

EX

MEM

WB

Instr 2

IF

ID

EX

MEM

WB

Instr 3

stall

IF

ID

EX

MEM

WB

Introducing stalls degrades performance as we saw before. Why, then, would the designer allow structural hazards? There are two reasons:

  1. To reduce cost. For example, machines that support both an instruction and a cache access every cycle (to prevent the structural hazard of the above example) require at least twice as much total memory.
    To reduce the latency of the unit. The shorter latency comes from the lack of pipeline registers that introduce overhead

Data Hazards

A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazards. Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine.

Consider the pipelined execution of these instructions:

1

2

3

4

5

6

7

8

9

ADD

R1, R2, R3

IF

ID

EX

MEM

WB

SUB

R4, R5, R1

IF

IDsub

EX

MEM

WB

AND

R6, R1, R7

IF

IDand

EX

MEM

WB

OR

R8, R1, R9

IF

IDor

EX

MEM

WB

XOR

R10,R1,R11

IF

IDxor

EX

MEM

WB

All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it.

The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong result.

The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle.

The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD.

The next page discusses forwarding, a technique to eliminate the stalls for the hazard involving the SUB and AND instructions.

We will also classify the data hazards and consider the cases when stalls can not be eliminated. We will see what compiler can do to schedule the pipeline to avoid stalls

Forwarding

The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.

1

2

3

4

5

6

7

ADD

R1, R2, R3

IF

ID

EX

MEM

WB

SUB

R4, R5, R1

IF

IDsub

EX

MEM

WB

AND

R6, R1, R7

IF

IDand

EX

MEM

WB


The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it.

If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided.
Using this observation , forwarding works as follows:

The ALU result from the EX/MEM register is always fed back to the ALU input latches.
If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.

Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs.


The paths correspond to a forwarding of:
(a) the ALU output at the end of EX,
(b) the ALU output at the end of MEM, and
(c) the memory output at the end of MEM.

Without forwarding our example will execute correctly with stalls:

1

2

3

4

5

6

7

8

9

ADD

R1, R2, R3

IF

ID

EX

MEM

WB

SUB

R4, R5, R1

IF

stall

stall

IDsub

EX

MEM

WB

AND

R6, R1, R7

stall

stall

IF

IDand

EX

MEM

WB

As our example shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code sequence can be executed without stalls:

1

2

3

4

5

6

7

ADD

R1, R2, R3

IF

ID

EXadd

MEMadd

WB

SUB

R4, R5, R1

IF

ID

EXsub

MEM

WB

AND

R6, R1, R7

IF

ID

EXand

MEM

WB

The first forwarding is for value of R1 from EXadd to EXsub .
The second forwarding is also for value of R1 from MEMadd to EXand.
This code now can be executed without stalls.

Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.

One more Example
To prevent a stall in this example, we would need to forward the values of R1 and R4 from the pipeline registers to the ALU and data memory inputs.

1

2

3

4

5

6

7

ADD

R1, R2, R3

IF

ID

EXadd

MEMadd

WB

LW

R4, d (R1)

IF

ID

EXlw

MEMlw

WB

SW

R4,12(R1)

IF

ID

EXsw

MEMsw

WB

Stores require an operand during MEM, and forwarding of that operand is shown here.
The first forwarding is for value of R1 from EXadd to EXlw .
The second forwarding is also for value of R1 from MEMadd to EXsw.
The third forwarding is for value of R4 from MEMlw to MEMsw.
Observe that the SW instruction is storing the value of R4 into a memory location computed by adding the displacement 12 to the value contained in register R1. This effective address computation is done in the ALU during the EX stage of the SW instruction. The value to be stored (R4 in this case) is needed only in the MEM stage as an input to Data Memory. Thus the value of R1 is forwarded to the EX stage for effective address computation and is needed earlier in time than the value of R4 which is forwarded to the input of Data Memory in the MEM stage.
So forwarding takes place from "left-to-right" in time, but operands are not ALWAYS forwarded to the EX stage - it depends on the instruction and the point in the Datapath where the operand is needed. Of course, hardware support is necessary to support data forwarding.

Add a comment
Know the answer?
Add Answer to:
The basic pipeline for DLX has five stages: IF, ID, EX, MEM, and WB. Assuming all...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • I need help plz 6. 4-stage MIPS pipeline: It has "IF, ID-EX, MEM, WB" stages. Specify...

    I need help plz 6. 4-stage MIPS pipeline: It has "IF, ID-EX, MEM, WB" stages. Specify how many stalls (or bubbles) are required for the following cases. Assume that there is a forwarding logic and branch condition is determined at ID-EX stage. No delayed branch is assumed. (a) lw $1, 4($0) add $2, $1, $1 (b) lw $1, 4($0) beq $1, $2, add $3, $4, $5 X: sub $3, $3, $5 7. For problem 2 (c), how many bubbles are...

  • Consider a standard 5-stage MIPS pipeline of the type discussed during the class sessions: IF- ID-EX-M-WB....

    Consider a standard 5-stage MIPS pipeline of the type discussed during the class sessions: IF- ID-EX-M-WB. Assume that forwarding is not implemented and only the hazard detection and stall logic is implemented so that all data dependencies are handled by having the pipeline stall until the register fetch will result in the correct data being fetched. Furthermore, assume that the memory is written/updated in the first half of the clock cycle (i.e. on the rising edge of the clock) and...

  • c. The classic five-stage pipeline MIPS architecture is used to execute the code fragments. Assume the...

    c. The classic five-stage pipeline MIPS architecture is used to execute the code fragments. Assume the followings: Register write is done in the first half of the clock cycle; register read is performed in the second half of the clock cycle, Branches are resolved in the second stage of the pipeline and the architecture does not utilize any branch prediction mechanism Forwarding is fully supported Clock Cycle à 1 2 3 4 5 6 7 8 9 10 11 12...

  • A 5-Stage pipeline is composed of the following stages Instruction Fetch (IF), Decode (DE), Execute (EX), Memory Access...

    A 5-Stage pipeline is composed of the following stages Instruction Fetch (IF), Decode (DE), Execute (EX), Memory Access (ME) and Register Write-back (WB). Assume the pipeline does not have a branch prediction unit, does not have superscalar support and does not support out of order execution. Assume that all memory accesses are in the L1 cache and therefore do not introduce any stalls. Show a pipeline diagram that shows the execution of each stage for the assembly code below. Also...

  • We found that the instruction fetch and memory stages are the critical path of our 5-stage...

    We found that the instruction fetch and memory stages are the critical path of our 5-stage pipelined MIPS CPU. Therefore, we changed the IF and MEM stages to take two cycles while increasing the clock rate. You can assume that the register file is written at the falling edge of the clock. Assume that no pipelining optimizations have been made, and that branch comparisons are made by the ALU. Here’s how our pipeline looks when executing two add instructions: Clock...

  • Using a 5 stage pipeline architecture (IF-OF-EX-MEM-WB), use a drawing to indicate the stages used by...

    Using a 5 stage pipeline architecture (IF-OF-EX-MEM-WB), use a drawing to indicate the stages used by each one of the following instructions: add, sw, lw, j, and lui. (use Y for yes and blank for no)

  • The latencies of individual stages in five-stage MIPS (Microprocessor without Interlocked Pipeline Stages) Architecture are given...

    The latencies of individual stages in five-stage MIPS (Microprocessor without Interlocked Pipeline Stages) Architecture are given below. Instruction Instruction Fetch Register Read Arithmetic Logic Unit (ALU) Memory Access Register Write Latency 200ps 100ps 200ps 300ps 100ps a. (10 pts) What is the clock cycle time in a pipelined and non-pipelined processor? Pipelined version : ______________ Non-pipelined version : ______________ b. The classic five-stage pipeline MIPS architecture is used to execute the code fragments. Assume the followings: Register write is done...

  • The datapath for 5-stage MIPS pipelined architecture is given below. IFAID ID/EX EX/MEM MEM/WB > Add...

    The datapath for 5-stage MIPS pipelined architecture is given below. IFAID ID/EX EX/MEM MEM/WB > Add Add Add result Shift left 2 Lo-o PC Address Read register 1 Read Read data Instruction memory register Registers Read SALU Zero ALU result Read Address data data 2 Write register Write data Data memory Write data 16 Sign- extend Choose all the components that generate a useful result during the execution of the following instruction: ADD R1, R2, R3 O 1. Program Counter...

  • The datapath for 5-stage MIPS pipelined architecture is given below. IF/D ID/EX EX/MEM MEM/WB >Add Add...

    The datapath for 5-stage MIPS pipelined architecture is given below. IF/D ID/EX EX/MEM MEM/WB >Add Add Add result Shift left 2 Address Instruction memory Read Read register 1 data Read register "Registers Read Write data 2 register Write data SALU ALU result Address Read data Data memory Write data 16 Sign- 32 extend Choose all the components that generate a useful result during the execution of the following instruction: Choose all the components that generate a useful result during the...

  • 1. Assume that individual stages of the datapath have the following latencies: IF ID EX MEM...

    1. Assume that individual stages of the datapath have the following latencies: IF ID EX MEM WB 250 ps 350 ps 150 ps 300 ps 200 ps Also, assume that instructions executed by the processor are broken down as follows: alu beq lw sw 45% 20% 20% 15% What is the total latency of an ?w instruction in a pipelined and non-pipelined processor? Instead of a single-cycle organization, we can use a multi-cycle organization where each instruction takes multiple cycles...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT