

Instruction Execution Steps: The Multi Cycle Circuit

| The M            | licro Mip               | IS A                                  |                                             |      |
|------------------|-------------------------|---------------------------------------|---------------------------------------------|------|
|                  |                         |                                       |                                             |      |
|                  |                         |                                       |                                             |      |
|                  |                         | - 0                                   |                                             | opf  |
| Copy             | Load upper immediate    | lui rt, imm                           | $rt \leftarrow (imm, 0x0000)$               | 15   |
|                  | Add                     |                                       | $rd \leftarrow (rs) + (rt)$                 | 0 \$ |
|                  |                         |                                       | $rd \leftarrow (rs) - (rt)$                 | 0 \$ |
| Arithmetic       |                         | · · · · · · · · · · · · · · · · · · · | $rd \leftarrow if(rs) < (rd) then 1 else 0$ | 04   |
|                  |                         |                                       | $rt \leftarrow (rs) + imm$                  | 8    |
|                  | Set Less than immediate |                                       | $rt \leftarrow if(rs) < imm$ then 1 else 0  | 10   |
|                  | AND                     |                                       | $rd \leftarrow (rs) \land (rt)$             | 0 3  |
|                  | OR                      |                                       | $rd \leftarrow (rs) \lor (rt)$              | 0 \$ |
|                  |                         |                                       | $rd \leftarrow (rs) \oplus (rt)$            | 0 \$ |
| Logic            | NOR                     |                                       | $rd \leftarrow ((rs) \lor (rt))'$           | 0 \$ |
|                  |                         | andi rt <i>,</i> rs,imm               | $rt \leftarrow (rs) \wedge imm$             | 12   |
|                  | OR immediate            |                                       | $rt \leftarrow (rs) \lor imm$               | 13   |
|                  | XOR immediate           | xori rt,rs,imm                        | $rt \leftarrow (rs) \oplus imm$             | 14   |
|                  | Load Word               |                                       | $rt \leftarrow mem[(rs) + imm]$             | 35   |
| Memory Word      | Store Word              | sw rt,imm,(rs)                        | $mem[(rs) + imm] \leftarrow (rt)$           | 43   |
|                  | Jump                    | j L                                   | goto L                                      | 2    |
|                  | Jump register           | jr rs                                 | goto (rs)                                   | 08   |
|                  | Branch on less than 0   |                                       | if(rs) < 0 then go o L                      | 1    |
| Control transfer | Branch on equal         |                                       | if $(rs) = (rt)$ then go to L               | 4    |
|                  | Branch on not equal     | bne rs,rt,L                           | $if(rs) \neq (rt)$ then go o L              | 5    |
|                  | Jump and link           | jal L                                 | goto L; $31 \leftarrow (PC)+4$              | 3    |
| 7                | System call             | syscall                               | Associated with an OS system routine        | 0 1  |



### Performance of the Single Cycle Architecture

- □ The above design of control circuit is a stateless and combinational design.
- □ Each new instruction is read from the PC, and is executed in one single clock.
  - Thus CPI=1
- □ The clock cycle is determined by the longest instruction.



### Obtaining better performance

- □ Note that the average instruction time is less, depends on the type of instruction, and their percentages in an application.
- Rtype 44% 6 ns No data cache Load 24% 8 ns Store 12% 7ns No register write-back Branch 18% 5ns Fetch+Register Read+Next-addr formation Jump 2% 3ns Fetch + Instruction Decode Weighted average = 6.36 ns
   So, with a variable cycle time implementation, the performance is 157 MIPS
   However, this is not possible. But we see that a single cycle implementation has a poor performance.



# Shorter Clock Cycles in Multi-cycle implementation

- The MIPS instructions typically has a set of actions, namely: memory access, register read, ALU operation, register write back.
- $\square$  Each takes around 2 ns time.
- □ In a single cycle implementation, the worst-case (longest) time of the instructions is taken as the clock frequency.
- □ In a multi-cycle implementation, a subset of these actions is performed in one clock: thus the clock cycle can be much shorter.
- □ Every instructions takes several clock cycles (thus  $CPI \neq 1$ )













- □ A single memory unit suffices (as read and write from and to memory) are at different clock cycles.
- Requirement of Instruction Register: This register has to hold the instructions to generate appropriate control signals through the multiple cycles until it is executed.













- □ State Encoding
  - sequential
  - gray
  - Johnson
  - one-hot

| ncoding Formats |            |      |         |          |
|-----------------|------------|------|---------|----------|
| No              | Sequential | Gray | Johnson | One-hot  |
|                 |            |      |         |          |
| 0               | 000        | 000  | 0000    | 00000001 |
| 1               | 001        | 001  | 0001    | 00000010 |
| 2               | 010        | 011  | 0011    | 00000100 |
| 3               | 011        | 010  | 0111    | 00001000 |
| 4               | 100        | 110  | 1111    | 00010000 |
| 5               | 101        | 111  | 1110    | 00100000 |
| 6               | 110        | 101  | 1100    | 0100000  |
| 7               | 111        | 100  | 1000    | 1000000  |

### Comments on the coding styles

- Binary: Good for arithmetic operations. But may have more transitions, leading to more power consumptions. Also prone to error during the state transitions.
- Gray: Good as they reduce the transitions, and hence consume less dynamic power. Also, can be handy in detecting state transition errors.

### Coding Styles

- □ Johnson: Also there is one bit change, and can be useful in detecting errors during transitions. More bits are required, increases linearly with the number of states. There are unused states, so we require either explicit asynchronous reset or recovery from illegal states (even more hardware!)
- □ **One-hot:** yet another low power coding style, requires more no of bits. Useful for describing bus protocols.





### Good FSMs

□ Keep separate CS, NS and OL

# NextState (NS) always @(input or currentstate) begin NextState=ST0; case(currentstate) ST0: begin NextState=ST1; end ST1: begin ... ST3: NextState=ST0; endcase end













### Example

- Consider a MIPS++ processor, which is similar to our processor, except there are 3 types of R-type instructions:
  - R<sub>a</sub>-type: half of all R-type instructions, 4 cycles
  - $R_b$ -type: <sup>1</sup>/<sub>4</sub> th of all R-type instructions, 6 cycles
  - $R_c$ -type: <sup>1</sup>/<sub>4</sub> th of all R-type instructions, 10 cycles
- □ With the same instruction mix in the last example, and assuming the slowest R-type instruction takes 16ns to execute in a single cycle implementation , derive the performance ration for a multi-cycle implementation.



### Microprogramming

- The control state machine resembles a program that has instructions, states, branching, and loops.
- □ We call such a hardware program a microprogram.
- □ Its basic steps are called as micro-instructions.
- □ Within each micro-instruction, there are different actions being performed, being called as micro-order.



### Advantages

- □ More regular.
- □ Less dependent on the Instruction-set architecture.
  - The same hardware can be reused by simply changing the content of the ROM.
- Errors and omissions can be taken care of by simply changing the micro-program, rather than redesigning the circuit.
- □ Microprogramming is designing a suitable sequence of microinstructions to realize a particular ISA.



- □ Lower speed compared to a hardwired control circuit.
- Each machine level instruction takes 3-5
   ROM accesses to fetch the micro-instructions.
- □ After each micro-instruction has been read and placed in the micro-instruction register, sufficient time has to be given to allow the signals to stabilize and the actions to take place.



- □ The design of the microcontrolled controller begins with a format.
- □ Each of the 20 control signals bear one-one relationship with the control bits.
- □ Except for the last 2 bit Sequence control signal.





- The 2-bit sequence control bits allow for the control of micro-instruction sequencing in the same way that "PC control" affects the sequencing of machine language instruction.
- $\Box$  Option 0 is to advance to the next micro-instruction in sequence by incrementing the  $\mu$ PC.
- □ Option 1 and 2 allow branching, depending on the opcode of the instruction.
- Option 3 is to go to the microinstruction 0 corresponding to state 0; this initiates the fetch phase of the next machine instruction.



### Dispatch tables

- □ Each of the two dispatch tables translates the opcode into a microinstruction address.
- □ Dispatch table 1 corresponds to the multi-way branch in going from cycle 2 to 3.
- □ Dispatch table 2 implements the branch between cycles 3 and 4.

# Microinstruction field values and their symbolic names (default value is 0)

| PC control | 0001        | 1001     | X011     | X101             | X111   |
|------------|-------------|----------|----------|------------------|--------|
|            | PCjump      | syscall  | PCjreg   | PCbranch         | PCnext |
| Cache      | 0101        | 1010     | 1100     |                  |        |
| Control    | Cache Fetch | Cache    | Cache    |                  |        |
|            |             | Store    | Load     |                  |        |
| Register   | 1000        | 1001     | 1011     | 1101             |        |
| Control    | rt←Data     | rt←z     | rd←z     | \$31 <b>←</b> PC |        |
| ALU        | 000         | 011      | 101      | 110              |        |
| inputs     | PCo 4       | PCo 4imm | xoy      | x∘imm            |        |
| ALU        | 0xx10       | 1xx01    | 1xx10    | X0011            | X0111  |
| function   | +           | <        | -        | Λ                | V      |
|            | X1011       | X1111    | Xxx00    |                  |        |
|            | XOR         | NOR      | lui      |                  |        |
| Sequence   | 01          | 10       | 11       |                  |        |
| Control    | µPCdisp1    | µPCdisp2 | μPCfetch |                  |        |



| Comp       | lete M | licro-program |
|------------|--------|---------------|
| <b>I</b> - |        |               |

| fetch: | PCnext,CacheFetch, PC+4         | State 0 (start) |
|--------|---------------------------------|-----------------|
|        | PC+4imm,µPCdisp1                | State 1         |
| lui1:  | lui(imm)                        | State 7lui      |
|        | rt←z, μPCfetch                  | State 8lui      |
| addi:  | x+y                             | State 7add      |
|        | rd <b>←</b> z, μPCfetch         | State 8add      |
| subi:  | х-у                             | State 7sub      |
|        | rd <b>←</b> z, μPCfetch         | State 8sub      |
| slt1:  | х-у                             | State 7slt      |
|        | rd <b>←</b> z, μPCfetch         | State 8slt      |
| addi1: | x+imm                           | State 7addi     |
|        | $rd \leftarrow z, \mu PC fetch$ | State 8addi     |

# Complete Micro-program (Contd.)

| slti1: | x-imm                   | State 7slti |
|--------|-------------------------|-------------|
|        | rt <b>←</b> z, μPCfetch | State 8slti |
| and1:  | хЛу                     | State 7and  |
|        | rd <b>←</b> z, μPCfetch | State 8and  |
| or1:   | xVy                     | State 7add  |
|        | rd <b>←</b> z, μPCfetch | State 8add  |
| xor1:  | xVy                     | State 7or   |
|        | rd <b>←</b> z, μPCfetch | State 8or   |
| nor1:  | x~Vy                    | State 7nor  |
|        | rd <b>←</b> z, μPCfetch | State 8nor  |
| andi1: | хЛimm                   | State 7andi |
|        | rt <b>←</b> z, μPCfetch | State 8andi |

# Complete Micro-program (Contd.)

| ori1:  | xVimm                   | State 7ori  |
|--------|-------------------------|-------------|
|        | rt <b>←</b> z, μPCfetch | State 8ori  |
| xori1: | хФimm                   | State 7xori |
|        | rd←z, μPCfetch          | State 8xori |
| lwsw1: | x+imm, µPCdisp2         | State 2     |
| lw2:   | CacheLoad               | State 3     |
|        | rd←Data, μPCfetch       | State 4     |
| sw2:   | CacheStore, µPCfetch    | State 6     |

## Complete Micro-program (Contd.)

| j1:      | PCjump, µPCfetch          | State 5j       |
|----------|---------------------------|----------------|
| jr1:     | PCjreg, µPCfetch          | State 5jr      |
| branch1: | PCbranch, µPCfetch        | State 5branch  |
| jal1:    | PCjump, \$31←PC, µPCfetch | State 5jal     |
| syscall: | PCsyscall, µPCfetch       | State 5syscall |



### Assignment (not for submission)

Simplify the micro-instruction format, and design the micro-programs for the ISA, if the 5 ALU bits are directly generated in a separate decoder and fed to the ALU.

### Horizontal vs Vertical

### Microinstruction

- □ The instruction discussed with separate bits for each of the 20 control bits of the datapath is called horizontal microinstruction.
- □ However, suitable encoding can reduce the size of the instructions.
  - Eg. the cache control field has four values, which can be encoded in 2 bits.
- □ Such an encoded instruction format is called as vertical microinstruction.
- □ However, they get slower as they need further decoders.