1 00:00:00,729 --> 00:00:06,310 6.004 students work around the dryer bottleneck by finding a laundromat that has two dryers 2 00:00:06,310 --> 00:00:07,950 for every washer. 3 00:00:07,950 --> 00:00:13,339 Looking at the timeline you can see the plan, which is divided into 30-minute steps. 4 00:00:13,339 --> 00:00:18,800 The washer is in use every step, producing a newly-washed load every 30 minutes. 5 00:00:18,800 --> 00:00:24,140 Dryer usage is interleaved, where Dryer #1 is used to dry the odd-numbered loads and 6 00:00:24,140 --> 00:00:28,410 Dryer #2 is used to dry the even-numbered loads. 7 00:00:28,410 --> 00:00:34,050 Once started, a dryer runs for a duration of two steps, a total of 60 minutes. 8 00:00:34,050 --> 00:00:38,920 Since the dryers run on a staggered schedule, the system as a whole produces a load of clean, 9 00:00:38,920 --> 00:00:41,980 dry laundry every 30 minutes. 10 00:00:41,980 --> 00:00:47,039 The steady-state throughput is 1 load of laundry every 30 minutes and the latency for a particular 11 00:00:47,039 --> 00:00:50,280 load of laundry is 90 minutes. 12 00:00:50,280 --> 00:00:53,590 And now here's the take-home message from this example. 13 00:00:53,590 --> 00:00:56,970 Consider the operation of the two-dryer system. 14 00:00:56,970 --> 00:01:01,929 Even though the component dryers themselves aren't pipelined, the two-dryer interleaving 15 00:01:01,929 --> 00:01:07,190 system is acting like a 2-stage pipeline with a clock period of 30 minutes and a latency 16 00:01:07,190 --> 00:01:08,750 of 60 minutes. 17 00:01:08,750 --> 00:01:14,250 In other words, by interleaving the operation of 2 unpipelined components we can achieve 18 00:01:14,250 --> 00:01:18,120 the effect of a 2-stage pipeline. 19 00:01:18,120 --> 00:01:23,010 Returning to the example of the previous section, we couldn't improve the throughput of our 20 00:01:23,010 --> 00:01:29,789 pipelined system past 1/8 ns because the minimum clock period was set by the 8 ns latency of 21 00:01:29,789 --> 00:01:31,580 the C module. 22 00:01:31,580 --> 00:01:36,180 To improve the throughput further we either need to find a pipelined version of the C 23 00:01:36,180 --> 00:01:41,340 component or use an interleaving strategy to achieve the effect of a 2-stage pipeline 24 00:01:41,340 --> 00:01:44,610 using two instances of the unpipelined C component. 25 00:01:44,610 --> 00:01:47,330 Let's try that… 26 00:01:47,330 --> 00:01:52,270 Here's a circuit for a general-purpose two-way interleaver, using, in this case, two copies 27 00:01:52,270 --> 00:01:57,160 of the unpipelined C component, C_0 and C_1. 28 00:01:57,160 --> 00:02:01,580 The input for each C component comes from a D-latch, which has the job of capturing 29 00:02:01,580 --> 00:02:03,990 and holding the input value. 30 00:02:03,990 --> 00:02:08,239 There's also a multiplexer to select which C-component output will be captured by the 31 00:02:08,239 --> 00:02:10,380 output register. 32 00:02:10,380 --> 00:02:15,390 In the lower left-hand corner of the circuit is a very simple 2-state FSM with one state 33 00:02:15,390 --> 00:02:16,690 bit. 34 00:02:16,690 --> 00:02:21,360 The next-state logic is a single inverter, which causes the state to alternate between 35 00:02:21,360 --> 00:02:25,239 0 and 1 on successive clock cycles. 36 00:02:25,239 --> 00:02:31,299 This timing diagram shows how the state bit changes right after each rising clock edge. 37 00:02:31,299 --> 00:02:35,379 To help us understand the circuit, we'll look at some signal waveforms to illustrate its 38 00:02:35,379 --> 00:02:36,400 operation. 39 00:02:36,400 --> 00:02:42,930 To start, here are the waveforms for the CLK signal and our FSM state bit from the previous 40 00:02:42,930 --> 00:02:43,930 slide. 41 00:02:43,930 --> 00:02:49,920 A new X input arrives from the previous stage just after the rising edge of the clock. 42 00:02:49,920 --> 00:02:54,340 Next, let's follow the operation of the C_0 component. 43 00:02:54,340 --> 00:03:00,500 Its input latch is open when FSM Q is low, so the newly arriving X_1 input passes through 44 00:03:00,500 --> 00:03:06,670 the latch and C_0 can begin its computation, producing its result at the end of clock cycle 45 00:03:06,670 --> 00:03:08,420 #2. 46 00:03:08,420 --> 00:03:13,209 Note that the C_0 input latch closes at the beginning of the second clock cycle, holding 47 00:03:13,209 --> 00:03:19,400 the X_1 input value stable even though the X input is starting to change. 48 00:03:19,400 --> 00:03:25,030 The effect is that C_0 has a valid and stable input for the better part of 2 clock cycles 49 00:03:25,030 --> 00:03:29,670 giving it enough time to compute its result. 50 00:03:29,670 --> 00:03:33,819 The C_1 waveforms are similar, just shifted by one clock cycle. 51 00:03:33,819 --> 00:03:40,200 C_1's input latch is open when FSM Q is high, so the newly arriving X_2 input passes through 52 00:03:40,200 --> 00:03:46,239 the latch and C_1 can begin its computation, producing its result at the end of clock cycle 53 00:03:46,239 --> 00:03:48,400 #3. 54 00:03:48,400 --> 00:03:50,480 Now let's check the output of the multiplexer. 55 00:03:50,480 --> 00:03:57,659 When FSM Q is high, it selects the value from C_0 and when FSM Q is low, it selects the 56 00:03:57,659 --> 00:03:59,560 value from C_1. 57 00:03:59,560 --> 00:04:02,590 We can see that happening in the waveform shown. 58 00:04:02,590 --> 00:04:07,430 Finally, at the rising edge of the clock, the output register captures the value on 59 00:04:07,430 --> 00:04:12,319 its input and holds it stable for the remainder of the clock cycle. 60 00:04:12,319 --> 00:04:17,820 The behavior of the interleaving circuit is like a 2-stage pipeline: the input value arriving 61 00:04:17,820 --> 00:04:22,810 in cycle i is processed over two clock cycles and the result output becomes available on 62 00:04:22,810 --> 00:04:25,720 cycle i+2. 63 00:04:25,720 --> 00:04:28,980 What about the clock period for the interleaving system? 64 00:04:28,980 --> 00:04:34,240 Well, there is some time lost to the propagation delays of the upstream pipeline register that 65 00:04:34,240 --> 00:04:39,600 supplies the X input, the internal latches and multiplexer, and the setup time of the 66 00:04:39,600 --> 00:04:40,670 output register. 67 00:04:40,670 --> 00:04:46,090 So the clock cycle has to be just a little bit longer than half the propagation delay 68 00:04:46,090 --> 00:04:47,810 of the C module. 69 00:04:47,810 --> 00:04:51,680 We can treat the interleaving circuit as a 2-stage pipeline, consuming an input value 70 00:04:51,680 --> 00:04:56,070 every clock cycle and producing a result two cycles later. 71 00:04:56,070 --> 00:05:01,070 When incorporating an N-way interleaved component in our pipeline diagrams, we treat it just 72 00:05:01,070 --> 00:05:03,550 like a N-stage pipeline. 73 00:05:03,550 --> 00:05:07,970 So N of our pipelining contours have to pass through the component. 74 00:05:07,970 --> 00:05:13,790 Here we've replaced the slow unpipelined C component with a 2-way interleaved C-prime 75 00:05:13,790 --> 00:05:14,850 component. 76 00:05:14,850 --> 00:05:18,810 We can follow our process for drawing pipeline contours. 77 00:05:18,810 --> 00:05:22,180 First we draw a contour across all the outputs. 78 00:05:22,180 --> 00:05:27,730 Then we add contours, ensuring that two of them pass through the C-prime component. 79 00:05:27,730 --> 00:05:33,470 Then we add pipeline registers at the intersections of the contours with the signal connections. 80 00:05:33,470 --> 00:05:38,280 We see that the contours passing through C-prime have caused extra pipeline registers to be 81 00:05:38,280 --> 00:05:45,430 added on the other inputs to the F module, accommodating the 2-cycle delay through C-prime. 82 00:05:45,430 --> 00:05:50,940 Somewhat optimistically we've specified the C-prime minimum t_CLK to be 4 ns, so that 83 00:05:50,940 --> 00:05:55,830 means that the slow component which determines the system's clock period is now the F module, 84 00:05:55,830 --> 00:05:59,810 with a propagation delay of 5 ns. 85 00:05:59,810 --> 00:06:06,090 So the throughput of our new pipelined circuit is 1 output every 5 ns, and with 5 contours, 86 00:06:06,090 --> 00:06:11,930 it's a 5-pipeline so the latency is 5 times the clock period or 25 ns. 87 00:06:11,930 --> 00:06:17,150 By running pipelined systems in parallel we can continue to increase the throughput. 88 00:06:17,150 --> 00:06:22,220 Here we show a laundry with 2 washers and 4 dryers, essentially just two copies of the 89 00:06:22,220 --> 00:06:25,210 1-washer, 2-dryer system shown earlier. 90 00:06:25,210 --> 00:06:30,010 The operation is as described before, except that at each step the system produces and 91 00:06:30,010 --> 00:06:33,030 consumes two loads of laundry. 92 00:06:33,030 --> 00:06:39,240 So the throughput is 2 loads every 30 minutes for an effective rate of 1 load every 15 minutes. 93 00:06:39,240 --> 00:06:44,000 The latency for a load hasn't changed; it's still 90 minutes per load. 94 00:06:44,000 --> 00:06:47,909 We've seen that even with slow components we can use interleaving and parallelism to 95 00:06:47,909 --> 00:06:50,340 continue to increase throughput. 96 00:06:50,340 --> 00:06:53,370 Is there an upper bound on the throughput we can achieve? 97 00:06:53,370 --> 00:06:54,520 Yes! 98 00:06:54,520 --> 00:06:58,440 The timing overhead of the pipeline registers and interleaving components will set a lower 99 00:06:58,440 --> 00:07:04,160 bound on the achievable clock period, thus setting an upper bound on the achievable throughput. 100 00:07:04,160 --> 00:07:07,690 Sorry, no infinite speed-up is possible in the real world.