1 00:00:01,550 --> 00:00:03,920 The following content is provided under a Creative 2 00:00:03,920 --> 00:00:05,310 Commons license. 3 00:00:05,310 --> 00:00:07,520 Your support will help MIT OpenCourseWare 4 00:00:07,520 --> 00:00:11,610 continue to offer high-quality educational resources for free. 5 00:00:11,610 --> 00:00:14,180 To make a donation or to view additional materials 6 00:00:14,180 --> 00:00:18,140 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:18,140 --> 00:00:19,026 at ocw.mit.edu. 8 00:00:22,500 --> 00:00:25,020 GILBERT STRANG: OK, so this is an important day, 9 00:00:25,020 --> 00:00:27,730 and Friday was an important day. 10 00:00:27,730 --> 00:00:32,800 I hope you enjoyed Professor Sra's terrific lecture as much 11 00:00:32,800 --> 00:00:33,610 as I did. 12 00:00:33,610 --> 00:00:39,430 You probably saw me taking notes like mad for the section that's 13 00:00:39,430 --> 00:00:44,020 now to be written about stochastic gradient descent. 14 00:00:44,020 --> 00:00:49,020 And he promised a theorem, if you remember, 15 00:00:49,020 --> 00:00:50,710 and there wasn't time. 16 00:00:50,710 --> 00:00:52,720 And so he was going to send it to me 17 00:00:52,720 --> 00:00:54,760 or still is going to send it to me. 18 00:00:54,760 --> 00:00:57,520 I'll report, I haven't got it yet, 19 00:00:57,520 --> 00:01:04,599 but I'll bring it to class, waiting to see, hopefully. 20 00:01:04,599 --> 00:01:09,280 And that will give us a chance to review stochastic gradient 21 00:01:09,280 --> 00:01:13,930 descent, the central algorithm of deep learning. 22 00:01:13,930 --> 00:01:20,980 And then this today is about the central structure 23 00:01:20,980 --> 00:01:22,780 of deep neural nets. 24 00:01:22,780 --> 00:01:28,780 And some of you will know already how they're connected, 25 00:01:28,780 --> 00:01:36,800 what the function F, the learning function-- 26 00:01:36,800 --> 00:01:40,360 you could call it the learning function-- 27 00:01:40,360 --> 00:01:41,980 that's constructed. 28 00:01:41,980 --> 00:01:46,990 The whole system is aiming at constructing this function 29 00:01:46,990 --> 00:01:52,180 F which learns the training data and then 30 00:01:52,180 --> 00:01:55,480 applying it to the test data. 31 00:01:55,480 --> 00:02:01,570 And the miracle is that it does so well in practice. 32 00:02:01,570 --> 00:02:07,150 That's what has transformed deep learning into such 33 00:02:07,150 --> 00:02:11,290 an important application. 34 00:02:17,230 --> 00:02:22,270 Chapter 7 has been up for months on the 35 00:02:22,270 --> 00:02:27,340 math.mit.edu/learningfromdata site, 36 00:02:27,340 --> 00:02:28,525 and I'll add it to Stellar. 37 00:02:28,525 --> 00:02:33,270 Of course, that's where you'll be looking for it. 38 00:02:33,270 --> 00:02:37,630 OK, and then the second, the back propagation, 39 00:02:37,630 --> 00:02:42,940 the way to compute the gradient, I'll 40 00:02:42,940 --> 00:02:45,940 probably reach that idea today. 41 00:02:45,940 --> 00:02:51,070 And you'll see it's the chain rule, but how is it organized. 42 00:02:51,070 --> 00:02:53,290 OK, so what's the structure? 43 00:02:53,290 --> 00:02:57,360 What's the plan for deep neural nets? 44 00:02:57,360 --> 00:02:58,830 Good. 45 00:02:58,830 --> 00:03:02,820 Starting here, so what we have is training data. 46 00:03:06,200 --> 00:03:13,640 So we have vectors, x1 to x-- 47 00:03:13,640 --> 00:03:17,840 what should I use for the number of samples 48 00:03:17,840 --> 00:03:21,590 that we have in the training data? 49 00:03:21,590 --> 00:03:24,480 Well, let's say D for Data. 50 00:03:24,480 --> 00:03:26,120 OK. 51 00:03:26,120 --> 00:03:30,980 And each vector, those are called feature vectors, 52 00:03:30,980 --> 00:03:37,055 so equals feature vectors. 53 00:03:40,200 --> 00:03:48,010 So each one, each x, has like m features. 54 00:03:48,010 --> 00:03:55,230 So maybe my notation isn't so hot here. 55 00:03:55,230 --> 00:03:58,640 I have a whole lot of vectors. 56 00:03:58,640 --> 00:04:03,510 Let me not use the subscript for those right away. 57 00:04:03,510 --> 00:04:08,010 So vectors, feature vectors, and each vector 58 00:04:08,010 --> 00:04:13,170 has got maybe shall we say m features? 59 00:04:13,170 --> 00:04:17,250 Like, if we were measuring height and age and weight 60 00:04:17,250 --> 00:04:19,594 and so on, those would be features. 61 00:04:23,100 --> 00:04:28,140 The job of the neural network is to create-- 62 00:04:28,140 --> 00:04:29,730 and we're going to classify. 63 00:04:29,730 --> 00:04:35,220 Maybe we're going to classify men and women or boys 64 00:04:35,220 --> 00:04:36,080 and girls. 65 00:04:36,080 --> 00:04:41,330 So let's make it a classification problem, 66 00:04:41,330 --> 00:04:43,720 just binary. 67 00:04:43,720 --> 00:04:47,610 So the classification problem is-- 68 00:04:47,610 --> 00:04:48,600 what shall we say? 69 00:04:48,600 --> 00:04:59,040 Minus 1 or 1, or 0 or 1, or boy or girl, 70 00:04:59,040 --> 00:05:05,760 or cat or dog, or truck or car, or anyway, just two classes. 71 00:05:08,290 --> 00:05:12,360 So I'm just going to do two-class classification. 72 00:05:15,240 --> 00:05:18,480 We know which class the training data is in. 73 00:05:18,480 --> 00:05:21,970 For each vector x, we know the right answer. 74 00:05:21,970 --> 00:05:25,660 So we want to create a function that gives the right answer, 75 00:05:25,660 --> 00:05:30,460 and then we'll use that function on other data. 76 00:05:30,460 --> 00:05:33,420 So let me write that down. 77 00:05:33,420 --> 00:05:42,600 Create a function F of x so that gets-- 78 00:05:42,600 --> 00:05:50,140 most of gets the class correct. 79 00:05:50,140 --> 00:05:58,010 In other words, F of x should be negative for when 80 00:05:58,010 --> 00:06:04,960 the classification is minus 1, and F of x 81 00:06:04,960 --> 00:06:11,470 should be positive when the classification is plus 1. 82 00:06:11,470 --> 00:06:14,080 And as we know, we don't necessarily 83 00:06:14,080 --> 00:06:17,860 have to get every x, every sample, right. 84 00:06:17,860 --> 00:06:19,990 That may be over-fitting. 85 00:06:19,990 --> 00:06:24,850 If there's some sample that's just truly weird, by getting 86 00:06:24,850 --> 00:06:26,380 that right we're going to be looking 87 00:06:26,380 --> 00:06:32,100 for a truly weird data in the test set, 88 00:06:32,100 --> 00:06:33,830 and that's not a good idea. 89 00:06:39,620 --> 00:06:41,650 We're trying to discover the rule that 90 00:06:41,650 --> 00:06:46,420 covers almost all cases but not every crazy, weird case. 91 00:06:46,420 --> 00:06:47,290 OK? 92 00:06:47,290 --> 00:06:50,020 So that's our job, to create a function 93 00:06:50,020 --> 00:06:58,000 F of x that is correct on almost all of the training data. 94 00:07:01,860 --> 00:07:02,360 Yeah. 95 00:07:05,900 --> 00:07:12,760 So before I draw the picture of the network, 96 00:07:12,760 --> 00:07:23,750 let me just remember to mention the site Playground. 97 00:07:23,750 --> 00:07:26,840 I don't know if you've looked at that, so I'm going to ask you, 98 00:07:26,840 --> 00:07:28,640 playground@tensorflow.org. 99 00:07:36,650 --> 00:07:40,140 How many of you know that site or have met with it? 100 00:07:40,140 --> 00:07:42,530 Just a few, OK. 101 00:07:42,530 --> 00:07:46,140 OK, so it's not a very sophisticated site. 102 00:07:46,140 --> 00:07:49,680 It's got only four examples, four examples. 103 00:07:57,060 --> 00:08:03,590 So one example is a whole lot of points that are blue, 104 00:08:03,590 --> 00:08:08,830 B for Blue, inside a bunch of points 105 00:08:08,830 --> 00:08:18,760 that are another set that are O for Orange, orange, blue. 106 00:08:18,760 --> 00:08:23,350 OK, so those are the two classes, orange and blue. 107 00:08:23,350 --> 00:08:34,480 So the points x and the feature vector here is just the xy, 108 00:08:34,480 --> 00:08:42,820 the coordinates features are the xy coordinates of the points. 109 00:08:42,820 --> 00:08:46,420 And our job is to find a function that's 110 00:08:46,420 --> 00:08:51,400 positive on these points and negative on those points. 111 00:08:51,400 --> 00:08:55,910 So there is a simple model problem, and I recommend-- 112 00:08:55,910 --> 00:08:58,060 well, just partly-- 113 00:08:58,060 --> 00:09:00,910 if you're a expert in deep learning. 114 00:09:00,910 --> 00:09:07,150 This is for children, but morally here, I certainly 115 00:09:07,150 --> 00:09:10,960 learned from playing in this playground. 116 00:09:10,960 --> 00:09:19,320 So you set the step size. 117 00:09:22,930 --> 00:09:25,980 Do you set it, or does it set it? 118 00:09:25,980 --> 00:09:27,570 I guess you can change it. 119 00:09:27,570 --> 00:09:30,310 I don't think I've changed it. 120 00:09:30,310 --> 00:09:31,320 What else do you set? 121 00:09:31,320 --> 00:09:41,460 Oh, you set the nonlinear activation, the nonlinear 122 00:09:41,460 --> 00:09:46,590 activation function, active I'll say function. 123 00:09:46,590 --> 00:09:51,570 And let me just go over here and say what function people now 124 00:09:51,570 --> 00:09:52,500 mostly use. 125 00:09:52,500 --> 00:09:59,520 The activation function is called 126 00:09:59,520 --> 00:10:03,570 ReLU pronounced different ways. 127 00:10:03,570 --> 00:10:06,470 I don't know how we got into that crazy thing. 128 00:10:06,470 --> 00:10:13,430 For this function, that is 0 and x. 129 00:10:13,430 --> 00:10:15,680 So the function, ReLU is a function 130 00:10:15,680 --> 00:10:20,935 of x is the maximum the larger of 0 and x. 131 00:10:24,080 --> 00:10:27,400 The point is, it's not linear, and the point 132 00:10:27,400 --> 00:10:31,690 is that if we didn't allow nonlinearity in here somewhere, 133 00:10:31,690 --> 00:10:34,630 we couldn't even solve this playground problem. 134 00:10:34,630 --> 00:10:38,980 Because if our classifiers were all linear classifiers, 135 00:10:38,980 --> 00:10:44,410 like support vector machines, I couldn't separate the blue 136 00:10:44,410 --> 00:10:48,685 from the orange with a plane. 137 00:10:48,685 --> 00:10:53,130 It's got to somehow create some nonlinear function which maybe 138 00:10:53,130 --> 00:10:55,610 the function is trying to be-- 139 00:10:55,610 --> 00:11:03,430 a good function would be a function of r and theta maybe, 140 00:11:03,430 --> 00:11:07,060 maybe r minus 5. 141 00:11:07,060 --> 00:11:10,210 So maybe the distance out to that. 142 00:11:10,210 --> 00:11:13,970 Let's suppose that distance is 5. 143 00:11:13,970 --> 00:11:17,680 Then, r minus 5 will be negative on the blues, 144 00:11:17,680 --> 00:11:19,630 because r is small. 145 00:11:19,630 --> 00:11:22,720 And r minus 5 will be positive on the oranges, 146 00:11:22,720 --> 00:11:24,340 because r is bigger. 147 00:11:24,340 --> 00:11:30,705 And therefore, we will have the right signs, less than 0 148 00:11:30,705 --> 00:11:35,310 or greater than 0, and it'll classify 149 00:11:35,310 --> 00:11:39,960 this data, this training data. 150 00:11:39,960 --> 00:11:41,450 Yeah. 151 00:11:41,450 --> 00:11:43,230 So it has to do that. 152 00:11:43,230 --> 00:11:45,060 This is not a hard one to do. 153 00:11:45,060 --> 00:11:51,450 There are four examples, as I say, two are trivial. 154 00:11:51,450 --> 00:11:53,430 It finds a good function. 155 00:11:53,430 --> 00:11:56,760 Well yeah, I've forgotten, they're so trivial, 156 00:11:56,760 --> 00:12:07,920 they shouldn't be mentioned, and then this is the medium test. 157 00:12:07,920 --> 00:12:10,350 And then the hard test is when you 158 00:12:10,350 --> 00:12:17,370 have a sort of spiral of oranges, 159 00:12:17,370 --> 00:12:20,880 and inside, you have a spiral of blues. 160 00:12:20,880 --> 00:12:24,905 That was cooked up by a fiend. 161 00:12:29,590 --> 00:12:35,020 So the system is trying to find a function 162 00:12:35,020 --> 00:12:39,580 that's positive on one spiral and negative on the other 163 00:12:39,580 --> 00:12:45,670 spiral, and that takes quite a bit of time, many, many epochs. 164 00:12:45,670 --> 00:12:47,320 I learned what an epoch is. 165 00:12:50,140 --> 00:12:51,670 Did you know what an epoch is? 166 00:12:51,670 --> 00:12:53,320 I didn't know whether it was just 167 00:12:53,320 --> 00:12:58,780 a fancy word for counting the steps in gradient descent. 168 00:12:58,780 --> 00:13:03,250 But it counts the steps, all right, 169 00:13:03,250 --> 00:13:07,930 but one epoch is the number of steps that matches 170 00:13:07,930 --> 00:13:11,610 the size of the training data. 171 00:13:11,610 --> 00:13:14,710 So if you have a million samples-- 172 00:13:14,710 --> 00:13:18,760 where ordinary gradient descent you would be doing a million-- 173 00:13:18,760 --> 00:13:23,780 you'd have a million by a million problem per step. 174 00:13:23,780 --> 00:13:27,040 Of course, stochastic gradient descent 175 00:13:27,040 --> 00:13:32,650 just does a mini-batch of 1 or 32 or something, but anyway. 176 00:13:37,810 --> 00:13:41,860 So you have to do it enough mini-batches 177 00:13:41,860 --> 00:13:44,440 so that the total number you've covered 178 00:13:44,440 --> 00:13:54,563 is the equivalent of one full run through the training data, 179 00:13:54,563 --> 00:13:55,980 and that was an interesting point. 180 00:13:55,980 --> 00:13:57,340 Did you pick up that point? 181 00:13:57,340 --> 00:14:01,690 That in stochastic gradient descent, 182 00:14:01,690 --> 00:14:05,810 you could either do a mini-batch, 183 00:14:05,810 --> 00:14:11,320 and then put them back in the soup, so with replacement. 184 00:14:11,320 --> 00:14:17,760 Or you could just put your data in some order, 185 00:14:17,760 --> 00:14:20,610 from one to a zillion. 186 00:14:20,610 --> 00:14:25,290 So here's a first x and then more and more 187 00:14:25,290 --> 00:14:30,330 x's, and then just randomize the order. 188 00:14:30,330 --> 00:14:34,050 So you'd have to randomize the order for stochastic gradient 189 00:14:34,050 --> 00:14:36,930 descent to be reasonable, and then 190 00:14:36,930 --> 00:14:38,940 take a mini-batch and a mini-batch 191 00:14:38,940 --> 00:14:40,600 and a mini-batch and a mini-batch. 192 00:14:40,600 --> 00:14:44,070 And when you get to the bottom, you've finished one epoch. 193 00:14:44,070 --> 00:14:47,070 And then you'd probably randomize again, maybe, 194 00:14:47,070 --> 00:14:51,510 if you wanted to live right. 195 00:14:51,510 --> 00:14:56,040 And go through the mini-batches again, 196 00:14:56,040 --> 00:15:01,320 and probably do 1,000 times or more. 197 00:15:01,320 --> 00:15:06,060 Anyway, so I haven't said yet what you do, what this F of x 198 00:15:06,060 --> 00:15:09,870 is like, but you can sort of see it on the screen. 199 00:15:09,870 --> 00:15:14,410 Because as it creates this function F, 200 00:15:14,410 --> 00:15:17,150 it kind of plots it. 201 00:15:17,150 --> 00:15:23,680 And what you see on the screen is the 0 set for that function. 202 00:15:23,680 --> 00:15:27,810 So perfect would be for it to go through 0-- 203 00:15:27,810 --> 00:15:29,640 if I had another color. 204 00:15:29,640 --> 00:15:31,290 Oh, I do have another color. 205 00:15:31,290 --> 00:15:36,160 Look, this is the first time the whole semester blue is up here. 206 00:15:36,160 --> 00:15:43,310 OK, so if the function was positive there-- 207 00:15:43,310 --> 00:15:45,390 in this part, on the blues-- 208 00:15:45,390 --> 00:15:51,560 and negative outside that region for the oranges, that 209 00:15:51,560 --> 00:15:52,940 would be just what we want. 210 00:15:52,940 --> 00:15:53,840 Right? 211 00:15:53,840 --> 00:15:57,530 That would be what this little Playground site is creating. 212 00:15:57,530 --> 00:15:59,780 And on the screen, you'll see it. 213 00:15:59,780 --> 00:16:05,210 You'll see this curve, where it crosses 0. 214 00:16:05,210 --> 00:16:07,040 So that curve, where it crosses 0, 215 00:16:07,040 --> 00:16:10,130 is supposed to separate the two sets. 216 00:16:10,130 --> 00:16:12,610 One set is positive, one set is negative, 217 00:16:12,610 --> 00:16:15,250 where 0 is in between. 218 00:16:15,250 --> 00:16:19,870 And the point is, it's not a straight line, because we've 219 00:16:19,870 --> 00:16:22,100 got this nonlinear function. 220 00:16:22,100 --> 00:16:33,490 This is nonlinear, and it allows us to have 221 00:16:33,490 --> 00:16:36,520 functions like r minus 5. 222 00:16:36,520 --> 00:16:42,640 And so at 5, that's where the function would be 0, 223 00:16:42,640 --> 00:16:45,290 and you'll see that on the screen. 224 00:16:45,290 --> 00:16:50,110 You might just go to playground@tensorflow. 225 00:16:50,110 --> 00:16:53,650 Of course, TensorFlow is a big system. 226 00:16:53,650 --> 00:17:00,025 This is the child's department, but I 227 00:17:00,025 --> 00:17:01,150 thought it was pretty good. 228 00:17:01,150 --> 00:17:04,470 And then on this site, you decide 229 00:17:04,470 --> 00:17:08,800 how many layers there will be, how many neurons in each layer. 230 00:17:08,800 --> 00:17:13,140 So you create the structure that I'm about to draw. 231 00:17:13,140 --> 00:17:21,180 And you won't be able to get to solve this problem 232 00:17:21,180 --> 00:17:25,290 to find a function F that learns that data 233 00:17:25,290 --> 00:17:31,440 without a number of layers and a number of neurons. 234 00:17:31,440 --> 00:17:34,080 If you don't give it enough, you'll see it struggling. 235 00:17:39,250 --> 00:17:46,120 The 0 set tries to follow this, but it gives up at some point. 236 00:17:46,120 --> 00:17:48,850 This one doesn't take too many layers, 237 00:17:48,850 --> 00:17:54,910 and the two trivial examples, just a few neurons do the job. 238 00:17:54,910 --> 00:17:55,720 OK. 239 00:17:55,720 --> 00:18:02,350 So now, that's a little commented on one website. 240 00:18:02,350 --> 00:18:05,530 If you know other websites that I should know 241 00:18:05,530 --> 00:18:10,430 and should call attention to, could you send me an email? 242 00:18:10,430 --> 00:18:15,000 I'm just not aware of everything that's out there. 243 00:18:15,000 --> 00:18:22,650 Or if you know a good Convolutional Neural Net, CNN, 244 00:18:22,650 --> 00:18:27,450 that is available to practice on, 245 00:18:27,450 --> 00:18:32,760 where you could give it the training set. 246 00:18:32,760 --> 00:18:34,730 That's what I'm talking about here. 247 00:18:38,680 --> 00:18:41,050 I'd be glad to know, because I just 248 00:18:41,050 --> 00:18:43,300 don't know all that I should. 249 00:18:43,300 --> 00:18:43,930 OK. 250 00:18:43,930 --> 00:18:47,430 So what does the function look like? 251 00:18:47,430 --> 00:18:51,230 Well, as I say, linear isn't going to do it, 252 00:18:51,230 --> 00:18:56,840 but linear is a very important part of it, of this function 253 00:18:56,840 --> 00:19:01,220 F. So the function F really has the form-- 254 00:19:01,220 --> 00:19:08,000 well, so we start here with a vector of one, 255 00:19:08,000 --> 00:19:11,840 two, three, four, m is five. 256 00:19:11,840 --> 00:19:19,840 This is the vector x, five components. 257 00:19:19,840 --> 00:19:23,690 OK, so let me erase that now. 258 00:19:23,690 --> 00:19:36,700 OK, so then we have layer 1 with some number of points. 259 00:19:36,700 --> 00:19:47,670 Let's say, n1 is 6 neurons, and let me make this simple. 260 00:19:47,670 --> 00:19:51,420 I'll just have that one layer, and then I'll have the output. 261 00:19:51,420 --> 00:19:57,600 This will be the output layer, and it's just 262 00:19:57,600 --> 00:19:58,890 going to be one number. 263 00:20:01,490 --> 00:20:08,970 So I'm going to have a matrix, A1, that takes me from this. 264 00:20:08,970 --> 00:20:18,040 A1 will be 6 by 5, because I want 6 outputs and 5 inputs. 265 00:20:18,040 --> 00:20:23,570 6 by 5 matrix, so I have 30 weights to choose there. 266 00:20:23,570 --> 00:20:31,230 And so the y that comes out is going 267 00:20:31,230 --> 00:20:36,730 to be y1 will be A1 times x0. 268 00:20:36,730 --> 00:20:44,140 So x0 is the feature vector with 5 components. 269 00:20:44,140 --> 00:20:47,740 So that's a purely linear thing, but we also 270 00:20:47,740 --> 00:20:54,650 want an offset function, offset vector. 271 00:20:54,650 --> 00:20:56,120 So that's a vector. 272 00:20:56,120 --> 00:21:02,060 Then, this, the y that's coming out, has 6 components. 273 00:21:02,060 --> 00:21:07,940 The A1 is 6 by 5, the x0 was 5 by 1, 274 00:21:07,940 --> 00:21:10,550 and then of course, this is 6 by 1. 275 00:21:10,550 --> 00:21:12,830 So these are the weights. 276 00:21:15,500 --> 00:21:19,460 Yeah, I'll call them all weights, weights to compute. 277 00:21:25,940 --> 00:21:27,020 So these are connected. 278 00:21:27,020 --> 00:21:34,450 The usual picture is to show all these connections. 279 00:21:34,450 --> 00:21:38,270 I'll just put in some of them. 280 00:21:38,270 --> 00:21:49,010 So in here, we have 30 plus 6 parameters, 36 parameters, 281 00:21:49,010 --> 00:21:52,070 and then I'm going to close this. 282 00:21:52,070 --> 00:22:02,960 It's going to be a very shallow thing, so that will be just 1 283 00:22:02,960 --> 00:22:03,930 by 6. 284 00:22:03,930 --> 00:22:04,910 Yeah. 285 00:22:04,910 --> 00:22:05,410 OK. 286 00:22:08,780 --> 00:22:11,210 Right, so we're just getting one output. 287 00:22:11,210 --> 00:22:18,290 So that's just a vector at this final point, but of course, 288 00:22:18,290 --> 00:22:21,170 that the whole idea of deep neural nets 289 00:22:21,170 --> 00:22:23,810 is that you have many layers. 290 00:22:23,810 --> 00:22:29,420 So 36 more realistically is in the tens of thousands, 291 00:22:29,420 --> 00:22:32,330 and you have it multiple times. 292 00:22:32,330 --> 00:22:42,000 And the idea seems to be that you can separate what 293 00:22:42,000 --> 00:22:49,200 layer one learns about the data and from what layer two learns 294 00:22:49,200 --> 00:22:50,400 about the data. 295 00:22:50,400 --> 00:22:57,130 Layer one-- this A1, apparently by just looking after 296 00:22:57,130 --> 00:22:58,810 the computation-- 297 00:22:58,810 --> 00:23:04,630 this learns some basic facts about the data. 298 00:23:04,630 --> 00:23:13,930 The next, A2 which would go in here, would learn more detail, 299 00:23:13,930 --> 00:23:16,000 and then A3 would learn more details. 300 00:23:16,000 --> 00:23:19,540 So we would have a number of layers, 301 00:23:19,540 --> 00:23:27,150 and it's that construction that has made neural net successful. 302 00:23:27,150 --> 00:23:32,670 But I haven't finished, because right now, it's only linear. 303 00:23:32,670 --> 00:23:36,090 Right now, I just have, I'll call it A2 in here. 304 00:23:36,090 --> 00:23:38,700 Right now, I would just have a matrix multiplication 305 00:23:38,700 --> 00:23:47,850 apply A1 and then apply A2, but in between there is a 1 306 00:23:47,850 --> 00:23:55,930 by 1 action on each by this function. 307 00:23:58,580 --> 00:24:02,860 So that function acts on that number 308 00:24:02,860 --> 00:24:06,610 to give that number back again or to give 0. 309 00:24:06,610 --> 00:24:09,850 So in there is ReLU. 310 00:24:09,850 --> 00:24:18,340 In this comes ReLU on each, 6 copies of ReLU acting 311 00:24:18,340 --> 00:24:20,820 on each of those 6 numbers. 312 00:24:20,820 --> 00:24:21,630 Right? 313 00:24:21,630 --> 00:24:32,460 So really x1 comes from y1 by applying ReLU to it. 314 00:24:32,460 --> 00:24:34,470 Then, that gives the x. 315 00:24:34,470 --> 00:24:37,830 So here are the y's from the linear part, 316 00:24:37,830 --> 00:24:39,540 and here are the x-- 317 00:24:39,540 --> 00:24:40,480 that's y1. 318 00:24:40,480 --> 00:24:46,920 That's a vector y1 from just the linear plus an affine map. 319 00:24:46,920 --> 00:24:51,120 Linear plus constant, that's affine. 320 00:24:51,120 --> 00:24:55,590 And then the next step is component by component 321 00:24:55,590 --> 00:25:03,240 we apply this function, and we get x1, and then do it 322 00:25:03,240 --> 00:25:05,100 again and again and again. 323 00:25:05,100 --> 00:25:07,660 So do you see the function? 324 00:25:07,660 --> 00:25:10,830 How do I describe now the function F of x? 325 00:25:14,150 --> 00:25:26,390 So the learning function which depends on the weights, 326 00:25:26,390 --> 00:25:28,655 on the A's and b's. 327 00:25:32,150 --> 00:25:39,270 So I start with an x, I apply A1 to it. 328 00:25:39,270 --> 00:25:41,610 Yeah, let me do this. 329 00:25:41,610 --> 00:25:44,600 This is the function F of x. 330 00:25:44,600 --> 00:25:48,710 F of x is going to be F3, let's say, 331 00:25:48,710 --> 00:25:58,520 of F2 of F1 of x, one, two, three, parentheses, right? 332 00:25:58,520 --> 00:26:01,520 OK, so it's a chain, you could say. 333 00:26:01,520 --> 00:26:06,830 F is a-- what's the right word for a chain of functions, 334 00:26:06,830 --> 00:26:09,890 if I take a function of a function? 335 00:26:09,890 --> 00:26:12,800 The reason I use the word chain is that the chain rule 336 00:26:12,800 --> 00:26:14,360 gives the derivative. 337 00:26:14,360 --> 00:26:19,430 So a function of a function of a function, that's 338 00:26:19,430 --> 00:26:23,570 called composition, composing function. 339 00:26:23,570 --> 00:26:27,720 So this is a composition. 340 00:26:27,720 --> 00:26:30,900 I don't know if there's a standard symbol for starting 341 00:26:30,900 --> 00:26:36,820 with F1 and do some composition and do some composition. 342 00:26:36,820 --> 00:26:38,970 And now what are those separate F's? 343 00:26:43,180 --> 00:26:47,670 So the separate F's are the-- 344 00:26:47,670 --> 00:26:52,410 F1 of a vector would be-- it includes 345 00:26:52,410 --> 00:27:01,900 the ReLU part, the nonlinear part, of A1, x0 plus b1. 346 00:27:01,900 --> 00:27:07,590 So two parts, you do the linear or affine map 347 00:27:07,590 --> 00:27:13,080 on your feature vector, and then component 348 00:27:13,080 --> 00:27:18,560 by component you apply that nonlinear function. 349 00:27:18,560 --> 00:27:22,770 And it took some years before that nonlinear function 350 00:27:22,770 --> 00:27:27,630 became a big favorite. 351 00:27:27,630 --> 00:27:29,670 People imagined that it was better, 352 00:27:29,670 --> 00:27:33,180 it was important, to have a smooth function. 353 00:27:33,180 --> 00:27:42,950 So the original functions were sigmoids, like S curves, 354 00:27:42,950 --> 00:27:46,560 but of course, it turned out that experiments showed 355 00:27:46,560 --> 00:27:48,630 that this worked even better. 356 00:27:48,630 --> 00:27:52,140 Yeah, so that would be F1, and then F2 357 00:27:52,140 --> 00:27:55,780 would have the same form, and F3 would have the same form. 358 00:27:55,780 --> 00:27:59,640 So maybe this had 36 weights, and the next one 359 00:27:59,640 --> 00:28:04,440 would have another number and the next another number. 360 00:28:04,440 --> 00:28:08,940 You get quite complicated functions by composition, 361 00:28:08,940 --> 00:28:14,640 by like e to the sine of x, or e to the sign of the logarithm 362 00:28:14,640 --> 00:28:18,510 of x, or things like that. 363 00:28:18,510 --> 00:28:23,190 Pure math has asked, what functions can you get? 364 00:28:23,190 --> 00:28:24,780 Try to think of them all. 365 00:28:24,780 --> 00:28:27,690 Now, what kind of functions do we have here? 366 00:28:27,690 --> 00:28:33,870 What can I say about F of x as a function, as a math person? 367 00:28:33,870 --> 00:28:37,140 What kind of a function is it? 368 00:28:37,140 --> 00:28:42,660 So it's created out of matrices and vectors, 369 00:28:42,660 --> 00:28:49,980 out of a linear or affine map, followed by a nonlinear, 370 00:28:49,980 --> 00:28:54,490 by that particular nonlinear function. 371 00:28:54,490 --> 00:28:56,980 So what kind of a function is it? 372 00:28:56,980 --> 00:29:04,000 Well, I've written those words down up here, and F of x 373 00:29:04,000 --> 00:29:07,960 is going to be a continuous piecewise linear function. 374 00:29:10,630 --> 00:29:15,580 Because every step is continuous, 375 00:29:15,580 --> 00:29:17,800 that's a continuous function. 376 00:29:17,800 --> 00:29:19,900 Linear functions are a continuous functions, 377 00:29:19,900 --> 00:29:24,700 so we're taking a composition of continuous function, 378 00:29:24,700 --> 00:29:26,500 so it's continuous. 379 00:29:26,500 --> 00:29:30,940 And it's piecewise linear, because part of it is linear, 380 00:29:30,940 --> 00:29:32,950 and part of it is piecewise linear. 381 00:29:35,460 --> 00:29:51,480 So this is some continuous, piecewise, linear function 382 00:29:51,480 --> 00:29:58,900 of x, x in m dimensions. 383 00:29:58,900 --> 00:29:59,400 OK. 384 00:30:02,830 --> 00:30:07,810 So one little math question which I think 385 00:30:07,810 --> 00:30:13,810 helps to understand, to like to swallow 386 00:30:13,810 --> 00:30:19,960 the idea of a chain, of the kind of chain we have here, 387 00:30:19,960 --> 00:30:22,525 of linear followed by ReLU. 388 00:30:27,850 --> 00:30:29,470 So here's my question. 389 00:30:29,470 --> 00:30:31,450 This is the question I'm going to ask. 390 00:30:31,450 --> 00:30:34,660 And by the way, back propagation is certainly 391 00:30:34,660 --> 00:30:37,510 going to come Wednesday rather than today. 392 00:30:37,510 --> 00:30:41,120 That's a major topic in itself. 393 00:30:41,120 --> 00:30:44,450 So let me keep going with this function. 394 00:30:47,180 --> 00:30:50,750 Could you get any function whatsoever this way? 395 00:30:50,750 --> 00:30:53,270 Well, no, you only get continuous, piecewise, 396 00:30:53,270 --> 00:30:56,500 linear functions. 397 00:30:56,500 --> 00:30:59,060 It's an interesting case. 398 00:30:59,060 --> 00:31:01,840 Let me just ask you. 399 00:31:01,840 --> 00:31:04,810 One of the exercises says, if I took 400 00:31:04,810 --> 00:31:09,790 two continuous, piecewise, linear functions-- 401 00:31:09,790 --> 00:31:12,640 the next 20 minutes are an attempt 402 00:31:12,640 --> 00:31:16,780 to give us a picture of the graph of a piecewise, 403 00:31:16,780 --> 00:31:23,605 linear function in say a function of two variables. 404 00:31:23,605 --> 00:31:30,160 So I have m equal to 2, and I draw its graph. 405 00:31:30,160 --> 00:31:32,340 OK, help me to draw this graph. 406 00:31:32,340 --> 00:31:38,330 So this would be a graph of F of x1, x2, 407 00:31:38,330 --> 00:31:41,000 and it's going to be continuous and piecewise linear. 408 00:31:41,000 --> 00:31:43,550 So what does its graph look like? 409 00:31:43,550 --> 00:31:45,680 That's the question. 410 00:31:45,680 --> 00:31:50,540 What's the graph of a piecewise, linear function looks like? 411 00:31:50,540 --> 00:32:00,220 Well, it's got flat pieces in between the change from-- 412 00:32:00,220 --> 00:32:04,230 I do say piecewise, that means it's got different pieces. 413 00:32:04,230 --> 00:32:12,060 But within a piece, it's linear, and the pieces with each other, 414 00:32:12,060 --> 00:32:13,800 because it's continuous. 415 00:32:13,800 --> 00:32:20,050 So I visualize, well, it's like origami. 416 00:32:20,050 --> 00:32:24,760 This is the theory of origami almost. 417 00:32:24,760 --> 00:32:27,320 So right, origami, you take a flat thing, 418 00:32:27,320 --> 00:32:32,580 and you fold it along straight folds. 419 00:32:32,580 --> 00:32:34,330 So what's different from origami? 420 00:32:34,330 --> 00:32:35,300 Maybe not much. 421 00:32:38,136 --> 00:32:43,520 Well, maybe origami allows more than we allow here, 422 00:32:43,520 --> 00:32:46,790 or origami would allow you to fold it up and over. 423 00:32:46,790 --> 00:32:51,050 So origami would give you a multi-valued thing, 424 00:32:51,050 --> 00:32:55,880 because it's got a top and a bottom and other folds. 425 00:32:55,880 --> 00:33:02,980 This is just going out to infinity in flat pieces, 426 00:33:02,980 --> 00:33:06,020 and the question will be, how many pieces? 427 00:33:06,020 --> 00:33:07,710 So let me ask you that question. 428 00:33:07,710 --> 00:33:11,890 How many pieces do I have? 429 00:33:11,890 --> 00:33:15,440 Do you see what I mean by a piece? 430 00:33:15,440 --> 00:33:19,300 So I'm thinking of a graph that has these flat pieces, 431 00:33:19,300 --> 00:33:23,880 and they're connected along straight edges. 432 00:33:23,880 --> 00:33:30,870 And those straight edges come from the ReLU operation. 433 00:33:30,870 --> 00:33:33,520 Well, that's got two pieces. 434 00:33:33,520 --> 00:33:35,290 Actually, we could do it 1D. 435 00:33:35,290 --> 00:33:39,170 In 1D, we could count the number of pieces pretty easily. 436 00:33:39,170 --> 00:33:41,450 So what would be a piecewise linear? 437 00:33:41,450 --> 00:33:45,260 Let me put it over here on the side and erase it soon. 438 00:33:45,260 --> 00:33:48,140 OK. 439 00:33:48,140 --> 00:33:56,350 So here's m equal 1, a continuous, piecewise, linear 440 00:33:56,350 --> 00:34:00,380 F. I'll just draw its graph. 441 00:34:00,380 --> 00:34:08,159 So OK, so it's got straight pieces, 442 00:34:08,159 --> 00:34:11,300 straight pieces like so. 443 00:34:11,300 --> 00:34:12,855 Yeah, you've got the idea. 444 00:34:12,855 --> 00:34:14,540 It's a broken line type. 445 00:34:14,540 --> 00:34:16,310 Sometimes, people say broken line, 446 00:34:16,310 --> 00:34:21,350 but I'm never sure that's a good description of this. 447 00:34:21,350 --> 00:34:24,500 Piecewise, linear, continuous, so it's continuous 448 00:34:24,500 --> 00:34:30,070 because the pieces meet, and it's piecewise, 449 00:34:30,070 --> 00:34:31,900 linear, obviously. 450 00:34:31,900 --> 00:34:34,909 OK, so that's the kind of picture 451 00:34:34,909 --> 00:34:39,460 I have for a function of one variable. 452 00:34:39,460 --> 00:34:42,650 Now, my question is-- 453 00:34:42,650 --> 00:34:47,630 as an aid to try to visualize this function in 2D-- 454 00:34:47,630 --> 00:34:51,500 is to see if we can count the pieces, 455 00:34:51,500 --> 00:34:53,540 see if we can count the pieces. 456 00:34:53,540 --> 00:34:55,639 Yes. 457 00:34:55,639 --> 00:34:58,060 So that's in the notes. 458 00:34:58,060 --> 00:35:06,270 I found it in a paper by five authors for a meeting. 459 00:35:08,946 --> 00:35:16,390 So actually, the whole world of neural nets, 460 00:35:16,390 --> 00:35:21,150 it's the conferences every couple of years 461 00:35:21,150 --> 00:35:27,150 that everybody prepares for, submits more than one paper. 462 00:35:27,150 --> 00:35:30,450 So it's kind of a piecewise, linear conference, 463 00:35:30,450 --> 00:35:35,460 and those are the big conferences. 464 00:35:35,460 --> 00:35:36,300 OK. 465 00:35:36,300 --> 00:35:38,550 So this is the back propagation section, 466 00:35:38,550 --> 00:35:42,800 and I want to look at the-- 467 00:35:42,800 --> 00:35:43,300 OK. 468 00:35:46,560 --> 00:35:49,400 So this is a paper by Kleinberg and four others. 469 00:35:49,400 --> 00:35:54,530 Kleinberg, he's a computer science guy at Cornell. 470 00:35:54,530 --> 00:35:57,440 He was a PhD from here in math, and he's 471 00:35:57,440 --> 00:36:04,700 a very cool and significant person, 472 00:36:04,700 --> 00:36:13,930 not so much on neural networks as just this whole part 473 00:36:13,930 --> 00:36:15,130 of computer science. 474 00:36:15,130 --> 00:36:15,820 Right. 475 00:36:15,820 --> 00:36:19,390 So anyway, they and other people too 476 00:36:19,390 --> 00:36:20,740 have asked this same problem. 477 00:36:23,450 --> 00:36:24,910 Suppose I'm in two variables. 478 00:36:28,250 --> 00:36:33,830 So what are you imagining now for the surface, 479 00:36:33,830 --> 00:36:39,200 the graph of F of x and y? 480 00:36:39,200 --> 00:36:43,520 It has these lines, fold lines, right? 481 00:36:43,520 --> 00:36:45,600 I'm thinking it has fold lines. 482 00:36:48,840 --> 00:36:51,890 So I can start with a complete plane, and I fold it 483 00:36:51,890 --> 00:36:53,520 along one line. 484 00:36:53,520 --> 00:36:55,920 So now, it's like ReLU. 485 00:36:55,920 --> 00:37:00,070 It's one half plane there going into a different half plane 486 00:37:00,070 --> 00:37:00,570 there. 487 00:37:00,570 --> 00:37:03,130 Everybody with it? 488 00:37:03,130 --> 00:37:08,170 And now, I take that function, that surface 489 00:37:08,170 --> 00:37:14,020 which just has two parts, and I put in another fold. 490 00:37:14,020 --> 00:37:17,850 OK, how many parts have I got now? 491 00:37:17,850 --> 00:37:20,450 I think four, am I right? 492 00:37:20,450 --> 00:37:26,430 Four parts, yes, because this will be different from this, 493 00:37:26,430 --> 00:37:28,840 because it was folded along that line. 494 00:37:28,840 --> 00:37:31,380 So these will be four different pieces. 495 00:37:31,380 --> 00:37:34,820 They have the same value at the center there, 496 00:37:34,820 --> 00:37:39,230 and they match along the lines. 497 00:37:39,230 --> 00:37:43,800 So the number of flat pieces is four for this. 498 00:37:43,800 --> 00:37:47,640 So that's with two folds, and now I just want to ask you, 499 00:37:47,640 --> 00:37:51,975 with m folds how many pieces are there? 500 00:37:51,975 --> 00:37:53,655 Can I get up to three folds? 501 00:37:56,330 --> 00:37:59,390 So I'm going to look for the number of folds. 502 00:37:59,390 --> 00:38:04,690 So let me just use a notation, maybe r. 503 00:38:04,690 --> 00:38:22,910 r is the number of flat pieces, and m is the dimension of x. 504 00:38:22,910 --> 00:38:29,530 In my picture, it's two, and N is the number of folds. 505 00:38:33,750 --> 00:38:35,340 So let me say it again. 506 00:38:35,340 --> 00:38:36,310 I'm taking a plane. 507 00:38:40,200 --> 00:38:42,240 I'll fold that plane-- 508 00:38:42,240 --> 00:38:45,060 because the dimension was two-- 509 00:38:45,060 --> 00:38:47,205 I'll fold it N times. 510 00:38:51,260 --> 00:38:52,860 How many pieces? 511 00:38:52,860 --> 00:38:53,930 How many flat pieces? 512 00:39:06,190 --> 00:39:10,150 This would be a central step in understanding 513 00:39:10,150 --> 00:39:13,750 how close the function-- 514 00:39:13,750 --> 00:39:18,730 what freedom you have in the function F. For example, 515 00:39:18,730 --> 00:39:22,850 can you approximate any continuous function 516 00:39:22,850 --> 00:39:27,910 by one of these functions F by taking enough folds? 517 00:39:27,910 --> 00:39:30,910 Seems like the answer should be yes, and it is yes. 518 00:39:34,020 --> 00:39:38,380 For pure math, that's one question. 519 00:39:38,380 --> 00:39:41,830 Is this class of functions universal? 520 00:39:41,830 --> 00:39:44,230 So the universality theorem would 521 00:39:44,230 --> 00:39:49,060 be to say that any function-- 522 00:39:49,060 --> 00:39:53,680 sine x, whatever-- could be approximated 523 00:39:53,680 --> 00:39:59,380 as close as you like by one of these guys with enough folds. 524 00:39:59,380 --> 00:40:05,010 And over here, we're kind of making it more numerical. 525 00:40:05,010 --> 00:40:07,570 We're going to count the number of pieces 526 00:40:07,570 --> 00:40:10,480 just to see how quickly do they grow. 527 00:40:10,480 --> 00:40:12,490 So what happens here? 528 00:40:12,490 --> 00:40:14,760 So I have four folds. 529 00:40:14,760 --> 00:40:18,660 Right now, I have N equal 2. 530 00:40:18,660 --> 00:40:23,080 m is 2 here in this picture. 531 00:40:23,080 --> 00:40:27,030 And I'm trying to draw this surface, in here I've put in 2. 532 00:40:27,030 --> 00:40:28,050 Did I take N? 533 00:40:28,050 --> 00:40:34,030 Yeah, two folds, and now I'm going to go up to three folds. 534 00:40:34,030 --> 00:40:34,720 OK. 535 00:40:34,720 --> 00:40:37,890 So let me fold it along that line. 536 00:40:37,890 --> 00:40:39,570 How many pieces of I got now? 537 00:40:44,240 --> 00:40:49,600 Let's see, can I count those pieces? 538 00:40:49,600 --> 00:40:52,360 Is it seven? 539 00:40:52,360 --> 00:40:54,340 So what is a formula? 540 00:40:54,340 --> 00:40:55,810 What if I do another fold? 541 00:40:59,890 --> 00:41:01,910 Yeah, let's pretend we do another fold. 542 00:41:01,910 --> 00:41:02,630 Yeah? 543 00:41:02,630 --> 00:41:04,260 AUDIENCE: [INAUDIBLE] 544 00:41:04,260 --> 00:41:06,600 GILBERT STRANG: Uh, yeah. 545 00:41:06,600 --> 00:41:10,050 Well, maybe that's going to be it. 546 00:41:12,580 --> 00:41:15,020 It's a kind of nice question, because it asks 547 00:41:15,020 --> 00:41:16,970 you to visualize this thing. 548 00:41:16,970 --> 00:41:17,470 OK. 549 00:41:20,290 --> 00:41:22,030 So what happened? 550 00:41:22,030 --> 00:41:24,250 How many of those lines will be-- 551 00:41:24,250 --> 00:41:27,190 if I put in a fourth line-- 552 00:41:27,190 --> 00:41:29,040 how many? 553 00:41:29,040 --> 00:41:33,300 Yeah, how many new folds do I create? 554 00:41:33,300 --> 00:41:34,850 That's kind of the question, and I'm 555 00:41:34,850 --> 00:41:37,430 assuming that fourth line doesn't 556 00:41:37,430 --> 00:41:39,110 go through any of these points. 557 00:41:39,110 --> 00:41:40,655 It's sort of in general position. 558 00:41:43,660 --> 00:41:47,650 So I put it in a fourth line, da-da-da-da, there it is. 559 00:41:47,650 --> 00:41:50,620 OK, so what happened here? 560 00:41:50,620 --> 00:41:53,890 How many new ones did it create? 561 00:41:53,890 --> 00:41:55,360 How many new ones did it create? 562 00:41:58,690 --> 00:42:01,390 Let me make that one green, because I'm 563 00:42:01,390 --> 00:42:03,460 distinguishing that's the guy that's 564 00:42:03,460 --> 00:42:06,580 added after the original. 565 00:42:06,580 --> 00:42:07,660 We had seven. 566 00:42:07,660 --> 00:42:14,860 We had seven pieces, and now we've got more. 567 00:42:14,860 --> 00:42:15,580 Was it seven? 568 00:42:15,580 --> 00:42:16,390 It was, wasn't it? 569 00:42:16,390 --> 00:42:20,800 One, two, three, four, five, six, seven, but now how many 570 00:42:20,800 --> 00:42:22,090 pieces have I got? 571 00:42:22,090 --> 00:42:30,160 Or how many pieces did this new line create? 572 00:42:30,160 --> 00:42:32,750 We want to build it up, use a recursion. 573 00:42:32,750 --> 00:42:35,150 How many pieces did this new-- 574 00:42:35,150 --> 00:42:39,920 well, this new line created one new piece there. 575 00:42:39,920 --> 00:42:40,910 Right? 576 00:42:40,910 --> 00:42:43,760 One new piece there, one new piece there, 577 00:42:43,760 --> 00:42:50,050 one new piece there, so there are four new pieces. 578 00:42:50,050 --> 00:42:52,410 OK. 579 00:42:52,410 --> 00:42:55,490 Yes, so there's some formula that's 580 00:42:55,490 --> 00:42:59,430 going to tell us that, and now what would the next one create? 581 00:42:59,430 --> 00:43:03,870 Well, now I have one, two, three, four lines. 582 00:43:03,870 --> 00:43:06,870 So now, I'm going to put through a fifth line, 583 00:43:06,870 --> 00:43:09,300 and that will create a whole bunch of pieces. 584 00:43:09,300 --> 00:43:14,970 I'm losing the thread of this argument, but you're onto it. 585 00:43:14,970 --> 00:43:16,410 Right? 586 00:43:16,410 --> 00:43:19,800 Yeah, so any suggestions? 587 00:43:19,800 --> 00:43:20,860 Yeah. 588 00:43:20,860 --> 00:43:23,910 AUDIENCE: Yeah, I think you add essentially the number of lines 589 00:43:23,910 --> 00:43:26,840 that you have each time you add a line at most. 590 00:43:26,840 --> 00:43:27,960 GILBERT STRANG: OK. 591 00:43:27,960 --> 00:43:30,230 Yes. 592 00:43:30,230 --> 00:43:30,930 That's right. 593 00:43:30,930 --> 00:43:34,290 So there is a recursion formula that I want to know, 594 00:43:34,290 --> 00:43:36,840 and I learned it from Kleinberg's paper. 595 00:43:39,820 --> 00:43:42,210 And then we have an addition to do, 596 00:43:42,210 --> 00:43:45,970 so the recursion will tell me how much it goes up 597 00:43:45,970 --> 00:43:49,330 with each new function, and then we have to add. 598 00:43:49,330 --> 00:43:49,960 OK. 599 00:43:49,960 --> 00:43:52,210 So the recursion formula, let me write that down. 600 00:43:56,520 --> 00:44:06,030 So this is r of N and m that I'd like to find a formula for. 601 00:44:06,030 --> 00:44:12,260 It's the number of flat pieces with an m dimensional surface-- 602 00:44:12,260 --> 00:44:14,510 well, we're taking m to be 2-- 603 00:44:14,510 --> 00:44:16,115 and N folds. 604 00:44:18,790 --> 00:44:22,340 So N equal 1, 2, 3. 605 00:44:22,340 --> 00:44:25,982 Let's write down the numbers we know. 606 00:44:25,982 --> 00:44:29,630 With one fold, how many pieces? 607 00:44:29,630 --> 00:44:32,710 Two, good, so far so good. 608 00:44:35,760 --> 00:44:38,940 With one fold, there were two pieces. 609 00:44:38,940 --> 00:44:44,640 So this is the count r, and then with two folds, how many? 610 00:44:44,640 --> 00:44:46,740 Oh, we've gone past that point. 611 00:44:46,740 --> 00:44:50,130 So can we get back to just those two? 612 00:44:50,130 --> 00:44:51,280 Was it four? 613 00:44:51,280 --> 00:44:52,020 AUDIENCE: Yes. 614 00:44:52,020 --> 00:44:55,050 GILBERT STRANG: OK, thanks. 615 00:44:55,050 --> 00:44:59,790 Now, when I put in that third fold, how many did I have 616 00:44:59,790 --> 00:45:02,550 without the green line yet? 617 00:45:02,550 --> 00:45:06,180 Seven, was it seven? 618 00:45:06,180 --> 00:45:12,300 And when the fourth one went in, that green one, how many have I 619 00:45:12,300 --> 00:45:13,320 got in this picture? 620 00:45:15,990 --> 00:45:19,860 So the question is how many new ones did I create, I guess. 621 00:45:19,860 --> 00:45:24,270 So that line got chopped into that piece, that piece, 622 00:45:24,270 --> 00:45:27,930 that piece, that piece, four pieces for the new line. 623 00:45:27,930 --> 00:45:32,400 Four pieces for the new line, and then each of those pieces 624 00:45:32,400 --> 00:45:36,720 like added a flat bit. 625 00:45:36,720 --> 00:45:40,350 Because that piece from here to here 626 00:45:40,350 --> 00:45:43,260 separated these two which were previously 627 00:45:43,260 --> 00:45:45,770 just one piece, one flat piece. 628 00:45:45,770 --> 00:45:47,400 I folded on that line. 629 00:45:47,400 --> 00:45:48,420 I folded on this. 630 00:45:48,420 --> 00:45:49,380 I folded there. 631 00:45:49,380 --> 00:45:51,930 I think it went up by 4 to 11. 632 00:45:55,510 --> 00:45:59,740 So now, we just have to guess a formula that 633 00:45:59,740 --> 00:46:02,950 matches those numbers, and then of course, we really 634 00:46:02,950 --> 00:46:09,490 should guess it for any m and any N. 635 00:46:09,490 --> 00:46:13,390 And I'll write down the formula that they found. 636 00:46:16,340 --> 00:46:19,410 It involves binomial numbers. 637 00:46:19,410 --> 00:46:24,450 Everything in the world involves binomial numbers, 638 00:46:24,450 --> 00:46:28,650 because they satisfy every identity you could think of. 639 00:46:34,060 --> 00:46:35,065 So here's their formula. 640 00:46:38,920 --> 00:46:43,180 r with N folds, and we're in m dimensions. 641 00:46:43,180 --> 00:46:46,300 So we've really in our thinking had m equal to 2, 642 00:46:46,300 --> 00:46:53,470 but we should grow up and get m to be five dimensional. 643 00:46:53,470 --> 00:46:55,540 So we have a five dimensional-- 644 00:46:55,540 --> 00:46:57,700 let's not think about that. 645 00:46:57,700 --> 00:47:00,180 OK. 646 00:47:00,180 --> 00:47:03,030 So it turns out it's binomial numbers-- 647 00:47:03,030 --> 00:47:11,840 N 0, N 1, up to N m. 648 00:47:16,260 --> 00:47:28,880 So for m equals 2, which is my picture, it's N 0 plus N 1 649 00:47:28,880 --> 00:47:32,750 plus N 2, and what are these? 650 00:47:32,750 --> 00:47:37,560 What does that N 2 mean, for example? 651 00:47:37,560 --> 00:47:41,040 That's a binomial number. 652 00:47:41,040 --> 00:47:45,640 I don't know if you're keen on binomial numbers. 653 00:47:45,640 --> 00:47:50,170 Some people, their whole lives go into binomial numbers. 654 00:47:50,170 --> 00:47:53,290 So it's something like-- 655 00:47:53,290 --> 00:48:00,610 is it N factorial divided by N minus 2 factorial and 2 656 00:48:00,610 --> 00:48:01,390 factorial? 657 00:48:05,590 --> 00:48:08,070 I think that's what that number means. 658 00:48:08,070 --> 00:48:09,410 That's the binomial number. 659 00:48:12,860 --> 00:48:17,030 So at this point, I'm hoping to get the answer seven, I think. 660 00:48:22,290 --> 00:48:25,200 I'm in m equal to-- 661 00:48:25,200 --> 00:48:27,690 I've gone up to 2. 662 00:48:27,690 --> 00:48:36,510 Yeah, so I think I've obviously allowed for three cuts, 663 00:48:36,510 --> 00:48:41,430 and the r, when we had just three, was 7. 664 00:48:46,380 --> 00:48:49,110 So this is now I'm taking N to be 3, 665 00:48:49,110 --> 00:48:56,050 and I'm hoping for answer is 7. 666 00:49:01,370 --> 00:49:03,170 So I add these three things. 667 00:49:03,170 --> 00:49:08,490 So what is 3, the binomial number 3 with 2? 668 00:49:08,490 --> 00:49:10,550 I've forgotten how to say that. 669 00:49:10,550 --> 00:49:11,660 I'm ashamed to admit. 670 00:49:11,660 --> 00:49:13,380 3 choose 2, thanks. 671 00:49:13,380 --> 00:49:14,550 I knew there was a good way. 672 00:49:14,550 --> 00:49:15,890 So what is 3 choose 2? 673 00:49:18,700 --> 00:49:23,450 Well, put in 3, and 2 is in there already, 674 00:49:23,450 --> 00:49:28,520 so that'd be 6 over 1 times 2. 675 00:49:28,520 --> 00:49:29,690 This would be 3. 676 00:49:29,690 --> 00:49:30,530 Would that be 3? 677 00:49:35,520 --> 00:49:39,639 And what is 3 choose 1? 678 00:49:39,639 --> 00:49:41,120 AUDIENCE: 3. 679 00:49:41,120 --> 00:49:42,780 GILBERT STRANG: How do you know that? 680 00:49:42,780 --> 00:49:45,730 You're probably right. 681 00:49:45,730 --> 00:49:48,190 3, I think, yeah. 682 00:49:48,190 --> 00:49:52,800 Oh yeah, probably a theorem that if these add 3. 683 00:49:52,800 --> 00:49:55,340 Yeah, so I'm doing N equals 3 here. 684 00:49:55,340 --> 00:49:55,840 OK. 685 00:49:55,840 --> 00:49:57,270 So yeah, I agree. 686 00:49:57,270 --> 00:50:01,000 That's 3, and what about N to 0? 687 00:50:01,000 --> 00:50:05,580 That you have to live with 0 factorial, 688 00:50:05,580 --> 00:50:10,000 but 0 factorial is by no means 0. 689 00:50:10,000 --> 00:50:13,200 So what is 0 factorial? 690 00:50:13,200 --> 00:50:17,470 1, yeah. 691 00:50:17,470 --> 00:50:19,510 I remember when I was an undergraduate having 692 00:50:19,510 --> 00:50:20,470 a bet on that. 693 00:50:24,890 --> 00:50:29,030 I won, but he didn't pay off. 694 00:50:29,030 --> 00:50:31,480 Yeah, so it's 3. 695 00:50:31,480 --> 00:50:35,320 This is 3 factorial over 3 factorial times 0 factorial. 696 00:50:35,320 --> 00:50:37,060 So it's 6 over 6 times 1. 697 00:50:37,060 --> 00:50:38,130 So it's 1. 698 00:50:38,130 --> 00:50:41,496 Yeah, 1 and 3 and 3 make 7. 699 00:50:41,496 --> 00:50:44,450 So that proves the formula. 700 00:50:44,450 --> 00:50:46,190 Well, it doesn't quite prove the formula, 701 00:50:46,190 --> 00:50:51,380 but the way to prove it is by an induction. 702 00:50:51,380 --> 00:51:00,910 If you like this stuff, the recursion that you use 703 00:51:00,910 --> 00:51:01,850 induction on. 704 00:51:01,850 --> 00:51:04,810 Which is just what we did now, what we did here. 705 00:51:04,810 --> 00:51:11,130 Here comes in a number 4, and it cuts through, 706 00:51:11,130 --> 00:51:14,910 and then we just counted the 4 pieces there. 707 00:51:14,910 --> 00:51:23,250 So yeah, so let me just tell you what the r then and m. 708 00:51:23,250 --> 00:51:26,250 The number we're looking for is the number 709 00:51:26,250 --> 00:51:30,780 that we had with one less cut. 710 00:51:30,780 --> 00:51:35,310 So that's the previous count of flat pieces 711 00:51:35,310 --> 00:51:43,080 plus the number that was here was 4, the number of pieces 712 00:51:43,080 --> 00:51:44,310 that cut that. 713 00:51:44,310 --> 00:51:52,000 And that's r of N minus 1, m minus 1. 714 00:51:52,000 --> 00:51:57,790 Yeah, and I won't go further. 715 00:51:57,790 --> 00:52:02,800 Time's up, but that rule for recursion 716 00:52:02,800 --> 00:52:08,440 is proved in the section 7.1 taken from the paper 717 00:52:08,440 --> 00:52:11,030 by Kleinberg and others. 718 00:52:11,030 --> 00:52:11,530 Yeah. 719 00:52:11,530 --> 00:52:14,320 So OK, I think this is-- 720 00:52:14,320 --> 00:52:16,180 I don't know what you feel. 721 00:52:16,180 --> 00:52:20,380 For me, this like gave me a better feeling 722 00:52:20,380 --> 00:52:25,660 that I was understanding what kind of functions we had here. 723 00:52:25,660 --> 00:52:30,430 And so then the question is-- 724 00:52:33,270 --> 00:52:35,220 with this family of functions, we 725 00:52:35,220 --> 00:52:45,880 want to choose the A's and the weights, the A's and b's, 726 00:52:45,880 --> 00:52:50,130 to match the training data. 727 00:52:50,130 --> 00:52:56,190 So that we have a problem in minimizing the total loss, 728 00:52:56,190 --> 00:53:00,480 and we have a gradient descent problem. 729 00:53:00,480 --> 00:53:04,370 So we have to find the gradient, so that Wednesday's job. 730 00:53:04,370 --> 00:53:09,330 Wednesday's job is to find the gradient of F, 731 00:53:09,330 --> 00:53:11,260 and that's back propagation. 732 00:53:11,260 --> 00:53:11,760 Good. 733 00:53:11,760 --> 00:53:14,040 Thank you very much. 734 00:53:14,040 --> 00:53:16,790 7.1 is done.