1 00:00:01,170 --> 00:00:03,510 The following content is provided under a Creative 2 00:00:03,510 --> 00:00:04,930 Commons license. 3 00:00:04,930 --> 00:00:07,120 Your support will help MIT OpenCourseWare 4 00:00:07,120 --> 00:00:11,230 continue to offer high-quality educational resources for free. 5 00:00:11,230 --> 00:00:13,770 To make a donation or to view additional materials 6 00:00:13,770 --> 00:00:17,730 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,730 --> 00:00:18,610 at ocw.mit.edu. 8 00:00:23,542 --> 00:00:25,750 GABRIEL SANCHEZ-MARTINEZ: Any questions on Homework 1 9 00:00:25,750 --> 00:00:26,680 before we get started? 10 00:00:29,308 --> 00:00:30,190 AUDIENCE: Yeah. 11 00:00:30,190 --> 00:00:33,100 GABRIEL SANCHEZ-MARTINEZ: OK, fire away. 12 00:00:33,100 --> 00:00:36,940 AUDIENCE: I guess, first, do you think 13 00:00:36,940 --> 00:00:39,400 we have like this minimum cycle time, 14 00:00:39,400 --> 00:00:42,360 like a theoretical minimum cycle time and then what was actually 15 00:00:42,360 --> 00:00:45,630 [INAUDIBLE] cycle time? 16 00:00:45,630 --> 00:00:49,890 GABRIEL SANCHEZ-MARTINEZ: So cycle time, just to review-- 17 00:00:49,890 --> 00:00:54,540 it's the time that it takes a bus to-- 18 00:00:54,540 --> 00:00:57,620 from the time [AUDIO OUT] for a trip. 19 00:00:57,620 --> 00:01:01,530 It goes all the way one way, has to wait at the other end 20 00:01:01,530 --> 00:01:05,010 to recover the schedule, comes back, waits to recover, 21 00:01:05,010 --> 00:01:07,290 and is ready to begin the next round. 22 00:01:07,290 --> 00:01:09,890 So that's a cycle. 23 00:01:09,890 --> 00:01:14,040 AUDIENCE: Since you have [INAUDIBLE] going on, 24 00:01:14,040 --> 00:01:17,579 if you had 4.1 buses, then you use a cycle time. 25 00:01:17,579 --> 00:01:18,995 Then obviously, you can't do that? 26 00:01:18,995 --> 00:01:19,930 [INTERPOSING VOICES] 27 00:01:19,930 --> 00:01:21,315 GABRIEL SANCHEZ-MARTINEZ: So you would need five buses-- 28 00:01:21,315 --> 00:01:21,610 AUDIENCE: Yeah. 29 00:01:21,610 --> 00:01:23,860 GABRIEL SANCHEZ-MARTINEZ: --if that's what you've got. 30 00:01:23,860 --> 00:01:26,892 Or you would have to do a trade-off with reliability 31 00:01:26,892 --> 00:01:27,850 if that were to happen. 32 00:01:31,650 --> 00:01:33,312 AUDIENCE: I think most of my questions 33 00:01:33,312 --> 00:01:35,440 were on this very last couple of questions. 34 00:01:35,440 --> 00:01:38,368 GABRIEL SANCHEZ-MARTINEZ: Yeah. 35 00:01:38,368 --> 00:01:41,680 AUDIENCE: We were aggregating a bunch of data for-- 36 00:01:41,680 --> 00:01:45,094 [INAUDIBLE] you did it across both directions 37 00:01:45,094 --> 00:01:46,510 and then asked, how does it change 38 00:01:46,510 --> 00:01:49,650 when you would like to evaluate each direction separately 39 00:01:49,650 --> 00:01:51,062 in layover time? 40 00:01:51,062 --> 00:01:53,520 GABRIEL SANCHEZ-MARTINEZ: This is the penultimate question, 41 00:01:53,520 --> 00:01:53,880 correct? 42 00:01:53,880 --> 00:01:54,120 AUDIENCE: Yeah. 43 00:01:54,120 --> 00:01:54,510 GABRIEL SANCHEZ-MARTINEZ: So that's 44 00:01:54,510 --> 00:01:55,920 the hardest question on the assignment. 45 00:01:55,920 --> 00:01:56,670 AUDIENCE: OK. 46 00:01:56,670 --> 00:01:58,836 GABRIEL SANCHEZ-MARTINEZ: It is a challenge question 47 00:01:58,836 --> 00:02:02,430 because there are different cases that you have to analyze. 48 00:02:02,430 --> 00:02:05,330 That's maybe the hint, right? 49 00:02:05,330 --> 00:02:07,330 There are some cases. 50 00:02:07,330 --> 00:02:09,537 And for each case, there is a probability 51 00:02:09,537 --> 00:02:10,620 that that case will occur. 52 00:02:10,620 --> 00:02:11,830 AUDIENCE: Yeah. 53 00:02:11,830 --> 00:02:22,334 GABRIEL SANCHEZ-MARTINEZ: And-- let's see if this starts-- 54 00:02:22,334 --> 00:02:23,750 there's a probability that it will 55 00:02:23,750 --> 00:02:30,090 occur and then a consequence, or something happens in that case. 56 00:02:30,090 --> 00:02:32,990 So you have to look at each case and then aggregate the cases 57 00:02:32,990 --> 00:02:34,880 together, if that make sense. 58 00:02:34,880 --> 00:02:36,146 AUDIENCE: Yes. 59 00:02:36,146 --> 00:02:38,770 GABRIEL SANCHEZ-MARTINEZ: We're taking questions for Assignment 60 00:02:38,770 --> 00:02:40,090 1, which is due on Thursday. 61 00:02:43,740 --> 00:02:44,880 Any other questions? 62 00:02:44,880 --> 00:02:47,565 AUDIENCE: That's it. 63 00:02:47,565 --> 00:02:48,440 AUDIENCE: [INAUDIBLE] 64 00:02:48,440 --> 00:02:51,200 GABRIEL SANCHEZ-MARTINEZ: It is due at 4:00 65 00:02:51,200 --> 00:02:54,620 so at class time essentially, yeah. 66 00:02:54,620 --> 00:02:56,410 I actually [AUDIO OUT] if you 4:00. 67 00:02:56,410 --> 00:02:58,960 I said 4:05, so you have five minutes. 68 00:03:01,876 --> 00:03:05,035 AUDIENCE: Can you [INAUDIBLE] what assumptions there 69 00:03:05,035 --> 00:03:06,250 are [INAUDIBLE]? 70 00:03:10,485 --> 00:03:12,276 GABRIEL SANCHEZ-MARTINEZ: In what question? 71 00:03:12,276 --> 00:03:14,463 AUDIENCE: When you said it seems to be 72 00:03:14,463 --> 00:03:17,622 the reasoning or assumption about the schedule [INAUDIBLE]?? 73 00:03:17,622 --> 00:03:20,052 Which metric do you use? 74 00:03:20,052 --> 00:03:22,757 Based on the data, which [INAUDIBLE]?? 75 00:03:22,757 --> 00:03:25,340 GABRIEL SANCHEZ-MARTINEZ: Yeah, so that's Question 3, correct? 76 00:03:25,340 --> 00:03:25,965 AUDIENCE: Yeah. 77 00:03:25,965 --> 00:03:28,887 GABRIEL SANCHEZ-MARTINEZ: So I can't really explain. 78 00:03:28,887 --> 00:03:30,720 I can't give you the answer to the question. 79 00:03:30,720 --> 00:03:34,250 So what I'm looking for there is your intuition 80 00:03:34,250 --> 00:03:38,870 and your understanding of why you would pick which statistics 81 00:03:38,870 --> 00:03:45,240 from Question 2, where it tells you calculate all these things. 82 00:03:45,240 --> 00:03:48,500 Now I'm saying pick from those statistics 83 00:03:48,500 --> 00:03:52,370 what you would use for t and for r. 84 00:03:52,370 --> 00:03:54,770 And you may want to combine different statistics 85 00:03:54,770 --> 00:03:57,450 for the computation of r. 86 00:03:57,450 --> 00:03:57,950 Yeah? 87 00:03:57,950 --> 00:04:02,132 AUDIENCE: [INAUDIBLE] multiple valid responses but-- 88 00:04:02,132 --> 00:04:04,590 GABRIEL SANCHEZ-MARTINEZ: Yes, some more valid than others, 89 00:04:04,590 --> 00:04:06,510 but some that are definitely invalid 90 00:04:06,510 --> 00:04:12,900 and some that are almost 100% valid but not 100% valid. 91 00:04:12,900 --> 00:04:15,670 So there are several correct answers, 92 00:04:15,670 --> 00:04:18,130 and some that are very good answers 93 00:04:18,130 --> 00:04:21,220 because you can justify the choice of the statistic 94 00:04:21,220 --> 00:04:23,310 conceptually. 95 00:04:23,310 --> 00:04:24,380 Yeah. 96 00:04:24,380 --> 00:04:26,920 Any other questions on Homework 1? 97 00:04:26,920 --> 00:04:30,860 I can take some more questions after class, if that's OK. 98 00:04:30,860 --> 00:04:36,260 So we had a snow day if you had a good time, and/or at least, 99 00:04:36,260 --> 00:04:37,910 you could use it to catch up. 100 00:04:37,910 --> 00:04:40,430 So the schedule is a little different now. 101 00:04:40,430 --> 00:04:43,260 I've posted an update about that on Stellar (class site). 102 00:04:43,260 --> 00:04:44,930 There's a new syllabus. 103 00:04:44,930 --> 00:04:47,970 And we're going to do some [AUDIO OUT] different 104 00:04:47,970 --> 00:04:49,890 [AUDIO OUT]. 105 00:04:49,890 --> 00:04:53,370 You may remember that we have three introductory classes 106 00:04:53,370 --> 00:04:56,370 on topics of [INAUDIBLE]. 107 00:04:56,370 --> 00:04:59,140 And then, we had model characteristics and roles. 108 00:04:59,140 --> 00:05:02,100 And then, [AUDIO OUT]. 109 00:05:02,100 --> 00:05:03,788 We're going to shuffle a little bit. 110 00:05:03,788 --> 00:05:08,770 [AUDIO OUT] Microphone working? 111 00:05:08,770 --> 00:05:13,360 So because the second assignment is on data collection, 112 00:05:13,360 --> 00:05:14,770 we're going to cover that today. 113 00:05:14,770 --> 00:05:16,769 And we're going to give you that homework today, 114 00:05:16,769 --> 00:05:19,810 so that you can get started on the data collection side. 115 00:05:19,810 --> 00:05:23,080 Then, we're going to cover some of the short-range [INAUDIBLE] 116 00:05:23,080 --> 00:05:24,670 of planning concepts. 117 00:05:24,670 --> 00:05:25,910 Nema is going to do that-- 118 00:05:25,910 --> 00:05:26,480 Nema Nassir. 119 00:05:26,480 --> 00:05:29,540 You might recall him from the previous lecture. 120 00:05:29,540 --> 00:05:33,370 And then, we'll finish with [INAUDIBLE] and costs 121 00:05:33,370 --> 00:05:36,250 in March the 2nd, OK? 122 00:05:36,250 --> 00:05:39,860 So remember, there's no class on Monday the 21st. 123 00:05:44,410 --> 00:05:46,400 AUDIENCE: You mean Tuesday? 124 00:05:46,400 --> 00:05:49,131 GABRIEL SANCHEZ-MARTINEZ: Sorry, yes, Tuesday. 125 00:05:49,131 --> 00:05:50,630 I think, there's no class on Monday. 126 00:05:50,630 --> 00:05:52,190 And then, Tuesday there are classes. 127 00:05:52,190 --> 00:05:53,420 But it's Monday's schedule. 128 00:05:53,420 --> 00:05:55,250 So we don't have class. 129 00:05:55,250 --> 00:05:58,320 Thank you for bringing that up. 130 00:05:58,320 --> 00:05:59,630 OK. 131 00:05:59,630 --> 00:06:03,610 I'll leave Homework 2 for when we finish with the lecture. 132 00:06:03,610 --> 00:06:06,370 But I'll distribute it later. 133 00:06:06,370 --> 00:06:09,240 So let's just get started on that. 134 00:06:09,240 --> 00:06:11,430 So data collection techniques and program design-- 135 00:06:11,430 --> 00:06:13,999 that's the topic for today. 136 00:06:13,999 --> 00:06:14,790 Here's the outline. 137 00:06:14,790 --> 00:06:17,660 So we're going to cover a summary of current practice 138 00:06:17,660 --> 00:06:18,747 quite quickly. 139 00:06:18,747 --> 00:06:21,330 Then, we're going to talk about data collection program design 140 00:06:21,330 --> 00:06:25,050 process, the needs, the data needs, the techniques for data 141 00:06:25,050 --> 00:06:26,301 collection, the sampling. 142 00:06:26,301 --> 00:06:28,050 We're going to get into the details of how 143 00:06:28,050 --> 00:06:29,890 we get sample slices. 144 00:06:29,890 --> 00:06:32,730 And we're going to finish with special considerations 145 00:06:32,730 --> 00:06:35,200 for surveys and surveying techniques. 146 00:06:37,740 --> 00:06:38,610 so where are we? 147 00:06:38,610 --> 00:06:42,090 Where is the transit industry in terms of data collection, 148 00:06:42,090 --> 00:06:44,370 and sampling, and these things? 149 00:06:44,370 --> 00:06:45,810 Largely, there's been a transition 150 00:06:45,810 --> 00:06:48,047 from manual to automatic data collection. 151 00:06:48,047 --> 00:06:50,130 As you might imagine, with the internet of things, 152 00:06:50,130 --> 00:06:52,800 and sensors, and the internet, and wireless, 153 00:06:52,800 --> 00:06:54,720 it used to be that if you wanted to have 154 00:06:54,720 --> 00:06:56,100 statistics on your running times, 155 00:06:56,100 --> 00:06:57,690 you had to send people out. 156 00:06:57,690 --> 00:06:59,880 We call those people checkers. 157 00:06:59,880 --> 00:07:03,330 And those checkers would have notebooks and record 158 00:07:03,330 --> 00:07:05,400 running times, and number of people boarding, 159 00:07:05,400 --> 00:07:06,660 and these things. 160 00:07:06,660 --> 00:07:09,950 Nowadays, with the modern systems, especially 161 00:07:09,950 --> 00:07:13,920 the modern systems, we have several sensors and types 162 00:07:13,920 --> 00:07:16,410 of sensors that collect some of that data for us. 163 00:07:16,410 --> 00:07:20,085 So we're going to cover both approaches. 164 00:07:20,085 --> 00:07:23,054 [INAUDIBLE] data collection to supplement 165 00:07:23,054 --> 00:07:24,220 [INAUDIBLE] data collection. 166 00:07:24,220 --> 00:07:27,730 And if you happen to be consulting for a developing 167 00:07:27,730 --> 00:07:32,560 country that is working with a system that has not yet brought 168 00:07:32,560 --> 00:07:35,500 in automatic data collection technologies, 169 00:07:35,500 --> 00:07:39,100 it's also useful to know all about the manual design 170 00:07:39,100 --> 00:07:42,120 and manual data collection process. 171 00:07:42,120 --> 00:07:44,710 [AUDIO OUT] took this class and ended up 172 00:07:44,710 --> 00:07:47,980 working in large consulting firms have gone off 173 00:07:47,980 --> 00:07:52,369 to help countries put in new transit systems. 174 00:07:52,369 --> 00:07:54,160 And one of the first things they have to do 175 00:07:54,160 --> 00:07:59,222 is back to these slides and see what the plan is going to be, 176 00:07:59,222 --> 00:08:01,180 and how many people you need, and how much it's 177 00:08:01,180 --> 00:08:01,930 going to cost. 178 00:08:01,930 --> 00:08:04,510 So very useful topic. 179 00:08:04,510 --> 00:08:07,519 So as I said, there's automatic data collection. 180 00:08:07,519 --> 00:08:08,810 There's manual data collection. 181 00:08:08,810 --> 00:08:11,860 There's sometimes a mix of data collection techniques. 182 00:08:11,860 --> 00:08:14,470 Often, what happens is that we just send people 183 00:08:14,470 --> 00:08:15,970 out and collect data. 184 00:08:15,970 --> 00:08:19,420 Or we just extract a sample of automatically collected data. 185 00:08:19,420 --> 00:08:21,970 And we don't really think about sampling, and the confidence 186 00:08:21,970 --> 00:08:24,750 interval, and how sure are we of that result 187 00:08:24,750 --> 00:08:27,490 that we're going to influence policy or make decisions 188 00:08:27,490 --> 00:08:29,260 that will affect service. 189 00:08:29,260 --> 00:08:31,390 How sure are we of those? 190 00:08:31,390 --> 00:08:33,640 So statistical validity. 191 00:08:33,640 --> 00:08:37,179 Often, there's an efficient use of data. 192 00:08:37,179 --> 00:08:41,919 And ADCS, which is Automatic Data Collection Systems-- 193 00:08:41,919 --> 00:08:44,260 we'll use that abbreviation throughout the course- 194 00:08:44,260 --> 00:08:47,020 presents a major opportunity for strengthening data 195 00:08:47,020 --> 00:08:48,260 to support decision making. 196 00:08:48,260 --> 00:08:49,790 We'll talk about how that happens. 197 00:08:49,790 --> 00:08:52,520 Let's first compare manual and automatic data collection. 198 00:08:52,520 --> 00:08:54,386 So what happens with manual data collection? 199 00:08:54,386 --> 00:08:55,510 You hire people, as I said. 200 00:08:55,510 --> 00:08:56,950 You hired checkers. 201 00:08:56,950 --> 00:08:59,860 So initially, there's no setup cost. 202 00:08:59,860 --> 00:09:01,950 There's a low capital cost to that. 203 00:09:01,950 --> 00:09:04,210 But there's a high marginal cost because if you 204 00:09:04,210 --> 00:09:06,680 want to collect more data, you have to hire more people. 205 00:09:06,680 --> 00:09:08,134 Does that make sense? 206 00:09:08,134 --> 00:09:10,300 If you want to bring in an automatic data collection 207 00:09:10,300 --> 00:09:12,341 system, you might have to retrofit all your buses 208 00:09:12,341 --> 00:09:13,930 with AVL sensors. 209 00:09:13,930 --> 00:09:16,410 And that's going to cost you initially. 210 00:09:16,410 --> 00:09:19,710 So that's a high capital cost relatively. 211 00:09:19,710 --> 00:09:22,420 But low marginal cost-- once you have those systems in place, 212 00:09:22,420 --> 00:09:24,160 they keep collecting data for you. 213 00:09:24,160 --> 00:09:25,300 And it's almost free. 214 00:09:25,300 --> 00:09:27,760 You do need some maintenance on these equipments. 215 00:09:27,760 --> 00:09:31,510 But comparing to manual data collection, 216 00:09:31,510 --> 00:09:33,310 you have low marginal cost. 217 00:09:33,310 --> 00:09:35,770 Because of that marginal cost difference, 218 00:09:35,770 --> 00:09:38,320 it tends to happen that when you have manual data collection, 219 00:09:38,320 --> 00:09:41,920 you only pay checkers for small sample sizes-- 220 00:09:41,920 --> 00:09:43,300 just what you need. 221 00:09:43,300 --> 00:09:46,930 Whereas, once you put in automatic data collection 222 00:09:46,930 --> 00:09:49,720 systems, they keep collecting data. 223 00:09:49,720 --> 00:09:52,110 So you get much bigger data. 224 00:09:52,110 --> 00:09:53,950 Bless you. 225 00:09:53,950 --> 00:09:57,430 OK, in both cases, we can collect data and analyze it 226 00:09:57,430 --> 00:09:59,860 for aggregate analysis and disaggregate analysis. 227 00:09:59,860 --> 00:10:01,720 So you might want passenger-specific data 228 00:10:01,720 --> 00:10:02,620 on things. 229 00:10:02,620 --> 00:10:06,400 Or you might want things like just averages 230 00:10:06,400 --> 00:10:09,340 and aggregate things, total number of passengers 231 00:10:09,340 --> 00:10:10,960 using the system. 232 00:10:10,960 --> 00:10:12,940 And when you're doing manual data collection, 233 00:10:12,940 --> 00:10:14,890 you can look at quantitative things, things 234 00:10:14,890 --> 00:10:16,440 you can measure and count. 235 00:10:16,440 --> 00:10:19,820 Or you can also observe things qualitatively. 236 00:10:19,820 --> 00:10:22,090 One example that I saw in a recent paper 237 00:10:22,090 --> 00:10:26,680 was considering the [? therivation ?] 238 00:10:26,680 --> 00:10:28,719 by student in some country. 239 00:10:28,719 --> 00:10:30,760 And they didn't ask people if they were students. 240 00:10:30,760 --> 00:10:32,582 They were looking at people's-- 241 00:10:32,582 --> 00:10:33,790 more or less, are they young? 242 00:10:33,790 --> 00:10:35,410 Are they carrying a backpack? 243 00:10:35,410 --> 00:10:38,390 And that would be the labeling for your student. 244 00:10:38,390 --> 00:10:42,010 So that's something that a sensor might not do so well. 245 00:10:42,010 --> 00:10:44,270 Although now with machine learning, who knows? 246 00:10:44,270 --> 00:10:45,890 But we haven't seen that so. 247 00:10:45,890 --> 00:10:48,580 So you can do qualitative observations 248 00:10:48,580 --> 00:10:50,410 when you're doing manual data collection. 249 00:10:50,410 --> 00:10:52,810 Manual data collection tends to be unreliable, 250 00:10:52,810 --> 00:10:56,020 especially when people aren't very well trained 251 00:10:56,020 --> 00:10:59,320 and when you have a group of different people collecting 252 00:10:59,320 --> 00:10:59,830 data. 253 00:10:59,830 --> 00:11:01,621 So each person might have different biases. 254 00:11:01,621 --> 00:11:05,020 It's hard to reproduce the exact bias across persons. 255 00:11:05,020 --> 00:11:07,870 With automatic data collection, you do the errors. 256 00:11:07,870 --> 00:11:10,450 And often, they are not corrected. 257 00:11:10,450 --> 00:11:14,260 But if you do correct them, and you estimate those biases 258 00:11:14,260 --> 00:11:18,550 just for them, you can end up with a better result. 259 00:11:18,550 --> 00:11:21,280 Because of the small sample sizes in manual data 260 00:11:21,280 --> 00:11:25,180 collection, you tend to have to have limited spatial 261 00:11:25,180 --> 00:11:27,290 and temporal coverage of data. 262 00:11:27,290 --> 00:11:29,650 So for example, if you're interested in ridership 263 00:11:29,650 --> 00:11:34,900 in the system, it's unlikely that you will cover ridership 264 00:11:34,900 --> 00:11:38,650 in holidays for [INAUDIBLE] system 265 00:11:38,650 --> 00:11:40,330 because there are only a few holidays. 266 00:11:40,330 --> 00:11:44,350 And usually, you're not mostly interested in holidays. 267 00:11:44,350 --> 00:11:48,160 So chances are, you won't have data collection for holidays. 268 00:11:48,160 --> 00:11:50,320 Whereas once you install automatic data collection 269 00:11:50,320 --> 00:11:51,880 systems, they keep collecting data. 270 00:11:51,880 --> 00:11:56,500 So you get data at midnight on President's Day. 271 00:11:56,500 --> 00:11:59,350 So they're always on. 272 00:11:59,350 --> 00:12:01,210 They're always collecting data. 273 00:12:01,210 --> 00:12:06,170 Manual data needs to be checked, cleaned, analyzed, coded, 274 00:12:06,170 --> 00:12:08,670 and sometimes put into systems before they can be analyzed. 275 00:12:08,670 --> 00:12:09,670 That could take a while. 276 00:12:09,670 --> 00:12:11,320 You need to hire people to do that. 277 00:12:11,320 --> 00:12:15,490 Whereas automatic data collection systems often 278 00:12:15,490 --> 00:12:17,969 send their data to databases in real-time or very 279 00:12:17,969 --> 00:12:18,760 close to real-time. 280 00:12:18,760 --> 00:12:24,580 [INAUDIBLE] you can start analyzing things the next day. 281 00:12:24,580 --> 00:12:28,750 So you arrive in the morning to your desk at a transit agency, 282 00:12:28,750 --> 00:12:30,790 and you have performance metrics for yesterday. 283 00:12:30,790 --> 00:12:33,520 So you wouldn't be able to do that unless you have people 284 00:12:33,520 --> 00:12:36,250 working very hard if you're using manual data 285 00:12:36,250 --> 00:12:37,870 collection system. 286 00:12:37,870 --> 00:12:41,050 When we talk about automatic data collection systems, 287 00:12:41,050 --> 00:12:42,220 there are many. 288 00:12:42,220 --> 00:12:47,630 But there are three types that we refer to very, very often. 289 00:12:47,630 --> 00:12:51,250 And so the first one in AFC, Automatic Fare Collection 290 00:12:51,250 --> 00:12:52,180 Systems. 291 00:12:52,180 --> 00:12:54,710 This is your fare box or your fare gates in your smart card, 292 00:12:54,710 --> 00:12:55,370 your Charlie Card. 293 00:12:55,370 --> 00:12:56,078 You're in Boston. 294 00:12:56,078 --> 00:12:57,210 You tap to enter the bus. 295 00:12:57,210 --> 00:13:00,040 And you tap to enter the subway system. 296 00:13:00,040 --> 00:13:03,220 Increasingly, it's based on contactless smart cards. 297 00:13:03,220 --> 00:13:04,660 And those contactless smart cards 298 00:13:04,660 --> 00:13:06,760 have some sort of RFID technology 299 00:13:06,760 --> 00:13:08,440 with a unique identifier. 300 00:13:08,440 --> 00:13:10,780 When you tap that card to the sensor, 301 00:13:10,780 --> 00:13:13,240 the sensor will read that identifier. 302 00:13:13,240 --> 00:13:16,240 And it'll do things like fare calculation for you. 303 00:13:16,240 --> 00:13:18,760 But that record gets sent to a database. 304 00:13:18,760 --> 00:13:23,680 And it's there for people like us to analyze and make 305 00:13:23,680 --> 00:13:25,340 good use of it for planning. 306 00:13:25,340 --> 00:13:29,770 So it tends to provide entry information almost always. 307 00:13:29,770 --> 00:13:34,810 In some systems, like the Washington, DC metro or the TFL 308 00:13:34,810 --> 00:13:37,600 subway, you tap in to enter and exit. 309 00:13:37,600 --> 00:13:41,320 So you have both origin and destinations. 310 00:13:41,320 --> 00:13:43,690 And if you always have the systems on, 311 00:13:43,690 --> 00:13:47,050 then you have full spatial and temporal coverage 312 00:13:47,050 --> 00:13:51,100 of all of the use of the system at an individual passenger 313 00:13:51,100 --> 00:13:51,600 level. 314 00:13:51,600 --> 00:13:55,040 So very disaggregate-- sorry about that. 315 00:13:55,040 --> 00:13:57,320 Traditionally, these systems are not real-time. 316 00:13:57,320 --> 00:14:01,340 So it might take a while for those transactions 317 00:14:01,340 --> 00:14:03,170 to make it to the data warehouse, where 318 00:14:03,170 --> 00:14:05,810 they're available for planners to analyze it. 319 00:14:05,810 --> 00:14:10,070 The calculation of how much fare in some systems 320 00:14:10,070 --> 00:14:11,000 is in real-time. 321 00:14:11,000 --> 00:14:13,400 In other systems like the Charlie Card, 322 00:14:13,400 --> 00:14:17,210 the stored value that you have is stored on your card. 323 00:14:17,210 --> 00:14:21,020 So it may take a while if you tap at a bus for that bus 324 00:14:21,020 --> 00:14:23,570 to go to a garage and get probed-- 325 00:14:23,570 --> 00:14:25,940 and for the data that has been stored in that bus 326 00:14:25,940 --> 00:14:31,500 to be extracted from that bus to the central server. 327 00:14:31,500 --> 00:14:33,171 There is a move-- 328 00:14:33,171 --> 00:14:34,920 and we'll talk more about this when we get 329 00:14:34,920 --> 00:14:37,020 to fare policy and technology-- 330 00:14:37,020 --> 00:14:39,810 towards using mobile phone payments 331 00:14:39,810 --> 00:14:42,820 and using contactless bank card payment systems. 332 00:14:42,820 --> 00:14:45,840 And those systems often do the full transaction 333 00:14:45,840 --> 00:14:47,040 over the air in real-time. 334 00:14:47,040 --> 00:14:49,770 So we're starting to look at the possibility 335 00:14:49,770 --> 00:14:52,170 of having all this data in real-time or almost 336 00:14:52,170 --> 00:14:53,130 in real-time. 337 00:14:53,130 --> 00:14:54,110 But it's not there yet. 338 00:14:54,110 --> 00:14:56,360 AUDIENCE: [INAUDIBLE] can I ask a question about that? 339 00:14:56,360 --> 00:14:56,980 GABRIEL SANCHEZ-MARTINEZ: Yeah, of course. 340 00:14:56,980 --> 00:14:59,305 AUDIENCE: In terms of smart card, 341 00:14:59,305 --> 00:15:01,449 where this balance is stored on the card-- 342 00:15:01,449 --> 00:15:02,740 GABRIEL SANCHEZ-MARTINEZ: Yeah. 343 00:15:02,740 --> 00:15:06,134 AUDIENCE: --if one can figure out how to hack that card-- 344 00:15:06,134 --> 00:15:07,425 GABRIEL SANCHEZ-MARTINEZ: Yeah. 345 00:15:07,425 --> 00:15:08,966 AUDIENCE: --then what can [INAUDIBLE] 346 00:15:08,966 --> 00:15:12,877 fares through an elaborate technology that I couldn't do 347 00:15:12,877 --> 00:15:14,320 and most people couldn't do. 348 00:15:14,320 --> 00:15:15,290 But maybe some could. 349 00:15:15,290 --> 00:15:17,081 GABRIEL SANCHEZ-MARTINEZ: Yeah, definitely. 350 00:15:17,081 --> 00:15:19,880 So the Charlie Card system is an example about-- 351 00:15:19,880 --> 00:15:23,480 actually, MIT students were the first to hack it. 352 00:15:23,480 --> 00:15:24,980 AUDIENCE: I'm not surprised. 353 00:15:24,980 --> 00:15:28,310 GABRIEL SANCHEZ-MARTINEZ: So it's older technology. 354 00:15:28,310 --> 00:15:30,530 It used a low-bit encryption key. 355 00:15:30,530 --> 00:15:32,660 That's a symmetric encryption key. 356 00:15:32,660 --> 00:15:35,731 And they just brute forced it. 357 00:15:35,731 --> 00:15:36,980 They figured what the key was. 358 00:15:36,980 --> 00:15:39,260 They happened to use the same key for every card. 359 00:15:39,260 --> 00:15:43,250 So once you broke that key, you could take any card. 360 00:15:43,250 --> 00:15:45,844 And with the right hardware, you could add however much value 361 00:15:45,844 --> 00:15:46,760 you want to that card. 362 00:15:46,760 --> 00:15:47,260 And-- 363 00:15:47,260 --> 00:15:48,140 AUDIENCE: [INAUDIBLE] 364 00:15:48,140 --> 00:15:50,390 GABRIEL SANCHEZ-MARTINEZ: Yeah, yeah, exactly. 365 00:15:50,390 --> 00:15:52,700 We don't think it's been a major problem. 366 00:15:52,700 --> 00:15:54,590 AUDIENCE: But it happens. 367 00:15:54,590 --> 00:15:56,798 GABRIEL SANCHEZ-MARTINEZ: I haven't seen MIT students 368 00:15:56,798 --> 00:15:58,550 selling special MIT cards. 369 00:15:58,550 --> 00:16:02,690 But that would be criminal, of course. 370 00:16:02,690 --> 00:16:06,450 Yeah, so newer systems have much stronger encryption. 371 00:16:06,450 --> 00:16:10,410 And they have different encryption keys for each card. 372 00:16:10,410 --> 00:16:13,970 And certainly, when we're moving towards contactless bank cards, 373 00:16:13,970 --> 00:16:17,570 we're talking about a much more secure encryption. 374 00:16:17,570 --> 00:16:20,270 It's your credit card that you're using to tap 375 00:16:20,270 --> 00:16:21,539 or your Android or Apple Pay. 376 00:16:21,539 --> 00:16:23,080 AUDIENCE: Account based [INAUDIBLE].. 377 00:16:23,080 --> 00:16:24,788 GABRIEL SANCHEZ-MARTINEZ: Account based-- 378 00:16:24,788 --> 00:16:27,860 and essentially, what you have is a token with an ID. 379 00:16:27,860 --> 00:16:32,020 And then, the balance is not even stored on your card. 380 00:16:32,020 --> 00:16:36,320 The account server is handling the balance and those things. 381 00:16:36,320 --> 00:16:39,740 So much more difficult to break. 382 00:16:39,740 --> 00:16:42,380 Yup. 383 00:16:42,380 --> 00:16:45,950 OK, AVL systems, or Automatic Vehicle Location systems-- 384 00:16:45,950 --> 00:16:49,250 so these are systems that track vehicle movement. 385 00:16:49,250 --> 00:16:51,490 So for bus, they tend to be based on GPS. 386 00:16:51,490 --> 00:16:54,520 You have GPS on a bus, on the top of the bus, a little hub. 387 00:16:54,520 --> 00:16:58,960 And it collects data every five seconds or every 10 seconds. 388 00:16:58,960 --> 00:17:04,119 And these positions might get sent either in real-time, 389 00:17:04,119 --> 00:17:07,089 or maybe they get stored on the onboard computer 390 00:17:07,089 --> 00:17:10,920 and then are extracted when the bus reaches the garage. 391 00:17:10,920 --> 00:17:17,160 So just GPS-- sophisticated AVL systems for bus 392 00:17:17,160 --> 00:17:21,930 also have gyroscopes to do inertial navigation and dead 393 00:17:21,930 --> 00:17:25,380 reckoning, especially when the GPS precision drops. 394 00:17:25,380 --> 00:17:28,830 And that happens especially with the urban canyon effect. 395 00:17:28,830 --> 00:17:31,540 If you have tall buildings, GPS signal bounces around. 396 00:17:31,540 --> 00:17:36,950 The dilution of precision messes up the position of the bus. 397 00:17:36,950 --> 00:17:38,790 Or maybe you're entering a tunnel, 398 00:17:38,790 --> 00:17:42,210 and you want to continue to get updates 399 00:17:42,210 --> 00:17:43,800 of positions inside the tunnel. 400 00:17:43,800 --> 00:17:45,390 So this is a temporary system that 401 00:17:45,390 --> 00:17:49,500 kicks in and interpolates positions and figures 402 00:17:49,500 --> 00:17:51,780 out how the bus is moving. 403 00:17:51,780 --> 00:17:54,119 For a train, it's usually based on track circuits. 404 00:17:54,119 --> 00:17:56,160 So we're going to talk more about track circuits. 405 00:17:56,160 --> 00:17:59,160 But essentially, a track knows if a train 406 00:17:59,160 --> 00:18:02,640 is occupying that segment or not occupying that segment. 407 00:18:02,640 --> 00:18:09,570 And there are often some sensors that read with RFID technology 408 00:18:09,570 --> 00:18:11,670 the ID number of a car. 409 00:18:11,670 --> 00:18:14,190 And sometimes, you have a sensor in the front of each car 410 00:18:14,190 --> 00:18:15,750 and [AUDIO OUT] each car. 411 00:18:15,750 --> 00:18:20,490 And so a computer will look up the sequence of readings 412 00:18:20,490 --> 00:18:23,610 and follow track circuits as they are being occupied 413 00:18:23,610 --> 00:18:25,560 and unoccupied-- 414 00:18:25,560 --> 00:18:29,530 and in that manner, track trains throughout the system. 415 00:18:29,530 --> 00:18:32,790 These systems were put in place mostly for safety 416 00:18:32,790 --> 00:18:35,670 to prevent train crashes. 417 00:18:35,670 --> 00:18:39,330 And because of that, you would need it to know buses 418 00:18:39,330 --> 00:18:41,310 or where a train was. 419 00:18:41,310 --> 00:18:42,900 They are available in real-time. 420 00:18:42,900 --> 00:18:44,460 They were designed from the beginning 421 00:18:44,460 --> 00:18:46,000 to track vehicles in real-time. 422 00:18:46,000 --> 00:18:48,086 So that's what we have. 423 00:18:48,086 --> 00:18:49,710 I guess what's newer is that now, we're 424 00:18:49,710 --> 00:18:52,650 collecting them and keeping them in a data warehouse 425 00:18:52,650 --> 00:18:54,730 so that we can analyze running times. 426 00:18:54,730 --> 00:18:56,895 AUDIENCE: [INAUDIBLE] these systems have benefit 427 00:18:56,895 --> 00:18:58,320 to the consumer? 428 00:18:58,320 --> 00:18:58,680 GABRIEL SANCHEZ-MARTINEZ: They do. 429 00:18:58,680 --> 00:19:00,638 And that's the newest thing that has happened-- 430 00:19:00,638 --> 00:19:02,460 that nobody thought about consumers when 431 00:19:02,460 --> 00:19:04,080 they were put in place. 432 00:19:04,080 --> 00:19:07,110 So yeah, we are talking about tracking, 433 00:19:07,110 --> 00:19:09,780 knowing how many minutes I have to wait for my bus, 434 00:19:09,780 --> 00:19:10,725 for example. 435 00:19:10,725 --> 00:19:13,380 And those things are pushed through a public API, 436 00:19:13,380 --> 00:19:16,200 so that if I'm a smartphone app developer, 437 00:19:16,200 --> 00:19:19,950 I can go ahead and pull data from this next bus app 438 00:19:19,950 --> 00:19:20,979 and make an app. 439 00:19:20,979 --> 00:19:23,520 And so people can download it, and they know how many minutes 440 00:19:23,520 --> 00:19:24,380 they have to wait. 441 00:19:24,380 --> 00:19:27,800 Yeah, so definitely. 442 00:19:27,800 --> 00:19:31,170 So we have seen a lot of AVL being pushed in that manner. 443 00:19:31,170 --> 00:19:35,850 We have not seen so much AFC data or APC data being pushed. 444 00:19:35,850 --> 00:19:37,980 Obviously, you wouldn't want all the details 445 00:19:37,980 --> 00:19:39,840 of AFC being pushed. 446 00:19:39,840 --> 00:19:42,540 But you might want to know how crowded is my next bus, 447 00:19:42,540 --> 00:19:45,100 or how crowded is my next train. 448 00:19:45,100 --> 00:19:46,860 And you might actually alter your decision 449 00:19:46,860 --> 00:19:48,780 whether to wait for a crowded train 450 00:19:48,780 --> 00:19:52,640 or walk a longer time based on that information. 451 00:19:52,640 --> 00:19:54,127 So that's coming. 452 00:19:54,127 --> 00:19:55,710 I think, in the next few years, that's 453 00:19:55,710 --> 00:19:57,900 going to start happening. 454 00:19:57,900 --> 00:20:00,690 So passenger counting-- many different technologies exist. 455 00:20:00,690 --> 00:20:05,700 For bus, we tend to have these optical sensors in the back. 456 00:20:05,700 --> 00:20:08,640 You might see them if you pay attention-- 457 00:20:08,640 --> 00:20:09,740 broken beam sensors. 458 00:20:09,740 --> 00:20:12,210 They look like two little eyes with two little mirrors 459 00:20:12,210 --> 00:20:13,320 on each door. 460 00:20:13,320 --> 00:20:16,260 And so when you cross the beams, if you 461 00:20:16,260 --> 00:20:18,870 press one beam first and then the other, 462 00:20:18,870 --> 00:20:20,280 that sensor will know-- 463 00:20:20,280 --> 00:20:22,230 is a person coming into the bus? 464 00:20:22,230 --> 00:20:24,270 Or is a person exiting the bus? 465 00:20:24,270 --> 00:20:26,100 And you have that at each door. 466 00:20:26,100 --> 00:20:31,470 And it counts those beams going in and going out. 467 00:20:31,470 --> 00:20:34,110 And often, this is slightly inaccurate. 468 00:20:34,110 --> 00:20:36,780 So you might get more boardings and lightings for a given trip. 469 00:20:36,780 --> 00:20:39,150 So at the end of a trip, whatever 470 00:20:39,150 --> 00:20:41,950 remains in terms of imbalance between boardings and lightings 471 00:20:41,950 --> 00:20:42,900 gets zeroed out. 472 00:20:42,900 --> 00:20:46,910 And the area is distributed throughout that trip 473 00:20:46,910 --> 00:20:48,372 that was just run. 474 00:20:48,372 --> 00:20:50,580 And often, you still have to do some error correction 475 00:20:50,580 --> 00:20:51,360 after that. 476 00:20:51,360 --> 00:20:54,420 But it's a way of counting people getting on and off. 477 00:20:54,420 --> 00:20:57,060 And that's useful to get how many people are riding 478 00:20:57,060 --> 00:21:00,330 the system and also the passenger miles-- 479 00:21:00,330 --> 00:21:02,720 the passengers multiplied by distance, which is often 480 00:21:02,720 --> 00:21:07,380 a required reporting element in things like the NTB, 481 00:21:07,380 --> 00:21:10,020 the National Transit Database. 482 00:21:10,020 --> 00:21:14,420 So for rail systems, we have gates 483 00:21:14,420 --> 00:21:16,231 that count how many times they open 484 00:21:16,231 --> 00:21:17,480 and how many times they close. 485 00:21:17,480 --> 00:21:21,530 So you might have that kind of counting in rail. 486 00:21:21,530 --> 00:21:23,150 You also have video-based counting-- 487 00:21:23,150 --> 00:21:27,710 so camera feeds that can be hooked up 488 00:21:27,710 --> 00:21:31,990 to a system that will essentially track nodes moving 489 00:21:31,990 --> 00:21:33,270 inside that frame. 490 00:21:33,270 --> 00:21:36,260 And you can count things that cross a certain line, 491 00:21:36,260 --> 00:21:37,370 for example. 492 00:21:37,370 --> 00:21:42,020 And you could do that to count flows. 493 00:21:42,020 --> 00:21:45,200 And then for train, we also have the weight systems. 494 00:21:45,200 --> 00:21:47,870 So this is only in trains. 495 00:21:47,870 --> 00:21:50,780 The braking systems in trains apply braking force 496 00:21:50,780 --> 00:21:53,780 in proportion to the load on each car. 497 00:21:53,780 --> 00:21:55,340 So if you have a very heavy car, you 498 00:21:55,340 --> 00:21:58,490 need to apply stronger braking force than in a car that 499 00:21:58,490 --> 00:22:00,050 is almost empty. 500 00:22:00,050 --> 00:22:04,400 If you don't do that, then you apply a lot more force 501 00:22:04,400 --> 00:22:06,430 per weight on the lighter car. 502 00:22:06,430 --> 00:22:10,070 That car is going to be the one pushing the other cars 503 00:22:10,070 --> 00:22:12,530 or pulling the other cars through the coupling. 504 00:22:12,530 --> 00:22:14,480 And that will eventually break the [INAUDIBLE] 505 00:22:14,480 --> 00:22:15,360 at a faster rate. 506 00:22:15,360 --> 00:22:18,560 So what you want is, each car to slow down 507 00:22:18,560 --> 00:22:21,690 at the same rate by itself as much as possible. 508 00:22:21,690 --> 00:22:24,620 And for that, you need to brake in proportion to the weight. 509 00:22:24,620 --> 00:22:26,630 And therefore, you have these weight systems. 510 00:22:26,630 --> 00:22:29,000 They used to just do that. 511 00:22:29,000 --> 00:22:30,680 And more recently, we hooked them 512 00:22:30,680 --> 00:22:33,770 up to a little storage device that 513 00:22:33,770 --> 00:22:36,830 keeps track of the weight and maybe Wi-Fi, 514 00:22:36,830 --> 00:22:39,410 so that each time it reaches a station or the terminal, 515 00:22:39,410 --> 00:22:40,930 it sends the data off. 516 00:22:40,930 --> 00:22:47,240 And we might have a rather somewhat [? unprecise ?] 517 00:22:47,240 --> 00:22:50,600 idea of how many people are in the car just based 518 00:22:50,600 --> 00:22:54,440 on an average weight of a person. 519 00:22:54,440 --> 00:22:56,940 And these are traditionally not available in real real-time. 520 00:22:56,940 --> 00:22:57,710 [INAUDIBLE] you have questions? 521 00:22:57,710 --> 00:22:58,090 Yeah? 522 00:22:58,090 --> 00:22:59,410 AUDIENCE: You could also just reconcile it 523 00:22:59,410 --> 00:23:00,760 with the other system, right? 524 00:23:00,760 --> 00:23:01,370 GABRIEL SANCHEZ-MARTINEZ: Of course, yeah. 525 00:23:01,370 --> 00:23:02,250 AUDIENCE: So if you have-- 526 00:23:02,250 --> 00:23:02,460 [INTERPOSING VOICES] 527 00:23:02,460 --> 00:23:02,945 GABRIEL SANCHEZ-MARTINEZ: Yeah. 528 00:23:02,945 --> 00:23:05,370 AUDIENCE: --people early can transport to get on to. 529 00:23:05,370 --> 00:23:05,520 GABRIEL SANCHEZ-MARTINEZ: Yeah. 530 00:23:05,520 --> 00:23:06,250 AUDIENCE: [INAUDIBLE] 531 00:23:06,250 --> 00:23:07,420 GABRIEL SANCHEZ-MARTINEZ: Yeah, definitely. 532 00:23:07,420 --> 00:23:07,920 Yeah. 533 00:23:07,920 --> 00:23:11,900 And that's cutting edge research that's happening right now. 534 00:23:11,900 --> 00:23:14,570 How do you do data fiction and merge different systems? 535 00:23:14,570 --> 00:23:15,650 They all have errors. 536 00:23:15,650 --> 00:23:17,100 And how do you detect when one is 537 00:23:17,100 --> 00:23:18,350 more erroneous than the other? 538 00:23:18,350 --> 00:23:20,420 And how do you mix these data sources 539 00:23:20,420 --> 00:23:23,847 to get the most precise, not just loads, but paths 540 00:23:23,847 --> 00:23:25,430 within a network and things like that. 541 00:23:25,430 --> 00:23:26,460 Yeah. 542 00:23:26,460 --> 00:23:31,039 So any questions on these three very important automatic data 543 00:23:31,039 --> 00:23:31,830 collection systems? 544 00:23:31,830 --> 00:23:32,640 AUDIENCE: [INAUDIBLE] 545 00:23:32,640 --> 00:23:33,889 GABRIEL SANCHEZ-MARTINEZ: Yup. 546 00:23:33,889 --> 00:23:41,426 AUDIENCE: So if there [INAUDIBLE] 547 00:23:41,426 --> 00:23:45,782 AVL, what kind of reason can be [INAUDIBLE]?? 548 00:23:45,782 --> 00:23:47,490 GABRIEL SANCHEZ-MARTINEZ: So the question 549 00:23:47,490 --> 00:23:52,780 is, why might some of these technologies produce errors? 550 00:23:52,780 --> 00:23:55,150 And in particular, you're asking about AVL. 551 00:23:55,150 --> 00:23:58,090 So each of these has a different behavior. 552 00:23:58,090 --> 00:24:01,030 And within each of these categories of technologies, 553 00:24:01,030 --> 00:24:04,870 each vendor's system might have specific things that happen. 554 00:24:04,870 --> 00:24:06,730 With AVL, the most common thing is 555 00:24:06,730 --> 00:24:10,900 end of root problems-- detecting when a trip actually 556 00:24:10,900 --> 00:24:12,460 begins and ends. 557 00:24:12,460 --> 00:24:17,020 So AVL systems, you have this GPS 558 00:24:17,020 --> 00:24:18,450 coming in every five seconds. 559 00:24:18,450 --> 00:24:20,950 Depending on your chip set, you might get it more frequently 560 00:24:20,950 --> 00:24:21,450 than that. 561 00:24:21,450 --> 00:24:25,230 But you also actually sometimes hook it to the doors. 562 00:24:25,230 --> 00:24:28,420 So if the door is opening, you say, well, I must be at a stop. 563 00:24:28,420 --> 00:24:30,880 And therefore, let me find which one is closest. 564 00:24:30,880 --> 00:24:32,540 So there are ways to correct it. 565 00:24:32,540 --> 00:24:34,750 But when you get to the end of the route, 566 00:24:34,750 --> 00:24:37,430 it's not clear always-- have you finished your trip? 567 00:24:37,430 --> 00:24:41,290 Or rather, are you starting your trip already? 568 00:24:41,290 --> 00:24:45,970 So maybe if the terminal is at the same place on the trip-- 569 00:24:45,970 --> 00:24:47,710 the previous trip ends at the same place 570 00:24:47,710 --> 00:24:49,960 that the next trip begins, there might 571 00:24:49,960 --> 00:24:53,950 be a time where the doors open and close various times. 572 00:24:53,950 --> 00:24:56,140 And the trip isn't ready to leave yet. 573 00:24:56,140 --> 00:24:58,810 And so you really have to wait to see the bus leaving 574 00:24:58,810 --> 00:25:00,370 that terminal and moving. 575 00:25:00,370 --> 00:25:01,900 Sometimes, there are false starts. 576 00:25:01,900 --> 00:25:06,040 So maybe another bus comes along, and it needs that space. 577 00:25:06,040 --> 00:25:10,270 So the driver moves the bus a few meters forward. 578 00:25:10,270 --> 00:25:13,880 And the system thinks my trip has started. 579 00:25:13,880 --> 00:25:16,130 And then, when you're looking at aggregate data, 580 00:25:16,130 --> 00:25:19,120 you're looking at, say, running times at the trip level. 581 00:25:19,120 --> 00:25:21,940 You see these outliers with very long times. 582 00:25:21,940 --> 00:25:23,500 And if you were to plot them by stop, 583 00:25:23,500 --> 00:25:25,510 you see that the link between the first stop 584 00:25:25,510 --> 00:25:29,360 and the second step is sometimes very high, 15 minutes. 585 00:25:29,360 --> 00:25:30,880 And so you can throw those out. 586 00:25:30,880 --> 00:25:33,850 Or you can do some interpolation or imputation of data. 587 00:25:33,850 --> 00:25:36,880 Some systems that care very much about that 588 00:25:36,880 --> 00:25:40,240 will purposely place the terminal 589 00:25:40,240 --> 00:25:45,310 stops sufficiently far apart to prevent that 590 00:25:45,310 --> 00:25:48,210 from happening because it is a problem. 591 00:25:48,210 --> 00:25:52,050 And this data is crucial to planning service and figuring 592 00:25:52,050 --> 00:25:54,610 out how much resource you're going to put into each route. 593 00:25:54,610 --> 00:25:56,856 So yup. 594 00:25:56,856 --> 00:26:03,758 AUDIENCE: For tap cards, [INAUDIBLE] and metros, 595 00:26:03,758 --> 00:26:07,702 some of them we have to tap out to exit. 596 00:26:07,702 --> 00:26:09,903 It is because of variable [INAUDIBLE].. 597 00:26:09,903 --> 00:26:11,153 GABRIEL SANCHEZ-MARTINEZ: Yes. 598 00:26:11,153 --> 00:26:14,604 AUDIENCE: But in some systems, it's still a flat fare. 599 00:26:14,604 --> 00:26:16,083 You still have to tap out. 600 00:26:16,083 --> 00:26:18,548 Is the reason behind that mostly data collection? 601 00:26:18,548 --> 00:26:20,766 Or is there anything [INAUDIBLE] you're 602 00:26:20,766 --> 00:26:22,695 going to still have to tap out [INAUDIBLE]?? 603 00:26:22,695 --> 00:26:24,170 GABRIEL SANCHEZ-MARTINEZ: So yeah, no examples of it 604 00:26:24,170 --> 00:26:24,999 come to mind. 605 00:26:24,999 --> 00:26:25,790 You might know one. 606 00:26:25,790 --> 00:26:27,360 AUDIENCE: MARTA? 607 00:26:27,360 --> 00:26:29,460 GABRIEL SANCHEZ-MARTINEZ: OK, I haven't visited. 608 00:26:29,460 --> 00:26:31,890 So yeah, data collection might be a reason to do that. 609 00:26:31,890 --> 00:26:35,640 But I'll have to get back to you on why MARTA did that. 610 00:26:35,640 --> 00:26:41,090 But yeah, most systems that have controls in and out 611 00:26:41,090 --> 00:26:43,680 are for fare policy reasons and not 612 00:26:43,680 --> 00:26:46,560 for data collection reasons. 613 00:26:46,560 --> 00:26:49,980 We're starting to see more interest in data collection 614 00:26:49,980 --> 00:26:53,797 and in investing on these technologies just 615 00:26:53,797 --> 00:26:54,630 for data collection. 616 00:26:54,630 --> 00:26:58,220 So maybe-- but I'll have to check and get back to you. 617 00:26:58,220 --> 00:27:01,177 AUDIENCE: You mentioned some systems separate their depots 618 00:27:01,177 --> 00:27:03,260 to not confuse the end [? from the start point. ?] 619 00:27:03,260 --> 00:27:03,515 [INTERPOSING VOICES] 620 00:27:03,515 --> 00:27:04,760 GABRIEL SANCHEZ-MARTINEZ: Their terminal stops, yeah. 621 00:27:04,760 --> 00:27:06,720 AUDIENCE: What are some examples of those? 622 00:27:06,720 --> 00:27:10,510 GABRIEL SANCHEZ-MARTINEZ: TFL will do that in London, yeah. 623 00:27:10,510 --> 00:27:11,960 Yeah, so they'll monitor this. 624 00:27:11,960 --> 00:27:17,110 And if they see that this is occurring often, 625 00:27:17,110 --> 00:27:20,319 they will separate the stops a bit. 626 00:27:20,319 --> 00:27:22,110 And the reason they do that is because they 627 00:27:22,110 --> 00:27:26,190 have people whose job it is to impute data 628 00:27:26,190 --> 00:27:27,420 when it's incorrect. 629 00:27:27,420 --> 00:27:30,330 So if they don't do that, and the system is consistently 630 00:27:30,330 --> 00:27:32,170 producing bad data, then that means 631 00:27:32,170 --> 00:27:35,850 they're going to have to spend human resources on correcting 632 00:27:35,850 --> 00:27:37,050 that data. 633 00:27:37,050 --> 00:27:38,520 So at some point, it's just easier 634 00:27:38,520 --> 00:27:40,350 to move the stop a little bit. 635 00:27:40,350 --> 00:27:42,427 It doesn't have to be a long distance. 636 00:27:42,427 --> 00:27:43,135 AUDIENCE: Got it. 637 00:27:43,135 --> 00:27:45,260 GABRIEL SANCHEZ-MARTINEZ: It does not make the same 638 00:27:45,260 --> 00:27:48,030 and make it far enough apart that the geo fences can 639 00:27:48,030 --> 00:27:51,180 be told apart from each other. 640 00:27:51,180 --> 00:27:51,680 Alright? 641 00:27:51,680 --> 00:27:54,382 AUDIENCE: Really small scale data of the EZRide who I work 642 00:27:54,382 --> 00:27:57,922 for, actually you could see real-time bus loads 643 00:27:57,922 --> 00:27:59,340 [INAUDIBLE]-- 644 00:27:59,340 --> 00:28:02,349 GABRIEL SANCHEZ-MARTINEZ: Oh, interesting. 645 00:28:02,349 --> 00:28:04,890 AUDIENCE: --which was actually helpful if you're dispatching, 646 00:28:04,890 --> 00:28:07,950 and you know a bus is getting through people on it. 647 00:28:07,950 --> 00:28:08,450 [INAUDIBLE] 648 00:28:08,450 --> 00:28:10,450 GABRIEL SANCHEZ-MARTINEZ: Yeah, for real-time control. 649 00:28:10,450 --> 00:28:11,070 [INTERPOSING VOICES] 650 00:28:11,070 --> 00:28:12,778 AUDIENCE: But the terminal at our station 651 00:28:12,778 --> 00:28:15,082 had a drop-off point and a pick-up point. 652 00:28:15,082 --> 00:28:18,004 The drop-off point was before layover [INAUDIBLE] 653 00:28:18,004 --> 00:28:21,570 was after for this exact reason to make sure 654 00:28:21,570 --> 00:28:23,361 that it will go through the drop-off point, 655 00:28:23,361 --> 00:28:25,009 reset, until people get off of it. 656 00:28:25,009 --> 00:28:26,300 GABRIEL SANCHEZ-MARTINEZ: Yeah. 657 00:28:26,300 --> 00:28:27,192 Yeah, so it happens. 658 00:28:27,192 --> 00:28:28,025 [INTERPOSING VOICES] 659 00:28:28,025 --> 00:28:28,900 AUDIENCE: Definitely. 660 00:28:28,900 --> 00:28:31,179 [INAUDIBLE] 661 00:28:31,179 --> 00:28:33,262 GABRIEL SANCHEZ-MARTINEZ: That sounds about right. 662 00:28:36,324 --> 00:28:37,740 OK, if there are no more questions 663 00:28:37,740 --> 00:28:41,859 on the three very important categories of automated data 664 00:28:41,859 --> 00:28:43,650 collection systems, let's talk a little bit 665 00:28:43,650 --> 00:28:46,360 about the data collection program design process. 666 00:28:46,360 --> 00:28:49,920 So this comes from before automatic data collection. 667 00:28:49,920 --> 00:28:53,179 And nowadays, we think a little bit less about this. 668 00:28:53,179 --> 00:28:54,220 But it's still important. 669 00:28:54,220 --> 00:28:59,010 So if you do need to collect some data, 670 00:28:59,010 --> 00:29:01,500 there's a structure that you can follow to do it properly 671 00:29:01,500 --> 00:29:03,660 and to make sure that you collect data efficiently, 672 00:29:03,660 --> 00:29:06,624 so that you don't spend too much resources on data collection 673 00:29:06,624 --> 00:29:08,790 and that you can answer your policy or your planning 674 00:29:08,790 --> 00:29:09,840 questions. 675 00:29:09,840 --> 00:29:15,060 So based on your needs and the properties of your agency, 676 00:29:15,060 --> 00:29:17,400 I say here, determine property characteristics. 677 00:29:17,400 --> 00:29:18,630 That's a North American term. 678 00:29:18,630 --> 00:29:20,050 A property is an agency. 679 00:29:20,050 --> 00:29:23,259 So if you see that, that's an agency. 680 00:29:23,259 --> 00:29:25,800 So based on the characteristics of the service you're running 681 00:29:25,800 --> 00:29:28,811 and your data needs, you can select some data collection 682 00:29:28,811 --> 00:29:29,310 technique. 683 00:29:29,310 --> 00:29:31,690 We'll get into what some of these are. 684 00:29:31,690 --> 00:29:35,070 Then, you can develop route-by-route sampling plans 685 00:29:35,070 --> 00:29:39,810 based on how variable the data is in each case. 686 00:29:39,810 --> 00:29:41,900 And you can determine how many checkers do I need. 687 00:29:41,900 --> 00:29:44,760 A checker is a person who goes out and collects data. 688 00:29:44,760 --> 00:29:46,770 And then from that, the cost-- 689 00:29:46,770 --> 00:29:47,790 so human resources. 690 00:29:47,790 --> 00:29:49,440 It's a planning exercise. 691 00:29:49,440 --> 00:29:52,740 And what we do usually is that we conduct a baseline phase. 692 00:29:52,740 --> 00:29:57,280 So that's the first time you go out and collect data. 693 00:29:57,280 --> 00:30:00,150 You don't know much about what you're 694 00:30:00,150 --> 00:30:01,970 wanting to collect data on. 695 00:30:01,970 --> 00:30:06,870 So it might be only matrices, or loads, 696 00:30:06,870 --> 00:30:09,600 the people getting on and off. 697 00:30:09,600 --> 00:30:13,350 So you have to go out and do a bigger effort. 698 00:30:13,350 --> 00:30:15,810 And that's called the baseline phase effort. 699 00:30:15,810 --> 00:30:19,020 Once you've done that and you've established some tendencies, 700 00:30:19,020 --> 00:30:22,080 you might want to monitor that to see if it changes. 701 00:30:22,080 --> 00:30:25,530 So then, you do a lighter weight data collection effort, where 702 00:30:25,530 --> 00:30:29,220 you go out and less frequently, using fewer resources, 703 00:30:29,220 --> 00:30:31,980 you collect sometimes the same thing. 704 00:30:31,980 --> 00:30:37,890 Or sometimes, you observe something else that is related 705 00:30:37,890 --> 00:30:41,100 or can be correlated with what you really want. 706 00:30:41,100 --> 00:30:44,340 And then based on a relationship between the two, 707 00:30:44,340 --> 00:30:46,560 you can estimate what you really want. 708 00:30:46,560 --> 00:30:50,090 So you can monitor what you collected. 709 00:30:50,090 --> 00:30:51,870 And then, if you detect that there's 710 00:30:51,870 --> 00:30:54,360 been a trend or a change, and you need to investigate it 711 00:30:54,360 --> 00:30:57,420 further, you might go ahead and repeat the baseline phase 712 00:30:57,420 --> 00:30:59,530 to increase your accuracy. 713 00:30:59,530 --> 00:31:04,080 So one of the catches of this is that to determine 714 00:31:04,080 --> 00:31:06,600 sampling plans, to determine required sample 715 00:31:06,600 --> 00:31:09,700 sizes to achieve some confidence interval, 716 00:31:09,700 --> 00:31:12,120 you need to know how variable your data is. 717 00:31:12,120 --> 00:31:15,030 And if you haven't collected it yet, you don't know. 718 00:31:15,030 --> 00:31:18,350 So you might have some default values that you resort to. 719 00:31:18,350 --> 00:31:20,800 And we'll get to that later in this lecture. 720 00:31:20,800 --> 00:31:22,770 But you might also do a pre-test, where 721 00:31:22,770 --> 00:31:24,270 you send some people out, and you 722 00:31:24,270 --> 00:31:27,150 collect some data to really start 723 00:31:27,150 --> 00:31:30,030 to get a sense of how variable is it, 724 00:31:30,030 --> 00:31:35,090 and how big will my sample requirements be, 725 00:31:35,090 --> 00:31:37,560 and how much will it cost for me to do this. 726 00:31:37,560 --> 00:31:40,810 So this is the process that you might follow. 727 00:31:40,810 --> 00:31:44,622 And there are different data needs by the question 728 00:31:44,622 --> 00:31:45,830 that you're trying to answer. 729 00:31:45,830 --> 00:31:48,430 So one way of looking at that is, are you 730 00:31:48,430 --> 00:31:51,130 collecting things that are for specific routes, 731 00:31:51,130 --> 00:31:54,070 or for specific route segments, or at the stop level? 732 00:31:54,070 --> 00:31:57,950 Or are you using more aggregate system level data collection? 733 00:31:57,950 --> 00:32:00,100 Are your questions more system level? 734 00:32:00,100 --> 00:32:02,920 So system-level things are more about reporting, 735 00:32:02,920 --> 00:32:06,730 and they might be tied to things like federal funding. 736 00:32:06,730 --> 00:32:09,610 Whereas route-level things and stop-level things 737 00:32:09,610 --> 00:32:12,020 are more important for planning. 738 00:32:12,020 --> 00:32:14,860 So when we talk about route and route segment level, 739 00:32:14,860 --> 00:32:17,350 we're looking at things like loads at the peak load points 740 00:32:17,350 --> 00:32:18,580 or at some other key points. 741 00:32:18,580 --> 00:32:20,900 How many people are in the bus? 742 00:32:20,900 --> 00:32:23,980 The running time is by the segment 743 00:32:23,980 --> 00:32:26,260 to do schedule that has time points 744 00:32:26,260 --> 00:32:30,010 or maybe end-to-end to your operations plan. 745 00:32:30,010 --> 00:32:32,470 Schedule adherence-- are these buses running on time? 746 00:32:32,470 --> 00:32:34,870 Or are my schedules not realistic? 747 00:32:34,870 --> 00:32:37,120 Total boardings or revenue, two things 748 00:32:37,120 --> 00:32:41,590 that are highly correlated-- so number of passenger trips. 749 00:32:41,590 --> 00:32:44,014 Boardings by fare category-- so you might say, 750 00:32:44,014 --> 00:32:45,430 well, I want boardings, but I want 751 00:32:45,430 --> 00:32:47,170 to know how many seniors are using this, 752 00:32:47,170 --> 00:32:50,500 and how many students are using this, and how many people are 753 00:32:50,500 --> 00:32:52,540 using monthly passes, and how many people are 754 00:32:52,540 --> 00:32:55,750 using pay-per-ride. 755 00:32:55,750 --> 00:32:58,600 So you have different fare categories. 756 00:32:58,600 --> 00:33:03,430 And you might want to segregate the data by that. 757 00:33:03,430 --> 00:33:05,920 You might want passenger boarding and lighting by stop. 758 00:33:05,920 --> 00:33:07,840 So that's what APC would give you 759 00:33:07,840 --> 00:33:10,900 if you have an automated system. 760 00:33:10,900 --> 00:33:13,540 But you might also use a write checker, who sits on the bus 761 00:33:13,540 --> 00:33:16,660 and counts people boarding in a lighting. 762 00:33:16,660 --> 00:33:19,690 Transfer rates between routes-- to see you maybe you're 763 00:33:19,690 --> 00:33:23,460 looking at changing service so that people 764 00:33:23,460 --> 00:33:25,920 don't have to transfer. 765 00:33:25,920 --> 00:33:28,440 Passenger characteristics and attitudes-- this usually 766 00:33:28,440 --> 00:33:31,020 requires some degree of survey, where 767 00:33:31,020 --> 00:33:35,622 you ask people things, passenger travel patterns. 768 00:33:35,622 --> 00:33:37,080 At the system level, we have things 769 00:33:37,080 --> 00:33:39,840 like unlinked passenger trips, passenger miles, linked 770 00:33:39,840 --> 00:33:40,817 passenger trips. 771 00:33:40,817 --> 00:33:42,150 This had the whole system level. 772 00:33:42,150 --> 00:33:45,990 So sometimes, you do route level or route segment level 773 00:33:45,990 --> 00:33:47,400 analysis, and then, you aggregate 774 00:33:47,400 --> 00:33:48,750 to get the system-level things. 775 00:33:48,750 --> 00:33:50,970 That's usually how you proceed. 776 00:33:50,970 --> 00:33:54,120 But the requirements in terms of how many of these 777 00:33:54,120 --> 00:33:56,204 you have to sample might be different. 778 00:33:56,204 --> 00:33:58,620 So if you want to achieve a certain accuracy at the system 779 00:33:58,620 --> 00:34:01,260 level, you don't need to achieve the accuracy 780 00:34:01,260 --> 00:34:04,830 for each of the routes that are in that system 781 00:34:04,830 --> 00:34:07,740 because you might have-- 782 00:34:07,740 --> 00:34:11,400 so if you want to say 90% confidence 783 00:34:11,400 --> 00:34:15,810 in some system-level data element, 784 00:34:15,810 --> 00:34:19,239 you might only need 80% or 70% of the element level. 785 00:34:19,239 --> 00:34:21,000 And once you bring those altogether, 786 00:34:21,000 --> 00:34:23,159 you achieve the 90% that you need. 787 00:34:23,159 --> 00:34:27,840 So data inference, I talked about how sometimes we 788 00:34:27,840 --> 00:34:33,280 can infer items if we don't observe them directly. 789 00:34:33,280 --> 00:34:36,540 So from AFC with AFC is a low-fare collection system, 790 00:34:36,540 --> 00:34:39,449 we have boardings because people are tapping into the bus 791 00:34:39,449 --> 00:34:41,560 or tapping into the subway system. 792 00:34:41,560 --> 00:34:44,909 And if we have APC, we count people getting on. 793 00:34:44,909 --> 00:34:49,360 So we can look at total number of boardings that way, 794 00:34:49,360 --> 00:34:50,670 if that makes sense. 795 00:34:50,670 --> 00:34:51,690 That's pretty direct. 796 00:34:51,690 --> 00:34:54,300 Sometimes, you want to correct for errors in the APC system, 797 00:34:54,300 --> 00:34:57,271 or you might have things like variation affecting 798 00:34:57,271 --> 00:34:59,520 that number-- like it goes from AFC to how many people 799 00:34:59,520 --> 00:35:00,720 were actually in that bus. 800 00:35:00,720 --> 00:35:02,100 How many people actually boarded? 801 00:35:02,100 --> 00:35:05,160 So you might do a little bit of manual surveys 802 00:35:05,160 --> 00:35:09,450 to check what that relationship is and apply some correction. 803 00:35:09,450 --> 00:35:11,340 For passenger miles, we need to know 804 00:35:11,340 --> 00:35:15,330 how many people are at the bus between each stop here. 805 00:35:15,330 --> 00:35:18,930 So AFC gives you boardings and only boardings. 806 00:35:18,930 --> 00:35:20,820 APC gives you ons and offs. 807 00:35:20,820 --> 00:35:23,700 If every bus had APC, then you could calculate passenger miles 808 00:35:23,700 --> 00:35:24,630 directly. 809 00:35:24,630 --> 00:35:28,710 But often, you have systems where only a portion 810 00:35:28,710 --> 00:35:29,970 of the fleet has APC. 811 00:35:29,970 --> 00:35:33,240 So maybe 15% of your fleet is equipped with APC. 812 00:35:33,240 --> 00:35:38,220 And from that, you get the sample OD matrix. 813 00:35:38,220 --> 00:35:40,080 And you can use that OD matrix to convert 814 00:35:40,080 --> 00:35:43,560 from boardings only to the distribution and the ons 815 00:35:43,560 --> 00:35:45,810 and offs at all bus routes. 816 00:35:45,810 --> 00:35:47,760 And from that, you can get passenger miles. 817 00:35:47,760 --> 00:35:50,280 Or you might just use your buses that 818 00:35:50,280 --> 00:35:54,720 have APC, if that suffices for your data collection unit. 819 00:35:54,720 --> 00:35:59,070 Same thing with peak point load-- similar idea. 820 00:35:59,070 --> 00:36:01,200 The AFC only measures boardings. 821 00:36:01,200 --> 00:36:03,940 So it doesn't give you the peak point load automatically. 822 00:36:03,940 --> 00:36:05,670 But from APC, you could get it. 823 00:36:05,670 --> 00:36:09,390 And it you can establish a relationship between boardings 824 00:36:09,390 --> 00:36:11,460 and the peak load point, then you 825 00:36:11,460 --> 00:36:14,730 can use that model to infer the peak load 826 00:36:14,730 --> 00:36:16,200 point from just boardings. 827 00:36:16,200 --> 00:36:19,620 So this is a key thing to be efficient about data 828 00:36:19,620 --> 00:36:20,580 collection. 829 00:36:20,580 --> 00:36:24,194 Any questions on this idea? 830 00:36:24,194 --> 00:36:25,176 Yup. 831 00:36:25,176 --> 00:36:27,140 AUDIENCE: So to get passenger miles, 832 00:36:27,140 --> 00:36:29,104 you're also going to have a GPS system 833 00:36:29,104 --> 00:36:31,068 as well to know the distance? 834 00:36:31,068 --> 00:36:33,277 Or are we just basically [INAUDIBLE] this 835 00:36:33,277 --> 00:36:34,699 is the routing [INAUDIBLE]? 836 00:36:34,699 --> 00:36:35,990 GABRIEL SANCHEZ-MARTINEZ: Both. 837 00:36:35,990 --> 00:36:36,790 AUDIENCE: [INAUDIBLE] 838 00:36:36,790 --> 00:36:37,730 GABRIEL SANCHEZ-MARTINEZ: Yeah, both. 839 00:36:37,730 --> 00:36:39,190 AUDIENCE: [INAUDIBLE] 840 00:36:39,190 --> 00:36:41,106 GABRIEL SANCHEZ-MARTINEZ: What tends to happen 841 00:36:41,106 --> 00:36:44,770 is that the APC, it'll come in. 842 00:36:44,770 --> 00:36:47,720 And it'll say, at this stop, this many people boarded. 843 00:36:47,720 --> 00:36:48,940 This many people are lighted. 844 00:36:48,940 --> 00:36:52,740 So you have other layers in your database 845 00:36:52,740 --> 00:36:55,780 that say where the buses and what the distance 846 00:36:55,780 --> 00:37:00,450 is between stops and the stop pair level. 847 00:37:00,450 --> 00:37:03,267 So you then essentially know how many people 848 00:37:03,267 --> 00:37:05,350 are riding on each link and how long that link is, 849 00:37:05,350 --> 00:37:06,391 and you multiply the two. 850 00:37:06,391 --> 00:37:09,000 So yeah, passenger miles. 851 00:37:09,000 --> 00:37:10,640 Yeah, more questions. 852 00:37:10,640 --> 00:37:13,420 AUDIENCE: Yeah, for these checks that are going on 853 00:37:13,420 --> 00:37:14,649 like the more manual checks-- 854 00:37:14,649 --> 00:37:15,940 GABRIEL SANCHEZ-MARTINEZ: Yeah. 855 00:37:15,940 --> 00:37:17,100 AUDIENCE: --I know often, there's 856 00:37:17,100 --> 00:37:18,900 derivation checkers who are coming into a check. 857 00:37:18,900 --> 00:37:20,775 GABRIEL SANCHEZ-MARTINEZ: That's right, yeah. 858 00:37:20,775 --> 00:37:23,480 AUDIENCE: Do they also use that data to cross-reference 859 00:37:23,480 --> 00:37:25,370 the passenger counts? 860 00:37:25,370 --> 00:37:26,790 As in, [? this ?] person gets on, 861 00:37:26,790 --> 00:37:28,939 and they check everyone's voice to [INAUDIBLE] DFL. 862 00:37:28,939 --> 00:37:30,230 GABRIEL SANCHEZ-MARTINEZ: Yeah. 863 00:37:30,230 --> 00:37:32,970 AUDIENCE: They then know exactly how they go on the bus. 864 00:37:32,970 --> 00:37:34,220 GABRIEL SANCHEZ-MARTINEZ: Yes. 865 00:37:34,220 --> 00:37:34,615 Yeah. 866 00:37:34,615 --> 00:37:35,800 AUDIENCE: Do they use that data? 867 00:37:35,800 --> 00:37:36,740 GABRIEL SANCHEZ-MARTINEZ: Yeah, they can. 868 00:37:36,740 --> 00:37:39,350 In the APC, sometimes there's reliability problems, 869 00:37:39,350 --> 00:37:41,200 especially when vehicles are very 870 00:37:41,200 --> 00:37:43,090 full because sometimes, people will 871 00:37:43,090 --> 00:37:44,530 block the sensor by the door. 872 00:37:47,230 --> 00:37:49,030 Actually, people like to stand by the door 873 00:37:49,030 --> 00:37:50,821 all the time, even when the bus isn't full. 874 00:37:50,821 --> 00:37:53,235 And that kind of affects APC. 875 00:37:53,235 --> 00:37:54,610 You might notice this on the one. 876 00:37:54,610 --> 00:37:55,600 If you take the one-- 877 00:37:55,600 --> 00:38:00,970 so yeah, you sometimes have a little bit of a manual effort 878 00:38:00,970 --> 00:38:01,840 to figure out. 879 00:38:01,840 --> 00:38:03,730 Just learn about your APC system, 880 00:38:03,730 --> 00:38:06,190 and what are the errors, and when do you see them. 881 00:38:06,190 --> 00:38:09,430 It often happens that you have more variation when 882 00:38:09,430 --> 00:38:10,540 you have very high loads. 883 00:38:10,540 --> 00:38:12,400 And that's when APC is least accurate. 884 00:38:12,400 --> 00:38:15,880 So it all comes together. 885 00:38:15,880 --> 00:38:17,294 Yeah. 886 00:38:17,294 --> 00:38:18,210 Questions on the back? 887 00:38:18,210 --> 00:38:19,330 I think I saw a question. 888 00:38:19,330 --> 00:38:19,829 No? 889 00:38:19,829 --> 00:38:22,864 AUDIENCE: Yeah, I noticed that in Chicago, 890 00:38:22,864 --> 00:38:26,720 when the bus would be crowded, then people get off the bus. 891 00:38:26,720 --> 00:38:28,180 They let people off-- 892 00:38:28,180 --> 00:38:28,660 GABRIEL SANCHEZ-MARTINEZ: That's right. 893 00:38:28,660 --> 00:38:28,990 AUDIENCE: --and then back on. 894 00:38:28,990 --> 00:38:29,510 GABRIEL SANCHEZ-MARTINEZ: Yeah. 895 00:38:29,510 --> 00:38:30,010 Yeah. 896 00:38:30,010 --> 00:38:31,210 These double things. 897 00:38:31,210 --> 00:38:33,296 But somebody might be by the door just blocking 898 00:38:33,296 --> 00:38:34,295 the two little sensors-- 899 00:38:34,295 --> 00:38:34,990 [INTERPOSING VOICES] 900 00:38:34,990 --> 00:38:36,990 GABRIEL SANCHEZ-MARTINEZ: --the two little eyes. 901 00:38:36,990 --> 00:38:40,710 And that's it, no records of people getting on or off. 902 00:38:44,369 --> 00:38:46,660 So if you're doing a little data collection, as I said, 903 00:38:46,660 --> 00:38:48,010 we use checkers. 904 00:38:48,010 --> 00:38:50,260 And actually, your second assignment, you 905 00:38:50,260 --> 00:38:52,630 will be checkers of some kind. 906 00:38:52,630 --> 00:38:55,550 The typical checkers which you won't be in this assignment 907 00:38:55,550 --> 00:38:57,940 are ride checkers and point checkers. 908 00:38:57,940 --> 00:39:02,050 So a ride checker sits in the vehicle and rides 909 00:39:02,050 --> 00:39:03,370 with the vehicle. 910 00:39:03,370 --> 00:39:07,540 And the typical thing that these ride checkers are looking at 911 00:39:07,540 --> 00:39:10,380 is, how long did it take to cover some distance? 912 00:39:10,380 --> 00:39:12,479 So what was the running time for that trip? 913 00:39:12,479 --> 00:39:14,020 And also, people getting on and off-- 914 00:39:14,020 --> 00:39:16,090 so they act as APC essentially. 915 00:39:16,090 --> 00:39:18,140 And they act as AVL. 916 00:39:18,140 --> 00:39:20,770 So AVL and APC together might replace 917 00:39:20,770 --> 00:39:23,170 most of the functionality of a ride checker. 918 00:39:23,170 --> 00:39:26,980 Although a ride checker often can conduct an onboard survey, 919 00:39:26,980 --> 00:39:30,250 asking passengers about where are they going, 920 00:39:30,250 --> 00:39:33,590 or their trip purpose, or things related to social demographics, 921 00:39:33,590 --> 00:39:38,530 which are qualitative and cannot be collected with the sensors. 922 00:39:38,530 --> 00:39:40,900 Point checkers stand outside of the vehicle. 923 00:39:40,900 --> 00:39:43,720 They stay at a specific place, and they 924 00:39:43,720 --> 00:39:46,360 can look at headways between buses-- 925 00:39:46,360 --> 00:39:49,570 so how long did it take between each bus to come by, 926 00:39:49,570 --> 00:39:52,120 and how loaded were these buses? 927 00:39:52,120 --> 00:39:55,540 So if you're interested in the peak load point, 928 00:39:55,540 --> 00:39:57,400 and you know where the peak load point is, 929 00:39:57,400 --> 00:40:01,612 and you just want to observe, measure 930 00:40:01,612 --> 00:40:03,070 what are the loads of the peak load 931 00:40:03,070 --> 00:40:05,530 point, then you can just station a point 932 00:40:05,530 --> 00:40:06,880 checker at the peak load point. 933 00:40:06,880 --> 00:40:09,310 And if that person is strained, we'll 934 00:40:09,310 --> 00:40:13,270 be able to more or less say how many people are in the vehicle 935 00:40:13,270 --> 00:40:16,680 from looking at the vehicle. 936 00:40:16,680 --> 00:40:19,212 With automated data collection systems-- 937 00:40:19,212 --> 00:40:21,420 yeah, with a fair system, we have passenger accounts. 938 00:40:21,420 --> 00:40:23,410 We have transaction data, which is very rich. 939 00:40:23,410 --> 00:40:25,440 It will tell you not only that somebody 940 00:40:25,440 --> 00:40:27,930 is entering or exiting, but also how much they're 941 00:40:27,930 --> 00:40:32,430 paying, sometimes information about the fare product 942 00:40:32,430 --> 00:40:36,510 type, which might help you infer if this person is 943 00:40:36,510 --> 00:40:39,630 a senior, or a student, or a frequent user, an infrequent 944 00:40:39,630 --> 00:40:40,290 user-- 945 00:40:40,290 --> 00:40:43,490 so many things that are very useful for planning. 946 00:40:43,490 --> 00:40:46,060 And we'll get to play with some of these later in the course. 947 00:40:46,060 --> 00:40:49,750 And then, there's Automatic Passenger Counters, APC. 948 00:40:49,750 --> 00:40:54,880 So as more and motor systems switch to automatic data 949 00:40:54,880 --> 00:40:57,940 collection, we still use some manual data collection, 950 00:40:57,940 --> 00:41:01,000 but not in the traditional sense. 951 00:41:01,000 --> 00:41:02,920 Now, we reserve those resources for things 952 00:41:02,920 --> 00:41:07,180 like surveys about social demographics and other things. 953 00:41:07,180 --> 00:41:10,640 And we also carry out web-based surveys, 954 00:41:10,640 --> 00:41:12,640 which would have some biases. 955 00:41:12,640 --> 00:41:16,260 But if people registered their cards, 956 00:41:16,260 --> 00:41:18,010 and you have email accounts, you can maybe 957 00:41:18,010 --> 00:41:21,730 send a mass email to everyone and carry out surveys. 958 00:41:21,730 --> 00:41:23,110 The MBTA does that. 959 00:41:23,110 --> 00:41:25,090 Maybe some of you are in the panel 960 00:41:25,090 --> 00:41:27,670 of people who are e-mailed every now and then. 961 00:41:27,670 --> 00:41:29,370 Is anybody in that panel? 962 00:41:29,370 --> 00:41:30,350 No hands. 963 00:41:30,350 --> 00:41:31,410 I'm in that panel. 964 00:41:31,410 --> 00:41:35,006 But I know somebody must be. 965 00:41:35,006 --> 00:41:37,630 So yeah, they send an email, and they ask about your last ride. 966 00:41:37,630 --> 00:41:39,370 And they say, where did you start from? 967 00:41:39,370 --> 00:41:41,350 What were you doing this trip for? 968 00:41:41,350 --> 00:41:43,900 How long did you have to walk? 969 00:41:43,900 --> 00:41:45,450 Are you happy with the system? 970 00:41:45,450 --> 00:41:47,130 Was your bus on time? 971 00:41:47,130 --> 00:41:48,130 Yeah, things like that-- 972 00:41:48,130 --> 00:41:51,000 how satisfied are you? 973 00:41:51,000 --> 00:41:53,200 It's a survey with qualitative questions 974 00:41:53,200 --> 00:41:55,450 that you couldn't collect automatically. 975 00:41:55,450 --> 00:41:57,850 It's [INAUDIBLE] seeing things about your experience 976 00:41:57,850 --> 00:42:03,370 outside of the bus, which there are no sensors for. 977 00:42:03,370 --> 00:42:05,170 All right, sampling strategies-- a bunch 978 00:42:05,170 --> 00:42:08,200 of different ones and the simplest one 979 00:42:08,200 --> 00:42:12,200 is called simple random sampling-- very, very simple. 980 00:42:12,200 --> 00:42:14,310 So when you have sample random sampling, 981 00:42:14,310 --> 00:42:16,060 what happens is that every trip, if you're 982 00:42:16,060 --> 00:42:18,850 looking at surveying trips, for things like how many people 983 00:42:18,850 --> 00:42:19,930 boarded this trip-- 984 00:42:19,930 --> 00:42:22,960 let's take that as an example. 985 00:42:22,960 --> 00:42:24,920 Then, if you're using simple random sampling, 986 00:42:24,920 --> 00:42:27,490 every trip has equal likelihood of being picked and being 987 00:42:27,490 --> 00:42:28,330 surveyed. 988 00:42:28,330 --> 00:42:33,610 So if you go through your process, 989 00:42:33,610 --> 00:42:35,560 and you determine that you need to observe 100 990 00:42:35,560 --> 00:42:38,550 trips to get an average reliably. 991 00:42:38,550 --> 00:42:42,380 And you're going to use that to plan something, 992 00:42:42,380 --> 00:42:44,450 then you need to look at 100 trips. 993 00:42:44,450 --> 00:42:46,270 So if you use simple random sampling, 994 00:42:46,270 --> 00:42:50,020 you take your schedule, and you randomly pick 100 trips. 995 00:42:50,020 --> 00:42:51,190 And that's your sample. 996 00:42:51,190 --> 00:42:53,860 Those are the ones that you send people out to collect data. 997 00:42:53,860 --> 00:42:56,450 Now, there's a little bit of a problem with that. 998 00:42:56,450 --> 00:42:57,880 It's not the most efficient method 999 00:42:57,880 --> 00:42:59,713 because if you're going to send someone out, 1000 00:42:59,713 --> 00:43:03,340 and that person is going to be active, and require some time 1001 00:43:03,340 --> 00:43:06,520 to get to the site and some time to return, then 1002 00:43:06,520 --> 00:43:08,260 once they're out there, you want them 1003 00:43:08,260 --> 00:43:10,240 to collect as much as they can. 1004 00:43:10,240 --> 00:43:12,100 So that's not simple random sampling. 1005 00:43:12,100 --> 00:43:14,600 That's cluster sampling. 1006 00:43:14,600 --> 00:43:16,460 Before we get to that systematic sampling-- 1007 00:43:16,460 --> 00:43:21,350 so typically, instead of picking randomly, we say, 1008 00:43:21,350 --> 00:43:26,000 OK, we need to get 10% of the trips. 1009 00:43:26,000 --> 00:43:30,290 So let's just make it such that we count. 1010 00:43:30,290 --> 00:43:33,680 And maybe it's every five trips, we have to survey it. 1011 00:43:33,680 --> 00:43:35,780 So now, it's evenly spaced. 1012 00:43:35,780 --> 00:43:38,850 And this is useful for some things. 1013 00:43:38,850 --> 00:43:41,180 One example is weekday, picking the weekday 1014 00:43:41,180 --> 00:43:43,610 that you're going to survey on. 1015 00:43:43,610 --> 00:43:47,340 So the technique that is often used is sample every six days. 1016 00:43:47,340 --> 00:43:49,581 Why would that be? 1017 00:43:49,581 --> 00:43:50,080 Yeah. 1018 00:43:50,080 --> 00:43:53,000 So if you do it every seven, then you always have a Monday. 1019 00:43:53,000 --> 00:43:54,655 And that's going to get some bias 1020 00:43:54,655 --> 00:43:57,120 if Mondays happen to be low ridership days 1021 00:43:57,120 --> 00:43:58,420 or high ridership days. 1022 00:43:58,420 --> 00:44:01,630 So if do every sixth day over a year, 1023 00:44:01,630 --> 00:44:04,390 you have a good sample of every week day. 1024 00:44:04,390 --> 00:44:07,510 So that's an example of systematic sampling. 1025 00:44:07,510 --> 00:44:11,830 But you still have that issue of it 1026 00:44:11,830 --> 00:44:13,540 might not be the most efficient. 1027 00:44:13,540 --> 00:44:17,740 Cluster sampling, sometimes it's more efficient 1028 00:44:17,740 --> 00:44:20,110 once you send out a person to collect 1029 00:44:20,110 --> 00:44:22,690 data to do as much as possible. 1030 00:44:22,690 --> 00:44:24,760 And you survey a cluster. 1031 00:44:24,760 --> 00:44:28,510 So one example is, if you're distributing surveys 1032 00:44:28,510 --> 00:44:31,180 to passengers, and you need to distribute 100 surveys. 1033 00:44:31,180 --> 00:44:35,071 If you do 100 simple random sample, 1034 00:44:35,071 --> 00:44:37,570 then those people might be in different parts of the system. 1035 00:44:37,570 --> 00:44:40,570 And one might be the first person 1036 00:44:40,570 --> 00:44:43,000 you see getting off at South Station. 1037 00:44:43,000 --> 00:44:44,920 And then another one by me might be 1038 00:44:44,920 --> 00:44:48,700 the first person you see getting off at the Kendall station. 1039 00:44:48,700 --> 00:44:50,270 So that's very inefficient. 1040 00:44:50,270 --> 00:44:53,070 So a cluster might be everybody on board a bus, 1041 00:44:53,070 --> 00:44:55,690 and that will get a bunch of people together. 1042 00:44:55,690 --> 00:44:59,470 However, it's not as efficient statistically to do that. 1043 00:44:59,470 --> 00:45:01,930 So you can't just add up to 100, and you're 1044 00:45:01,930 --> 00:45:07,360 done because there might be some correlation within the people 1045 00:45:07,360 --> 00:45:09,670 riding that vehicle that they will tend 1046 00:45:09,670 --> 00:45:12,310 to answer in a similar way. 1047 00:45:12,310 --> 00:45:14,410 So you might need to increase your sample size 1048 00:45:14,410 --> 00:45:15,576 when you use this technique. 1049 00:45:15,576 --> 00:45:19,830 But still, you might have a more efficient sampling plan. 1050 00:45:19,830 --> 00:45:21,290 Then, there is the ratio estimation 1051 00:45:21,290 --> 00:45:22,520 and conversion factors. 1052 00:45:22,520 --> 00:45:24,560 We gave examples of this already. 1053 00:45:24,560 --> 00:45:26,820 This is in the context of baseline phase 1054 00:45:26,820 --> 00:45:28,770 and then monitoring phase. 1055 00:45:28,770 --> 00:45:31,930 So you start out with a baseline phase. 1056 00:45:31,930 --> 00:45:33,790 And in the baseline phase, you collect 1057 00:45:33,790 --> 00:45:36,640 the thing you really want and something 1058 00:45:36,640 --> 00:45:40,480 that is very easily collected with lower resources. 1059 00:45:40,480 --> 00:45:42,850 And you make a model of the thing you really 1060 00:45:42,850 --> 00:45:45,910 want as a function of the thing that 1061 00:45:45,910 --> 00:45:47,920 is cheap and easy to collect. 1062 00:45:47,920 --> 00:45:49,420 And then, on the monitoring phase, 1063 00:45:49,420 --> 00:45:54,310 you only measure the thing that is cheap, and easy, and quick. 1064 00:45:54,310 --> 00:45:57,890 And you then use the model to estimate what you really want. 1065 00:45:57,890 --> 00:46:00,790 So converting AFC boarding to passenger miles, 1066 00:46:00,790 --> 00:46:02,320 we give an example of that. 1067 00:46:02,320 --> 00:46:04,090 We're converting loads at checkpoints 1068 00:46:04,090 --> 00:46:05,840 to load somewhere else. 1069 00:46:05,840 --> 00:46:07,840 So maybe only measure loads with a point 1070 00:46:07,840 --> 00:46:09,640 checker at the peak load point. 1071 00:46:09,640 --> 00:46:12,910 And you have some relationship to convert those loads 1072 00:46:12,910 --> 00:46:18,702 to loads at other key transfer stations as an example. 1073 00:46:18,702 --> 00:46:20,160 And then, the stratified sampling-- 1074 00:46:20,160 --> 00:46:23,970 so one of the things that determines how big of a sample 1075 00:46:23,970 --> 00:46:25,830 you need is the variability in the data 1076 00:46:25,830 --> 00:46:26,910 that you're collecting. 1077 00:46:26,910 --> 00:46:30,900 So correlation, when you're looking 1078 00:46:30,900 --> 00:46:35,650 at a whole system with multiple routes or multiple segments-- 1079 00:46:35,650 --> 00:46:37,690 maybe when you look at one route, 1080 00:46:37,690 --> 00:46:42,550 there's some variability of running times. 1081 00:46:42,550 --> 00:46:44,770 But they have a central tendency as well. 1082 00:46:44,770 --> 00:46:46,420 And when you've got a second route, 1083 00:46:46,420 --> 00:46:48,392 you have also some variability and 1084 00:46:48,392 --> 00:46:49,600 a different central tendency. 1085 00:46:49,600 --> 00:46:51,624 So you bunch all the data together, 1086 00:46:51,624 --> 00:46:54,040 some of the variability across data points in our data set 1087 00:46:54,040 --> 00:46:56,980 are going to be the inherent variability of each route. 1088 00:46:56,980 --> 00:46:59,920 And some of it will be systematic-- the differences 1089 00:46:59,920 --> 00:47:01,390 between both routes. 1090 00:47:01,390 --> 00:47:03,340 So if you do a simple random sample, 1091 00:47:03,340 --> 00:47:05,800 and you don't separate the systematic variability 1092 00:47:05,800 --> 00:47:08,560 from the inherent variability, then you're 1093 00:47:08,560 --> 00:47:10,650 going to get a wider variability. 1094 00:47:10,650 --> 00:47:13,270 And you will require a bigger sample size. 1095 00:47:13,270 --> 00:47:14,950 Stratified sampling is an approach 1096 00:47:14,950 --> 00:47:18,790 where you determine sample sizes for each of these separately. 1097 00:47:18,790 --> 00:47:21,790 And it's more efficient if you do it well 1098 00:47:21,790 --> 00:47:25,270 because you eliminate the need, or you at least 1099 00:47:25,270 --> 00:47:28,600 reduce the need, to collect data for the sake 1100 00:47:28,600 --> 00:47:32,380 of the systematic differences between different parts 1101 00:47:32,380 --> 00:47:35,090 of the system. 1102 00:47:35,090 --> 00:47:36,450 Any questions on these methods? 1103 00:47:39,614 --> 00:47:41,066 Yes. 1104 00:47:41,066 --> 00:47:42,518 AUDIENCE: [INAUDIBLE] 1105 00:47:45,034 --> 00:47:46,450 GABRIEL SANCHEZ-MARTINEZ: Yeah, so 1106 00:47:46,450 --> 00:47:47,860 let's maybe pick another example. 1107 00:47:55,330 --> 00:48:01,130 Let's say that you're looking at the proportion of passengers 1108 00:48:01,130 --> 00:48:04,070 in a bus who are students. 1109 00:48:04,070 --> 00:48:05,660 And you're distributing a survey. 1110 00:48:05,660 --> 00:48:11,800 And they tell you whether they're students or not. 1111 00:48:11,800 --> 00:48:13,820 And you want this for the whole system 1112 00:48:13,820 --> 00:48:16,830 or for at least a group of routes. 1113 00:48:16,830 --> 00:48:19,900 And it tends to be that some routes don't serve universities 1114 00:48:19,900 --> 00:48:20,900 and don't serve schools. 1115 00:48:20,900 --> 00:48:24,020 So they have a lower proportion of people. 1116 00:48:24,020 --> 00:48:26,690 And then, some routes that do go through universities, 1117 00:48:26,690 --> 00:48:28,860 and they have a higher proportion of students. 1118 00:48:28,860 --> 00:48:33,290 So if you just want the system-wide proportion 1119 00:48:33,290 --> 00:48:36,890 of people who are students, and you join all these data points 1120 00:48:36,890 --> 00:48:39,320 together, there's going to be a lot of variability 1121 00:48:39,320 --> 00:48:41,630 in what proportion that is across every trip 1122 00:48:41,630 --> 00:48:44,930 that you survey, correct? 1123 00:48:44,930 --> 00:48:49,610 So in some sense, it will indicate 1124 00:48:49,610 --> 00:48:51,290 that because of that variability, 1125 00:48:51,290 --> 00:48:55,260 you're going to need a higher sampling size. 1126 00:48:55,260 --> 00:48:57,830 You're going to have to survey more trips 1127 00:48:57,830 --> 00:49:02,810 to get at your desired accuracy level and tolerance. 1128 00:49:02,810 --> 00:49:06,080 But now, if you say no, I'm going to split routes in two, 1129 00:49:06,080 --> 00:49:07,100 into two stratas. 1130 00:49:07,100 --> 00:49:11,060 One is the routes that serve the universities. 1131 00:49:11,060 --> 00:49:16,700 And these tend to have around 50% proportion. 1132 00:49:16,700 --> 00:49:19,610 And then, there's the routes that don't serve universities. 1133 00:49:19,610 --> 00:49:23,180 And these tend to have proportions near 0. 1134 00:49:23,180 --> 00:49:27,260 So if you're in your 0, you might require a lower sample 1135 00:49:27,260 --> 00:49:28,700 size to cover those. 1136 00:49:28,700 --> 00:49:30,410 And you can just very efficiently 1137 00:49:30,410 --> 00:49:32,480 cover most of your bus routes that way. 1138 00:49:32,480 --> 00:49:35,030 And then, focus your efforts on just the ones 1139 00:49:35,030 --> 00:49:37,160 that have higher proportion. 1140 00:49:37,160 --> 00:49:39,980 And you achieved your system-level tolerance 1141 00:49:39,980 --> 00:49:44,990 requirements with much fewer, with by far fewer resources 1142 00:49:44,990 --> 00:49:47,069 required to collect the data. 1143 00:49:47,069 --> 00:49:48,360 Does that answer your question? 1144 00:49:48,360 --> 00:49:48,942 Yeah. 1145 00:49:48,942 --> 00:49:50,298 AUDIENCE: [INAUDIBLE] 1146 00:49:52,260 --> 00:49:54,510 GABRIEL SANCHEZ-MARTINEZ: So what he meant by inherent 1147 00:49:54,510 --> 00:49:57,600 is that within each bus route or within each strata, 1148 00:49:57,600 --> 00:49:59,130 there will be some variability. 1149 00:49:59,130 --> 00:50:02,130 Even within the trips that are serving universities, 1150 00:50:02,130 --> 00:50:04,530 every trip might have a different proportion. 1151 00:50:04,530 --> 00:50:07,120 So there's going to be a little bit of variability in that. 1152 00:50:07,120 --> 00:50:10,600 But if you mix that with trips that are not serving students, 1153 00:50:10,600 --> 00:50:12,992 then you pull all that data together. 1154 00:50:12,992 --> 00:50:15,450 Then, it's going to look like the variance of that data set 1155 00:50:15,450 --> 00:50:16,170 is much higher. 1156 00:50:20,950 --> 00:50:23,200 All right, so we've tossed these terms 1157 00:50:23,200 --> 00:50:25,430 around-- tolerance, confidence, level accuracy. 1158 00:50:25,430 --> 00:50:27,996 So let's define them more precisely. 1159 00:50:27,996 --> 00:50:29,620 Accuracy-- when we talk about accuracy, 1160 00:50:29,620 --> 00:50:31,960 that has two dimensions. 1161 00:50:31,960 --> 00:50:36,070 So somebody might say, the average boardings per trip 1162 00:50:36,070 --> 00:50:38,032 is 33.1. 1163 00:50:38,032 --> 00:50:39,490 And then, the question that follows 1164 00:50:39,490 --> 00:50:42,070 is, do you mean exactly 33.1? 1165 00:50:42,070 --> 00:50:43,570 How certain are you of that? 1166 00:50:43,570 --> 00:50:45,100 And how accurate is that? 1167 00:50:45,100 --> 00:50:48,970 So when we talk about tolerance, there's relative tolerance, 1168 00:50:48,970 --> 00:50:50,860 and there's absolute tolerance. 1169 00:50:50,860 --> 00:50:52,750 Relative tolerance is expressed in terms 1170 00:50:52,750 --> 00:50:57,760 of a percent of the amount you were collecting or a fraction. 1171 00:50:57,760 --> 00:51:01,660 So you might say mean boardings per trip is 33.1, 1172 00:51:01,660 --> 00:51:03,170 plus or minus 10%. 1173 00:51:03,170 --> 00:51:05,710 And that's the 10% of 33.1. 1174 00:51:05,710 --> 00:51:07,876 That's why it's relative tolerance. 1175 00:51:07,876 --> 00:51:09,250 Then, there's absolute tolerance. 1176 00:51:09,250 --> 00:51:14,240 So mean boarding per trip is 33.1, plus or minus 3.3. 1177 00:51:14,240 --> 00:51:17,630 Now, in this case, these two are equivalent. 1178 00:51:17,630 --> 00:51:20,810 3.3 in absolute terms is 10% of 33.1. 1179 00:51:20,810 --> 00:51:23,600 But this was expressed in absolute terms, 1180 00:51:23,600 --> 00:51:25,820 and the previous one was expressed in relative terms. 1181 00:51:28,766 --> 00:51:32,130 So don't always assume that if you see a percent, 1182 00:51:32,130 --> 00:51:35,190 it's relative because if what you're measuring is in itself 1183 00:51:35,190 --> 00:51:38,850 a percent, unless you're using a percent of a percent, 1184 00:51:38,850 --> 00:51:39,930 then it's absolute. 1185 00:51:39,930 --> 00:51:41,940 So here's an example. 1186 00:51:41,940 --> 00:51:46,740 Mean percentage of students is 23%, plus or minus 5%. 1187 00:51:46,740 --> 00:51:49,785 That's absolute because it's 5%, not 5% of 23%. 1188 00:51:54,660 --> 00:51:57,480 First, we talked about, is that exactly 33.1? 1189 00:51:57,480 --> 00:52:00,010 Or is it something different from 33.1? 1190 00:52:00,010 --> 00:52:02,460 Then, the second question is, how sure are you, 1191 00:52:02,460 --> 00:52:06,310 how confident are you that the number you give, 1192 00:52:06,310 --> 00:52:12,320 plus or minus the tolerance you give, is the right answer? 1193 00:52:12,320 --> 00:52:15,400 So now, you say I'm 95% confident 1194 00:52:15,400 --> 00:52:18,355 that the mean boardings per trip is 33.1, plus or minus 10%. 1195 00:52:18,355 --> 00:52:20,347 So now, you combine the tolerance 1196 00:52:20,347 --> 00:52:21,430 with the confidence level. 1197 00:52:21,430 --> 00:52:24,170 And that's the full expression of your accuracy. 1198 00:52:24,170 --> 00:52:27,740 And that's what you need when we look at the data collection. 1199 00:52:27,740 --> 00:52:30,800 So you have two different things that you could play with. 1200 00:52:30,800 --> 00:52:33,860 And what happens typically is that you choose 1201 00:52:33,860 --> 00:52:35,210 a high confidence level-- 1202 00:52:35,210 --> 00:52:38,150 90%, 95 percent are typical. 1203 00:52:38,150 --> 00:52:39,830 And then, you hold that fixed. 1204 00:52:39,830 --> 00:52:42,830 And you calculate what level of accuracy you need. 1205 00:52:42,830 --> 00:52:45,020 Or rather, you decide what level of accuracy 1206 00:52:45,020 --> 00:52:48,110 you need, depending on the question you want to answer, 1207 00:52:48,110 --> 00:52:51,560 and the impact it could have on the system. 1208 00:52:51,560 --> 00:52:54,350 So if you're looking to [INAUDIBLE] something 1209 00:52:54,350 --> 00:53:01,850 that will have very significant effects on the service plan 1210 00:53:01,850 --> 00:53:04,070 or maybe on investment in the system, 1211 00:53:04,070 --> 00:53:07,430 then you might need a higher accuracy. 1212 00:53:07,430 --> 00:53:10,597 But if you're collecting data just for reporting, 1213 00:53:10,597 --> 00:53:11,930 maybe it doesn't matter as much. 1214 00:53:11,930 --> 00:53:15,830 And you don't need to spend as much money on data collection. 1215 00:53:15,830 --> 00:53:20,540 So as an example here, the National Transit Database-- 1216 00:53:20,540 --> 00:53:23,150 NTD, we call it NTD-- 1217 00:53:23,150 --> 00:53:26,150 for annual boardings and passenger miles, it says, 1218 00:53:26,150 --> 00:53:27,740 you should collect data to achieve 1219 00:53:27,740 --> 00:53:31,890 an accuracy of 10%, relative tolerance at 95% confidence 1220 00:53:31,890 --> 00:53:33,250 level. 1221 00:53:33,250 --> 00:53:36,090 You need both. 1222 00:53:36,090 --> 00:53:38,340 So take home message about this. 1223 00:53:38,340 --> 00:53:40,630 The other thing, the t distribution-- 1224 00:53:40,630 --> 00:53:43,920 so this is a probability distribution that 1225 00:53:43,920 --> 00:53:44,960 is bell-shaped. 1226 00:53:44,960 --> 00:53:47,490 It kind of looks like the normal distribution. 1227 00:53:47,490 --> 00:53:49,440 And it approaches the normal distribution 1228 00:53:49,440 --> 00:53:52,330 as the sample size gets very large. 1229 00:53:52,330 --> 00:53:54,960 This is the distribution that arises naturally 1230 00:53:54,960 --> 00:53:58,110 when you're estimating the mean of a population that 1231 00:53:58,110 --> 00:54:01,950 is normally distributed with unknown mean and variance 1232 00:54:01,950 --> 00:54:04,380 and some known sample size. 1233 00:54:04,380 --> 00:54:08,870 So to the right here, we have your equations 1234 00:54:08,870 --> 00:54:11,990 that I'm sure you've seen before for sample mean, sample 1235 00:54:11,990 --> 00:54:13,880 variance. 1236 00:54:13,880 --> 00:54:15,740 And I guess, what's important to think 1237 00:54:15,740 --> 00:54:18,470 about is that the distribution of what 1238 00:54:18,470 --> 00:54:20,220 you're collecting-- for example, you 1239 00:54:20,220 --> 00:54:23,630 might be collecting data on a number of people boarding route 1240 00:54:23,630 --> 00:54:25,100 1. 1241 00:54:25,100 --> 00:54:29,390 So that might have some distribution. 1242 00:54:29,390 --> 00:54:31,440 As you collect more and more data, 1243 00:54:31,440 --> 00:54:36,350 so as you survey more and more trips, 1244 00:54:36,350 --> 00:54:40,700 the distribution of how many people board each trip 1245 00:54:40,700 --> 00:54:43,400 does not necessarily have to be normal. 1246 00:54:43,400 --> 00:54:45,980 But it turns out from the Central Limit Theorem 1247 00:54:45,980 --> 00:54:52,990 and other laws and properties of statistics and probability 1248 00:54:52,990 --> 00:54:54,920 that the distribution of the estimator-- 1249 00:54:54,920 --> 00:54:58,570 so the distribution of the mean that you calculate based 1250 00:54:58,570 --> 00:55:00,040 on that sample that you collected-- 1251 00:55:00,040 --> 00:55:03,380 is normally distributed as the sample size increases. 1252 00:55:03,380 --> 00:55:06,402 So if you have a lower sample size, 1253 00:55:06,402 --> 00:55:08,110 instead of using the normal distribution, 1254 00:55:08,110 --> 00:55:10,650 use t distribution. 1255 00:55:10,650 --> 00:55:12,730 Sometimes, we call that a student, the t student 1256 00:55:12,730 --> 00:55:13,780 distribution. 1257 00:55:13,780 --> 00:55:20,440 And this distribution gets wider as the variability increases 1258 00:55:20,440 --> 00:55:23,310 and as the sample size gets smaller. 1259 00:55:23,310 --> 00:55:26,090 It has a property called degrees of freedom, 1260 00:55:26,090 --> 00:55:28,360 which is sample size minus 1. 1261 00:55:28,360 --> 00:55:31,294 And you can see from this chart right here when 1262 00:55:31,294 --> 00:55:32,710 you have degrees of freedom equals 1263 00:55:32,710 --> 00:55:35,540 1, which means you collected two data points, 1264 00:55:35,540 --> 00:55:38,870 it's wider than when V approaches infinity. 1265 00:55:38,870 --> 00:55:42,610 And what you have in black here, the thinnest and least variable 1266 00:55:42,610 --> 00:55:46,520 of these, is essentially a normal distribution. 1267 00:55:46,520 --> 00:55:48,990 And this is the distribution not of what you collected. 1268 00:55:48,990 --> 00:55:52,540 It's not the distribution of the number 1269 00:55:52,540 --> 00:55:54,250 of people who boarded route 1. 1270 00:55:54,250 --> 00:55:58,690 It's the distribution of the mean that you estimate. 1271 00:55:58,690 --> 00:55:59,860 AUDIENCE: [INAUDIBLE] 1272 00:55:59,860 --> 00:56:00,550 GABRIEL SANCHEZ-MARTINEZ: Exactly, it's 1273 00:56:00,550 --> 00:56:02,420 a sampling distribution of the mean. 1274 00:56:02,420 --> 00:56:05,980 And if you were to repeat that experiment with the same number 1275 00:56:05,980 --> 00:56:08,680 of trips but different number of trips, 1276 00:56:08,680 --> 00:56:11,320 you might get a slightly different mean. 1277 00:56:11,320 --> 00:56:14,110 So if you were to repeat that many, many times, 1278 00:56:14,110 --> 00:56:19,145 the distribution of those means would be shaped in this manner. 1279 00:56:19,145 --> 00:56:20,020 AUDIENCE: [INAUDIBLE] 1280 00:56:20,020 --> 00:56:22,519 GABRIEL SANCHEZ-MARTINEZ: Yeah, well, student t distributed. 1281 00:56:22,519 --> 00:56:26,700 And as sample size increases to infinity, normally distributed. 1282 00:56:26,700 --> 00:56:27,200 Harry. 1283 00:56:27,200 --> 00:56:32,009 AUDIENCE: So just for V equals 5, I think you [INAUDIBLE].. 1284 00:56:32,009 --> 00:56:33,175 GABRIEL SANCHEZ-MARTINEZ: 4. 1285 00:56:33,175 --> 00:56:33,620 AUDIENCE: 4. 1286 00:56:33,620 --> 00:56:34,460 GABRIEL SANCHEZ-MARTINEZ: Sorry, 6. 1287 00:56:34,460 --> 00:56:35,308 6. 1288 00:56:35,308 --> 00:56:36,955 AUDIENCE: Approximately 5 [INAUDIBLE].. 1289 00:56:36,955 --> 00:56:38,330 GABRIEL SANCHEZ-MARTINEZ: Yes, 6. 1290 00:56:38,330 --> 00:56:39,280 Yeah. 1291 00:56:39,280 --> 00:56:41,950 I mispoke. 1292 00:56:41,950 --> 00:56:43,940 [INAUDIBLE] 1293 00:56:43,940 --> 00:56:47,340 AUDIENCE: When there's a sample variance, sigma x squared 1294 00:56:47,340 --> 00:56:48,320 equals roughly. 1295 00:56:48,320 --> 00:56:50,250 Is that not supposed to be an equals? 1296 00:56:50,250 --> 00:56:52,890 Is that not the way the sample variances define? 1297 00:56:52,890 --> 00:56:56,104 Because I thought it's the-- 1298 00:56:56,104 --> 00:56:57,645 GABRIEL SANCHEZ-MARTINEZ: So-- --it's 1299 00:56:57,645 --> 00:56:59,103 below the variance of distribution. 1300 00:56:59,103 --> 00:57:02,110 But that's roughly [INAUDIBLE]. 1301 00:57:02,110 --> 00:57:05,380 AUDIENCE: Yeah, I guess the issue is that you 1302 00:57:05,380 --> 00:57:10,060 don't know the true mean. 1303 00:57:10,060 --> 00:57:14,530 So you're using an estimate to calculate the sample variance. 1304 00:57:14,530 --> 00:57:17,517 And therefore, it's almost, almost the sample variance. 1305 00:57:17,517 --> 00:57:18,850 GABRIEL SANCHEZ-MARTINEZ: Right. 1306 00:57:18,850 --> 00:57:19,782 But I thought-- 1307 00:57:19,782 --> 00:57:21,240 AUDIENCE: You're using an estimator 1308 00:57:21,240 --> 00:57:23,340 to do the-- that's what you have to do. 1309 00:57:23,340 --> 00:57:24,700 [INTERPOSING VOICES] 1310 00:57:24,700 --> 00:57:26,390 AUDIENCE: He's incorporating the fact 1311 00:57:26,390 --> 00:57:29,350 we're dividing by n minus 1 rather dividing by [INAUDIBLE].. 1312 00:57:29,350 --> 00:57:31,225 GABRIEL SANCHEZ-MARTINEZ: No, so n minus 1, 1313 00:57:31,225 --> 00:57:34,450 that has to do with the degrees of freedom issue. 1314 00:57:34,450 --> 00:57:38,750 And that's to go from population variance to sample variance. 1315 00:57:38,750 --> 00:57:40,630 But the other thing that happens is 1316 00:57:40,630 --> 00:57:43,760 that if you're doing the population, 1317 00:57:43,760 --> 00:57:46,200 then you know exactly what your mean is. 1318 00:57:46,200 --> 00:57:47,295 It's exact, right? 1319 00:57:47,295 --> 00:57:47,920 AUDIENCE: Yeah. 1320 00:57:47,920 --> 00:57:49,920 GABRIEL SANCHEZ-MARTINEZ: And then in that case, 1321 00:57:49,920 --> 00:57:52,500 you would know what the exact variances is as well. 1322 00:57:52,500 --> 00:57:53,080 Yeah. 1323 00:57:53,080 --> 00:57:55,900 So the n minus 1 is just to remove 1324 00:57:55,900 --> 00:57:59,700 a bias that would arise from collecting only a sample. 1325 00:57:59,700 --> 00:58:01,460 AUDIENCE: But here for example, you 1326 00:58:01,460 --> 00:58:03,960 can say this is equals to [INAUDIBLE].. 1327 00:58:03,960 --> 00:58:04,960 GABRIEL SANCHEZ-MARTINEZ: Yeah, yeah, yeah, yeah. 1328 00:58:04,960 --> 00:58:05,792 AUDIENCE: You're working with the sample 1329 00:58:05,792 --> 00:58:07,890 to know it would be an approximate [INAUDIBLE].. 1330 00:58:07,890 --> 00:58:09,310 GABRIEL SANCHEZ-MARTINEZ: Yeah, in practice equal 2. 1331 00:58:09,310 --> 00:58:11,270 AUDIENCE: As your sample distribution 1332 00:58:11,270 --> 00:58:13,377 increases, then obviously, your sample increases-- 1333 00:58:13,377 --> 00:58:14,210 [INTERPOSING VOICES] 1334 00:58:14,210 --> 00:58:14,920 GABRIEL SANCHEZ-MARTINEZ: And therefore, this 1335 00:58:14,920 --> 00:58:16,030 becomes more and more accurate. 1336 00:58:16,030 --> 00:58:16,660 AUDIENCE: [INAUDIBLE] 1337 00:58:16,660 --> 00:58:17,080 GABRIEL SANCHEZ-MARTINEZ: Exactly. 1338 00:58:17,080 --> 00:58:18,640 AUDIENCE: It should be approaching more [INAUDIBLE].. 1339 00:58:18,640 --> 00:58:19,180 GABRIEL SANCHEZ-MARTINEZ: Yeah, so I 1340 00:58:19,180 --> 00:58:20,620 guess what's important to realize 1341 00:58:20,620 --> 00:58:27,430 is that this is an estimate of the population variance, which 1342 00:58:27,430 --> 00:58:30,260 in itself uses another estimate. 1343 00:58:30,260 --> 00:58:32,200 And I guess, that's why that's there. 1344 00:58:32,200 --> 00:58:33,530 But it's a very small detail. 1345 00:58:33,530 --> 00:58:37,496 I didn't mean to distract you. 1346 00:58:37,496 --> 00:58:42,046 AUDIENCE: So for the n, is it the sum of all the different 1347 00:58:42,046 --> 00:58:43,730 samples of [INAUDIBLE] or is it just-- 1348 00:58:43,730 --> 00:58:44,070 [INTERPOSING VOICES] 1349 00:58:44,070 --> 00:58:45,630 GABRIEL SANCHEZ-MARTINEZ: So you don't ever 1350 00:58:45,630 --> 00:58:47,160 repeat the experiment like this. 1351 00:58:47,160 --> 00:58:49,800 This is more of a theoretical explanation 1352 00:58:49,800 --> 00:58:52,420 to why there is a distribution to the mean, 1353 00:58:52,420 --> 00:58:53,850 even though you only have one. 1354 00:58:53,850 --> 00:58:55,380 You only have one mean, right? 1355 00:58:55,380 --> 00:58:57,340 Because you're going to collect data. 1356 00:58:57,340 --> 00:58:59,100 And once you finish collecting data, 1357 00:58:59,100 --> 00:59:01,680 you're going to calculate the mean of all that data. 1358 00:59:01,680 --> 00:59:04,020 So you only have one mean. 1359 00:59:04,020 --> 00:59:08,160 If you were hypothetically to repeat that experiment, 1360 00:59:08,160 --> 00:59:10,656 and you calculated separate means for each one, 1361 00:59:10,656 --> 00:59:12,030 then you would get a distribution 1362 00:59:12,030 --> 00:59:14,220 that would look like this. 1363 00:59:14,220 --> 00:59:17,100 In practice, you would just increase your sample size 1364 00:59:17,100 --> 00:59:21,554 and still compute one mean, which would be more accurate. 1365 00:59:21,554 --> 00:59:22,054 Yeah. 1366 00:59:24,910 --> 00:59:27,010 OK, let's move on. 1367 00:59:27,010 --> 00:59:28,629 So tolerance and confidence level-- 1368 00:59:28,629 --> 00:59:29,920 so we have these distributions. 1369 00:59:29,920 --> 00:59:33,760 These are the distributions of the statistics, 1370 00:59:33,760 --> 00:59:35,310 of the mean in this case. 1371 00:59:35,310 --> 00:59:36,640 They are bell-shaped. 1372 00:59:36,640 --> 00:59:41,977 As your sample size increases, the degrees of freedom goes up. 1373 00:59:41,977 --> 00:59:43,060 And your accuracy goes up. 1374 00:59:43,060 --> 00:59:45,790 And the variance of that statistic distribution 1375 00:59:45,790 --> 00:59:46,370 decreases. 1376 00:59:46,370 --> 00:59:47,770 So it gets thinner. 1377 00:59:47,770 --> 00:59:52,240 So here in red, you have a distribution with a smaller 1378 00:59:52,240 --> 00:59:55,570 sample, and therefore, less accuracy or less confidence 1379 00:59:55,570 --> 00:59:56,590 would look like. 1380 00:59:56,590 --> 00:59:59,260 And then as you increase your sample size, 1381 00:59:59,260 --> 01:00:04,120 you see that it becomes more peaky. 1382 01:00:04,120 --> 01:00:09,130 So when we talk about tolerance, and let's 1383 01:00:09,130 --> 01:00:11,170 come back to the concept of absolute tolerance 1384 01:00:11,170 --> 01:00:13,330 in particular, we're talking about the distance 1385 01:00:13,330 --> 01:00:16,000 between the center of that distribution, which 1386 01:00:16,000 --> 01:00:20,020 is a symmetrical distribution, and some limit. 1387 01:00:20,020 --> 01:00:24,460 So we're saying, if you have a tolerance of plus/minus 10. 1388 01:00:24,460 --> 01:00:28,390 Then, you're going to measure 10, say 10 boardings, 1389 01:00:28,390 --> 01:00:32,270 from the center to the right and from the center to the left. 1390 01:00:32,270 --> 01:00:35,590 And that's your absolute tolerance. 1391 01:00:35,590 --> 01:00:38,410 So when you calculate absolute tolerance, 1392 01:00:38,410 --> 01:00:40,750 you can express that tolerance as a function 1393 01:00:40,750 --> 01:00:46,210 of the variance and/or the standard deviation, 1394 01:00:46,210 --> 01:00:48,790 rather of your mean. 1395 01:00:48,790 --> 01:00:52,750 So instead of saying 10, you could say 2 times 1396 01:00:52,750 --> 01:00:57,010 the standard deviation of that distribution using the equation 1397 01:00:57,010 --> 01:00:58,496 that we just calculated. 1398 01:00:58,496 --> 01:00:59,620 And that's very convenient. 1399 01:00:59,620 --> 01:01:02,100 Why would we do that? 1400 01:01:02,100 --> 01:01:04,170 Why would I want to complicate things that way? 1401 01:01:07,068 --> 01:01:09,075 AUDIENCE: [? Outside ?] [? of ?] a cumulative 1402 01:01:09,075 --> 01:01:10,950 GABRIEL SANCHEZ-MARTINEZ: No, I mean, there's 1403 01:01:10,950 --> 01:01:12,690 a mathematical convenience here. 1404 01:01:12,690 --> 01:01:15,420 What is this a function of? 1405 01:01:15,420 --> 01:01:18,510 It's a function of the standard deviation 1406 01:01:18,510 --> 01:01:22,490 of the thing you were collecting and your sample size, right? 1407 01:01:22,490 --> 01:01:23,670 And what do we want to do? 1408 01:01:23,670 --> 01:01:25,470 We want to determine how many things we 1409 01:01:25,470 --> 01:01:26,670 need to collect, right? 1410 01:01:26,670 --> 01:01:27,570 So here we go-- 1411 01:01:27,570 --> 01:01:28,650 we have n. 1412 01:01:28,650 --> 01:01:32,280 And now we can solve for n, we have the sample size 1413 01:01:32,280 --> 01:01:34,380 that we require for a given tolerance. 1414 01:01:34,380 --> 01:01:37,560 So we're going to decide what the tolerance is 1415 01:01:37,560 --> 01:01:41,100 and calculate sample size, a minimum required sample size. 1416 01:01:41,100 --> 01:01:44,030 You can always collect more data. 1417 01:01:44,030 --> 01:01:44,810 All right. 1418 01:01:44,810 --> 01:01:46,790 So again, to review, this is the same equation 1419 01:01:46,790 --> 01:01:48,200 I had in the last slide. 1420 01:01:48,200 --> 01:01:51,020 You have absolutely tolerance. 1421 01:01:51,020 --> 01:01:54,740 You can express that as a multiplier 1422 01:01:54,740 --> 01:01:59,270 times the standard deviation of the mean. 1423 01:01:59,270 --> 01:02:02,330 And then you solve for n, and you get this equation 1424 01:02:02,330 --> 01:02:03,260 right here. 1425 01:02:03,260 --> 01:02:06,170 t is your tolerance and you can-- 1426 01:02:06,170 --> 01:02:09,980 oh, sorry. t is the number of standard deviations 1427 01:02:09,980 --> 01:02:11,690 from the mean. 1428 01:02:11,690 --> 01:02:14,600 d is your tolerance, which you choose. 1429 01:02:14,600 --> 01:02:17,090 And this is something that you know, or collect, 1430 01:02:17,090 --> 01:02:18,410 or approximate. 1431 01:02:18,410 --> 01:02:20,210 So these are all given. 1432 01:02:20,210 --> 01:02:21,510 Where does t come from? 1433 01:02:21,510 --> 01:02:24,050 Well, we said that we're going to use the t 1434 01:02:24,050 --> 01:02:24,890 distribution, right? 1435 01:02:24,890 --> 01:02:28,220 So the t distribution has a table-- 1436 01:02:28,220 --> 01:02:30,230 or it has a certain shape, rather. 1437 01:02:30,230 --> 01:02:32,870 And using Excel or looking up at some table, 1438 01:02:32,870 --> 01:02:38,030 you can figure out what t is for two times 1439 01:02:38,030 --> 01:02:40,890 the standard deviation from the center. 1440 01:02:40,890 --> 01:02:43,820 So you can just plug it in from Excel or from-- 1441 01:02:43,820 --> 01:02:46,040 it's a property of the distribution, essentially. 1442 01:02:46,040 --> 01:02:48,890 Once you pick a confidence interval, you know t. 1443 01:02:48,890 --> 01:02:51,470 If you want to go to 95, it's a certain value. 1444 01:02:51,470 --> 01:02:54,320 If you want to go to 90, it's a different value. 1445 01:02:54,320 --> 01:02:55,460 OK. 1446 01:02:55,460 --> 01:02:57,050 When we look at relative tolerance, 1447 01:02:57,050 --> 01:03:00,740 relative tolerance is just absolute tolerance 1448 01:03:00,740 --> 01:03:03,890 divided by the mean that you are collecting, correct? 1449 01:03:03,890 --> 01:03:06,890 Because instead of saying plus or minus 10 boardings, 1450 01:03:06,890 --> 01:03:09,380 we're saying plus or minus 5% of the mean. 1451 01:03:09,380 --> 01:03:13,730 So we just take absolute tolerance and divide by x bar, 1452 01:03:13,730 --> 01:03:17,240 the sampling mean, the sample mean. 1453 01:03:17,240 --> 01:03:19,040 And we solve for n again. 1454 01:03:19,040 --> 01:03:23,690 So what we have now, it looks very similar as to the question 1455 01:03:23,690 --> 01:03:24,570 right here. 1456 01:03:24,570 --> 01:03:27,660 But now we have the mean and the denominator. 1457 01:03:27,660 --> 01:03:30,860 OK, this quantity, standard deviation 1458 01:03:30,860 --> 01:03:34,040 divided by mean, sample standard deviation divided 1459 01:03:34,040 --> 01:03:38,200 by sampling mean, is called the coefficient of variation. 1460 01:03:38,200 --> 01:03:40,930 And there's a convenience to this. 1461 01:03:40,930 --> 01:03:44,192 And there's actually a reason why 1462 01:03:44,192 --> 01:03:45,900 sometimes relative tolerance is preferred 1463 01:03:45,900 --> 01:03:46,820 to absolute tolerance. 1464 01:03:46,820 --> 01:03:48,361 It's because of this, because there's 1465 01:03:48,361 --> 01:03:52,620 a mathematically convenient characteristic of property 1466 01:03:52,620 --> 01:03:53,760 coming out of this-- 1467 01:03:53,760 --> 01:03:57,270 that you don't need to know the standard deviation of what 1468 01:03:57,270 --> 01:04:00,300 you're collecting to figure out your sample size. 1469 01:04:00,300 --> 01:04:02,310 We're kind of running in circles here, right? 1470 01:04:02,310 --> 01:04:04,101 We're saying that to determine sample size, 1471 01:04:04,101 --> 01:04:05,829 you need to know the standard deviation. 1472 01:04:05,829 --> 01:04:07,120 Well, I haven't collected data. 1473 01:04:07,120 --> 01:04:09,009 So I don't know how variable the data is. 1474 01:04:09,009 --> 01:04:09,800 So that's an issue. 1475 01:04:09,800 --> 01:04:11,820 Now I have to estimate what that is. 1476 01:04:11,820 --> 01:04:15,510 It tends to happen that the coefficient of variation 1477 01:04:15,510 --> 01:04:19,230 is a more stable property than the variation in itself, 1478 01:04:19,230 --> 01:04:22,410 than the variance or the standard deviation itself. 1479 01:04:22,410 --> 01:04:26,670 So you're more likely to get away 1480 01:04:26,670 --> 01:04:29,640 with using default values for the coefficient of variation 1481 01:04:29,640 --> 01:04:34,260 than you are with assuming a specific standard deviation. 1482 01:04:34,260 --> 01:04:37,480 AUDIENCE: It should be noted that it's unitless, coefficient 1483 01:04:37,480 --> 01:04:38,410 of variation. 1484 01:04:38,410 --> 01:04:40,326 GABRIEL SANCHEZ-MARTINEZ: Yes, it is unitless. 1485 01:04:40,326 --> 01:04:41,980 Thank you. 1486 01:04:41,980 --> 01:04:42,480 OK. 1487 01:04:42,480 --> 01:04:45,210 So what happens is that relative tolerances are typically 1488 01:04:45,210 --> 01:04:46,080 used for averages. 1489 01:04:46,080 --> 01:04:47,310 So here's an example-- 1490 01:04:47,310 --> 01:04:51,760 you measured 5720 boardings plus minus 5%. 1491 01:04:51,760 --> 01:04:54,494 So if you were to get the absolute equivalent 1492 01:04:54,494 --> 01:04:55,910 of the absolute tolerance of that. 1493 01:04:55,910 --> 01:04:58,890 That would be 5% of 5720. 1494 01:04:58,890 --> 01:05:01,180 That would be 286 passengers. 1495 01:05:01,180 --> 01:05:03,600 That's a weird thing to report. 1496 01:05:03,600 --> 01:05:06,000 5% is more understandable, right? 1497 01:05:06,000 --> 01:05:07,630 And it kind of makes more sense. 1498 01:05:07,630 --> 01:05:11,700 So that's what we want naturally, anyway. 1499 01:05:11,700 --> 01:05:14,360 So as I said, the coefficient variation 1500 01:05:14,360 --> 01:05:17,310 is typically easier to guess than the mean and the variance 1501 01:05:17,310 --> 01:05:18,690 separately. 1502 01:05:18,690 --> 01:05:20,820 So we use that. 1503 01:05:20,820 --> 01:05:23,070 Here's an example using the t distribution, 1504 01:05:23,070 --> 01:05:26,070 where the sample is not large enough 1505 01:05:26,070 --> 01:05:30,280 to assume a normal distribution. 1506 01:05:30,280 --> 01:05:33,550 So we say, let's have a relative tolerance of plus minus 5%, 1507 01:05:33,550 --> 01:05:36,120 a confidence level of 95%, and a coefficient 1508 01:05:36,120 --> 01:05:37,650 of variation of 0.3. 1509 01:05:37,650 --> 01:05:39,660 So we start out assuming large sample, 1510 01:05:39,660 --> 01:05:42,210 and therefore degrees of freedom is infinity. 1511 01:05:42,210 --> 01:05:44,140 We can use the normal distribution. 1512 01:05:44,140 --> 01:05:46,860 If we look at the normal distribution, 1513 01:05:46,860 --> 01:05:52,920 with plus minus 5%, confidence level 95%, the t is 1.96. 1514 01:05:52,920 --> 01:05:57,030 So we look that up on a table, or we use Excel norm dist, 1515 01:05:57,030 --> 01:05:58,440 or-- yeah. 1516 01:05:58,440 --> 01:06:02,110 t dist for t and norm dist for normal. 1517 01:06:02,110 --> 01:06:04,860 We got 1.96. 1518 01:06:04,860 --> 01:06:06,870 We plug in the relative tolerance, 1519 01:06:06,870 --> 01:06:08,366 the 0.3-- we get 140. 1520 01:06:08,366 --> 01:06:11,460 140 is not quite infinity, right? 1521 01:06:11,460 --> 01:06:14,190 So if we look at 140 as a sample size, 1522 01:06:14,190 --> 01:06:16,980 that would imply that all the degrees of freedom is 139. 1523 01:06:16,980 --> 01:06:19,410 Now we go back and look at the t dist, 1524 01:06:19,410 --> 01:06:23,730 and we change 1.96 to the value from the t distribution 1525 01:06:23,730 --> 01:06:26,680 for that degree of freedoms. 1526 01:06:26,680 --> 01:06:28,710 And we get 140.73. 1527 01:06:28,710 --> 01:06:32,010 So you're sort of seeing that you were almost right. 1528 01:06:32,010 --> 01:06:35,160 140 is very large. 1529 01:06:35,160 --> 01:06:37,380 In practice, you would just round up a little bit 1530 01:06:37,380 --> 01:06:40,800 and get a nice round number, and you would even play with this 1531 01:06:40,800 --> 01:06:43,860 once you're looking at planning who you're going to send out 1532 01:06:43,860 --> 01:06:45,780 and how many hours you're going to collect. 1533 01:06:45,780 --> 01:06:48,974 You want to get at least 141, but if you're 1534 01:06:48,974 --> 01:06:51,390 going to have people in units of eight hours, for example, 1535 01:06:51,390 --> 01:06:54,560 or units of four hours, then you might as well finish the batch 1536 01:06:54,560 --> 01:06:56,250 for four hours, the last one. 1537 01:06:56,250 --> 01:07:00,500 Maybe you'll get 150, 160 from that. 1538 01:07:00,500 --> 01:07:02,740 Here's an example of that equation 1539 01:07:02,740 --> 01:07:08,260 with different assumptions of confidence and tolerance. 1540 01:07:08,260 --> 01:07:11,320 And so we're using 90% confidence, 1541 01:07:11,320 --> 01:07:15,410 and we're assuming a certain sample size here. 1542 01:07:15,410 --> 01:07:19,000 So you can see that, as the tolerance decreases, which 1543 01:07:19,000 --> 01:07:22,150 means that you require a greater accuracy 1544 01:07:22,150 --> 01:07:25,060 for different coefficients of variation, 1545 01:07:25,060 --> 01:07:26,670 the sample size can get really large. 1546 01:07:26,670 --> 01:07:29,200 So if your data is not very variable, 1547 01:07:29,200 --> 01:07:31,160 then you can sample just a few trips. 1548 01:07:31,160 --> 01:07:33,490 And you know because they don't vary 1549 01:07:33,490 --> 01:07:35,440 that much what the mean is. 1550 01:07:35,440 --> 01:07:37,540 But if there's a lot of variability across strips, 1551 01:07:37,540 --> 01:07:38,440 then you need more. 1552 01:07:38,440 --> 01:07:43,630 So that's what you see as you go down the rows on this table. 1553 01:07:43,630 --> 01:07:44,860 Here we have tolerance. 1554 01:07:44,860 --> 01:07:51,850 If you only have to be 50% accurate, plus minus 50%, 1555 01:07:51,850 --> 01:07:54,160 then you don't have to collect that much data. 1556 01:07:54,160 --> 01:07:56,410 If you want to be more precise, and you 1557 01:07:56,410 --> 01:08:00,860 want to say plus minus 5%, then you need a bigger sample size, 1558 01:08:00,860 --> 01:08:01,870 right? 1559 01:08:01,870 --> 01:08:03,720 OK. 1560 01:08:03,720 --> 01:08:05,940 Proportions-- and the homework, actually, 1561 01:08:05,940 --> 01:08:08,871 is based on proportions, so this is important. 1562 01:08:08,871 --> 01:08:10,620 Consider something, a group of passengers, 1563 01:08:10,620 --> 01:08:13,740 to estimate the proportion of passengers who are students. 1564 01:08:13,740 --> 01:08:16,109 So from probability, when you are 1565 01:08:16,109 --> 01:08:17,880 looking at an event that can either 1566 01:08:17,880 --> 01:08:20,830 be 0 or 1, or black or white-- 1567 01:08:20,830 --> 01:08:24,540 in this case, students or non-students-- 1568 01:08:24,540 --> 01:08:27,240 there's a certain probability that that person is a student, 1569 01:08:27,240 --> 01:08:27,739 right? 1570 01:08:27,739 --> 01:08:29,850 And what you want to estimate is that probability 1571 01:08:29,850 --> 01:08:31,920 or, in other words, what percent of the things 1572 01:08:31,920 --> 01:08:34,290 you observe are students. 1573 01:08:37,020 --> 01:08:40,229 So from the properties of the Bernoulli distribution, 1574 01:08:40,229 --> 01:08:43,200 the variance is p times 1 minus p. 1575 01:08:43,200 --> 01:08:47,160 So if everybody is a student, or nobody is a student, 1576 01:08:47,160 --> 01:08:49,800 either way there's no variability, right? 1577 01:08:49,800 --> 01:08:55,319 So you would have 1 times 1 minus 1, 1 times 0, 0-- no 1578 01:08:55,319 --> 01:08:56,430 variability. 1579 01:08:56,430 --> 01:08:59,609 Though at the peak variability, the highest variance 1580 01:08:59,609 --> 01:09:02,910 of this distribution, is when 50% of your people 1581 01:09:02,910 --> 01:09:08,340 are students, so 0.5 times 1 minus 0.5, 0.25. 1582 01:09:08,340 --> 01:09:10,859 That's the highest variance, OK? 1583 01:09:10,859 --> 01:09:12,792 So the tolerance is typically specified 1584 01:09:12,792 --> 01:09:15,000 in absolute terms when you're estimating proportions, 1585 01:09:15,000 --> 01:09:18,420 because the proportion is in itself a percent. 1586 01:09:18,420 --> 01:09:22,470 So you use absolute tolerance. 1587 01:09:22,470 --> 01:09:28,859 And you just substitute, essentially, this variance. 1588 01:09:28,859 --> 01:09:31,710 You put in the variance of the Bernoulli distribution, 1589 01:09:31,710 --> 01:09:33,300 which is p times 1 minus p. 1590 01:09:33,300 --> 01:09:36,205 And that's how you get the sampling equation, sample size 1591 01:09:36,205 --> 01:09:37,080 requirement equation. 1592 01:09:41,950 --> 01:09:43,180 Here's a problem. 1593 01:09:43,180 --> 01:09:47,899 We don't know in advance what the proportion will be, right? 1594 01:09:47,899 --> 01:09:50,415 And we need that to know how many people we need to survey 1595 01:09:50,415 --> 01:09:52,540 to figure out-- or how many trips we need to survey 1596 01:09:52,540 --> 01:09:53,649 to figure out-- 1597 01:09:53,649 --> 01:09:55,066 sorry, how many students we need-- 1598 01:09:55,066 --> 01:09:56,440 how many riders we need to survey 1599 01:09:56,440 --> 01:09:58,760 to figure out what the average number of students are. 1600 01:09:58,760 --> 01:09:59,963 OK, so-- 1601 01:09:59,963 --> 01:10:03,414 AUDIENCE: And it's also a [INAUDIBLE] p times 1 1602 01:10:03,414 --> 01:10:05,248 minus p [INAUDIBLE] is a constrained number. 1603 01:10:05,248 --> 01:10:07,455 GABRIEL SANCHEZ-MARTINEZ: It is a constrained number, 1604 01:10:07,455 --> 01:10:09,417 and that's exactly where we're going. 1605 01:10:09,417 --> 01:10:11,750 So we use something called absolute equivalent tolerance 1606 01:10:11,750 --> 01:10:13,340 instead of absolute tolerance. 1607 01:10:13,340 --> 01:10:15,950 We assume that p is 0.5-- 1608 01:10:15,950 --> 01:10:18,090 that's the maximum it could be. 1609 01:10:18,090 --> 01:10:20,750 So let's go ahead with a worst case scenario. 1610 01:10:20,750 --> 01:10:22,830 And then what happens with p itself? 1611 01:10:22,830 --> 01:10:27,260 Well, if your percent is high, then you 1612 01:10:27,260 --> 01:10:29,960 can tolerate a bigger number, right? 1613 01:10:29,960 --> 01:10:35,600 So if it's 32%, you're probably OK with plus minus 5%. 1614 01:10:35,600 --> 01:10:39,320 If your average were 1.2, plus minus 5% 1615 01:10:39,320 --> 01:10:40,970 is not that good, right? 1616 01:10:40,970 --> 01:10:42,170 You need a higher-- 1617 01:10:42,170 --> 01:10:46,220 you need a much stricter, tighter confidence 1618 01:10:46,220 --> 01:10:47,700 interval for that. 1619 01:10:47,700 --> 01:10:51,259 So probably not good to do plus minus 5% in that case. 1620 01:10:51,259 --> 01:10:53,550 AUDIENCE: [? Well, do ?] [? you mean ?] you have a plus 1621 01:10:53,550 --> 01:10:55,730 minus 5% absolutely percentage? 1622 01:10:55,730 --> 01:10:56,040 GABRIEL SANCHEZ-MARTINEZ: Absolute, yeah. 1623 01:10:56,040 --> 01:10:57,530 AUDIENCE: And you'd be going negative [INAUDIBLE] 1624 01:10:57,530 --> 01:10:57,800 GABRIEL SANCHEZ-MARTINEZ: Negative, 1625 01:10:57,800 --> 01:11:00,590 which is possible but difficult to interpret. 1626 01:11:00,590 --> 01:11:04,320 AUDIENCE: Sorry, so this isn't actually 32% plus or minus 5% 1627 01:11:04,320 --> 01:11:06,020 of 32 [INAUDIBLE] 1628 01:11:06,020 --> 01:11:06,610 GABRIEL SANCHEZ-MARTINEZ: It is not-- yeah, 1629 01:11:06,610 --> 01:11:09,110 it's absolute tolerance, not relative tolerance, right. 1630 01:11:09,110 --> 01:11:12,500 So what's convenient about this is that these two factors work 1631 01:11:12,500 --> 01:11:13,490 in opposite directions. 1632 01:11:13,490 --> 01:11:19,490 So as you get bigger, as the proportion gets closer to 50%, 1633 01:11:19,490 --> 01:11:20,690 the variance increases. 1634 01:11:20,690 --> 01:11:23,150 So oh, well, we need a bigger sample. 1635 01:11:23,150 --> 01:11:26,630 But your tolerance increases as well, 1636 01:11:26,630 --> 01:11:28,670 so you don't need as big of a sample. 1637 01:11:28,670 --> 01:11:30,050 And so it's convenient. 1638 01:11:30,050 --> 01:11:33,500 And the practical solution is assume p is 0.5 1639 01:11:33,500 --> 01:11:37,070 and work in terms of absolute equivalent tolerance. 1640 01:11:37,070 --> 01:11:40,160 So you pick a tolerance under the assumption 1641 01:11:40,160 --> 01:11:43,670 that our proportion is 50%. 1642 01:11:43,670 --> 01:11:46,170 And here's what happens. 1643 01:11:46,170 --> 01:11:48,800 Yeah, if the expected proportion is 50%, 1644 01:11:48,800 --> 01:11:51,890 and you say plus minus 5 percent, what you would get 1645 01:11:51,890 --> 01:11:56,520 is this 5%, if it turns out that p is 5%. 1646 01:11:56,520 --> 01:12:01,400 But if it worked more to the extremes, like 5% or 95%, 1647 01:12:01,400 --> 01:12:04,970 what you would actually achieve from having planned the survey, 1648 01:12:04,970 --> 01:12:07,580 assuming 50%, is 2.2-- 1649 01:12:07,580 --> 01:12:11,690 so much better, much more acceptable 1650 01:12:11,690 --> 01:12:14,840 to say 5% plus minus 2.2%, right? 1651 01:12:14,840 --> 01:12:16,370 So it works out. 1652 01:12:16,370 --> 01:12:20,030 And there's a convenient equation 1653 01:12:20,030 --> 01:12:22,640 if you assume a very large sample, or large enough sample, 1654 01:12:22,640 --> 01:12:26,570 and you pick 95%, 0.25, which is the variance 1655 01:12:26,570 --> 01:12:30,320 times the normal distribution t squared 1656 01:12:30,320 --> 01:12:32,280 is 0.96, which is almost 1. 1657 01:12:32,280 --> 01:12:33,530 So then you get this equation. 1658 01:12:33,530 --> 01:12:35,654 You take 1, you divide it by the tolerance 1659 01:12:35,654 --> 01:12:37,820 that you want, your equivalent tolerance, and that's 1660 01:12:37,820 --> 01:12:38,940 your sample size. 1661 01:12:38,940 --> 01:12:42,950 So it doesn't depend on anything about the data in itself. 1662 01:12:42,950 --> 01:12:46,490 You just say if I want, on whatever I'm collecting, 1663 01:12:46,490 --> 01:12:48,270 whatever proportion I'm collecting, 1664 01:12:48,270 --> 01:12:51,620 a 5% absolute equivalent tolerance, 1665 01:12:51,620 --> 01:12:57,190 then I need 400 surveys to be answered. 1666 01:12:57,190 --> 01:12:58,296 Yeah? 1667 01:12:58,296 --> 01:13:03,152 AUDIENCE: So this assumes a random-- 1668 01:13:03,152 --> 01:13:05,110 GABRIEL SANCHEZ-MARTINEZ: Simple random sample. 1669 01:13:05,110 --> 01:13:05,750 AUDIENCE: [INAUDIBLE] 1670 01:13:05,750 --> 01:13:07,190 GABRIEL SANCHEZ-MARTINEZ: Yes, a simple random sample. 1671 01:13:07,190 --> 01:13:08,648 So you would increase these numbers 1672 01:13:08,648 --> 01:13:11,050 if you are using cluster sampling 1673 01:13:11,050 --> 01:13:13,330 to account for correlation. 1674 01:13:13,330 --> 01:13:16,830 You would have to increase them if you're giving people 1675 01:13:16,830 --> 01:13:19,330 a survey, and not all of them answer the survey, because you 1676 01:13:19,330 --> 01:13:22,240 need 400 surveys answered. 1677 01:13:22,240 --> 01:13:24,610 So if only half of the people answer the survey, 1678 01:13:24,610 --> 01:13:27,010 then you need to distribute 800 surveys. 1679 01:13:27,010 --> 01:13:28,962 AUDIENCE: Do you recommend calculating also 1680 01:13:28,962 --> 01:13:32,070 that the standard error after this so that [INAUDIBLE] 1681 01:13:32,070 --> 01:13:32,570 make sure? 1682 01:13:32,570 --> 01:13:33,530 GABRIEL SANCHEZ-MARTINEZ: Absolutely, yeah. 1683 01:13:33,530 --> 01:13:35,810 You want to go back and check with the standard error 1684 01:13:35,810 --> 01:13:38,330 and when your confidence interval is and see 1685 01:13:38,330 --> 01:13:39,830 if you meet it or if you need to add 1686 01:13:39,830 --> 01:13:41,570 a few days of data collection. 1687 01:13:41,570 --> 01:13:42,360 AUDIENCE: Right. 1688 01:13:42,360 --> 01:13:43,651 GABRIEL SANCHEZ-MARTINEZ: Yeah. 1689 01:13:43,651 --> 01:13:48,740 OK, so with proportions, you need a very large sample size 1690 01:13:48,740 --> 01:13:51,590 to estimate a proportion if you want accuracy. 1691 01:13:51,590 --> 01:13:54,800 If you say absolutely equivalent intolerance of 4%, 1692 01:13:54,800 --> 01:13:56,900 then you need 600. 1693 01:13:56,900 --> 01:14:00,020 That's a big number, so it just gives you an idea of that. 1694 01:14:00,020 --> 01:14:02,390 If you get greedy with the tolerance, 1695 01:14:02,390 --> 01:14:08,580 you have to pay for the surveyors to go out. 1696 01:14:08,580 --> 01:14:11,090 OK. 1697 01:14:11,090 --> 01:14:14,510 So the process is you determine the needed sample size 1698 01:14:14,510 --> 01:14:18,140 just with the discussion of the equations that we discussed. 1699 01:14:18,140 --> 01:14:19,850 Then you multiply the sample sizes. 1700 01:14:23,675 --> 01:14:25,550 If you're using stratified sampling or if you 1701 01:14:25,550 --> 01:14:28,280 have questions that have multiple variables, 1702 01:14:28,280 --> 01:14:30,920 you need to then make sure that you achieve that sample 1703 01:14:30,920 --> 01:14:34,200 size for each combination of things that you're measuring. 1704 01:14:34,200 --> 01:14:36,830 So if you're, for example, looking at not just 1705 01:14:36,830 --> 01:14:42,880 boardings, but proportion of passengers that are car-owning, 1706 01:14:42,880 --> 01:14:43,810 who are pleased. 1707 01:14:43,810 --> 01:14:46,990 So you could just independently measure pleased, independently 1708 01:14:46,990 --> 01:14:52,840 measure passengers who own a car. 1709 01:14:52,840 --> 01:14:55,960 And you might have the tolerance you need on each one, 1710 01:14:55,960 --> 01:14:57,910 but if you want the combination of that, 1711 01:14:57,910 --> 01:14:59,650 now you need a higher sample, because you 1712 01:14:59,650 --> 01:15:03,610 need that number for the combination of those things. 1713 01:15:03,610 --> 01:15:05,230 Then there's a clustering effect, 1714 01:15:05,230 --> 01:15:06,862 so a typical thing if you're doing 1715 01:15:06,862 --> 01:15:08,820 the clustering of a whole vehicle of passengers 1716 01:15:08,820 --> 01:15:12,190 is to multiply by 4. 1717 01:15:12,190 --> 01:15:15,010 And then for things like OD matrices, the rule of thumb 1718 01:15:15,010 --> 01:15:17,927 is 20 times the number of cells. 1719 01:15:17,927 --> 01:15:18,760 What does that mean? 1720 01:15:18,760 --> 01:15:20,554 That if your OD matrix is quite aggregate, 1721 01:15:20,554 --> 01:15:21,970 and it's at the segment level-- so 1722 01:15:21,970 --> 01:15:24,700 say you divide a root into two segments, 1723 01:15:24,700 --> 01:15:27,190 then your OD matrix has four cells. 1724 01:15:27,190 --> 01:15:30,310 Four cells times 20, that's how many people you have to survey. 1725 01:15:30,310 --> 01:15:33,269 If you do error at the stop level, 1726 01:15:33,269 --> 01:15:35,560 then you have many more stops and, therefore, many more 1727 01:15:35,560 --> 01:15:39,465 cells and, therefore, a much higher sample size. 1728 01:15:39,465 --> 01:15:41,090 If you have a response rate that is not 1729 01:15:41,090 --> 01:15:43,360 100%, which is always the case, then you 1730 01:15:43,360 --> 01:15:46,480 have to expand by 1 minus that in the reciprocal-- sorry, 1 1731 01:15:46,480 --> 01:15:48,080 over that in the reciprocal. 1732 01:15:48,080 --> 01:15:49,760 And then you get a very large number, 1733 01:15:49,760 --> 01:15:51,860 and you say I don't have the budget for that. 1734 01:15:51,860 --> 01:15:57,460 And you have to make tradeoffs and figure out what you can do. 1735 01:15:57,460 --> 01:15:59,590 And maybe you have to-- 1736 01:15:59,590 --> 01:16:01,750 maybe you can't collect this combination 1737 01:16:01,750 --> 01:16:03,160 and know that accurately, right? 1738 01:16:03,160 --> 01:16:07,500 So you revise your expectations. 1739 01:16:07,500 --> 01:16:11,265 OK, with response rates, you are concerned 1740 01:16:11,265 --> 01:16:12,640 with getting the correct answers. 1741 01:16:12,640 --> 01:16:14,740 You also want to be getting a high response rate. 1742 01:16:14,740 --> 01:16:17,281 If you don't get a high response rate, there might be a bias. 1743 01:16:17,281 --> 01:16:19,730 So you have to worry about that. 1744 01:16:19,730 --> 01:16:21,250 If you have low response rates, that 1745 01:16:21,250 --> 01:16:23,000 means you need to distribute more surveys, 1746 01:16:23,000 --> 01:16:24,190 and that costs money. 1747 01:16:24,190 --> 01:16:26,240 And there's the bias that I just mentioned, 1748 01:16:26,240 --> 01:16:29,980 so people who don't respond may not be responding for a reason. 1749 01:16:29,980 --> 01:16:32,560 And then done that might bias your results. 1750 01:16:32,560 --> 01:16:34,990 And that might make you decide something in planning 1751 01:16:34,990 --> 01:16:38,570 that is not the right decision based on what actually happens. 1752 01:16:38,570 --> 01:16:41,410 So we call that the non-response bias. 1753 01:16:41,410 --> 01:16:43,099 OK, so what happens? 1754 01:16:43,099 --> 01:16:44,890 People who don't respond might be different 1755 01:16:44,890 --> 01:16:47,015 or might have responded differently to the question 1756 01:16:47,015 --> 01:16:47,950 had they responded. 1757 01:16:47,950 --> 01:16:50,120 So here's some examples. 1758 01:16:50,120 --> 01:16:52,660 If you're surveying people who are standing, 1759 01:16:52,660 --> 01:16:54,220 they are less comfortable. 1760 01:16:54,220 --> 01:16:57,280 And maybe it's a crowded bus-- they are less comfortable. 1761 01:16:57,280 --> 01:17:00,100 Or maybe they're getting off one of those stops that 1762 01:17:00,100 --> 01:17:03,250 is coming up, so they are less likely to have the time 1763 01:17:03,250 --> 01:17:05,020 to respond to your survey. 1764 01:17:05,020 --> 01:17:07,570 People with low literacy, teenagers, 1765 01:17:07,570 --> 01:17:10,420 people who don't speak the language, 1766 01:17:10,420 --> 01:17:11,870 are less likely to respond. 1767 01:17:11,870 --> 01:17:14,720 And they might have different travel patterns. 1768 01:17:14,720 --> 01:17:16,710 So if you understand those things, 1769 01:17:16,710 --> 01:17:18,392 and you get lower samples for them, 1770 01:17:18,392 --> 01:17:20,350 you might be able to do some sort of correction 1771 01:17:20,350 --> 01:17:21,670 to those biases. 1772 01:17:21,670 --> 01:17:23,740 But you have to pay attention. 1773 01:17:23,740 --> 01:17:25,390 How do you improve your response rate? 1774 01:17:25,390 --> 01:17:28,150 Well you can make your questions shorter. 1775 01:17:28,150 --> 01:17:29,950 You can do a quick oral survey. 1776 01:17:29,950 --> 01:17:32,890 That's what we're going to do for this homework. 1777 01:17:32,890 --> 01:17:36,550 You can try to get information from automatic sources whenever 1778 01:17:36,550 --> 01:17:37,100 possible. 1779 01:17:37,100 --> 01:17:42,100 So if you have an AFC system, let's not collect boardings, 1780 01:17:42,100 --> 01:17:45,100 because we know that. 1781 01:17:45,100 --> 01:17:47,650 And then of course some training, and just being kind, 1782 01:17:47,650 --> 01:17:51,610 and having supervision helps a lot. 1783 01:17:51,610 --> 01:17:53,650 OK, here's some suggested tolerances 1784 01:17:53,650 --> 01:17:55,340 for different things. 1785 01:17:55,340 --> 01:17:58,420 So we're looking here at boardings or the peak load. 1786 01:17:58,420 --> 01:18:00,580 And you see here that the suggested tolerance 1787 01:18:00,580 --> 01:18:05,290 is 30%, plus minus 30%, when you have a route with one 1788 01:18:05,290 --> 01:18:05,980 to three buses. 1789 01:18:05,980 --> 01:18:07,270 And then as you have more and more buses, 1790 01:18:07,270 --> 01:18:08,380 the tolerance decreases. 1791 01:18:08,380 --> 01:18:11,920 That means you require a larger sample. 1792 01:18:11,920 --> 01:18:14,670 Why is that? 1793 01:18:14,670 --> 01:18:16,530 Why do you need a bigger sample if you 1794 01:18:16,530 --> 01:18:20,238 have a route with more buses? 1795 01:18:20,238 --> 01:18:23,560 AUDIENCE: You're less likely to sample a different bus. 1796 01:18:23,560 --> 01:18:26,830 GABRIEL SANCHEZ-MARTINEZ: Yes, and when you have higher-- 1797 01:18:26,830 --> 01:18:30,112 when you have more buses, you tend to have higher frequency. 1798 01:18:30,112 --> 01:18:30,820 There's bunching. 1799 01:18:30,820 --> 01:18:35,260 OK, so if you then survey loads, for example, 1800 01:18:35,260 --> 01:18:38,590 and you only get a few because of the bunching effect 1801 01:18:38,590 --> 01:18:40,380 and because there are more buses, 1802 01:18:40,380 --> 01:18:42,910 and you're observing a smaller percentage of them 1803 01:18:42,910 --> 01:18:45,490 for a given time period, say, you're 1804 01:18:45,490 --> 01:18:48,760 less likely to have observed the bus that was really crowded, 1805 01:18:48,760 --> 01:18:49,330 right? 1806 01:18:49,330 --> 01:18:52,390 So that means that you need to decrease your tolerance. 1807 01:18:52,390 --> 01:18:55,490 And therefore, it's more expensive to survey that. 1808 01:18:55,490 --> 01:18:56,350 OK, good. 1809 01:18:56,350 --> 01:19:00,220 Trip time-- 10% for routes with less than 20 minutes, 1810 01:19:00,220 --> 01:19:03,490 5% with routes of greater than 20 minutes. 1811 01:19:03,490 --> 01:19:06,100 Similar concept if you have greater than 20 minutes-- 1812 01:19:09,540 --> 01:19:11,170 there can be just more variability, 1813 01:19:11,170 --> 01:19:14,860 and you really want to get that right. 1814 01:19:14,860 --> 01:19:16,900 When you have less than 20 minutes, 1815 01:19:16,900 --> 01:19:20,080 your decision on cycle times and things 1816 01:19:20,080 --> 01:19:22,960 like this are not going to have as much impact on the fleet 1817 01:19:22,960 --> 01:19:24,550 size that you require. 1818 01:19:24,550 --> 01:19:31,720 As you get bigger running times, a small percentage change 1819 01:19:31,720 --> 01:19:34,600 in the mean could influence how many buses 1820 01:19:34,600 --> 01:19:37,750 you need to dedicate to that and the cost of running 1821 01:19:37,750 --> 01:19:39,990 that service. 1822 01:19:39,990 --> 01:19:43,610 On-time performance-- 10% absolute equivalent tolerance. 1823 01:19:43,610 --> 01:19:47,260 These are typical values-- don't take them as gospel, please. 1824 01:19:47,260 --> 01:19:49,530 And these are for reporting, not for anything that's 1825 01:19:49,530 --> 01:19:51,750 very critical for operations. 1826 01:19:51,750 --> 01:19:53,560 Some of them are. 1827 01:19:53,560 --> 01:19:56,820 Yeah, 30% at least, I would say, is for reporting. 1828 01:19:56,820 --> 01:20:00,840 I wouldn't make any critical decisions with 30%. 1829 01:20:00,840 --> 01:20:04,410 On-time performance-- we're talking here about whether 1830 01:20:04,410 --> 01:20:07,060 a trip is on time or not on time-- 1831 01:20:07,060 --> 01:20:08,970 so Bernoulli trials, right? 1832 01:20:08,970 --> 01:20:10,650 And there's a proportion of trips 1833 01:20:10,650 --> 01:20:15,120 that are on time, and what we do is that, we essentially 1834 01:20:15,120 --> 01:20:19,710 say plus-- if we say plus minus 10%, then we're saying 1835 01:20:19,710 --> 01:20:22,980 that the sample size should be 1 over 0.1. 1836 01:20:22,980 --> 01:20:23,480 Yeah. 1837 01:20:26,330 --> 01:20:28,160 All right, default coefficient-- these 1838 01:20:28,160 --> 01:20:29,576 are default values for coefficient 1839 01:20:29,576 --> 01:20:30,950 of variation of key data items. 1840 01:20:30,950 --> 01:20:33,170 Ideally, you have your own data that you look at, 1841 01:20:33,170 --> 01:20:34,790 and you don't resort to this. 1842 01:20:34,790 --> 01:20:37,910 But if you ever find yourself in a situation 1843 01:20:37,910 --> 01:20:40,470 where you need to start out with something. 1844 01:20:40,470 --> 01:20:45,440 Here are some based on studies that previous [AUDIO OUT] They 1845 01:20:45,440 --> 01:20:48,590 took different routes and looked at loads 1846 01:20:48,590 --> 01:20:51,440 and running times for different time periods 1847 01:20:51,440 --> 01:20:53,570 and found what the coefficients of variations were. 1848 01:20:53,570 --> 01:20:56,420 And here they are on a table for you to use. 1849 01:21:00,947 --> 01:21:03,530 In the interest of time, since I want to discuss the homework, 1850 01:21:03,530 --> 01:21:05,750 I'm going to stop here with slide 25. 1851 01:21:05,750 --> 01:21:11,480 And I'm going to not cover the whole process, which 1852 01:21:11,480 --> 01:21:14,880 includes the monitoring phase. 1853 01:21:14,880 --> 01:21:17,790 And in this slide here, we have how you 1854 01:21:17,790 --> 01:21:20,010 establish conversion factor. 1855 01:21:20,010 --> 01:21:23,870 The conversion factor in itself has a variance. 1856 01:21:23,870 --> 01:21:26,130 So there's some uncertainty about the relationship 1857 01:21:26,130 --> 01:21:31,380 that you estimate between your baseline data item 1858 01:21:31,380 --> 01:21:33,550 and your auxiliary data item. 1859 01:21:33,550 --> 01:21:37,210 So you need to consider that in your sample size. 1860 01:21:37,210 --> 01:21:39,720 And here are some tables with some examples of what 1861 01:21:39,720 --> 01:21:44,130 happens when you require different-- well, when you're 1862 01:21:44,130 --> 01:21:47,940 variability of or your coefficient 1863 01:21:47,940 --> 01:21:52,650 of variation of your relationship increases 1864 01:21:52,650 --> 01:21:54,390 or decreases. 1865 01:21:54,390 --> 01:21:56,430 OK, let's look at the homework. 1866 01:21:56,430 --> 01:21:59,893 I really want to use these last five minutes for that. 1867 01:21:59,893 --> 01:22:07,560 So please take one and pass. 1868 01:22:07,560 --> 01:22:12,300 OK, so the MBTA, there's a proposal here in Boston 1869 01:22:12,300 --> 01:22:14,850 of taking Route 70 and 70A-- 1870 01:22:14,850 --> 01:22:17,370 they run through Waltham, and they 1871 01:22:17,370 --> 01:22:20,130 go into around Central Square. 1872 01:22:20,130 --> 01:22:23,690 And some people are saying those two routes should be extended 1873 01:22:23,690 --> 01:22:28,430 to Kendall Square, because a lot of people 1874 01:22:28,430 --> 01:22:31,700 are actually going to MIT, or Kendall Square, or the Kendall 1875 01:22:31,700 --> 01:22:33,890 Square area-- 1876 01:22:33,890 --> 01:22:37,280 not just Kendall Square Station, but the whole area around. 1877 01:22:37,280 --> 01:22:39,350 So if it's true, A lot of people could 1878 01:22:39,350 --> 01:22:40,670 benefit from that extension. 1879 01:22:40,670 --> 01:22:41,670 And we don't know. 1880 01:22:41,670 --> 01:22:43,140 So what are you going to do? 1881 01:22:43,140 --> 01:22:45,620 You're going to go to a specific stop 1882 01:22:45,620 --> 01:22:48,620 where it is very likely that the people who would be going 1883 01:22:48,620 --> 01:22:52,430 to MIT or those areas of Kendall Square that would benefit 1884 01:22:52,430 --> 01:22:55,040 from this extension would alight, 1885 01:22:55,040 --> 01:22:57,140 and you're going to ask people, would you 1886 01:22:57,140 --> 01:23:01,250 have stayed on your bus if this bus had continued 1887 01:23:01,250 --> 01:23:02,960 to MIT and Kendall Square? 1888 01:23:02,960 --> 01:23:06,635 It's a simple oral survey, yes or no question, one question. 1889 01:23:06,635 --> 01:23:08,510 You're going to work in teams of four people. 1890 01:23:13,130 --> 01:23:16,670 The stop that you're going to station yourself in 1891 01:23:16,670 --> 01:23:18,296 is shown in figure 3. 1892 01:23:21,230 --> 01:23:23,360 And you're going to collect data for the AM peak, 1893 01:23:23,360 --> 01:23:25,760 from 7:30 to 9:30. 1894 01:23:25,760 --> 01:23:27,320 You pick the day. 1895 01:23:27,320 --> 01:23:29,090 The teams are assigned on Stellar, 1896 01:23:29,090 --> 01:23:32,150 so please log into Stellar and see what your team is 1897 01:23:32,150 --> 01:23:34,910 and coordinate with them to pick a day. 1898 01:23:34,910 --> 01:23:37,580 And tell me what that day is, because-- 1899 01:23:37,580 --> 01:23:39,410 actually, right after class, I'm going 1900 01:23:39,410 --> 01:23:43,370 to set up a shared spreadsheet that you can all access. 1901 01:23:43,370 --> 01:23:46,519 And just go into that spreadsheet and pick a day. 1902 01:23:46,519 --> 01:23:48,560 I'm going to put all the days that are available, 1903 01:23:48,560 --> 01:23:51,170 and you can say team 1, team 2, et cetera. 1904 01:23:51,170 --> 01:23:54,410 Make sure that two teams don't go on the same day. 1905 01:23:54,410 --> 01:23:56,400 We want data from different days. 1906 01:23:56,400 --> 01:23:58,400 And you're going to all bring that data together 1907 01:23:58,400 --> 01:24:00,650 in that same spreadsheet, and there 1908 01:24:00,650 --> 01:24:03,650 are some questions for you to analyze 1909 01:24:03,650 --> 01:24:06,230 the data that you collected, all of the class collected 1910 01:24:06,230 --> 01:24:08,540 together. 1911 01:24:08,540 --> 01:24:12,980 You're measuring the percent of people who would 1912 01:24:12,980 --> 01:24:14,450 have stayed on the bus, right? 1913 01:24:14,450 --> 01:24:18,730 So it's a proportion. 1914 01:24:18,730 --> 01:24:22,960 And one submission per team in PDF format to Stellar. 1915 01:24:22,960 --> 01:24:26,410 This is due March 7, but in order 1916 01:24:26,410 --> 01:24:28,830 to leave you enough time to do the analysis, 1917 01:24:28,830 --> 01:24:31,590 the data collection efforts should be done by February 28. 1918 01:24:31,590 --> 01:24:37,150 So please submit your data by the end of Tuesday, February 28 1919 01:24:37,150 --> 01:24:41,230 at midnight, say, or sometime before the beginning of March 1920 01:24:41,230 --> 01:24:43,300 in the morning, where a person would 1921 01:24:43,300 --> 01:24:45,100 be trying to analyze your data. 1922 01:24:48,510 --> 01:24:51,280 OK, if you have questions, let me know. 1923 01:24:51,280 --> 01:24:56,250 And if not, have fun. 1924 01:24:56,250 --> 01:24:58,910 Remember that assignment 1 is due Thursday. 1925 01:25:01,531 --> 01:25:02,030 Eric? 1926 01:25:02,030 --> 01:25:03,822 AUDIENCE: Just the one question: [? is that ?] [? this is ?] 1927 01:25:03,822 --> 01:25:06,140 going to miss anyone who is transferred to the Red Line 1928 01:25:06,140 --> 01:25:07,746 to then go to Kendall Square. 1929 01:25:07,746 --> 01:25:09,620 GABRIEL SANCHEZ-MARTINEZ: And going back to-- 1930 01:25:09,620 --> 01:25:10,120 let's see. 1931 01:25:14,430 --> 01:25:17,660 I forget where I had it. 1932 01:25:17,660 --> 01:25:20,630 Well, I guess what I-- there was a point I made earlier 1933 01:25:20,630 --> 01:25:23,730 where we can measure that from automatically collected data, 1934 01:25:23,730 --> 01:25:24,230 right? 1935 01:25:24,230 --> 01:25:24,950 AUDIENCE: OK. 1936 01:25:24,950 --> 01:25:25,610 GABRIEL SANCHEZ-MARTINEZ: Does that make sense? 1937 01:25:25,610 --> 01:25:27,860 AUDIENCE: Yeah, people who [? car up ?] come from 70. 1938 01:25:27,860 --> 01:25:29,235 GABRIEL SANCHEZ-MARTINEZ: So if I 1939 01:25:29,235 --> 01:25:32,410 see you tapping of the 70 or the 70A, 1940 01:25:32,410 --> 01:25:35,480 and then I see you tapping at Central Square, 1941 01:25:35,480 --> 01:25:37,790 I can infer that you were using the service 1942 01:25:37,790 --> 01:25:40,790 to transfer to Central Square. 1943 01:25:40,790 --> 01:25:42,950 And then we'll cover ODX, which is 1944 01:25:42,950 --> 01:25:44,480 an inference model for destinations 1945 01:25:44,480 --> 01:25:46,210 later in this course. 1946 01:25:46,210 --> 01:25:51,170 But looking at the sequence of taps, I can infer-- 1947 01:25:51,170 --> 01:25:53,420 we can infer-- what the destination of that bus trip 1948 01:25:53,420 --> 01:25:53,919 was. 1949 01:25:53,919 --> 01:25:55,720 We can infer that it was the stop that 1950 01:25:55,720 --> 01:25:57,440 was closest to Central. 1951 01:25:57,440 --> 01:26:00,710 And later that day, presumably the person 1952 01:26:00,710 --> 01:26:04,310 who might be going to Kendall Square Station after work taps 1953 01:26:04,310 --> 01:26:05,199 to Kendall Square. 1954 01:26:05,199 --> 01:26:07,490 So I might think, oh, he took the Red Line from Central 1955 01:26:07,490 --> 01:26:09,500 to Kendall. 1956 01:26:09,500 --> 01:26:12,320 So I don't need to ask those people where they're going. 1957 01:26:12,320 --> 01:26:14,880 And anyway, they might not care about this extension. 1958 01:26:14,880 --> 01:26:17,570 So we're going to stand on the bus stop that 1959 01:26:17,570 --> 01:26:21,590 is after Central Square and see where those people are going 1960 01:26:21,590 --> 01:26:25,350 and whether they would have stayed on that bus. 1961 01:26:25,350 --> 01:26:28,119 AUDIENCE: Is this an actual [INAUDIBLE] 1962 01:26:28,119 --> 01:26:30,410 GABRIEL SANCHEZ-MARTINEZ: Some people are proposing it. 1963 01:26:30,410 --> 01:26:33,060 It is a real proposal. 1964 01:26:33,060 --> 01:26:35,230 The MBTA is a big organization. 1965 01:26:35,230 --> 01:26:41,490 So I can't say that the MBTA wants to do this 1966 01:26:41,490 --> 01:26:43,210 or doesn't want to do this. 1967 01:26:43,210 --> 01:26:45,090 But some people are interested. 1968 01:26:45,090 --> 01:26:48,630 And it will get looked into. 1969 01:26:48,630 --> 01:26:50,770 So it's useful. 1970 01:26:50,770 --> 01:26:53,195 AUDIENCE: [? Can ?] [? we ?] [? share ?] [INAUDIBLE] 1971 01:26:53,195 --> 01:26:56,105 GABRIEL SANCHEZ-MARTINEZ: Yeah, why not? 1972 01:26:56,105 --> 01:26:58,045 AUDIENCE: [INAUDIBLE] 1973 01:26:58,045 --> 01:27:00,470 GABRIEL SANCHEZ-MARTINEZ: Yeah. 1974 01:27:00,470 --> 01:27:03,320 And I guess one other thing that I-- yeah, 1975 01:27:03,320 --> 01:27:06,250 so we're going to probably make of this 1976 01:27:06,250 --> 01:27:08,594 like a theme of assignments. 1977 01:27:08,594 --> 01:27:10,760 So there's going to be another assignment on surface 1978 01:27:10,760 --> 01:27:12,510 planning, operations planning. 1979 01:27:12,510 --> 01:27:15,160 So we're going to start looking at this combination of Route 70 1980 01:27:15,160 --> 01:27:19,760 and 70A, and we're going to essentially make 1981 01:27:19,760 --> 01:27:22,520 a thread of this and do some serious planning 1982 01:27:22,520 --> 01:27:26,300 on some scenarios where the 70 and the 70A could be merged. 1983 01:27:26,300 --> 01:27:29,860 And they could maybe be terminated a little-- 1984 01:27:29,860 --> 01:27:32,810 yeah, we'll make some changes to the service plan 1985 01:27:32,810 --> 01:27:34,440 under some hypothetical scenarios. 1986 01:27:34,440 --> 01:27:38,840 And you'll get a chance to do an operations plan on these. 1987 01:27:38,840 --> 01:27:41,550 And then the last homework will be on policy, 1988 01:27:41,550 --> 01:27:44,170 so there might be some policy questions 1989 01:27:44,170 --> 01:27:47,930 that I have in mind about what we could do about 1990 01:27:47,930 --> 01:27:52,640 service outside, on the outer parts of the 70 and 70A. 1991 01:27:57,440 --> 01:27:59,290 All right?