1 00:00:01,130 --> 00:00:03,470 The following content is provided under a Creative 2 00:00:03,470 --> 00:00:04,860 Commons license. 3 00:00:04,860 --> 00:00:07,070 Your support will help MIT OpenCourseWare 4 00:00:07,070 --> 00:00:11,160 continue to offer high-quality educational resources for free. 5 00:00:11,160 --> 00:00:13,730 To make a donation or to view additional materials 6 00:00:13,730 --> 00:00:17,690 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,690 --> 00:00:18,570 at ocw@mit.edu. 8 00:00:21,860 --> 00:00:26,200 GABRIEL SANCHEZ-MARTINEZ: I'll start today with an animation. 9 00:00:26,200 --> 00:00:28,670 I think most of you have seen this. 10 00:00:28,670 --> 00:00:32,140 Raise your hand if you haven't. 11 00:00:32,140 --> 00:00:33,641 So some of you have not. 12 00:00:33,641 --> 00:00:34,140 OK. 13 00:00:34,140 --> 00:00:37,020 So I'm going to just play it. 14 00:00:37,020 --> 00:00:38,850 This is London. 15 00:00:38,850 --> 00:00:41,650 And you're going to see different colors. 16 00:00:47,880 --> 00:00:52,170 There's a legend right here on this corner, lower left. 17 00:00:52,170 --> 00:00:58,790 And blue stands for some cardholder 18 00:00:58,790 --> 00:01:05,990 in the London system that has not yet been [AUDIO OUT] 19 00:01:05,990 --> 00:01:09,500 or it's after the last time that person was seen. 20 00:01:09,500 --> 00:01:15,270 So it's a proxy for home, essentially. 21 00:01:15,270 --> 00:01:19,620 Bright green is going to be a proxy for travel. 22 00:01:19,620 --> 00:01:23,700 Not a proxy, it's going to mean that this card is currently 23 00:01:23,700 --> 00:01:29,400 in the TFL system in a bus ride or in a train somewhere. 24 00:01:29,400 --> 00:01:39,570 And then red will show anything between trips that day. 25 00:01:39,570 --> 00:01:41,280 So that's a proxy for work. 26 00:01:41,280 --> 00:01:43,260 Or it could be a proxy for shopping, 27 00:01:43,260 --> 00:01:48,870 or restaurants, or anything between travel. 28 00:01:48,870 --> 00:01:51,500 So after the last trip is completed, 29 00:01:51,500 --> 00:01:53,500 and before the last trip ends. 30 00:01:53,500 --> 00:01:55,500 So I'm just going to play this. 31 00:01:55,500 --> 00:01:59,001 And hopefully this will motivate the discussion, 32 00:01:59,001 --> 00:02:00,000 the rest of the lecture. 33 00:02:00,000 --> 00:02:04,030 So you can see the time at the bottom. 34 00:02:04,030 --> 00:02:06,020 Sorry. 35 00:02:06,020 --> 00:02:11,310 So you see the morning rush, and then people starting to work, 36 00:02:11,310 --> 00:02:12,630 so it turns red. 37 00:02:12,630 --> 00:02:17,442 That means most people are between trips. 38 00:02:17,442 --> 00:02:18,900 You see a lot of buzz in the middle 39 00:02:18,900 --> 00:02:21,530 of the city, middle of the day. 40 00:02:21,530 --> 00:02:23,400 And then as we approach the afternoon peak, 41 00:02:23,400 --> 00:02:27,160 you start seeing more green activity, 42 00:02:27,160 --> 00:02:28,750 starting from the center, going out. 43 00:02:32,210 --> 00:02:34,580 And then some blue as people reach their homes 44 00:02:34,580 --> 00:02:37,280 and won't travel again that day. 45 00:02:37,280 --> 00:02:42,470 Still a lot of activity in some centers, especially 46 00:02:42,470 --> 00:02:45,980 in the center of London, a lot of travel still. 47 00:02:45,980 --> 00:02:48,590 And then past midnight, you sort of see Soho, 48 00:02:48,590 --> 00:02:51,670 so you know where to hang out in London. 49 00:02:51,670 --> 00:02:53,370 OK? 50 00:02:53,370 --> 00:02:54,200 OK. 51 00:02:54,200 --> 00:03:01,440 So before we continue, any questions about that video? 52 00:03:01,440 --> 00:03:05,380 AUDIENCE: Yeah, it seemed to me that some of the dots 53 00:03:05,380 --> 00:03:07,954 were actually not moving along the lines. 54 00:03:07,954 --> 00:03:08,746 Is that deliberate? 55 00:03:08,746 --> 00:03:10,162 GABRIEL SANCHEZ-MARTINEZ: So yeah. 56 00:03:10,162 --> 00:03:10,805 In this video-- 57 00:03:10,805 --> 00:03:11,440 AUDIENCE: [INAUDIBLE] 58 00:03:11,440 --> 00:03:12,706 GABRIEL SANCHEZ-MARTINEZ: --there are multiple ways-- 59 00:03:12,706 --> 00:03:13,400 AUDIENCE: --magnitude? 60 00:03:13,400 --> 00:03:13,870 GABRIEL SANCHEZ-MARTINEZ: Yeah. 61 00:03:13,870 --> 00:03:16,780 So there are multiple ways of generating this visualization. 62 00:03:16,780 --> 00:03:21,930 And the one that my colleague used to make this-- 63 00:03:21,930 --> 00:03:23,590 and by the way, this video was made 64 00:03:23,590 --> 00:03:25,720 by my colleague, Jay Gordon. 65 00:03:25,720 --> 00:03:28,750 You'll see in the last slide, the references to his papers 66 00:03:28,750 --> 00:03:31,780 and to the website with a link to the video. 67 00:03:31,780 --> 00:03:36,560 So yeah, for each stage-- 68 00:03:36,560 --> 00:03:38,410 and we'll talk about what the stage is. 69 00:03:38,410 --> 00:03:41,710 For each of these trips, for now, you 70 00:03:41,710 --> 00:03:44,410 could do it a straight line or you could really interpolate 71 00:03:44,410 --> 00:03:46,300 geographically along the line. 72 00:03:46,300 --> 00:03:50,820 And in some aspects, the straight line one 73 00:03:50,820 --> 00:03:53,510 is showing isochromes almost. 74 00:03:53,510 --> 00:03:59,170 So it's easier to understand, visually, the OD pairs 75 00:03:59,170 --> 00:04:01,160 when you do it that way. 76 00:04:01,160 --> 00:04:02,200 But both have value. 77 00:04:02,200 --> 00:04:04,780 And you could look at it both ways. 78 00:04:04,780 --> 00:04:08,045 Any other questions about this animation visualization? 79 00:04:12,590 --> 00:04:13,090 OK. 80 00:04:13,090 --> 00:04:14,420 AUDIENCE: [INAUDIBLE] 81 00:04:14,420 --> 00:04:15,711 GABRIEL SANCHEZ-MARTINEZ: Yeah. 82 00:04:15,711 --> 00:04:18,829 Let's talk about how we made that. 83 00:04:18,829 --> 00:04:23,060 And so what goes into creating this visualization, 84 00:04:23,060 --> 00:04:25,570 and what data was used for it. 85 00:04:25,570 --> 00:04:29,750 So today's lecture is on origin destination and transfer 86 00:04:29,750 --> 00:04:31,220 inference. 87 00:04:31,220 --> 00:04:35,210 We abbreviate that ODX, O for origin, D for destination, 88 00:04:35,210 --> 00:04:39,950 X being more graphical, two lines crossing in the middle. 89 00:04:39,950 --> 00:04:43,280 So your opportunity to transfer or interchange. 90 00:04:43,280 --> 00:04:46,550 And if we use the British term for transfers, 91 00:04:46,550 --> 00:04:47,900 that would be interchange. 92 00:04:47,900 --> 00:04:52,344 So what's common in all of these methods 93 00:04:52,344 --> 00:04:54,260 is that we're going to be basing these methods 94 00:04:54,260 --> 00:04:56,690 on automatically-collected data. 95 00:04:56,690 --> 00:04:58,400 So we're going to be using AFC, AVL, 96 00:04:58,400 --> 00:05:00,410 APC, instead of manual surveys. 97 00:05:00,410 --> 00:05:03,710 There are ways of estimating origin destination matrices 98 00:05:03,710 --> 00:05:07,130 in the traditional way, with manual surveys. 99 00:05:07,130 --> 00:05:10,910 You go out, and you distribute colored cards and collect them, 100 00:05:10,910 --> 00:05:12,950 and you can do this. 101 00:05:12,950 --> 00:05:17,360 So a little bit about that when we talk about surveys 102 00:05:17,360 --> 00:05:18,350 and survey planning. 103 00:05:22,509 --> 00:05:24,800 Some of these methods can be used to infer destinations 104 00:05:24,800 --> 00:05:26,060 in open systems. 105 00:05:26,060 --> 00:05:29,240 So open systems are like the bus and rail, 106 00:05:29,240 --> 00:05:34,760 here in Boston, where you tap in, but you don't tap out. 107 00:05:34,760 --> 00:05:39,350 If you look at the rail system in London or in Washington DC, 108 00:05:39,350 --> 00:05:42,830 that's a closed system, where you tap in and tap out. 109 00:05:42,830 --> 00:05:44,870 And therefore, the OD pair for each trip 110 00:05:44,870 --> 00:05:48,050 is given in that part of the system. 111 00:05:48,050 --> 00:05:49,730 It also infers transfers. 112 00:05:49,730 --> 00:05:53,630 So we'll talk about why this is important. 113 00:05:53,630 --> 00:05:55,190 One of the caveats of these methods 114 00:05:55,190 --> 00:05:58,395 is that they only look at the current public transportation 115 00:05:58,395 --> 00:05:59,250 demand. 116 00:05:59,250 --> 00:06:04,550 So if you want a model of all of the demand that could be there, 117 00:06:04,550 --> 00:06:08,400 the latent demand included, this does not look at that. 118 00:06:08,400 --> 00:06:10,450 So just be aware. 119 00:06:10,450 --> 00:06:15,860 And also, specifically, one of the models we will look at, 120 00:06:15,860 --> 00:06:18,650 it can't infer destinations for every transaction. 121 00:06:18,650 --> 00:06:19,910 Some of that makes sense. 122 00:06:19,910 --> 00:06:22,480 If you see a card only one time a day, 123 00:06:22,480 --> 00:06:25,640 then you don't necessarily have information 124 00:06:25,640 --> 00:06:28,520 to infer destination or transfers. 125 00:06:28,520 --> 00:06:30,560 That also happens with cash transactions, 126 00:06:30,560 --> 00:06:35,000 which cannot be tracked using a smart card. 127 00:06:35,000 --> 00:06:35,940 There's fare evasion. 128 00:06:35,940 --> 00:06:37,470 So some people might jump into a bus 129 00:06:37,470 --> 00:06:40,220 and not interact with the fare system, 130 00:06:40,220 --> 00:06:43,100 so we can't capture that directly. 131 00:06:43,100 --> 00:06:44,810 And then there's trips on other modes. 132 00:06:44,810 --> 00:06:49,400 So part of the logic applying to destinations and transfers 133 00:06:49,400 --> 00:06:54,576 will, essentially, assume that the people are mostly traveling 134 00:06:54,576 --> 00:06:55,700 with public transportation. 135 00:06:55,700 --> 00:06:57,950 And they're not going to take Uber or a long bike ride 136 00:06:57,950 --> 00:06:58,670 in between. 137 00:06:58,670 --> 00:07:02,210 So we will look at that more in detail. 138 00:07:02,210 --> 00:07:05,390 Most of these methods have been validated with surveys, 139 00:07:05,390 --> 00:07:06,920 with good results. 140 00:07:06,920 --> 00:07:09,560 And they need to be scaled up because they don't 141 00:07:09,560 --> 00:07:11,414 make inferences for every card. 142 00:07:11,414 --> 00:07:13,580 If you want the full demand, you need to scale it up 143 00:07:13,580 --> 00:07:14,540 to the full demand. 144 00:07:14,540 --> 00:07:16,160 We'll talk about scaling. 145 00:07:16,160 --> 00:07:16,730 Questions? 146 00:07:16,730 --> 00:07:20,698 AUDIENCE: [INAUDIBLE] but with London, fare evasion, 147 00:07:20,698 --> 00:07:24,666 that's not a big problem, right? 148 00:07:24,666 --> 00:07:25,465 Or is-- 149 00:07:25,465 --> 00:07:27,090 GABRIEL SANCHEZ-MARTINEZ: I don't know. 150 00:07:27,090 --> 00:07:27,631 I don't know. 151 00:07:27,631 --> 00:07:28,465 AUDIENCE: All right. 152 00:07:28,465 --> 00:07:29,839 GABRIEL SANCHEZ-MARTINEZ: I don't 153 00:07:29,839 --> 00:07:31,320 have information about that. 154 00:07:33,850 --> 00:07:35,880 So it could be fare evasion, or it could be-- 155 00:07:35,880 --> 00:07:38,610 you might have a pass and hop onto the bus, 156 00:07:38,610 --> 00:07:40,860 and technically, it wouldn't be fare evasion. 157 00:07:40,860 --> 00:07:44,070 But it is non-interaction, so that would still 158 00:07:44,070 --> 00:07:46,799 be counted here as fare evasion. 159 00:07:46,799 --> 00:07:48,840 AUDIENCE: The more open a system, the more people 160 00:07:48,840 --> 00:07:50,400 will manage to evade. 161 00:07:50,400 --> 00:07:53,420 So I saw a number, then, the Boston commuter rail, 162 00:07:53,420 --> 00:07:54,660 they estimate 14%. 163 00:07:54,660 --> 00:07:55,000 GABRIEL SANCHEZ-MARTINEZ: Yeah. 164 00:07:55,000 --> 00:07:56,310 AUDIENCE: Yeah, that number's a lie, but yes. 165 00:07:56,310 --> 00:07:56,850 GABRIEL SANCHEZ-MARTINEZ: But that's-- 166 00:07:56,850 --> 00:07:57,945 AUDIENCE: --a lot or-- 167 00:07:57,945 --> 00:08:00,510 AUDIENCE: Yeah, 70% of people pay with a pass, so it's-- 168 00:08:00,510 --> 00:08:01,843 GABRIEL SANCHEZ-MARTINEZ: Right. 169 00:08:01,843 --> 00:08:05,670 So there, they are using fare evasion more overtly. 170 00:08:05,670 --> 00:08:07,770 It is high, though, because the train attendants 171 00:08:07,770 --> 00:08:10,560 don't manage to collect tickets from everyone. 172 00:08:10,560 --> 00:08:12,859 And there are ways to game the system, where 173 00:08:12,859 --> 00:08:15,150 you can activate a ticket only if the fare inspector is 174 00:08:15,150 --> 00:08:15,990 approaching you. 175 00:08:15,990 --> 00:08:18,420 So that's a flaw in the system. 176 00:08:18,420 --> 00:08:19,650 AUDIENCE: [INAUDIBLE] 177 00:08:19,650 --> 00:08:21,840 GABRIEL SANCHEZ-MARTINEZ: Yeah. 178 00:08:21,840 --> 00:08:26,670 So more generally, when we talk about data collection systems-- 179 00:08:26,670 --> 00:08:29,932 we've seen the key ones, AVL, AFC, APC. 180 00:08:29,932 --> 00:08:31,890 I think we're all familiar with how these look. 181 00:08:31,890 --> 00:08:35,039 Does anybody have questions on any of these systems right now? 182 00:08:35,039 --> 00:08:37,262 You were looking at some data in your homework. 183 00:08:37,262 --> 00:08:38,220 Do you have a question? 184 00:08:38,220 --> 00:08:38,761 AUDIENCE: No. 185 00:08:38,761 --> 00:08:40,559 GABRIEL SANCHEZ-MARTINEZ: OK. 186 00:08:40,559 --> 00:08:42,780 So they can be used for many things. 187 00:08:42,780 --> 00:08:46,510 And so if you look at supply and demand, 188 00:08:46,510 --> 00:08:49,350 they both produce automatically-collected data. 189 00:08:49,350 --> 00:08:52,380 So on the demand side, we have the fare transactions 190 00:08:52,380 --> 00:08:54,360 of the AFC system. 191 00:08:54,360 --> 00:08:56,490 On the supply side, you have the vehicle 192 00:08:56,490 --> 00:08:58,780 tracking with AVL system. 193 00:08:58,780 --> 00:09:00,060 You have APC as well. 194 00:09:00,060 --> 00:09:05,370 So they can be sent to some server or data warehouse 195 00:09:05,370 --> 00:09:06,820 and used for many things. 196 00:09:06,820 --> 00:09:09,120 It could be used for offline functions. 197 00:09:09,120 --> 00:09:11,340 So performance measurement is one example 198 00:09:11,340 --> 00:09:14,290 of that, where you want to measure reliability, 199 00:09:14,290 --> 00:09:16,050 running times, et cetera. 200 00:09:16,050 --> 00:09:18,510 It could be used for service and operations planning. 201 00:09:18,510 --> 00:09:20,610 And then it can also be used in real-time. 202 00:09:20,610 --> 00:09:22,440 So you could use some of this information 203 00:09:22,440 --> 00:09:24,960 to generate customer information, which feeds back 204 00:09:24,960 --> 00:09:25,650 to demand. 205 00:09:25,650 --> 00:09:28,140 You could send alerts saying the trains are 206 00:09:28,140 --> 00:09:29,640 being delayed right now. 207 00:09:29,640 --> 00:09:31,170 Expect a longer wait. 208 00:09:31,170 --> 00:09:33,510 And that could, in fact, influence demand. 209 00:09:33,510 --> 00:09:38,580 It could make some people not take a particular trip 210 00:09:38,580 --> 00:09:41,670 or wait if they get it on their phone. 211 00:09:41,670 --> 00:09:44,610 And then on the supply side, the information 212 00:09:44,610 --> 00:09:46,870 can be used to control service. 213 00:09:46,870 --> 00:09:48,990 So you might actually affect supply 214 00:09:48,990 --> 00:09:51,090 by changing the departure times. 215 00:09:51,090 --> 00:09:53,009 And that would affect the data that is being 216 00:09:53,009 --> 00:09:54,300 generated from the supply side. 217 00:09:54,300 --> 00:09:55,740 So there's a feedback loop. 218 00:09:55,740 --> 00:10:00,060 In this lecture, we will focus on only one 219 00:10:00,060 --> 00:10:01,980 aspect of this framework, and that 220 00:10:01,980 --> 00:10:03,870 is origin destination matrices. 221 00:10:03,870 --> 00:10:05,820 Origin destination matrices [AUDIO OUT] 222 00:10:05,820 --> 00:10:07,710 one of the key inputs to service planning. 223 00:10:07,710 --> 00:10:11,520 They are the key demand input to service planning. 224 00:10:11,520 --> 00:10:14,560 You need them to figure out where people want to go. 225 00:10:14,560 --> 00:10:17,970 And it's just the data that expresses 226 00:10:17,970 --> 00:10:19,170 where people want to go. 227 00:10:19,170 --> 00:10:22,560 And you should try to design your system 228 00:10:22,560 --> 00:10:24,180 to match that demand. 229 00:10:24,180 --> 00:10:29,250 So it used to be that we had to use manual surveys. 230 00:10:29,250 --> 00:10:31,240 They were expensive. 231 00:10:31,240 --> 00:10:35,630 They didn't cover all the times or places very well. 232 00:10:35,630 --> 00:10:37,250 And now, with all these automated data 233 00:10:37,250 --> 00:10:40,340 collection systems, we can infer some of the origin destination 234 00:10:40,340 --> 00:10:42,060 matrices from the data. 235 00:10:42,060 --> 00:10:43,850 And that's what we want to understand. 236 00:10:43,850 --> 00:10:45,050 How does it happen? 237 00:10:45,050 --> 00:10:46,440 What can we do? 238 00:10:46,440 --> 00:10:47,250 How can we do it? 239 00:10:47,250 --> 00:10:52,410 So OD matrix estimation can happen at different levels. 240 00:10:52,410 --> 00:10:54,950 One way of looking at it is to think about route 241 00:10:54,950 --> 00:10:56,950 level versus network level. 242 00:10:56,950 --> 00:10:59,600 So route level, we're talking about one bus line. 243 00:10:59,600 --> 00:11:03,320 And we want to look up the trips made in the one bus line 244 00:11:03,320 --> 00:11:05,960 and understand where people get on and off. 245 00:11:05,960 --> 00:11:06,930 That's route level. 246 00:11:06,930 --> 00:11:10,410 So if you have two routes here, route one, route two, 247 00:11:10,410 --> 00:11:12,920 we might notice or estimate that a person 248 00:11:12,920 --> 00:11:17,340 boards here and alights here. 249 00:11:17,340 --> 00:11:19,400 Some of the people doing that OD pair 250 00:11:19,400 --> 00:11:21,806 may, in fact, continue on route two. 251 00:11:21,806 --> 00:11:23,180 They might transfer to route two. 252 00:11:23,180 --> 00:11:25,940 So what's the drawback? 253 00:11:25,940 --> 00:11:28,850 Or why would it be important to know 254 00:11:28,850 --> 00:11:32,210 the transfer to route two and the destination on route two? 255 00:11:38,135 --> 00:11:39,010 AUDIENCE: [INAUDIBLE] 256 00:11:39,010 --> 00:11:39,940 GABRIEL SANCHEZ-MARTINEZ: Or is it irrelevant? 257 00:11:39,940 --> 00:11:40,990 Is it just cool? 258 00:11:40,990 --> 00:11:42,104 Or-- 259 00:11:42,104 --> 00:11:44,840 AUDIENCE: Maybe the destination of the route two 260 00:11:44,840 --> 00:11:50,260 is the real destination of the person's trips 261 00:11:50,260 --> 00:11:53,610 because if he goes to work, for example, 262 00:11:53,610 --> 00:11:55,917 the work will be at the end of the route two. 263 00:11:55,917 --> 00:11:57,250 GABRIEL SANCHEZ-MARTINEZ: Right. 264 00:11:57,250 --> 00:11:59,680 So the real destination, the place where the person actually 265 00:11:59,680 --> 00:12:02,138 wants to go, might be on the destination route two, and not 266 00:12:02,138 --> 00:12:05,020 on the destination at route one. 267 00:12:05,020 --> 00:12:06,580 So this might be-- 268 00:12:06,580 --> 00:12:13,120 the only reason why the person is alighting there 269 00:12:13,120 --> 00:12:14,920 is because that's the only way-- that's 270 00:12:14,920 --> 00:12:17,140 sort of a function of the network of the supply you 271 00:12:17,140 --> 00:12:17,710 provide. 272 00:12:17,710 --> 00:12:21,210 So if there were a direct route from the first origin 273 00:12:21,210 --> 00:12:23,920 to the last destination, perhaps they would prefer that. 274 00:12:23,920 --> 00:12:25,870 And for service planning, you want 275 00:12:25,870 --> 00:12:27,700 to know what people want to do. 276 00:12:27,700 --> 00:12:29,560 That's what demand is. 277 00:12:29,560 --> 00:12:32,560 So we want network level. 278 00:12:32,560 --> 00:12:35,620 So we go from unlinked trips to linked trips. 279 00:12:35,620 --> 00:12:37,270 That's part of what we want to do. 280 00:12:37,270 --> 00:12:40,330 At the network level, we are looking at all the buses 281 00:12:40,330 --> 00:12:42,220 and the rail system. 282 00:12:42,220 --> 00:12:45,220 So again, that's what we want to do. 283 00:12:45,220 --> 00:12:48,050 We'll look at both kinds of OD matrix estimation 284 00:12:48,050 --> 00:12:49,220 in this lecture. 285 00:12:49,220 --> 00:12:51,820 Let's start with one of the simpler cases. 286 00:12:51,820 --> 00:12:56,830 Consider a bus having APC and only APC. 287 00:12:56,830 --> 00:12:59,586 So APC data looks like this. 288 00:12:59,586 --> 00:13:01,210 You have timestamps, you have a bus ID, 289 00:13:01,210 --> 00:13:03,820 you have a route ID, a trip ID, some information 290 00:13:03,820 --> 00:13:06,210 about direction, perhaps, and then 291 00:13:06,210 --> 00:13:08,950 counts at each stop of how many people get on 292 00:13:08,950 --> 00:13:12,190 and how many people get off, boardings and alightings. 293 00:13:12,190 --> 00:13:14,320 And, of course, a stop ID, or a stop name, 294 00:13:14,320 --> 00:13:15,370 or something like that. 295 00:13:15,370 --> 00:13:21,250 So you can aggregate that across trips or do it at a trip level, 296 00:13:21,250 --> 00:13:23,890 and count how many people get on at each stop, 297 00:13:23,890 --> 00:13:25,960 and how many people get off at each stop. 298 00:13:25,960 --> 00:13:29,680 These are called control totals in the context of scaling. 299 00:13:29,680 --> 00:13:34,510 So you might be aggregating across days, for example, 300 00:13:34,510 --> 00:13:36,790 for a 30-minute period, and count 301 00:13:36,790 --> 00:13:39,340 how many people are getting on at each stop 302 00:13:39,340 --> 00:13:41,200 and getting off at each stop. 303 00:13:41,200 --> 00:13:45,040 And then what we want to estimate, 304 00:13:45,040 --> 00:13:47,470 knowing how many people get on and off at each stop, 305 00:13:47,470 --> 00:13:49,330 is the origin destination matrix. 306 00:13:49,330 --> 00:13:52,540 That is the cells inside of this matrix 307 00:13:52,540 --> 00:13:56,980 saying how many people get on at stop one and off at stop three. 308 00:13:56,980 --> 00:13:58,780 We're showing here, 10. 309 00:13:58,780 --> 00:14:01,990 Now you may notice that the matrix here, it doesn't 310 00:14:01,990 --> 00:14:06,460 necessarily match the totals. 311 00:14:06,460 --> 00:14:11,270 So here we have 35 and 237, and the target is 40. 312 00:14:11,270 --> 00:14:13,780 So we have to scale it up a little bit. 313 00:14:13,780 --> 00:14:16,240 Here, we have 30 people getting off at stop two, 314 00:14:16,240 --> 00:14:17,821 and we only have 25. 315 00:14:17,821 --> 00:14:18,820 So that number is wrong. 316 00:14:18,820 --> 00:14:20,060 We need to scale up. 317 00:14:20,060 --> 00:14:21,670 So what we're going to look at now 318 00:14:21,670 --> 00:14:25,480 is a procedure called iterative proportional fitting that 319 00:14:25,480 --> 00:14:29,260 estimates, given some control totals, what the origin 320 00:14:29,260 --> 00:14:32,050 destination matrix is. 321 00:14:32,050 --> 00:14:35,320 This is known as biproportional fitting or matrix scaling 322 00:14:35,320 --> 00:14:36,790 as well. 323 00:14:36,790 --> 00:14:41,980 And we start with some initial matrix or some seed matrix 324 00:14:41,980 --> 00:14:43,720 here in the center. 325 00:14:43,720 --> 00:14:45,910 The value of that seed matrix is important. 326 00:14:45,910 --> 00:14:47,530 It affects the solution. 327 00:14:47,530 --> 00:14:49,930 So having an accurate seed matrix 328 00:14:49,930 --> 00:14:52,480 improves the accuracy of the final estimate. 329 00:14:52,480 --> 00:14:54,790 If you don't have an idea, then you 330 00:14:54,790 --> 00:14:59,110 could certainly initialize that seed matrix with all ones, 331 00:14:59,110 --> 00:15:00,850 and it will produce an output. 332 00:15:00,850 --> 00:15:03,430 But it may not be the best output or the most accurate 333 00:15:03,430 --> 00:15:04,750 result. 334 00:15:04,750 --> 00:15:10,890 So it has been shown that if all the values provided 335 00:15:10,890 --> 00:15:13,870 in the matrix are strictly positive-- 336 00:15:13,870 --> 00:15:18,750 and here I am excluding what we call structural zeros, so 337 00:15:18,750 --> 00:15:23,880 all the cells in which people could actually be traveling. 338 00:15:23,880 --> 00:15:44,460 Here we are showing a route with four stops, A, B, C, D. 339 00:15:44,460 --> 00:15:47,130 And we have a matrix showing how many people go 340 00:15:47,130 --> 00:15:50,790 from A to B, from A to C, from A to D, from B to C, from B to C, 341 00:15:50,790 --> 00:15:52,230 and from C to D. 342 00:15:52,230 --> 00:15:55,590 Those are the only possible OD pairs. 343 00:15:55,590 --> 00:15:58,230 Nobody is going to go from A to A because that's the same stop, 344 00:15:58,230 --> 00:15:59,640 so that's not a valid trip. 345 00:15:59,640 --> 00:16:01,540 And we're only looking at one direction, 346 00:16:01,540 --> 00:16:03,540 so anything below that diagonal would 347 00:16:03,540 --> 00:16:05,910 be in the opposite direction. 348 00:16:05,910 --> 00:16:07,740 And we're not including it in this example. 349 00:16:10,620 --> 00:16:14,220 So what we want to do is start off with adding up each row, 350 00:16:14,220 --> 00:16:16,987 adding up each column, so we have total alightings 351 00:16:16,987 --> 00:16:17,820 and total boardings. 352 00:16:17,820 --> 00:16:21,150 I want them to match boardings, and match target alightings, 353 00:16:21,150 --> 00:16:22,770 or the control totals. 354 00:16:22,770 --> 00:16:23,768 Questions? 355 00:16:23,768 --> 00:16:25,720 AUDIENCE: What are the control totals again? 356 00:16:28,125 --> 00:16:29,500 GABRIEL SANCHEZ-MARTINEZ: They're 357 00:16:29,500 --> 00:16:32,050 counts of boardings and counts of alightings at each stop. 358 00:16:32,050 --> 00:16:35,670 And they can, in this example, come from APC. 359 00:16:35,670 --> 00:16:39,080 AUDIENCE: That's what come from APC. 360 00:16:39,080 --> 00:16:41,930 GABRIEL SANCHEZ-MARTINEZ: So what we do, the algorithm 361 00:16:41,930 --> 00:16:43,880 for iterative proportional fitting, 362 00:16:43,880 --> 00:16:47,240 it calculates a scaling factor for each row. 363 00:16:47,240 --> 00:16:48,800 We start with rows. 364 00:16:48,800 --> 00:16:53,420 And we say, well we need to scale up everything 365 00:16:53,420 --> 00:16:58,010 on the first row by 40 over 3, so that that number adds up 366 00:16:58,010 --> 00:16:59,660 to 40. 367 00:16:59,660 --> 00:17:02,660 And you calculate the scaling factor for each row 368 00:17:02,660 --> 00:17:06,270 and apply it to the cells in the matrix. 369 00:17:06,270 --> 00:17:11,920 And of course, the sum of cells column-wise 370 00:17:11,920 --> 00:17:14,079 is not going to add up to the target alightings. 371 00:17:14,079 --> 00:17:17,710 So the second step is to apply the same procedure 372 00:17:17,710 --> 00:17:19,390 on the columns. 373 00:17:19,390 --> 00:17:21,160 And now we realize, well, we need 374 00:17:21,160 --> 00:17:24,730 to apply a scaling factor of 30 over 13.3 375 00:17:24,730 --> 00:17:27,280 to get B to add up to 30. 376 00:17:27,280 --> 00:17:30,440 And you do the same for each column. 377 00:17:30,440 --> 00:17:34,160 And now the columns sum up perfectly, 378 00:17:34,160 --> 00:17:36,120 but the rows don't anymore. 379 00:17:36,120 --> 00:17:38,060 So now we go back and repeat the process. 380 00:17:38,060 --> 00:17:40,835 And we go to the rows and the columns. 381 00:17:43,850 --> 00:17:47,210 I've put it in all the slides so you can actually 382 00:17:47,210 --> 00:17:51,200 repeat this in your spreadsheet program, if you want. 383 00:17:51,200 --> 00:17:57,400 It has been shown that if all the non-structural values 384 00:17:57,400 --> 00:18:00,310 of the cells in that matrix are not zero, 385 00:18:00,310 --> 00:18:04,000 and then they are positive, then it will converge. 386 00:18:04,000 --> 00:18:06,850 And it will converge to the maximum likelihood estimate, 387 00:18:06,850 --> 00:18:10,390 of the best possible estimate, given your seed 388 00:18:10,390 --> 00:18:13,160 and given your control totals. 389 00:18:13,160 --> 00:18:16,450 So you can apply this if you ever 390 00:18:16,450 --> 00:18:18,640 have a situation where you have control totals, 391 00:18:18,640 --> 00:18:20,650 but not the origin destination matrix. 392 00:18:20,650 --> 00:18:22,270 And that's one example of-- that would 393 00:18:22,270 --> 00:18:25,790 be having APC and nothing else. 394 00:18:25,790 --> 00:18:28,110 Any questions on this method? 395 00:18:28,110 --> 00:18:30,110 We're to see it again in a different application 396 00:18:30,110 --> 00:18:31,480 later in this lecture. 397 00:18:31,480 --> 00:18:35,360 AUDIENCE: Is it guaranteed to converge to the correct value? 398 00:18:35,360 --> 00:18:39,200 GABRIEL SANCHEZ-MARTINEZ: Well, what is correct? 399 00:18:39,200 --> 00:18:41,900 It may not be the truth, if that's 400 00:18:41,900 --> 00:18:43,700 what you mean by correct. 401 00:18:43,700 --> 00:18:46,610 So it's the best estimate of the truth, 402 00:18:46,610 --> 00:18:48,126 given the information provided. 403 00:18:48,126 --> 00:18:49,959 AUDIENCE: How might it converge to something 404 00:18:49,959 --> 00:18:51,590 that isn't correct? 405 00:18:51,590 --> 00:18:54,460 GABRIEL SANCHEZ-MARTINEZ: Your seed matrix might be wrong. 406 00:18:54,460 --> 00:18:58,330 Or there might be aggregation errors, for example. 407 00:18:58,330 --> 00:19:00,820 So if you start with all ones, that's clearly not true. 408 00:19:00,820 --> 00:19:02,444 And this is the best possible estimate, 409 00:19:02,444 --> 00:19:04,780 given your starting assumption that all of the pairs 410 00:19:04,780 --> 00:19:10,090 are equally likely, and then adjusting from there, right? 411 00:19:10,090 --> 00:19:12,800 And then I mentioned structural zeros. 412 00:19:12,800 --> 00:19:17,080 So if you have a non-structural zero, 413 00:19:17,080 --> 00:19:21,790 say that with onboard survey, you 414 00:19:21,790 --> 00:19:25,090 collected OD demand only for a couple of trips 415 00:19:25,090 --> 00:19:27,290 and used that to seed the matrix. 416 00:19:27,290 --> 00:19:29,290 And let's say that first some of these OD pairs, 417 00:19:29,290 --> 00:19:32,815 you didn't observe a single person taking that trip. 418 00:19:32,815 --> 00:19:35,460 So you say, well, in the seed, I have a zero, 419 00:19:35,460 --> 00:19:39,400 but that's a non-structural zero because-- 420 00:19:39,400 --> 00:19:40,790 it is a structural zero, rather. 421 00:19:40,790 --> 00:19:45,540 So this value, some people might be taking the OD pair. 422 00:19:45,540 --> 00:19:47,140 And if you seed it a zero, then you 423 00:19:47,140 --> 00:19:49,750 can't scale it up above zero. 424 00:19:49,750 --> 00:19:54,970 So in this case, you would not converge necessarily. 425 00:19:54,970 --> 00:19:56,470 And you certainly would not converge 426 00:19:56,470 --> 00:19:59,170 to the maximum likelihood estimate. 427 00:19:59,170 --> 00:20:01,347 OK? 428 00:20:01,347 --> 00:20:03,930 AUDIENCE: I was going to ask how you get a better seed matrix, 429 00:20:03,930 --> 00:20:04,200 but you-- 430 00:20:04,200 --> 00:20:06,033 GABRIEL SANCHEZ-MARTINEZ: So with any kind-- 431 00:20:06,033 --> 00:20:10,440 we'll talk about other methods for ODX, 432 00:20:10,440 --> 00:20:13,470 so for estimating origin destination matrices. 433 00:20:13,470 --> 00:20:15,930 Manual surveys is one example. 434 00:20:15,930 --> 00:20:20,130 Any knowledge that you have about what OD pairs are busiest 435 00:20:20,130 --> 00:20:21,400 should help. 436 00:20:21,400 --> 00:20:25,280 So you could do on-off counts with the traditional way 437 00:20:25,280 --> 00:20:28,940 if the only thing you have is this. 438 00:20:28,940 --> 00:20:32,035 And we've talked about surveys extensively. 439 00:20:32,035 --> 00:20:33,466 AUDIENCE: So just to [INAUDIBLE],, 440 00:20:33,466 --> 00:20:35,660 APC is the target boarding [INAUDIBLE].. 441 00:20:35,660 --> 00:20:36,951 GABRIEL SANCHEZ-MARTINEZ: Yeah. 442 00:20:36,951 --> 00:20:38,550 So over some number of trips-- 443 00:20:38,550 --> 00:20:40,290 and this is a toy example, clearly. 444 00:20:40,290 --> 00:20:42,960 Over some number of trips, you counted 445 00:20:42,960 --> 00:20:46,440 40 people boarding at A, and 30 people 446 00:20:46,440 --> 00:20:48,690 alighting at B, and so forth. 447 00:20:48,690 --> 00:20:51,360 So you want the cells to match that. 448 00:20:51,360 --> 00:20:53,020 OK? 449 00:20:53,020 --> 00:20:54,730 So you could do this in Excel, or you 450 00:20:54,730 --> 00:20:58,910 could write your own little function to do this. 451 00:20:58,910 --> 00:21:01,680 It amplifies errors in the seed matrix. 452 00:21:01,680 --> 00:21:03,200 You're scaling up, so if you have 453 00:21:03,200 --> 00:21:05,690 errors in your seed matrix, they will be scaled up too. 454 00:21:05,690 --> 00:21:07,550 Just be aware of that. 455 00:21:07,550 --> 00:21:10,610 So what about if we don't have APC? 456 00:21:10,610 --> 00:21:13,700 What if we only have a AFC and AVL? 457 00:21:13,700 --> 00:21:15,470 So now we don't have control tools. 458 00:21:15,470 --> 00:21:19,040 AFC might give you boardings, but not alightings. 459 00:21:19,040 --> 00:21:21,980 So what are the ways of scaling up with that? 460 00:21:21,980 --> 00:21:25,480 You have different systems, and it depends on the system. 461 00:21:25,480 --> 00:21:29,360 So if you look at TFL in London, we said AFC there is closed, 462 00:21:29,360 --> 00:21:32,540 so the origin station pairs are given by the rail system 463 00:21:32,540 --> 00:21:34,670 because people have to tap in and tap out. 464 00:21:34,670 --> 00:21:36,350 On bus, however, people only tap in. 465 00:21:36,350 --> 00:21:39,350 So there, you would have to apply this inference method. 466 00:21:39,350 --> 00:21:42,090 Here in the MBTA, both bus and rail are open. 467 00:21:42,090 --> 00:21:42,590 You tap in. 468 00:21:42,590 --> 00:21:43,339 You don't tap out. 469 00:21:43,339 --> 00:21:47,840 So we have to infer destinations in rail and in bus. 470 00:21:47,840 --> 00:21:49,910 And then in some more advanced systems, 471 00:21:49,910 --> 00:21:52,860 a lot of information, including transfer information, is given. 472 00:21:52,860 --> 00:21:54,497 Seoul is one example of that. 473 00:21:54,497 --> 00:21:55,830 AUDIENCE: How is that different? 474 00:21:55,830 --> 00:21:56,430 They-- 475 00:21:56,430 --> 00:21:57,020 AUDIENCE: They tap in [INAUDIBLE].. 476 00:21:57,020 --> 00:21:58,978 GABRIEL SANCHEZ-MARTINEZ: They tap in between-- 477 00:21:58,978 --> 00:21:59,630 yeah. 478 00:21:59,630 --> 00:22:05,142 So that there's an interchange tap. 479 00:22:05,142 --> 00:22:07,225 AUDIENCE: They actually have to tap out to leave-- 480 00:22:07,225 --> 00:22:08,933 GABRIEL SANCHEZ-MARTINEZ: And by the way, 481 00:22:08,933 --> 00:22:10,940 in some parts of London's network, that is true. 482 00:22:10,940 --> 00:22:16,370 You tap to prove that you were transferring. 483 00:22:16,370 --> 00:22:19,540 There might be a fare advantage to doing that. 484 00:22:19,540 --> 00:22:22,840 So control totals. 485 00:22:22,840 --> 00:22:27,970 So here in Boston, with the MBTA buses, some portion of buses 486 00:22:27,970 --> 00:22:30,110 have APC, but not all of them. 487 00:22:30,110 --> 00:22:32,650 So you could use the first method applied 488 00:22:32,650 --> 00:22:34,870 to only a fraction of vehicles and then scale up 489 00:22:34,870 --> 00:22:35,710 to all vehicles. 490 00:22:35,710 --> 00:22:38,210 That's one possibility. 491 00:22:38,210 --> 00:22:41,890 Or you need something else. 492 00:22:41,890 --> 00:22:46,990 In London, they don't have APC, at least not widespread. 493 00:22:46,990 --> 00:22:49,550 And they do have the ticketing machine. 494 00:22:49,550 --> 00:22:51,490 So in theory, drivers are supposed 495 00:22:51,490 --> 00:22:56,050 to push a button if somebody boards, and they don't tap. 496 00:22:56,050 --> 00:22:58,360 Do they actually do it? 497 00:22:58,360 --> 00:23:01,120 Not clear to what extent the drivers 498 00:23:01,120 --> 00:23:03,370 comply with that instruction. 499 00:23:03,370 --> 00:23:07,800 And then gates and rail gates. 500 00:23:07,800 --> 00:23:11,460 So tapping in or out of the subway system there, 501 00:23:11,460 --> 00:23:12,240 there's a counter. 502 00:23:12,240 --> 00:23:13,740 So it counts people passing through. 503 00:23:13,740 --> 00:23:20,490 So if somebody goes in through a gate and out some other place, 504 00:23:20,490 --> 00:23:22,690 and we don't know exactly what they did, 505 00:23:22,690 --> 00:23:28,530 the total number of people at each node in the system 506 00:23:28,530 --> 00:23:29,220 can be counted. 507 00:23:29,220 --> 00:23:31,270 And we can use that information to scale up. 508 00:23:31,270 --> 00:23:34,170 So we'll talk about that later. 509 00:23:34,170 --> 00:23:36,190 So it depends on the context. 510 00:23:36,190 --> 00:23:38,670 Let's start with origin inference, the first letter 511 00:23:38,670 --> 00:23:41,110 in ODX is origin inference. 512 00:23:41,110 --> 00:23:46,260 So we're looking at a bus, which has one stop and then 513 00:23:46,260 --> 00:23:47,400 another stop. 514 00:23:47,400 --> 00:23:52,410 And if we match the AFC transaction times to the AVL 515 00:23:52,410 --> 00:23:57,150 stop visit times, we can put them on the same timeline 516 00:23:57,150 --> 00:24:00,060 and realize, well, there was a tap 517 00:24:00,060 --> 00:24:04,260 right after that AVL system said that the bus left that stop. 518 00:24:04,260 --> 00:24:06,690 It's very close, however, to that stop. 519 00:24:06,690 --> 00:24:09,150 So let's assume that the tap-- 520 00:24:09,150 --> 00:24:11,250 maybe the bus pulled out and started moving, 521 00:24:11,250 --> 00:24:13,410 and the person was finding the card. 522 00:24:13,410 --> 00:24:16,380 But you still tap-- it if it's close enough, 523 00:24:16,380 --> 00:24:18,864 let's assign it to that stop. 524 00:24:18,864 --> 00:24:20,280 AUDIENCE: Just a second, are there 525 00:24:20,280 --> 00:24:25,227 systems where the AFC is connected to the AVL directly? 526 00:24:25,227 --> 00:24:26,060 [INTERPOSING VOICES] 527 00:24:26,060 --> 00:24:26,780 GABRIEL SANCHEZ-MARTINEZ: Are there systems 528 00:24:26,780 --> 00:24:28,250 where the AFC connected to the AVL? 529 00:24:28,250 --> 00:24:28,749 Yes. 530 00:24:28,749 --> 00:24:30,057 In London, they do that now. 531 00:24:30,057 --> 00:24:30,890 [INTERPOSING VOICES] 532 00:24:30,890 --> 00:24:32,690 GABRIEL SANCHEZ-MARTINEZ: They didn't when we started 533 00:24:32,690 --> 00:24:33,800 with this, but they do it now. 534 00:24:33,800 --> 00:24:34,610 AUDIENCE: Because I know from the process 535 00:24:34,610 --> 00:24:37,070 that you and Neema wrote for Chicago, where 536 00:24:37,070 --> 00:24:42,600 you had to connect the AFC to the AVL, it was a headache. 537 00:24:42,600 --> 00:24:45,962 But it seems to me like it shouldn't be that way. 538 00:24:45,962 --> 00:24:48,295 GABRIEL SANCHEZ-MARTINEZ: Well, remember these systems-- 539 00:24:48,295 --> 00:24:50,680 AUDIENCE: It should have an AV feeder into AFC. 540 00:24:50,680 --> 00:24:53,180 GABRIEL SANCHEZ-MARTINEZ: So it's starting to move that way, 541 00:24:53,180 --> 00:24:56,900 but none of these systems were put in to capture data. 542 00:24:56,900 --> 00:24:57,910 None of them. 543 00:24:57,910 --> 00:24:59,900 APC is the only one, actually. 544 00:24:59,900 --> 00:25:02,450 So APC was put in to collect data and not 545 00:25:02,450 --> 00:25:06,150 have to do all these surveys because that's expensive. 546 00:25:06,150 --> 00:25:11,210 But AFC was put in to collect fares, and avoid 547 00:25:11,210 --> 00:25:17,510 theft of fare revenue, and simplify the duties of drivers, 548 00:25:17,510 --> 00:25:18,480 improve safety. 549 00:25:18,480 --> 00:25:21,680 So there are many advantages to it. 550 00:25:21,680 --> 00:25:23,940 Smart cards have the advantage of having passes 551 00:25:23,940 --> 00:25:28,180 and all these things, so many advantages to an AFC system. 552 00:25:28,180 --> 00:25:30,410 AVL was for safety. 553 00:25:30,410 --> 00:25:32,885 If there was an emergency on the bus, 554 00:25:32,885 --> 00:25:34,760 the driver could hit a button, and the police 555 00:25:34,760 --> 00:25:36,440 and the ambulances could be dispatched 556 00:25:36,440 --> 00:25:38,990 to the location of the bus. 557 00:25:38,990 --> 00:25:41,780 That's why it was started. 558 00:25:41,780 --> 00:25:46,220 Later on, it started being used for management as well. 559 00:25:46,220 --> 00:25:49,560 Data collection of how many miles of service you provided, 560 00:25:49,560 --> 00:25:52,900 which is a requirement for the NDT reporting. 561 00:25:52,900 --> 00:25:54,470 So aggregate level reporting. 562 00:25:54,470 --> 00:25:57,830 But none of these systems were put in thinking, 563 00:25:57,830 --> 00:25:59,870 oh, we're going to estimate origin destination 564 00:25:59,870 --> 00:26:00,960 matrices with them. 565 00:26:00,960 --> 00:26:03,600 So that's something that has come after the fact. 566 00:26:03,600 --> 00:26:06,590 Now that people are thinking about that, yes. 567 00:26:06,590 --> 00:26:09,350 We start seeing, can we hook up these two systems, 568 00:26:09,350 --> 00:26:11,450 which might be from different vendors, 569 00:26:11,450 --> 00:26:14,780 and make them talk to each other? 570 00:26:14,780 --> 00:26:18,120 So London does that now. 571 00:26:18,120 --> 00:26:22,290 So if you see some tap that is very far away, in time, 572 00:26:22,290 --> 00:26:26,075 from any stop, you might not be able to infer the origin. 573 00:26:26,075 --> 00:26:32,040 If it's close or between the reported arrival and departure, 574 00:26:32,040 --> 00:26:34,530 we match that transaction to that origin. 575 00:26:34,530 --> 00:26:35,670 Simple, right? 576 00:26:35,670 --> 00:26:39,410 So in London, we did that. 577 00:26:39,410 --> 00:26:42,360 This is Jay Gordon's thesis, which is 578 00:26:42,360 --> 00:26:44,910 referenced in the last slide. 579 00:26:44,910 --> 00:26:46,950 Looking at 10 weekdays. 580 00:26:46,950 --> 00:26:49,650 Oyster is the AFC system in London. 581 00:26:49,650 --> 00:26:52,150 And so 96% of boarding locations were 582 00:26:52,150 --> 00:26:54,150 inferred within plus or minus five minutes. 583 00:26:54,150 --> 00:26:56,910 And that was one of the thresholds they looked at. 584 00:26:56,910 --> 00:27:00,840 28% were exactly between the reported arrival and departure. 585 00:27:00,840 --> 00:27:04,530 So some tolerance before the arrival 586 00:27:04,530 --> 00:27:07,260 and after the departure from each stop 587 00:27:07,260 --> 00:27:11,880 was needed to infer a large portion of these. 588 00:27:11,880 --> 00:27:13,320 All right? 589 00:27:13,320 --> 00:27:13,820 Simple. 590 00:27:16,360 --> 00:27:18,140 Destinations, that's the next step. 591 00:27:18,140 --> 00:27:19,570 So we have origins. 592 00:27:19,570 --> 00:27:22,420 It's a rail system, you tap in. 593 00:27:22,420 --> 00:27:24,910 It just tells you I'm gate number blah, 594 00:27:24,910 --> 00:27:26,590 and that is at some station. 595 00:27:26,590 --> 00:27:30,010 If it's a bus, you can join AVL with AFC, and you get it. 596 00:27:30,010 --> 00:27:31,550 Now let's look at destinations. 597 00:27:31,550 --> 00:27:35,892 So there are different methods for inferring destinations. 598 00:27:35,892 --> 00:27:38,030 [AUDIO OUT] AFC and AVL. 599 00:27:38,030 --> 00:27:40,820 And one of the simplest methods, or the family of methods, 600 00:27:40,820 --> 00:27:45,270 is the closest stop assumption. 601 00:27:45,270 --> 00:27:47,780 So what are the key assumptions? 602 00:27:47,780 --> 00:27:52,610 We start by saying that the destination of each trip 603 00:27:52,610 --> 00:27:56,070 segment is close to the origin of the following trip segment. 604 00:27:56,070 --> 00:27:59,330 So in other words, that is true, physically, right? 605 00:27:59,330 --> 00:28:01,920 You have to move somehow through space. 606 00:28:01,920 --> 00:28:05,870 So we, further, now assume that that movement is happening 607 00:28:05,870 --> 00:28:08,090 mostly through the public transportation network, 608 00:28:08,090 --> 00:28:10,310 and that no trips on other modes are being made. 609 00:28:13,010 --> 00:28:17,270 So if you go from home to work in the morning, 610 00:28:17,270 --> 00:28:21,030 and you have to then-- 611 00:28:21,030 --> 00:28:23,390 say that you work, and then at the end of the workday, 612 00:28:23,390 --> 00:28:25,500 you go back home. 613 00:28:25,500 --> 00:28:28,190 We see that your next origin is the stop 614 00:28:28,190 --> 00:28:30,660 across the street from where you got off, hopefully. 615 00:28:30,660 --> 00:28:36,080 So we'll look at which stop was closest on the trip you boarded 616 00:28:36,080 --> 00:28:37,940 to the next origin. 617 00:28:37,940 --> 00:28:40,700 And we'll infer that is the destination. 618 00:28:40,700 --> 00:28:41,996 If it's a rail system-- 619 00:28:41,996 --> 00:28:43,370 that's what we show here in blue. 620 00:28:43,370 --> 00:28:45,800 So we have an origin on this bus line. 621 00:28:45,800 --> 00:28:47,840 We want to infer which of the downstream stops 622 00:28:47,840 --> 00:28:49,004 is a destination. 623 00:28:49,004 --> 00:28:50,420 And then we look at the next trip, 624 00:28:50,420 --> 00:28:53,120 and the next trip started at T, the target. 625 00:28:53,120 --> 00:28:55,940 And we want to get as close as possible to the target. 626 00:28:55,940 --> 00:28:57,680 So we'll say that the destination 627 00:28:57,680 --> 00:29:02,660 is D, if the distance between the D and T is small enough. 628 00:29:02,660 --> 00:29:04,520 Because if it's three kilometers, 629 00:29:04,520 --> 00:29:06,140 we might say we have no clue. 630 00:29:06,140 --> 00:29:08,784 This person may have moved with a different mode. 631 00:29:08,784 --> 00:29:10,700 And therefore, this assumption of closest stop 632 00:29:10,700 --> 00:29:12,720 may not apply in that case. 633 00:29:12,720 --> 00:29:15,440 So in those few cases, we won't make 634 00:29:15,440 --> 00:29:16,910 an inference of destination. 635 00:29:16,910 --> 00:29:20,150 If it's rail, so two rail lines, and you 636 00:29:20,150 --> 00:29:24,770 may be able to change between lines behind the gate, 637 00:29:24,770 --> 00:29:30,920 then closest stop is the same station 638 00:29:30,920 --> 00:29:36,680 that you next enter because that's the closest. 639 00:29:36,680 --> 00:29:42,080 So if it's a bus-- so you may have boarded the Red Line here 640 00:29:42,080 --> 00:29:45,810 and somehow gotten to the Blue Line. 641 00:29:45,810 --> 00:29:47,900 We don't know that yet, but we observe 642 00:29:47,900 --> 00:29:50,160 that the next tap is at a bus, then 643 00:29:50,160 --> 00:29:52,220 we find which station on the rail network 644 00:29:52,220 --> 00:29:54,230 is closest to that bus stop. 645 00:29:54,230 --> 00:29:55,850 And that's the inference. 646 00:29:55,850 --> 00:29:57,170 That's destination. 647 00:29:57,170 --> 00:30:02,350 So this is the simplest method for destination inference. 648 00:30:02,350 --> 00:30:05,630 Any questions with the closest stop rule 649 00:30:05,630 --> 00:30:07,100 and that inference method? 650 00:30:09,780 --> 00:30:14,690 Here's an example of one card with four trips. 651 00:30:14,690 --> 00:30:17,040 It's a time-space diagram of sorts. 652 00:30:17,040 --> 00:30:19,430 So we start the day here in the morning, 653 00:30:19,430 --> 00:30:24,230 and we maybe observe a boarding at this line. 654 00:30:24,230 --> 00:30:28,235 This person, in reality, transferred to the second trip. 655 00:30:28,235 --> 00:30:32,090 So we don't know that at first, but we do see 656 00:30:32,090 --> 00:30:34,490 the origin and the second trip. 657 00:30:34,490 --> 00:30:38,090 So we find which of these stops was closest to that origin 658 00:30:38,090 --> 00:30:40,880 and, we say that's destination. 659 00:30:40,880 --> 00:30:44,960 And likewise, from the trip leading to work, 660 00:30:44,960 --> 00:30:47,300 and from the trip returning from work-- work, school, 661 00:30:47,300 --> 00:30:48,950 whatever it is-- 662 00:30:48,950 --> 00:30:52,340 we find the closest one, and we say that's destination. 663 00:30:52,340 --> 00:30:53,780 And we just keep doing that. 664 00:30:53,780 --> 00:30:56,840 What happens at the end of the day? 665 00:30:56,840 --> 00:31:00,260 There's no next one, right? 666 00:31:00,260 --> 00:31:01,760 So what do we do? 667 00:31:01,760 --> 00:31:03,480 AUDIENCE: If a person gets on a bus 668 00:31:03,480 --> 00:31:10,540 that does go to where they started, if the last bus-- 669 00:31:10,540 --> 00:31:11,980 let's talk about bus for a second. 670 00:31:11,980 --> 00:31:13,605 If it leads back to where they started, 671 00:31:13,605 --> 00:31:15,260 then we can assume that they-- 672 00:31:15,260 --> 00:31:15,620 GABRIEL SANCHEZ-MARTINEZ: Right. 673 00:31:15,620 --> 00:31:16,977 So that's the key assumption. 674 00:31:16,977 --> 00:31:18,560 They key assumption is that the person 675 00:31:18,560 --> 00:31:21,830 returns to the first place seen that day. 676 00:31:21,830 --> 00:31:24,320 Another option is to look at the AFC system 677 00:31:24,320 --> 00:31:28,430 and see what is the next place of origin the next day, 678 00:31:28,430 --> 00:31:29,870 if you have that information. 679 00:31:29,870 --> 00:31:31,510 Both things are possible. 680 00:31:31,510 --> 00:31:32,570 OK? 681 00:31:32,570 --> 00:31:34,800 AUDIENCE: But if they get on the last bus of the day, 682 00:31:34,800 --> 00:31:35,675 does not [INAUDIBLE]. 683 00:31:35,675 --> 00:31:37,300 GABRIEL SANCHEZ-MARTINEZ: Well then you 684 00:31:37,300 --> 00:31:38,360 can't make an inference. 685 00:31:38,360 --> 00:31:40,410 So if they get on-- the question was, 686 00:31:40,410 --> 00:31:43,800 what happens if they get on, say, a bus. 687 00:31:43,800 --> 00:31:46,520 And none of the downstream stops of the origin 688 00:31:46,520 --> 00:31:49,610 get close to the first origin of that day 689 00:31:49,610 --> 00:31:53,660 or the first origin of the next day, 690 00:31:53,660 --> 00:31:56,430 then we can't make an inference. 691 00:31:56,430 --> 00:32:01,680 So we leave that destination uninferred for now. 692 00:32:01,680 --> 00:32:02,300 All right? 693 00:32:02,300 --> 00:32:04,970 So there are some tests. 694 00:32:04,970 --> 00:32:06,230 We talked about distance. 695 00:32:06,230 --> 00:32:08,540 Time is another one. 696 00:32:08,540 --> 00:32:11,720 So there's different ways of looking at this. 697 00:32:11,720 --> 00:32:14,630 In London, when Jay Gordon did this, 698 00:32:14,630 --> 00:32:18,200 he got an entrance rate of about 75%. 699 00:32:18,200 --> 00:32:22,520 Here's a distribution of speed between station 700 00:32:22,520 --> 00:32:26,880 exit and inferred bus alighting or subsequent station entry. 701 00:32:26,880 --> 00:32:30,290 So very slow speeds here. 702 00:32:30,290 --> 00:32:33,420 This goes up way higher than 800. 703 00:32:33,420 --> 00:32:34,460 What does that show? 704 00:32:37,640 --> 00:32:41,000 This is meters per hour, so if you move zero or, say, 705 00:32:41,000 --> 00:32:45,445 one meter per hour, what does that imply? 706 00:32:45,445 --> 00:32:48,022 AUDIENCE: Someone was taking a bus? 707 00:32:48,022 --> 00:32:49,980 GABRIEL SANCHEZ-MARTINEZ: Who are those people? 708 00:32:52,866 --> 00:32:53,828 Emily? 709 00:32:53,828 --> 00:33:04,680 AUDIENCE: People who get off at, say, a tube stop and then 710 00:33:04,680 --> 00:33:05,370 go to work-- 711 00:33:05,370 --> 00:33:06,450 GABRIEL SANCHEZ-MARTINEZ: Go to work for eight hours. 712 00:33:06,450 --> 00:33:06,780 AUDIENCE: --for eight hours. 713 00:33:06,780 --> 00:33:07,080 And then-- 714 00:33:07,080 --> 00:33:07,260 [INTERPOSING VOICES] 715 00:33:07,260 --> 00:33:09,260 GABRIEL SANCHEZ-MARTINEZ: And then next boarding 716 00:33:09,260 --> 00:33:11,340 is right across the street, 8 hours later. 717 00:33:11,340 --> 00:33:13,860 So those are people who are between trips. 718 00:33:13,860 --> 00:33:18,000 And then to the right, here, we have something sort 719 00:33:18,000 --> 00:33:23,100 of bell-shaped, with some distribution, quite wide. 720 00:33:23,100 --> 00:33:25,770 These are in the range of walking speeds. 721 00:33:25,770 --> 00:33:33,650 So it checks with what we know, and what we infer. 722 00:33:33,650 --> 00:33:35,190 Here is a distribution of distance 723 00:33:35,190 --> 00:33:37,180 between subsequent tap and closest 724 00:33:37,180 --> 00:33:39,360 stop on the current route. 725 00:33:39,360 --> 00:33:43,080 So how far away that you walk to the target 726 00:33:43,080 --> 00:33:45,180 from your destination, in other words. 727 00:33:45,180 --> 00:33:48,680 And there's a cutoff. 728 00:33:48,680 --> 00:33:52,830 If this were too far, we would not want to make an inference. 729 00:33:52,830 --> 00:33:57,640 But you can see that most people have quite short distances. 730 00:33:57,640 --> 00:33:59,050 So that's good. 731 00:33:59,050 --> 00:34:04,830 That means that our inference is more likely to be true. 732 00:34:04,830 --> 00:34:05,640 OK? 733 00:34:05,640 --> 00:34:07,900 And you have some details here. 734 00:34:07,900 --> 00:34:10,870 So in the case of London, there was a comparison 735 00:34:10,870 --> 00:34:14,830 of the origins and destinations produced by this algorithm, 736 00:34:14,830 --> 00:34:18,219 with the bus OD survey, which is a manual survey. 737 00:34:18,219 --> 00:34:21,010 And it compared favorably. 738 00:34:21,010 --> 00:34:24,760 One thing with BODS, of course, it had the biases 739 00:34:24,760 --> 00:34:26,800 that a manual survey has. 740 00:34:26,800 --> 00:34:29,881 So it seems that BODS underestimated ridership 741 00:34:29,881 --> 00:34:32,380 during the peak periods, where it was maybe harder to count. 742 00:34:36,070 --> 00:34:39,310 Sometimes the BODS return rates were low. 743 00:34:39,310 --> 00:34:42,670 We saw some of the reasons for biases in manual surveys. 744 00:34:42,670 --> 00:34:45,280 If it's a very full bus, people are 745 00:34:45,280 --> 00:34:46,900 less likely to return a survey. 746 00:34:46,900 --> 00:34:50,500 Or if you are a person who is getting off at the next stop, 747 00:34:50,500 --> 00:34:52,610 you are less likely to answer the survey. 748 00:34:52,610 --> 00:34:56,830 So BODS was, of course, subject to those biases. 749 00:34:56,830 --> 00:35:01,160 And essentially, the people who were doing this validation 750 00:35:01,160 --> 00:35:02,900 were happy with the inference method. 751 00:35:05,730 --> 00:35:10,950 Now what happens in rail in Boston, say? 752 00:35:10,950 --> 00:35:12,130 It's an open system. 753 00:35:12,130 --> 00:35:16,560 So you can apply the nearest node method. 754 00:35:20,300 --> 00:35:23,750 But you have a complication in Boston, the Green Line. 755 00:35:23,750 --> 00:35:25,675 So the Green Line is-- 756 00:35:25,675 --> 00:35:27,800 if you take it in the branches, you board, and then 757 00:35:27,800 --> 00:35:29,550 you tap into the vehicle. 758 00:35:29,550 --> 00:35:32,780 So it looks like a bus from the fare standpoint. 759 00:35:32,780 --> 00:35:37,430 And then you could end up anywhere on the rail network. 760 00:35:37,430 --> 00:35:39,980 So it's a little harder to make an inference 761 00:35:39,980 --> 00:35:45,971 of where you get off in that case, especially going back. 762 00:35:45,971 --> 00:35:51,020 Going back to the branch, it's not clear. 763 00:35:51,020 --> 00:35:55,300 Besides that, there are some other reasons 764 00:35:55,300 --> 00:35:58,550 to try more sophisticated destination entrance methods. 765 00:35:58,550 --> 00:36:03,550 We know that it may not always be the case that the nearest 766 00:36:03,550 --> 00:36:07,420 station in the rail system is actually the alighting station. 767 00:36:07,420 --> 00:36:10,420 There are some cases where you wouldn't take an extra 15 768 00:36:10,420 --> 00:36:12,630 minutes to get a little bit closer. 769 00:36:12,630 --> 00:36:14,560 And you are willing to walk and make 770 00:36:14,560 --> 00:36:16,900 a compromise between those two. 771 00:36:16,900 --> 00:36:22,000 So the minimum cost path method is an improvement 772 00:36:22,000 --> 00:36:25,040 over the closest stop method. 773 00:36:25,040 --> 00:36:26,800 What we do there is we look at-- here's 774 00:36:26,800 --> 00:36:30,490 the origin tap location, the entry to the system. 775 00:36:30,490 --> 00:36:36,010 And we, essentially, explore using a minimum cost 776 00:36:36,010 --> 00:36:38,710 formulation, a dynamic programming approach. 777 00:36:38,710 --> 00:36:41,680 All the feasible paths that the person 778 00:36:41,680 --> 00:36:45,880 could take to their next tap location, the target. 779 00:36:45,880 --> 00:36:50,620 And that includes walking links from any possible exit station. 780 00:36:50,620 --> 00:36:55,990 So we then use a generalized cost equation 781 00:36:55,990 --> 00:36:59,740 to assign a cost to each of these paths, 782 00:36:59,740 --> 00:37:03,050 with relative disutility weights on each component. 783 00:37:03,050 --> 00:37:05,170 So waiting is more than vehicle time. 784 00:37:05,170 --> 00:37:09,010 Walking is more than waiting time. 785 00:37:09,010 --> 00:37:11,020 We've seen these equations before. 786 00:37:11,020 --> 00:37:14,650 And now we have a list of paths, and we 787 00:37:14,650 --> 00:37:16,840 assume that the person took the one that 788 00:37:16,840 --> 00:37:19,330 minimized their disutility. 789 00:37:19,330 --> 00:37:21,310 Their combined, generalized disutility-- 790 00:37:21,310 --> 00:37:23,920 avoiding walking, in-vehicle time transfers, 791 00:37:23,920 --> 00:37:25,670 all those things. 792 00:37:25,670 --> 00:37:29,020 So in this case, perhaps the person 793 00:37:29,020 --> 00:37:31,750 prefers to get off of the Red Line 794 00:37:31,750 --> 00:37:35,860 and walk to the next location. 795 00:37:35,860 --> 00:37:38,620 You could think of this as the Red Line 796 00:37:38,620 --> 00:37:41,620 from Kendall to Park Street. 797 00:37:41,620 --> 00:37:44,890 And then the next entry is at, say, a bus stop 798 00:37:44,890 --> 00:37:46,870 close to Boylston Street. 799 00:37:46,870 --> 00:37:50,310 You could transfer to the Green Line and take it one stop, 800 00:37:50,310 --> 00:37:52,650 or you might decide it's a nice walk. 801 00:37:52,650 --> 00:37:53,940 I'm going to walk. 802 00:37:53,940 --> 00:37:56,900 I'm not going to wait for the Green Line. 803 00:37:56,900 --> 00:37:59,970 And then some possible paths take you 804 00:37:59,970 --> 00:38:04,470 way far from your next location, so they are pruned. 805 00:38:04,470 --> 00:38:06,810 They're not included. 806 00:38:06,810 --> 00:38:11,600 So what happens if we compare the two methods? 807 00:38:11,600 --> 00:38:15,530 What's your intuition? 808 00:38:15,530 --> 00:38:18,050 Or what do you think happens if we compare 809 00:38:18,050 --> 00:38:22,600 the results of nearest node with this more sophisticated method? 810 00:38:26,186 --> 00:38:30,427 What percentage of destinations do you 811 00:38:30,427 --> 00:38:32,260 think will be inferred at a different place? 812 00:38:35,417 --> 00:38:36,710 Is it close to 5? 813 00:38:36,710 --> 00:38:37,850 Close to 50? 814 00:38:37,850 --> 00:38:39,285 Close to 25? 815 00:38:42,280 --> 00:38:43,426 10%? 816 00:38:43,426 --> 00:38:44,920 AUDIENCE: 5%. 817 00:38:44,920 --> 00:38:46,752 GABRIEL SANCHEZ-MARTINEZ: 5%? 818 00:38:46,752 --> 00:38:49,210 AUDIENCE: What percent of the destinations will be inferred 819 00:38:49,210 --> 00:38:49,810 or will not be inferred? 820 00:38:49,810 --> 00:38:51,921 GABRIEL SANCHEZ-MARTINEZ: Will be inferred differently. 821 00:38:51,921 --> 00:38:53,875 AUDIENCE: Oh, will be inferred differently. 822 00:38:56,785 --> 00:38:57,449 5%. 823 00:38:57,449 --> 00:38:58,740 GABRIEL SANCHEZ-MARTINEZ: Five? 824 00:38:58,740 --> 00:38:59,240 OK. 825 00:38:59,240 --> 00:39:01,860 So that's actually close. 826 00:39:01,860 --> 00:39:05,210 I actually don't think I wrote the results, which is good. 827 00:39:05,210 --> 00:39:08,445 So let's look at two examples. 828 00:39:08,445 --> 00:39:10,710 It is close to 5%, in fact. 829 00:39:10,710 --> 00:39:13,960 Some of the differences in the Boston network 830 00:39:13,960 --> 00:39:16,810 are clear improvements in the accuracy. 831 00:39:16,810 --> 00:39:18,670 I'll give you one example of that. 832 00:39:18,670 --> 00:39:21,310 Some people go from Forest Hills, 833 00:39:21,310 --> 00:39:25,300 and then their next tap is at Copley. 834 00:39:25,300 --> 00:39:29,450 So the walk between Back Bay and Copley is five minutes, 835 00:39:29,450 --> 00:39:31,700 and it's a nice walk. 836 00:39:31,700 --> 00:39:33,470 If you use nearest node, you have 837 00:39:33,470 --> 00:39:34,850 to remain on the rail line. 838 00:39:34,850 --> 00:39:38,660 And you have to transfer either at Downtown Crossing to the Red 839 00:39:38,660 --> 00:39:40,190 and then at Park Street to the Green 840 00:39:40,190 --> 00:39:43,340 or go to Haymarket and transfer it to the Orange Line. 841 00:39:43,340 --> 00:39:47,100 That's a 20-minute, 25-minute ordeal. 842 00:39:47,100 --> 00:39:49,621 AUDIENCE: For a walk from Downtown across to Park Street, 843 00:39:49,621 --> 00:39:50,120 that's-- 844 00:39:50,120 --> 00:39:53,725 AUDIENCE: Sure, but then you are not following the method. 845 00:39:53,725 --> 00:39:55,600 AUDIENCE: Oh, you don't [INAUDIBLE] transfer? 846 00:39:55,600 --> 00:39:57,620 GABRIEL SANCHEZ-MARTINEZ: Yeah. 847 00:39:57,620 --> 00:40:02,976 So this is a case where the minimum cost approach says, 848 00:40:02,976 --> 00:40:04,850 yeah, you get off the Back Bay, and you walk. 849 00:40:04,850 --> 00:40:07,314 So yes, it's an improvement. 850 00:40:07,314 --> 00:40:08,480 There are some other cases-- 851 00:40:08,480 --> 00:40:10,580 AUDIENCE: What was the person destination? 852 00:40:10,580 --> 00:40:11,900 GABRIEL SANCHEZ-MARTINEZ: Well, we don't know destination. 853 00:40:11,900 --> 00:40:13,135 We're inferring destination. 854 00:40:13,135 --> 00:40:13,440 [INTERPOSING VOICES] 855 00:40:13,440 --> 00:40:14,270 GABRIEL SANCHEZ-MARTINEZ: And what we know 856 00:40:14,270 --> 00:40:16,480 is that they get on at Copley the next time. 857 00:40:16,480 --> 00:40:18,990 AUDIENCE: Yeah, but was that their afternoon trip? 858 00:40:18,990 --> 00:40:20,360 Or was that their-- 859 00:40:20,360 --> 00:40:21,380 GABRIEL SANCHEZ-MARTINEZ: It's a morning trip. 860 00:40:21,380 --> 00:40:22,490 They go from Forest Hills. 861 00:40:22,490 --> 00:40:25,339 AUDIENCE: So both Forest Hills and Copley were morning taps? 862 00:40:25,339 --> 00:40:27,130 GABRIEL SANCHEZ-MARTINEZ: Oh, I don't know. 863 00:40:27,130 --> 00:40:28,870 Copley might have been the afternoon tap. 864 00:40:28,870 --> 00:40:30,286 AUDIENCE: It doesn't really matter 865 00:40:30,286 --> 00:40:32,734 if it was in the afternoon. 866 00:40:32,734 --> 00:40:34,190 AUDIENCE: [INAUDIBLE]. 867 00:40:34,190 --> 00:40:36,815 But then why would a person get on-- 868 00:40:36,815 --> 00:40:38,440 AUDIENCE: The question is where do they 869 00:40:38,440 --> 00:40:39,740 get off the Orange Line. 870 00:40:39,740 --> 00:40:41,073 GABRIEL SANCHEZ-MARTINEZ: Right. 871 00:40:41,073 --> 00:40:43,300 We're trying to infer the destination when they board 872 00:40:43,300 --> 00:40:44,810 the Orange Line in the morning. 873 00:40:44,810 --> 00:40:48,446 AUDIENCE: But the time gap also matters. 874 00:40:48,446 --> 00:40:49,050 Forest Hills-- 875 00:40:49,050 --> 00:40:51,466 GABRIEL SANCHEZ-MARTINEZ: We are making an assumption that 876 00:40:51,466 --> 00:40:56,260 people [AUDIO OUT] too far from their-- 877 00:40:56,260 --> 00:40:57,250 AUDIENCE: But I-- 878 00:40:57,250 --> 00:40:58,875 GABRIEL SANCHEZ-MARTINEZ: --destination 879 00:40:58,875 --> 00:41:01,276 in a non-public transportation mode. 880 00:41:01,276 --> 00:41:01,900 AUDIENCE: Sure. 881 00:41:01,900 --> 00:41:03,400 GABRIEL SANCHEZ-MARTINEZ: That's one 882 00:41:03,400 --> 00:41:06,010 of the assumptions in this method 883 00:41:06,010 --> 00:41:07,650 and in the previous method as well. 884 00:41:07,650 --> 00:41:08,660 AUDIENCE: But I'm still troubled by the time gap. 885 00:41:08,660 --> 00:41:10,410 Was Copley and afternoon tap? 886 00:41:10,410 --> 00:41:11,110 Or was it-- 887 00:41:11,110 --> 00:41:12,250 GABRIEL SANCHEZ-MARTINEZ: It could have been. 888 00:41:12,250 --> 00:41:14,030 There are many people who do this. 889 00:41:14,030 --> 00:41:14,680 AUDIENCE: It makes a difference-- 890 00:41:14,680 --> 00:41:15,340 GABRIEL SANCHEZ-MARTINEZ: So there 891 00:41:15,340 --> 00:41:16,870 are many people who do this. 892 00:41:16,870 --> 00:41:19,080 So there are some people who do it 893 00:41:19,080 --> 00:41:21,850 close in time and some people who do it in the evening, 894 00:41:21,850 --> 00:41:23,110 after they leave work. 895 00:41:23,110 --> 00:41:27,780 So I'm giving you an example of one origin target pair. 896 00:41:27,780 --> 00:41:33,231 And I would say it's a marked improvement. 897 00:41:33,231 --> 00:41:35,730 It's certainly not the case that the person goes all the way 898 00:41:35,730 --> 00:41:39,000 to Haymarket and turns around. 899 00:41:39,000 --> 00:41:39,900 OK. 900 00:41:39,900 --> 00:41:44,110 The other example is less clear. 901 00:41:44,110 --> 00:41:49,380 So somebody-- an OD pair starting at Maverick, 902 00:41:49,380 --> 00:41:54,280 and then the next tap is at Downtown Crossing. 903 00:41:54,280 --> 00:41:58,030 So obviously, the closest node assumption 904 00:41:58,030 --> 00:42:00,340 is that you transfer at State Street 905 00:42:00,340 --> 00:42:03,250 and take the Orange Line one stop. 906 00:42:03,250 --> 00:42:05,500 That's actually not too bad. 907 00:42:05,500 --> 00:42:08,145 The algorithm-- and first, for a lot of these people, 908 00:42:08,145 --> 00:42:09,520 that you get off at State Street, 909 00:42:09,520 --> 00:42:13,210 and you walk about four minutes, and if you 910 00:42:13,210 --> 00:42:14,730 look at Google directions, Google 911 00:42:14,730 --> 00:42:16,188 will say that's what you should do. 912 00:42:18,770 --> 00:42:22,070 The transfer to the Orange Line would take 913 00:42:22,070 --> 00:42:23,780 six minutes, instead of four. 914 00:42:23,780 --> 00:42:25,216 So it's very close. 915 00:42:25,216 --> 00:42:26,840 And it depends on the weather that day. 916 00:42:26,840 --> 00:42:29,350 And it depends on people's preference. 917 00:42:29,350 --> 00:42:31,760 It might depend on real-time information 918 00:42:31,760 --> 00:42:33,350 about whether the train is right here, 919 00:42:33,350 --> 00:42:37,160 and I can run, or arriving in one minute, 920 00:42:37,160 --> 00:42:39,600 or if it's 10 minutes away. 921 00:42:39,600 --> 00:42:41,970 So this is more subtle, more nuanced. 922 00:42:41,970 --> 00:42:43,950 And I wouldn't say that was an improvement. 923 00:42:43,950 --> 00:42:47,850 So part of the 5% is clear improvement, 924 00:42:47,850 --> 00:42:52,310 and another part of it is, well, it might be an improvement 925 00:42:52,310 --> 00:42:52,810 or not. 926 00:42:52,810 --> 00:42:54,990 It depends on people's preferences. 927 00:42:54,990 --> 00:43:03,170 So if we look at the distribution of the results 928 00:43:03,170 --> 00:43:07,820 from destination in this case, 70% of destinations 929 00:43:07,820 --> 00:43:09,700 were inferred. 930 00:43:09,700 --> 00:43:13,690 And then we have different reasons why we can't infer it. 931 00:43:13,690 --> 00:43:16,150 So for 16%, there was no target location. 932 00:43:16,150 --> 00:43:23,010 That means there was no other tap that day, essentially. 933 00:43:23,010 --> 00:43:28,000 So there was no target. 934 00:43:28,000 --> 00:43:29,870 For 8% of them, there was another target, 935 00:43:29,870 --> 00:43:31,150 but it was very far. 936 00:43:31,150 --> 00:43:36,870 So somehow, the person went to another bus stop in the system, 937 00:43:36,870 --> 00:43:40,240 and it was far away from any rail station. 938 00:43:40,240 --> 00:43:43,430 So we're not so comfortable, in that case, 939 00:43:43,430 --> 00:43:47,960 saying that the destination is close to the next tap. 940 00:43:47,960 --> 00:43:51,820 So we will not make an inference for those people. 941 00:43:51,820 --> 00:43:55,000 Some paths were non-feasible. 942 00:43:55,000 --> 00:43:59,830 So that means that the algorithm did not 943 00:43:59,830 --> 00:44:03,790 find any path that made it to the target on time 944 00:44:03,790 --> 00:44:06,760 to make the next. 945 00:44:06,760 --> 00:44:08,590 So that could be about data. 946 00:44:08,590 --> 00:44:10,695 It could be a number of things. 947 00:44:10,695 --> 00:44:12,820 There are some assumptions about how quickly people 948 00:44:12,820 --> 00:44:15,250 can access trains and hop. 949 00:44:15,250 --> 00:44:16,810 So many things can go into that. 950 00:44:19,390 --> 00:44:20,250 Yeah, and so forth. 951 00:44:20,250 --> 00:44:25,831 And then unknown origin, so errors in the data, et cetera. 952 00:44:25,831 --> 00:44:26,330 OK? 953 00:44:30,220 --> 00:44:34,790 And the inference probabilities, the total ones are shown here. 954 00:44:34,790 --> 00:44:37,980 So the blue line is overall destination entrance rates, 955 00:44:37,980 --> 00:44:40,529 so I said close to 70%. 956 00:44:40,529 --> 00:44:42,070 That's what you see on the blue line. 957 00:44:42,070 --> 00:44:44,260 It dips a little bit on weekends because there 958 00:44:44,260 --> 00:44:46,930 are fewer taps and maybe more walking 959 00:44:46,930 --> 00:44:50,770 between taps or between trips. 960 00:44:50,770 --> 00:44:54,070 For rail, it's a little higher than the general. 961 00:44:54,070 --> 00:44:57,350 For bus, shown in yellow, it's a little lower. 962 00:44:57,350 --> 00:44:58,900 And if you take away the part that 963 00:44:58,900 --> 00:45:05,260 didn't have a second tap or a tap after that the transaction, 964 00:45:05,260 --> 00:45:10,345 then it goes up to closer to 90%, not quite 90%. 965 00:45:10,345 --> 00:45:11,220 AUDIENCE: [INAUDIBLE] 966 00:45:11,220 --> 00:45:11,370 GABRIEL SANCHEZ-MARTINEZ: Yeah? 967 00:45:11,370 --> 00:45:13,495 AUDIENCE: Does this include only people who tapped? 968 00:45:13,495 --> 00:45:16,550 Or does this also include people who paid cash? 969 00:45:16,550 --> 00:45:20,870 GABRIEL SANCHEZ-MARTINEZ: So these numbers in this slide 970 00:45:20,870 --> 00:45:22,010 are everyone. 971 00:45:22,010 --> 00:45:26,510 If the person is cash, then they are not counted on the red line 972 00:45:26,510 --> 00:45:29,450 because they wouldn't have an inferable destination. 973 00:45:29,450 --> 00:45:35,567 But certainly, the bus line does include cash transactions. 974 00:45:35,567 --> 00:45:36,400 [INTERPOSING VOICES] 975 00:45:36,400 --> 00:45:36,580 GABRIEL SANCHEZ-MARTINEZ: And that's 976 00:45:36,580 --> 00:45:38,536 one of the reasons why it's lower. 977 00:45:38,536 --> 00:45:40,230 AUDIENCE: [INAUDIBLE] target location on the last slide 978 00:45:40,230 --> 00:45:41,355 includes cash transactions? 979 00:45:41,355 --> 00:45:44,890 GABRIEL SANCHEZ-MARTINEZ: So this was only for the rail. 980 00:45:44,890 --> 00:45:47,140 Sorry, this is for the whole system, 981 00:45:47,140 --> 00:45:52,720 but the two examples that I gave here, I was looking at-- 982 00:45:52,720 --> 00:45:55,300 I quoted 5% difference, and that's 983 00:45:55,300 --> 00:45:58,990 a case study where I compared rail transactions, not 984 00:45:58,990 --> 00:46:01,680 bus transactions. 985 00:46:01,680 --> 00:46:04,470 So that's the one thing to have in mind. 986 00:46:04,470 --> 00:46:06,620 But yeah, this is overall destination inference 987 00:46:06,620 --> 00:46:08,207 in the MBTA, and this as well. 988 00:46:08,207 --> 00:46:09,540 Different ways of looking at it. 989 00:46:09,540 --> 00:46:10,366 AUDIENCE: So if you paid cash, you'd 990 00:46:10,366 --> 00:46:11,540 be in the no target range. 991 00:46:11,540 --> 00:46:13,080 GABRIEL SANCHEZ-MARTINEZ: Yes. 992 00:46:13,080 --> 00:46:15,030 And you could infer an origin for that person, 993 00:46:15,030 --> 00:46:17,820 but not a destination, so you leave that trip uninferred 994 00:46:17,820 --> 00:46:19,460 destination for now. 995 00:46:19,460 --> 00:46:21,700 OK. 996 00:46:21,700 --> 00:46:25,210 We've covered O and D. Let's move to X, transfer inference. 997 00:46:25,210 --> 00:46:29,060 We talked about why transfer inference is so important. 998 00:46:29,060 --> 00:46:31,120 We also call this interchange inference. 999 00:46:31,120 --> 00:46:34,710 Interchange is a term preferred in London by the British. 1000 00:46:34,710 --> 00:46:37,670 In the US, we say transfer. 1001 00:46:37,670 --> 00:46:41,194 So we have seen this diagram before. 1002 00:46:41,194 --> 00:46:42,610 But now there are these blue boxes 1003 00:46:42,610 --> 00:46:47,980 surrounding both, say, the morning pair and the afternoon 1004 00:46:47,980 --> 00:46:48,770 pair. 1005 00:46:48,770 --> 00:46:52,630 So the inference we want to make now 1006 00:46:52,630 --> 00:46:55,660 is whether this first trip was connected 1007 00:46:55,660 --> 00:46:57,670 to the second with a transfer. 1008 00:46:57,670 --> 00:47:00,760 Or whether, in fact, the person was doing something else 1009 00:47:00,760 --> 00:47:04,630 in between those two trips, and this was the actual intended 1010 00:47:04,630 --> 00:47:07,340 destination of that passenger. 1011 00:47:07,340 --> 00:47:09,430 And that's an important question for the reasons 1012 00:47:09,430 --> 00:47:10,450 we talked about earlier. 1013 00:47:16,170 --> 00:47:19,790 These are some definitions for your reference. 1014 00:47:19,790 --> 00:47:24,860 A journey, in this subject, is everything 1015 00:47:24,860 --> 00:47:29,420 that is accomplished from the real origin 1016 00:47:29,420 --> 00:47:31,170 to the real destination of the person, 1017 00:47:31,170 --> 00:47:33,740 including transfers and, possibly, multiple fare 1018 00:47:33,740 --> 00:47:35,240 payments. 1019 00:47:35,240 --> 00:47:38,090 A fare stage, not included in the slide, 1020 00:47:38,090 --> 00:47:41,450 is everything that you do in a single fare payment. 1021 00:47:41,450 --> 00:47:44,720 So it could involve behind-the-gate transfers, 1022 00:47:44,720 --> 00:47:48,920 or it could be one bus ride. 1023 00:47:48,920 --> 00:47:53,630 Transfers are transfers between stages. 1024 00:47:53,630 --> 00:47:58,070 So they link segments of a journey. 1025 00:47:58,070 --> 00:48:00,390 How do we do this linking? 1026 00:48:00,390 --> 00:48:02,730 This is also from Jay Gordon's thesis, which 1027 00:48:02,730 --> 00:48:05,820 is referenced in the back. 1028 00:48:05,820 --> 00:48:09,000 We look at a series of three kinds of conditions-- 1029 00:48:09,000 --> 00:48:10,860 temporal conditions, logical conditions, 1030 00:48:10,860 --> 00:48:12,390 and spatial conditions. 1031 00:48:12,390 --> 00:48:16,020 Temporal conditions, say, how much time 1032 00:48:16,020 --> 00:48:17,940 happened between the inferred destination 1033 00:48:17,940 --> 00:48:21,280 and the next origin, the inferred origin. 1034 00:48:21,280 --> 00:48:24,280 If that was a very long time, and the distance was short, 1035 00:48:24,280 --> 00:48:27,370 then the person might have been doing something else. 1036 00:48:27,370 --> 00:48:33,780 So we can't necessarily assume that this was a transfer. 1037 00:48:33,780 --> 00:48:36,160 We also look at bus wait time. 1038 00:48:36,160 --> 00:48:37,680 So what if the distance was short? 1039 00:48:37,680 --> 00:48:42,390 A long time happened [AUDIO OUT] of next bus or every 20 1040 00:48:42,390 --> 00:48:45,960 minutes, and the person had to wait that long? 1041 00:48:45,960 --> 00:48:47,970 Well, that is also considered. 1042 00:48:47,970 --> 00:48:53,340 So if we look and see that that next bus passed 1043 00:48:53,340 --> 00:48:56,420 after a reasonable time allowed to get to the next stop. 1044 00:48:56,420 --> 00:48:59,280 Or how many buses passed? 1045 00:48:59,280 --> 00:49:01,500 Maybe you want to allow one, just in case 1046 00:49:01,500 --> 00:49:04,270 that bus is very full and can't take that person. 1047 00:49:04,270 --> 00:49:07,980 So these are the considerations in temporal conditions. 1048 00:49:07,980 --> 00:49:11,310 Spatial conditions, you want to look at maximum interchange 1049 00:49:11,310 --> 00:49:13,440 distance, assuming that a person can actually 1050 00:49:13,440 --> 00:49:18,720 do a transfer that is two kilometers long, for example. 1051 00:49:18,720 --> 00:49:20,940 You probably would be doing something else, 1052 00:49:20,940 --> 00:49:22,020 if that's the case. 1053 00:49:22,020 --> 00:49:24,870 And we look at circuity, so circuity at the journey level 1054 00:49:24,870 --> 00:49:26,750 and between stages. 1055 00:49:26,750 --> 00:49:29,250 A circuitous journey is one that ends very 1056 00:49:29,250 --> 00:49:32,280 close to where you started. 1057 00:49:32,280 --> 00:49:36,270 So if you infer transfers, and you 1058 00:49:36,270 --> 00:49:39,240 end up back where you started, then somewhere 1059 00:49:39,240 --> 00:49:42,420 in that chain of stages, there must have not 1060 00:49:42,420 --> 00:49:43,890 been a real transfer. 1061 00:49:43,890 --> 00:49:46,860 There must have been a non-transportation activity. 1062 00:49:46,860 --> 00:49:50,190 So therefore, you can't really infer that all of that chain 1063 00:49:50,190 --> 00:49:52,170 is linked with transfers. 1064 00:49:52,170 --> 00:49:53,830 And then circuity between stages. 1065 00:49:53,830 --> 00:49:59,490 So if you, for example, board the same bus line 1066 00:49:59,490 --> 00:50:03,720 going backwards, even if the time was short, 1067 00:50:03,720 --> 00:50:06,632 and the distance was short, you may have seen your friend 1068 00:50:06,632 --> 00:50:09,090 and given your friend something, and then hopped on the bus 1069 00:50:09,090 --> 00:50:09,720 again. 1070 00:50:09,720 --> 00:50:11,370 So it might have been a quick transfer. 1071 00:50:11,370 --> 00:50:14,430 And therefore, we want to look at circuity to infer 1072 00:50:14,430 --> 00:50:18,031 if that was a transfer or not. 1073 00:50:18,031 --> 00:50:20,530 Logical conditions, I actually gave an example of that right 1074 00:50:20,530 --> 00:50:21,030 now. 1075 00:50:21,030 --> 00:50:25,420 So if you're entering the same station you get off at, 1076 00:50:25,420 --> 00:50:28,660 or you take the same bus line, then 1077 00:50:28,660 --> 00:50:30,760 that shouldn't be a transfer because you could 1078 00:50:30,760 --> 00:50:32,890 have stayed on the same bus. 1079 00:50:32,890 --> 00:50:35,560 One example of that breaking would 1080 00:50:35,560 --> 00:50:39,220 be a bus being taken out of service or something like this. 1081 00:50:39,220 --> 00:50:43,390 So you could consider not to make that logical condition 1082 00:50:43,390 --> 00:50:49,322 more accurate. 1083 00:50:49,322 --> 00:50:51,280 In many cases when that happens, though, people 1084 00:50:51,280 --> 00:50:52,780 are not asked to tap again. 1085 00:50:52,780 --> 00:50:59,250 So take that with a grain of salt. 1086 00:50:59,250 --> 00:51:01,380 Questions about these assumptions and the tests 1087 00:51:01,380 --> 00:51:02,670 that we impose? 1088 00:51:02,670 --> 00:51:05,340 So essentially, if all of these tests pass, 1089 00:51:05,340 --> 00:51:07,920 we say, yes, this is a transfer. 1090 00:51:07,920 --> 00:51:10,650 If one of them doesn't pass, we say, we're not sure. 1091 00:51:10,650 --> 00:51:13,570 It could have been or maybe not. 1092 00:51:13,570 --> 00:51:16,780 And therefore, this will be a conservative assumption 1093 00:51:16,780 --> 00:51:23,580 about whether these two stages are linked as one journey. 1094 00:51:23,580 --> 00:51:26,970 Here is in London, the results. 1095 00:51:26,970 --> 00:51:31,860 So about 2/3 were one stage, about 1/4 were two stages, 1096 00:51:31,860 --> 00:51:34,650 and then about 10% were more than two stages. 1097 00:51:34,650 --> 00:51:37,950 And here's a distribution of duration 1098 00:51:37,950 --> 00:51:42,660 of journey from first origin to last destination, 1099 00:51:42,660 --> 00:51:45,310 instead of the unlinked trip time. 1100 00:51:45,310 --> 00:51:49,270 This includes transfer time in between. 1101 00:51:49,270 --> 00:51:53,770 So very powerful for service planning. 1102 00:51:53,770 --> 00:51:57,400 There was a comparison with the London travel diary survey, 1103 00:51:57,400 --> 00:51:59,230 little tedious. 1104 00:51:59,230 --> 00:52:02,600 And it lined up quite well, but there were some differences. 1105 00:52:02,600 --> 00:52:05,320 So if we look at people reporting that they only 1106 00:52:05,320 --> 00:52:08,770 took one journey on the day they were queried about, 1107 00:52:08,770 --> 00:52:10,870 it lined up very well. 1108 00:52:10,870 --> 00:52:15,220 If you then look at two or more, it 1109 00:52:15,220 --> 00:52:18,820 turns out that a lot of people in LTDS say that they took two, 1110 00:52:18,820 --> 00:52:21,100 but they may have taken more. 1111 00:52:21,100 --> 00:52:23,680 And they're just simplifying the reporting. 1112 00:52:23,680 --> 00:52:26,620 That's one possibility of errors. 1113 00:52:26,620 --> 00:52:30,020 And it is a known bias in surveys 1114 00:52:30,020 --> 00:52:33,130 that people to try to help you out by saying what I usually 1115 00:52:33,130 --> 00:52:36,970 do, instead of what I did yesterday, or things like that. 1116 00:52:36,970 --> 00:52:41,170 So percent of cardholders, shown on the second graph, 1117 00:52:41,170 --> 00:52:43,420 and the number of stages per journey, similar pattern. 1118 00:52:46,730 --> 00:52:54,780 So in LTDS, you tend to have fewer people. 1119 00:52:54,780 --> 00:53:01,030 Well, yeah, here we have more people inferred with one stage, 1120 00:53:01,030 --> 00:53:02,560 so no transfer. 1121 00:53:02,560 --> 00:53:05,370 But if you look at two or more, a similar pattern 1122 00:53:05,370 --> 00:53:13,400 emerges where people might be reporting two and, in fact, 1123 00:53:13,400 --> 00:53:15,230 it might have been more or less. 1124 00:53:15,230 --> 00:53:21,020 So in this case, the bias is towards more direct trips 1125 00:53:21,020 --> 00:53:24,120 on the inferred side, versus the questionnaire side. 1126 00:53:24,120 --> 00:53:26,090 OK? 1127 00:53:26,090 --> 00:53:28,530 So it didn't validate perfectly. 1128 00:53:28,530 --> 00:53:30,850 But there were some known issues with LTDS. 1129 00:53:33,740 --> 00:53:40,120 London has since decided that ODX is more accurate. 1130 00:53:40,120 --> 00:53:42,410 Now they continue LTDS because LTDS 1131 00:53:42,410 --> 00:53:45,480 is useful for other things, like asking 1132 00:53:45,480 --> 00:53:48,240 about social demographics, and trip purpose, 1133 00:53:48,240 --> 00:53:50,280 and things like that. 1134 00:53:50,280 --> 00:53:55,580 That doesn't obviate the need for LTDS, 1135 00:53:55,580 --> 00:54:02,300 but it might reduce the need to have as many LTDS surveys. 1136 00:54:02,300 --> 00:54:03,240 Scaling. 1137 00:54:03,240 --> 00:54:06,870 So we've done ODX, and we've inferred 1138 00:54:06,870 --> 00:54:09,500 a percentage of destinations-- or we've inferred destinations 1139 00:54:09,500 --> 00:54:12,720 for a percentage of transactions, not all of them. 1140 00:54:12,720 --> 00:54:14,880 And we've linked up the ones that we could. 1141 00:54:14,880 --> 00:54:17,820 Now, we want the full matrix, because for planning, 1142 00:54:17,820 --> 00:54:22,570 we want to know how many people want to go from here to there. 1143 00:54:22,570 --> 00:54:26,940 So there are different methods for scaling. 1144 00:54:26,940 --> 00:54:30,510 We have different situations. 1145 00:54:30,510 --> 00:54:35,460 So AFC, AVL, and ODX, together, given an OD matrix, 1146 00:54:35,460 --> 00:54:38,970 but it's only for a sample of passenger trips. 1147 00:54:38,970 --> 00:54:41,730 If you have APC, that gives you the full boarding count 1148 00:54:41,730 --> 00:54:42,709 [AUDIO OUT]. 1149 00:54:42,709 --> 00:54:45,000 So if you have that for all your bus fleet, then great. 1150 00:54:45,000 --> 00:54:46,460 You can use IPF. 1151 00:54:46,460 --> 00:54:53,240 And you could apply your ODX matrix as the seed 1152 00:54:53,240 --> 00:54:56,170 to make it more accurate. 1153 00:54:56,170 --> 00:54:57,100 That's great. 1154 00:54:57,100 --> 00:55:01,090 In some cases, you only have APC on a fraction of vehicles 1155 00:55:01,090 --> 00:55:02,650 or on no vehicles. 1156 00:55:02,650 --> 00:55:05,750 And therefore, that's a little tougher. 1157 00:55:05,750 --> 00:55:10,940 So IPF can be applied in this context, 1158 00:55:10,940 --> 00:55:13,420 not just on the whole matrix, but also 1159 00:55:13,420 --> 00:55:18,100 on the part of the matrix that is not inferred. 1160 00:55:18,100 --> 00:55:20,920 So you can, essentially, subtract from your control 1161 00:55:20,920 --> 00:55:25,660 totals the portion of the demand that was inferred, 1162 00:55:25,660 --> 00:55:28,570 and apply IPF only on the remainder. 1163 00:55:28,570 --> 00:55:31,590 And that will scale up only that part. 1164 00:55:31,590 --> 00:55:33,470 OK? 1165 00:55:33,470 --> 00:55:39,670 That's better if you don't want to distort your seed too 1166 00:55:39,670 --> 00:55:41,080 much, essentially. 1167 00:55:41,080 --> 00:55:43,900 If you're not very comfortable assuming 1168 00:55:43,900 --> 00:55:47,050 that all the people that are not inferred 1169 00:55:47,050 --> 00:55:50,050 have the same demand OD structure as the people 1170 00:55:50,050 --> 00:55:51,880 that you do have an inference for, 1171 00:55:51,880 --> 00:55:53,410 then separating those out and using 1172 00:55:53,410 --> 00:55:55,780 IPF on the uninferred portion will give you 1173 00:55:55,780 --> 00:55:58,660 a more accurate result because you're not 1174 00:55:58,660 --> 00:56:03,850 amplifying whatever you observed and was able to infer. 1175 00:56:03,850 --> 00:56:08,910 So one example of that is transfers. 1176 00:56:08,910 --> 00:56:12,300 And we'll give an example of that in the next few slides, 1177 00:56:12,300 --> 00:56:13,320 actually, right here. 1178 00:56:13,320 --> 00:56:19,110 So consider scaling when you have transfer information 1179 00:56:19,110 --> 00:56:21,990 from ODX, and you don't have APC on every bus. 1180 00:56:21,990 --> 00:56:25,050 You have it on some buses, but not every bus. 1181 00:56:25,050 --> 00:56:28,410 So the real complete OD matrix is what you want. 1182 00:56:28,410 --> 00:56:29,820 And we could split it. 1183 00:56:29,820 --> 00:56:32,370 I'm using algebraic notation here. 1184 00:56:32,370 --> 00:56:35,400 The real matrix, R, can be split into the inferred portion 1185 00:56:35,400 --> 00:56:39,600 and the missing portion, M. 1186 00:56:39,600 --> 00:56:40,650 And the missing portion-- 1187 00:56:40,650 --> 00:56:42,191 there's two reasons for missing data. 1188 00:56:42,191 --> 00:56:44,160 One of them is, I saw the boarding, 1189 00:56:44,160 --> 00:56:46,840 but I couldn't infer destination. 1190 00:56:46,840 --> 00:56:49,150 So that's U, the uninferred portion. 1191 00:56:49,150 --> 00:56:51,296 And then there's the N, the non-interaction part. 1192 00:56:51,296 --> 00:56:52,920 Those are the people that board without 1193 00:56:52,920 --> 00:56:54,170 interacting with the fare box. 1194 00:56:58,045 --> 00:57:02,820 We want all of R. We know I, or at least we made an inference 1195 00:57:02,820 --> 00:57:05,790 for I. And then we want to estimate U and estimate N. 1196 00:57:05,790 --> 00:57:08,910 And then we can add those two estimates together to the I, 1197 00:57:08,910 --> 00:57:12,230 and we'll have the estimate of R. And that's what we want. 1198 00:57:12,230 --> 00:57:14,490 That's what scaling accomplishes. 1199 00:57:14,490 --> 00:57:18,980 Now there's one critical observation to make here. 1200 00:57:22,710 --> 00:57:25,560 If you take a trip on a bus line, 1201 00:57:25,560 --> 00:57:28,320 and then you transferred somewhere else, 1202 00:57:28,320 --> 00:57:32,910 there will be a tap close to your destination, 1203 00:57:32,910 --> 00:57:35,710 shortly after your destination. 1204 00:57:35,710 --> 00:57:38,790 So the likelihood that you were able to make 1205 00:57:38,790 --> 00:57:41,380 a destination inference is very high Do you agree with that? 1206 00:57:45,490 --> 00:57:46,874 Yes or no? 1207 00:57:46,874 --> 00:57:49,684 AUDIENCE: [INAUDIBLE] 1208 00:57:49,684 --> 00:57:51,850 GABRIEL SANCHEZ-MARTINEZ: If you have some bus line, 1209 00:57:51,850 --> 00:57:57,120 and let's say that at stop B, there is a rail station. 1210 00:57:57,120 --> 00:58:01,380 And you are taking the bus line from D, C, 1211 00:58:01,380 --> 00:58:07,680 to B. If you are actually going to transfer to this transfer 1212 00:58:07,680 --> 00:58:14,580 station, then you will have a tap onto x, shortly 1213 00:58:14,580 --> 00:58:17,426 after you get off at B. Right? 1214 00:58:17,426 --> 00:58:18,370 OK. 1215 00:58:18,370 --> 00:58:23,680 So given that I inferred your origin being D, 1216 00:58:23,680 --> 00:58:26,470 the probability that I actually infer that your distinction was 1217 00:58:26,470 --> 00:58:29,230 B is very high because I have the information 1218 00:58:29,230 --> 00:58:30,590 to make that inference. 1219 00:58:30,590 --> 00:58:33,100 It's close in time and in distance. 1220 00:58:33,100 --> 00:58:34,660 It will pass the checks. 1221 00:58:34,660 --> 00:58:37,960 So if we make the assumption that in every case where 1222 00:58:37,960 --> 00:58:40,750 we have a transfer, we've successfully 1223 00:58:40,750 --> 00:58:45,100 inferred the destination, then we have to then say, well, 1224 00:58:45,100 --> 00:58:48,280 then none of the people who are uninferred 1225 00:58:48,280 --> 00:58:49,990 had a transfer afterwards. 1226 00:58:49,990 --> 00:58:50,490 Right? 1227 00:58:53,430 --> 00:58:53,930 Right? 1228 00:58:53,930 --> 00:58:54,540 OK. 1229 00:58:54,540 --> 00:58:56,730 And what happens with-- 1230 00:58:56,730 --> 00:58:59,200 what if this is a very popular rail station, 1231 00:58:59,200 --> 00:59:02,670 and a lot of people take it? 1232 00:59:02,670 --> 00:59:06,540 Then the uninferred portion of the demand 1233 00:59:06,540 --> 00:59:08,130 are people who don't transfer there. 1234 00:59:08,130 --> 00:59:13,590 And if you applied the ODX matrix of the people 1235 00:59:13,590 --> 00:59:16,440 that you had a destination inference for, 1236 00:59:16,440 --> 00:59:20,820 you would be weighing B as the destination too much. 1237 00:59:20,820 --> 00:59:23,340 Those people are not very likely transferring to-- 1238 00:59:23,340 --> 00:59:26,430 some people might be getting off at B, but fewer of them 1239 00:59:26,430 --> 00:59:28,230 because you're looking at the people who 1240 00:59:28,230 --> 00:59:29,670 don't end up transferring. 1241 00:59:29,670 --> 00:59:33,090 So it could be people that go somewhere else around B, 1242 00:59:33,090 --> 00:59:34,780 but the percentage will be lower. 1243 00:59:34,780 --> 00:59:39,690 So what we want to do is produce destination probability matrix 1244 00:59:39,690 --> 00:59:43,350 from the portion of I that we inferred that was not 1245 00:59:43,350 --> 00:59:45,850 followed by a transfer. 1246 00:59:45,850 --> 00:59:48,280 So we prepare a different matrix, 1247 00:59:48,280 --> 00:59:51,970 excluding the people that transferred after this trip. 1248 00:59:51,970 --> 00:59:55,570 And then we use that to scale up the remaining 1249 00:59:55,570 --> 00:59:57,680 origins in probability. 1250 00:59:57,680 --> 01:00:01,040 So that's what we have here, expressed mathematically. 1251 01:00:01,040 --> 01:00:03,572 U is the vector of boarding locations 1252 01:00:03,572 --> 01:00:04,780 with uninferred destinations. 1253 01:00:04,780 --> 01:00:06,970 And we multiply it times L bar, where 1254 01:00:06,970 --> 01:00:10,690 L bar is a matrix of destination probabilities of trips 1255 01:00:10,690 --> 01:00:12,040 not followed by transfers. 1256 01:00:12,040 --> 01:00:16,070 So that comes from ODX, But we remove 1257 01:00:16,070 --> 01:00:18,890 trips followed by transfers. 1258 01:00:18,890 --> 01:00:19,490 All right? 1259 01:00:19,490 --> 01:00:25,340 And then we have to take care of the non-interaction trips 1260 01:00:25,340 --> 01:00:26,780 or not observed trips. 1261 01:00:26,780 --> 01:00:31,490 Some of them are trips with uninferred origins. 1262 01:00:31,490 --> 01:00:35,080 So it could be that we know that this person was at this trip 1263 01:00:35,080 --> 01:00:37,420 because the origin inference failed, 1264 01:00:37,420 --> 01:00:39,220 or it could be that the person-- 1265 01:00:39,220 --> 01:00:42,280 there's some information that, in general, 1266 01:00:42,280 --> 01:00:44,540 some portion of passengers don't interact. 1267 01:00:44,540 --> 01:00:48,010 So you want to scale everything up by 5%, say, as an example. 1268 01:00:48,010 --> 01:00:50,830 That could come from surveys or from APC, 1269 01:00:50,830 --> 01:00:53,200 if you have APC on a portion of the fleet. 1270 01:00:53,200 --> 01:00:57,950 So essentially, n, here, is the scaling factor. 1271 01:00:57,950 --> 01:01:02,410 It could be bump what you have so far, which is I plus U, 1272 01:01:02,410 --> 01:01:04,870 by some amount, some percentage. 1273 01:01:04,870 --> 01:01:06,790 That's the simple way of doing it. 1274 01:01:06,790 --> 01:01:07,870 n could be a vector. 1275 01:01:07,870 --> 01:01:09,880 So you could have different scaling factors 1276 01:01:09,880 --> 01:01:12,050 for each boarding stop. 1277 01:01:12,050 --> 01:01:13,990 It could be a 5% overall, or it could 1278 01:01:13,990 --> 01:01:16,750 be there's a lot of non-interaction at this stop, 1279 01:01:16,750 --> 01:01:17,950 but not this stuff stop. 1280 01:01:17,950 --> 01:01:19,660 It could be correlated to loads. 1281 01:01:19,660 --> 01:01:20,965 So many things could happen. 1282 01:01:20,965 --> 01:01:22,720 You can scale up this way. 1283 01:01:22,720 --> 01:01:24,220 And now we have everything together. 1284 01:01:24,220 --> 01:01:27,880 So this is just combining the terms. 1285 01:01:27,880 --> 01:01:31,150 This is the scaling factor for not observed, 1286 01:01:31,150 --> 01:01:35,560 so if n is a flat 5%, this is 1.05. 1287 01:01:35,560 --> 01:01:37,930 And then I is what you had from ODX. 1288 01:01:37,930 --> 01:01:44,660 And uL is the application of the destinations improbability 1289 01:01:44,660 --> 01:01:48,420 to the people that had origins, but not inferred destinations. 1290 01:01:48,420 --> 01:01:49,840 OK? 1291 01:01:49,840 --> 01:01:53,370 Questions about this method, the scaling method? 1292 01:01:53,370 --> 01:01:57,690 This for one trip, or for bus trips together, say. 1293 01:01:57,690 --> 01:02:00,860 It's not journey-level scaling. 1294 01:02:00,860 --> 01:02:03,620 So let's move on to journey-level scaling. 1295 01:02:03,620 --> 01:02:06,080 So now we're getting a little more complicated. 1296 01:02:06,080 --> 01:02:10,760 So we have journeys, which include full itineraries 1297 01:02:10,760 --> 01:02:13,370 of people boarding at one location, 1298 01:02:13,370 --> 01:02:16,640 or entering a station in one location, doing several trips, 1299 01:02:16,640 --> 01:02:18,396 including transfers, possibly. 1300 01:02:18,396 --> 01:02:20,270 And each of those is considered an itinerary. 1301 01:02:20,270 --> 01:02:23,960 An itinerary could be one stage, or could be multiple stages 1302 01:02:23,960 --> 01:02:26,180 linked together with transfers. 1303 01:02:26,180 --> 01:02:31,910 So again, we have inferred itineraries 1304 01:02:31,910 --> 01:02:33,260 for a portion of the demand. 1305 01:02:33,260 --> 01:02:37,280 But we want to scale up the demand, knowing control totals, 1306 01:02:37,280 --> 01:02:40,070 but at the itinerary level, because we have information 1307 01:02:40,070 --> 01:02:41,280 about itineraries. 1308 01:02:41,280 --> 01:02:43,340 So why not do it that way? 1309 01:02:43,340 --> 01:02:44,760 It could be more accurate. 1310 01:02:44,760 --> 01:02:49,340 So it's challenging because there are many possibilities. 1311 01:02:49,340 --> 01:02:52,460 And some places that people go through 1312 01:02:52,460 --> 01:02:55,260 don't have good control totals. 1313 01:02:55,260 --> 01:02:57,870 But essentially, we can follow an approach that 1314 01:02:57,870 --> 01:03:02,100 is, in essence, IPF but applied not 1315 01:03:02,100 --> 01:03:07,620 to boardings and alightings, but to the scaling 1316 01:03:07,620 --> 01:03:08,890 factors themselves. 1317 01:03:08,890 --> 01:03:10,980 So this is a toy example. 1318 01:03:10,980 --> 01:03:13,440 Here's a rail line where there's tap in and tap out. 1319 01:03:13,440 --> 01:03:16,170 Here's a bus line where there's only tap in. 1320 01:03:16,170 --> 01:03:18,300 So there are many possible itineraries 1321 01:03:18,300 --> 01:03:19,860 that a person could have here. 1322 01:03:19,860 --> 01:03:21,600 Going from A to B, transfer to D, 1323 01:03:21,600 --> 01:03:23,610 alight at E, that's one itinerary. 1324 01:03:23,610 --> 01:03:27,100 Go from A to C, through B, that's another itinerary. 1325 01:03:27,100 --> 01:03:30,390 And at each of the nodes here, A, B, C, D, E, 1326 01:03:30,390 --> 01:03:32,100 there might be some counts. 1327 01:03:32,100 --> 01:03:35,520 So on D and E, we only have on counts because people are not 1328 01:03:35,520 --> 01:03:36,750 tapping off. 1329 01:03:36,750 --> 01:03:41,490 On A, B, and C, we have on and off, or entry and exit, counts. 1330 01:03:41,490 --> 01:03:45,750 That means that there are these count nodes, A 1331 01:03:45,750 --> 01:03:48,630 in, B out, as two examples, where 1332 01:03:48,630 --> 01:03:52,070 we count how many people go through that place. 1333 01:03:52,070 --> 01:03:59,300 And we know from-- 1334 01:03:59,300 --> 01:04:03,620 we have, for a portion of the people, this ODX sample here, 1335 01:04:03,620 --> 01:04:08,840 some number of itineraries that go through A. So this is A in. 1336 01:04:08,840 --> 01:04:12,957 So we have some people going-- 1337 01:04:12,957 --> 01:04:14,540 a portion of those people are inferred 1338 01:04:14,540 --> 01:04:18,560 to have gone from A to B, a portion to ABC, 1339 01:04:18,560 --> 01:04:21,560 a portion on the itinerary ABDE. 1340 01:04:21,560 --> 01:04:25,320 And then there's some portion of it, shown here as delta A 1341 01:04:25,320 --> 01:04:28,040 in, who are included in the counts, 1342 01:04:28,040 --> 01:04:33,440 but we don't have an inference of their itinerary. 1343 01:04:33,440 --> 01:04:34,210 OK? 1344 01:04:34,210 --> 01:04:37,990 So what we want to do is scale up the mixture of ODX 1345 01:04:37,990 --> 01:04:42,360 here to make up the total entry count. 1346 01:04:42,360 --> 01:04:44,760 But there's a catch. 1347 01:04:44,760 --> 01:04:47,760 These itineraries are affecting counts elsewhere 1348 01:04:47,760 --> 01:04:48,790 on the network. 1349 01:04:48,790 --> 01:04:51,310 So we also have B out, as one example. 1350 01:04:51,310 --> 01:04:54,360 And we know that the people on TAB 1351 01:04:54,360 --> 01:04:58,320 also show up on the count on B, not the people going ABC, 1352 01:04:58,320 --> 01:04:58,890 though. 1353 01:04:58,890 --> 01:05:00,660 Those are not included in the count of B 1354 01:05:00,660 --> 01:05:03,480 out because they don't exit at B. 1355 01:05:03,480 --> 01:05:07,950 And there are some new itineraries showing up at B out 1356 01:05:07,950 --> 01:05:09,600 that are not in A in. 1357 01:05:09,600 --> 01:05:14,880 So we want to somehow match the counts at all the places 1358 01:05:14,880 --> 01:05:18,000 that are affected, or that are showing up, 1359 01:05:18,000 --> 01:05:22,200 or associated with each itinerary, 1360 01:05:22,200 --> 01:05:24,600 and scale the demand so that the control totals are 1361 01:05:24,600 --> 01:05:27,590 satisfied at all locations. 1362 01:05:27,590 --> 01:05:30,070 So the method is similar to IPF. 1363 01:05:30,070 --> 01:05:33,460 We prepare a binary location itinerary incidence matrix 1364 01:05:33,460 --> 01:05:37,510 with zeros and ones, associating each itinerary 1365 01:05:37,510 --> 01:05:39,790 with the count nodes. 1366 01:05:39,790 --> 01:05:44,530 So A in, well, AB is one itinerary that is shown there. 1367 01:05:44,530 --> 01:05:50,150 So is ABC, ABDE, not CB, not CBDE, not DE, as an example. 1368 01:05:50,150 --> 01:05:53,680 So we have this big matrix of zeros and ones. 1369 01:05:53,680 --> 01:05:57,340 And we have two equations. 1370 01:06:01,120 --> 01:06:06,820 Ti is the total scaled up itinerary demand 1371 01:06:06,820 --> 01:06:09,550 on itinerary i. 1372 01:06:09,550 --> 01:06:11,130 And we know that that total is going 1373 01:06:11,130 --> 01:06:15,490 to be 1 plus the scaling factor, or the scaling factor 1374 01:06:15,490 --> 01:06:16,930 is 1 plus alpha, really. 1375 01:06:16,930 --> 01:06:18,730 Alpha is the portion over 1 that we 1376 01:06:18,730 --> 01:06:23,560 want to scale by times the observed or inferred flow 1377 01:06:23,560 --> 01:06:27,310 on that itinerary i, which is t. 1378 01:06:27,310 --> 01:06:31,690 Then we also have this other relationship that the remaining 1379 01:06:31,690 --> 01:06:37,270 portion of the count on a node, which is the count on node n, 1380 01:06:37,270 --> 01:06:40,030 the control total on node n, minus the portion 1381 01:06:40,030 --> 01:06:46,060 that was seen through that place is the amount-- 1382 01:06:46,060 --> 01:06:48,490 it adds up to the sum of all itineraries going 1383 01:06:48,490 --> 01:06:52,295 through that place times their scaling factors. 1384 01:06:52,295 --> 01:06:54,670 So now we need to figure out what the scaling factors are 1385 01:06:54,670 --> 01:06:58,490 for each itinerary, satisfying both of these equations. 1386 01:06:58,490 --> 01:07:02,620 We have two equations on vectors, 1387 01:07:02,620 --> 01:07:05,950 we have it on control nodes, and we also have it on itineraries. 1388 01:07:05,950 --> 01:07:07,450 So we're back to the same situation. 1389 01:07:07,450 --> 01:07:13,330 We can do IPF on itineraries and count nodes. 1390 01:07:13,330 --> 01:07:15,090 And that's what we do. 1391 01:07:15,090 --> 01:07:17,770 If we have better data, we could initialized with a good seed 1392 01:07:17,770 --> 01:07:21,200 matrix, otherwise you could initialized to 1. 1393 01:07:21,200 --> 01:07:29,740 We then update the estimated count nodes, which is delta. 1394 01:07:29,740 --> 01:07:32,680 Again, delta here is the difference 1395 01:07:32,680 --> 01:07:36,340 between the count of flow through that node 1396 01:07:36,340 --> 01:07:37,660 and the observed flow. 1397 01:07:37,660 --> 01:07:40,700 So it's the part that you have to scale up to. 1398 01:07:40,700 --> 01:07:42,610 And you do that for all nodes. 1399 01:07:42,610 --> 01:07:46,070 And then you-- oh, let's look at what happens here. 1400 01:07:46,070 --> 01:07:50,620 So when you apply this equation, you calculate a delta hat 1401 01:07:50,620 --> 01:07:51,880 right here. 1402 01:07:51,880 --> 01:07:54,440 That is much higher, it looks like. 1403 01:07:54,440 --> 01:07:58,919 Yeah, it's much higher than the actual measured delta. 1404 01:07:58,919 --> 01:08:00,460 So you know that that initial scaling 1405 01:08:00,460 --> 01:08:05,860 is producing demand flow that is too high through those nodes. 1406 01:08:05,860 --> 01:08:07,840 And that's because we said that alpha 1407 01:08:07,840 --> 01:08:10,210 was 1 for all of those nodes. 1408 01:08:10,210 --> 01:08:13,210 But it isn't, so those alphas need to be adjusted now. 1409 01:08:13,210 --> 01:08:14,740 So we moved to alphas. 1410 01:08:14,740 --> 01:08:18,490 And we update alphas by looking at, essentially, 1411 01:08:18,490 --> 01:08:21,430 the average scaling factor required 1412 01:08:21,430 --> 01:08:27,399 across these control deltas that apply to each itinerary. 1413 01:08:27,399 --> 01:08:29,380 Not all of them apply to each itinerary. 1414 01:08:29,380 --> 01:08:31,930 So for itinerary AB, you would take 1415 01:08:31,930 --> 01:08:35,500 the average required adjustment factor 1416 01:08:35,500 --> 01:08:37,990 across all the deltas that apply to AB, 1417 01:08:37,990 --> 01:08:40,180 which are only the first two. 1418 01:08:40,180 --> 01:08:42,310 You wouldn't include the last three 1419 01:08:42,310 --> 01:08:44,760 because they don't touch AB. 1420 01:08:44,760 --> 01:08:50,120 So you calculate the average, and you adjust the alphas. 1421 01:08:50,120 --> 01:08:56,060 But now you're not getting the demand that you expected, 1422 01:08:56,060 --> 01:08:59,470 so you have to go back, and you cycle through again. 1423 01:08:59,470 --> 01:09:01,220 And you go back and forth, back and forth. 1424 01:09:01,220 --> 01:09:04,550 You apply these two equations until you converge. 1425 01:09:04,550 --> 01:09:07,609 And convergence in this case means that the delta hats 1426 01:09:07,609 --> 01:09:10,700 will match the deltas that you measured, 1427 01:09:10,700 --> 01:09:12,979 and that the alpha values are not changing 1428 01:09:12,979 --> 01:09:15,180 much between iterations. 1429 01:09:15,180 --> 01:09:16,189 OK? 1430 01:09:16,189 --> 01:09:22,660 So we haven't seen the proof that this converges. 1431 01:09:22,660 --> 01:09:24,840 It relies on an average here. 1432 01:09:24,840 --> 01:09:29,560 So it's an introduction or a new aspect of the method. 1433 01:09:29,560 --> 01:09:34,670 But in every test that we've run, it does converge. 1434 01:09:34,670 --> 01:09:36,500 On a toy example, where we didn't 1435 01:09:36,500 --> 01:09:38,540 know what the actual journeys were, 1436 01:09:38,540 --> 01:09:41,090 because we produced the real data, 1437 01:09:41,090 --> 01:09:43,609 and then we hit it, and scaled it up, 1438 01:09:43,609 --> 01:09:50,750 we started with different required scaling factors 1439 01:09:50,750 --> 01:09:52,260 by itinerary. 1440 01:09:52,260 --> 01:09:59,480 So in the blue line case, this blue line right here, 1441 01:09:59,480 --> 01:10:02,090 the scaling factors required across itineraries 1442 01:10:02,090 --> 01:10:04,050 were very similar to each other. 1443 01:10:04,050 --> 01:10:06,750 They all needed to be scaled up by the same amount. 1444 01:10:06,750 --> 01:10:10,100 And what we show here is that the solution converged 1445 01:10:10,100 --> 01:10:13,280 very quickly, and that the accuracy was high, 1446 01:10:13,280 --> 01:10:16,880 because the root mean squared error was low. 1447 01:10:16,880 --> 01:10:17,630 OK? 1448 01:10:17,630 --> 01:10:19,940 But then as we start moving to differences 1449 01:10:19,940 --> 01:10:22,940 in scaling factors required, the algorithm 1450 01:10:22,940 --> 01:10:24,970 did converge and produced different alphas 1451 01:10:24,970 --> 01:10:26,990 for each itinerary. 1452 01:10:26,990 --> 01:10:29,960 But it took longer to converge, and the root 1453 01:10:29,960 --> 01:10:32,540 mean square of the final solution 1454 01:10:32,540 --> 01:10:34,460 was higher, which makes sense. 1455 01:10:34,460 --> 01:10:37,280 All these methods, the IPF founding methods 1456 01:10:37,280 --> 01:10:39,425 amplify errors. 1457 01:10:39,425 --> 01:10:41,550 They amplify whatever you give it at the beginning. 1458 01:10:41,550 --> 01:10:45,950 So if you start with something all 1's and in fact, 1459 01:10:45,950 --> 01:10:49,910 they are quite different from 1's or some of them are 1, 1460 01:10:49,910 --> 01:10:52,160 and others are not, then you're going to have a bigger 1461 01:10:52,160 --> 01:10:54,800 error in the final solution. 1462 01:10:54,800 --> 01:11:01,400 So this was applied to London once again. 1463 01:11:01,400 --> 01:11:05,710 In practice, there is another complication. 1464 01:11:05,710 --> 01:11:10,450 People who are counted at each node, 1465 01:11:10,450 --> 01:11:13,210 they don't all finish their journeys, 1466 01:11:13,210 --> 01:11:18,880 or they don't all start their journeys in the time band 1467 01:11:18,880 --> 01:11:22,660 that you are including the counts in. 1468 01:11:22,660 --> 01:11:26,620 So if you do a trip that takes a whole hour, 1469 01:11:26,620 --> 01:11:28,780 you might be seeing tapping in here, 1470 01:11:28,780 --> 01:11:32,590 and then you might end up tapping out at the next hour. 1471 01:11:32,590 --> 01:11:35,530 So you need control totals, in this case, by the hour. 1472 01:11:35,530 --> 01:11:39,950 We were looking at hour scaling, scaling of demand every hour. 1473 01:11:39,950 --> 01:11:45,550 But you need to adjust the control totals to get 1474 01:11:45,550 --> 01:11:48,010 what percentage of people who are tapping out 1475 01:11:48,010 --> 01:11:51,640 at this location actually started 1476 01:11:51,640 --> 01:11:53,680 their journey in that hour. 1477 01:11:53,680 --> 01:11:56,380 How many of them actually started on the hour before? 1478 01:11:56,380 --> 01:11:58,180 So there was an offset correction, 1479 01:11:58,180 --> 01:12:00,200 and this is what is being shown here. 1480 01:12:00,200 --> 01:12:04,000 Here you have raw entries and raw exits in dashed lines. 1481 01:12:04,000 --> 01:12:07,270 And the correction essentially shifted those entries 1482 01:12:07,270 --> 01:12:09,930 backwards in time a little bit, so that the control totals 1483 01:12:09,930 --> 01:12:11,260 matched. 1484 01:12:11,260 --> 01:12:13,876 And then you can run the journey scaling as we just showed. 1485 01:12:17,050 --> 01:12:21,430 And the results, we don't have ground truth data in this case. 1486 01:12:21,430 --> 01:12:22,630 You could run a survey. 1487 01:12:22,630 --> 01:12:25,060 I guess LTDS, in some ways, is ground truth, 1488 01:12:25,060 --> 01:12:26,610 but it's a low sample. 1489 01:12:26,610 --> 01:12:29,470 The overall scaling required was about 50%. 1490 01:12:29,470 --> 01:12:33,950 You can see the 3/2 line here. 1491 01:12:33,950 --> 01:12:38,110 One way of validating it was to run it only on rail. 1492 01:12:38,110 --> 01:12:43,540 Because rail has OND, we can run IPF on the rail matrix, 1493 01:12:43,540 --> 01:12:47,650 because you have the control totals for all the gates 1494 01:12:47,650 --> 01:12:48,750 in and out. 1495 01:12:48,750 --> 01:12:51,490 So there's no complication of the bus. 1496 01:12:51,490 --> 01:12:53,200 You can run IPF on that. 1497 01:12:53,200 --> 01:12:57,310 And it aligned very well with the more sophisticated solution 1498 01:12:57,310 --> 01:13:02,710 of running this by proportional fitting on the itinerary 1499 01:13:02,710 --> 01:13:07,570 scaling factors, instead of the simpler IPF method. 1500 01:13:07,570 --> 01:13:09,670 Presumably, the errors that you see here 1501 01:13:09,670 --> 01:13:11,410 that are slightly off the diagonal 1502 01:13:11,410 --> 01:13:14,050 are improvements in accuracy. 1503 01:13:14,050 --> 01:13:17,830 Because instead of starting from, say, all 1's you 1504 01:13:17,830 --> 01:13:23,080 are using good information about a seed matrix from ODX. 1505 01:13:23,080 --> 01:13:26,080 So the accuracy should have been improving. 1506 01:13:26,080 --> 01:13:32,419 We don't have, again, the ground truth to assert that measure, 1507 01:13:32,419 --> 01:13:33,835 how close are we about to reality. 1508 01:13:37,830 --> 01:13:40,670 And one application of this, here's 1509 01:13:40,670 --> 01:13:48,610 a chart showing all the origins. 1510 01:13:48,610 --> 01:13:50,550 So it's like a heat map of London. 1511 01:13:50,550 --> 01:13:56,880 And a darker color of red shading each cell 1512 01:13:56,880 --> 01:13:58,590 shows a higher proportion of people 1513 01:13:58,590 --> 01:14:01,890 originating at that location and going to Oxford Circus, which 1514 01:14:01,890 --> 01:14:03,640 is right here in the middle. 1515 01:14:03,640 --> 01:14:05,970 So for a planner, looking at this 1516 01:14:05,970 --> 01:14:08,430 and knowing if I want to know where 1517 01:14:08,430 --> 01:14:13,260 people are coming from to Oxford circle, here's a map. 1518 01:14:13,260 --> 01:14:17,620 And you can do is by time band, so only for the AM peak or-- 1519 01:14:17,620 --> 01:14:20,720 there's many applications of this origin destination data. 1520 01:14:20,720 --> 01:14:24,140 I'm just showing you one here. 1521 01:14:24,140 --> 01:14:29,600 And here are some references, so Jay Gordon's thesis, 1522 01:14:29,600 --> 01:14:32,300 a paper he wrote. 1523 01:14:32,300 --> 01:14:36,620 The Southwick reference is for the scaling of buses 1524 01:14:36,620 --> 01:14:39,690 without the transfer demand. 1525 01:14:39,690 --> 01:14:41,930 So if you want to read more about that, 1526 01:14:41,930 --> 01:14:43,820 that's the write up. 1527 01:14:43,820 --> 01:14:47,990 And then I wrote a paper on the inference 1528 01:14:47,990 --> 01:14:49,850 of destinations using dynamic programming, 1529 01:14:49,850 --> 01:14:51,680 instead of closest node. 1530 01:14:51,680 --> 01:14:53,310 That's also published. 1531 01:14:53,310 --> 01:14:54,380 You can get that. 1532 01:14:54,380 --> 01:14:56,870 And then finally, Jay's website has 1533 01:14:56,870 --> 01:15:02,000 the this visualizations for London, and Boston, and yeah. 1534 01:15:02,000 --> 01:15:04,610 So you can have fun looking at that. 1535 01:15:07,380 --> 01:15:07,940 All right. 1536 01:15:07,940 --> 01:15:09,440 Do we have any questions about this? 1537 01:15:13,120 --> 01:15:16,160 We can watch the animations again. 1538 01:15:16,160 --> 01:15:18,960 And maybe now we really know-- 1539 01:15:18,960 --> 01:15:22,670 now we really appreciate what went into it. 1540 01:15:22,670 --> 01:15:24,970 AUDIENCE: So I know there's a number 1541 01:15:24,970 --> 01:15:29,380 of old systems [AUDIO OUT] a lot of newer systems 1542 01:15:29,380 --> 01:15:31,840 are just doing proof of payment. 1543 01:15:31,840 --> 01:15:34,380 So in that case, they only can use-- 1544 01:15:34,380 --> 01:15:36,280 GABRIEL SANCHEZ-MARTINEZ: So some new systems 1545 01:15:36,280 --> 01:15:37,720 are proof of payment as well. 1546 01:15:37,720 --> 01:15:39,220 AUDIENCE: Right. 1547 01:15:39,220 --> 01:15:41,054 And in that case, they can only use APC to-- 1548 01:15:41,054 --> 01:15:42,511 GABRIEL SANCHEZ-MARTINEZ: So right. 1549 01:15:42,511 --> 01:15:43,360 You can have APC. 1550 01:15:43,360 --> 01:15:46,490 APC has some issues, especially when the vehicle is crowded. 1551 01:15:46,490 --> 01:15:48,430 Some people block APC sensors. 1552 01:15:48,430 --> 01:15:51,490 So you could have-- 1553 01:15:51,490 --> 01:15:54,550 if it's proof of payment, manual surveys 1554 01:15:54,550 --> 01:15:57,942 is the other alternative. 1555 01:15:57,942 --> 01:15:58,441 Yeah. 1556 01:16:01,190 --> 01:16:04,510 And proof of payment is something we can debate. 1557 01:16:04,510 --> 01:16:06,470 It has benefits and [AUDIO OUT]. 1558 01:16:06,470 --> 01:16:08,750 So on the data collections side, that's 1559 01:16:08,750 --> 01:16:11,890 a clear disadvantage of proof of payment. 1560 01:16:11,890 --> 01:16:13,430 Yeah. 1561 01:16:13,430 --> 01:16:17,600 AUDIENCE: So for the scaling-- 1562 01:16:17,600 --> 01:16:19,810 GABRIEL SANCHEZ-MARTINEZ: Which scaling method? 1563 01:16:19,810 --> 01:16:21,720 AUDIENCE: The slide 41 1564 01:16:21,720 --> 01:16:23,860 GABRIEL SANCHEZ-MARTINEZ: 41, OK. 1565 01:16:23,860 --> 01:16:27,250 AUDIENCE: You've got an itinerary and an itinerary, so 1566 01:16:27,250 --> 01:16:29,000 how many itineraries do you have? 1567 01:16:29,000 --> 01:16:30,660 GABRIEL SANCHEZ-MARTINEZ: Oh many, many, many, many. 1568 01:16:30,660 --> 01:16:31,535 AUDIENCE: [INAUDIBLE] 1569 01:16:31,535 --> 01:16:35,630 GABRIEL SANCHEZ-MARTINEZ: Any possible combination of-- 1570 01:16:35,630 --> 01:16:37,380 I don't know if I have that here. 1571 01:16:37,380 --> 01:16:40,600 AUDIENCE: This is only one example of one trip. 1572 01:16:40,600 --> 01:16:42,100 GABRIEL SANCHEZ-MARTINEZ: Yeah, so-- 1573 01:16:42,100 --> 01:16:42,630 AUDIENCE: So you have-- 1574 01:16:42,630 --> 01:16:43,280 GABRIEL SANCHEZ-MARTINEZ: So we know 1575 01:16:43,280 --> 01:16:45,590 that there are trillions of solutions 1576 01:16:45,590 --> 01:16:47,660 that satisfy the control total. 1577 01:16:47,660 --> 01:16:50,090 I forget how many unique itineraries there 1578 01:16:50,090 --> 01:16:52,730 are, but many, many, many. 1579 01:16:52,730 --> 01:16:58,650 It's a large number in a city like London, particularly. 1580 01:16:58,650 --> 01:17:00,530 AUDIENCE: So for this method, you-- 1581 01:17:00,530 --> 01:17:01,905 GABRIEL SANCHEZ-MARTINEZ: So this 1582 01:17:01,905 --> 01:17:03,560 is a computationally intense activity. 1583 01:17:03,560 --> 01:17:05,730 Yeah. 1584 01:17:05,730 --> 01:17:09,219 AUDIENCE: You need the APC information. 1585 01:17:09,219 --> 01:17:10,760 GABRIEL SANCHEZ-MARTINEZ: On bus, you 1586 01:17:10,760 --> 01:17:11,930 would want to have that. 1587 01:17:11,930 --> 01:17:15,380 That would be one control total that you could use for bus. 1588 01:17:15,380 --> 01:17:17,630 This method is flexible, though, because it 1589 01:17:17,630 --> 01:17:20,870 doesn't require that you have control totals everywhere. 1590 01:17:20,870 --> 01:17:23,967 You just use whatever control total you trust. 1591 01:17:23,967 --> 01:17:27,260 So say, if you didn't have APC on buses, 1592 01:17:27,260 --> 01:17:30,000 then you would only use the control total on rail. 1593 01:17:30,000 --> 01:17:31,700 If you have APC on some buses, you 1594 01:17:31,700 --> 01:17:34,100 could use the control totals on those buses 1595 01:17:34,100 --> 01:17:36,890 to improve the scaling information. 1596 01:17:36,890 --> 01:17:42,446 But it doesn't require that all the places have counts. 1597 01:17:42,446 --> 01:17:43,320 Does that make sense? 1598 01:17:43,320 --> 01:17:48,330 Because you have, essentially, a list of count nodes. 1599 01:17:48,330 --> 01:17:51,810 And that list is not necessarily a complete list of places 1600 01:17:51,810 --> 01:17:53,650 that people go through. 1601 01:17:53,650 --> 01:17:58,740 And then you have all the list of itineraries that you see, 1602 01:17:58,740 --> 01:18:00,170 and you want to associate those. 1603 01:18:06,330 --> 01:18:07,980 Again, no proof of convergence here. 1604 01:18:07,980 --> 01:18:11,310 But [AUDIO OUT]. 1605 01:18:11,310 --> 01:18:14,280 And on the toy examples we run, we 1606 01:18:14,280 --> 01:18:20,010 observe these properties of convergence rate and error 1607 01:18:20,010 --> 01:18:26,080 at the end, which is consistent with normal or more traditional 1608 01:18:26,080 --> 01:18:28,830 applications of IPF. 1609 01:18:28,830 --> 01:18:32,596 So we can hypothesize that it behaves very similarly. 1610 01:18:35,255 --> 01:18:36,130 Question in the back. 1611 01:18:36,130 --> 01:18:39,640 AUDIENCE: [INAUDIBLE] generate a list of reasonable itineraries. 1612 01:18:39,640 --> 01:18:40,070 GABRIEL SANCHEZ-MARTINEZ: It's not 1613 01:18:40,070 --> 01:18:41,444 a list of reasonable itineraries. 1614 01:18:41,444 --> 01:18:43,650 It's a list of inferred itineraries. 1615 01:18:43,650 --> 01:18:47,000 So it's an area that was inferred 1616 01:18:47,000 --> 01:18:49,010 because I saw you tapping in here, 1617 01:18:49,010 --> 01:18:51,710 out there, then transferring, taking this bus, 1618 01:18:51,710 --> 01:18:53,870 and maybe you took three other buses after that. 1619 01:18:53,870 --> 01:18:55,160 Maybe one person did that. 1620 01:18:55,160 --> 01:18:56,210 AUDIENCE: [INAUDIBLE] 1621 01:18:56,210 --> 01:18:57,240 GABRIEL SANCHEZ-MARTINEZ: That's one itinerary 1622 01:18:57,240 --> 01:18:58,114 that's included here. 1623 01:18:58,114 --> 01:19:00,270 AUDIENCE: So do we just ignore the possibility 1624 01:19:00,270 --> 01:19:02,634 that the people with uninferred destinations have-- 1625 01:19:02,634 --> 01:19:04,550 GABRIEL SANCHEZ-MARTINEZ: Something different? 1626 01:19:04,550 --> 01:19:06,140 Yes. 1627 01:19:06,140 --> 01:19:08,300 You could generate every combination, 1628 01:19:08,300 --> 01:19:10,520 but that's maybe intractable. 1629 01:19:10,520 --> 01:19:13,130 So what we're doing is only considering 1630 01:19:13,130 --> 01:19:17,120 itineraries that were observed, and only scaling those up. 1631 01:19:19,700 --> 01:19:22,350 But that's a good point. 1632 01:19:22,350 --> 01:19:24,780 There might be some people that were counted, 1633 01:19:24,780 --> 01:19:27,450 but didn't have an inference, and their itinerary 1634 01:19:27,450 --> 01:19:29,230 might be completely different. 1635 01:19:29,230 --> 01:19:31,730 And this method doesn't handle that. 1636 01:19:36,926 --> 01:19:42,370 AUDIENCE: How different is [AUDIO OUT] the scaling matrix, 1637 01:19:42,370 --> 01:19:45,990 compared with the traditional matrix that you can calculate 1638 01:19:45,990 --> 01:19:50,010 with the sample [INAUDIBLE] to infer the alighting? 1639 01:19:50,010 --> 01:19:52,527 GABRIEL SANCHEZ-MARTINEZ: You mean the traditional IPF? 1640 01:19:52,527 --> 01:19:53,110 AUDIENCE: Yes. 1641 01:19:53,110 --> 01:19:55,110 Are you [INAUDIBLE]? 1642 01:19:55,110 --> 01:19:56,300 Is it very different? 1643 01:19:56,300 --> 01:19:56,810 Or is-- 1644 01:19:56,810 --> 01:19:58,018 GABRIEL SANCHEZ-MARTINEZ: No. 1645 01:19:58,018 --> 01:20:02,530 Well, it depends on the accuracy of your IPF procedure 1646 01:20:02,530 --> 01:20:06,890 and the accuracy of your control totals. 1647 01:20:06,890 --> 01:20:10,030 Yeah, if your seed matrix is very good, 1648 01:20:10,030 --> 01:20:14,160 then IPF should [AUDIO OUT] well. 1649 01:20:14,160 --> 01:20:18,900 [AUDIO OUT] any applications of IPF, we just seed it to one 1650 01:20:18,900 --> 01:20:20,220 and run it. 1651 01:20:20,220 --> 01:20:23,910 And you have the issue of the transfer, which 1652 01:20:23,910 --> 01:20:26,880 would plague that, and it would amplify the error 1653 01:20:26,880 --> 01:20:30,540 because presumably, your seed matrix-- 1654 01:20:30,540 --> 01:20:32,850 a lot of people might be getting off here to transfer, 1655 01:20:32,850 --> 01:20:35,040 and you may not be considering that. 1656 01:20:35,040 --> 01:20:37,110 If you use IPF to scale up the inferred portion, 1657 01:20:37,110 --> 01:20:40,650 for example, that would happen. 1658 01:20:40,650 --> 01:20:44,760 So I think through our validation, 1659 01:20:44,760 --> 01:20:49,650 we have seen that there's more reason 1660 01:20:49,650 --> 01:20:53,070 to trust the inference algorithms scaled 1661 01:20:53,070 --> 01:20:57,960 up than to just take control totals and scale those up. 1662 01:20:57,960 --> 01:20:59,690 From an information theory perspective, 1663 01:20:59,690 --> 01:21:02,370 we're adding information, so you should-- 1664 01:21:02,370 --> 01:21:05,550 even if we can't provide evidence that absolutely, it 1665 01:21:05,550 --> 01:21:08,070 is the case, from information theory, 1666 01:21:08,070 --> 01:21:09,750 we see we're consuming more information 1667 01:21:09,750 --> 01:21:12,300 to generate this estimate, then that the estimate 1668 01:21:12,300 --> 01:21:15,630 should be more accurate. 1669 01:21:15,630 --> 01:21:19,440 But it depends on whether your inferences were correct or not. 1670 01:21:19,440 --> 01:21:23,150 That depends on the assumptions you made about people. 1671 01:21:23,150 --> 01:21:24,980 So many assumptions go into this, 1672 01:21:24,980 --> 01:21:28,602 and it's hard to say exactly. 1673 01:21:32,570 --> 01:21:33,210 All right. 1674 01:21:33,210 --> 01:21:35,940 So do you want to look at Boston? 1675 01:21:35,940 --> 01:21:37,716 Or-- 1676 01:21:43,870 --> 01:21:45,370 GABRIEL SANCHEZ-MARTINEZ: Let's see. 1677 01:21:48,250 --> 01:21:51,840 Turn it off. 1678 01:21:51,840 --> 01:21:54,002 All right, so here's Boston. 1679 01:22:06,900 --> 01:22:09,370 GABRIEL SANCHEZ-MARTINEZ: It's a much smaller city. 1680 01:22:09,370 --> 01:22:11,830 And this is an earlier video where we still 1681 01:22:11,830 --> 01:22:13,660 had some issues with ODX. 1682 01:22:13,660 --> 01:22:16,750 And so some of them, you could perhaps 1683 01:22:16,750 --> 01:22:19,202 detect by looking at the animation. 1684 01:22:19,202 --> 01:22:21,160 So that's another application of this animation 1685 01:22:21,160 --> 01:22:25,830 is to find issues with the algorithm. 1686 01:22:25,830 --> 01:22:28,170 In this case, people are being routed 1687 01:22:28,170 --> 01:22:29,980 through the actual paths. 1688 01:22:29,980 --> 01:22:32,610 So that's another difference between the video of London, 1689 01:22:32,610 --> 01:22:35,384 where people were bursting in linearly. 1690 01:22:35,384 --> 01:22:36,360 AUDIENCE: Right. 1691 01:22:36,360 --> 01:22:38,568 GABRIEL SANCHEZ-MARTINEZ: Here, it looks less bursty, 1692 01:22:38,568 --> 01:22:42,000 and part of that is that the nodes are moving through paths. 1693 01:22:42,000 --> 01:22:45,480 AUDIENCE: You see these thick dots along the red line. 1694 01:22:45,480 --> 01:22:47,095 GABRIEL SANCHEZ-MARTINEZ: Yeah. 1695 01:22:47,095 --> 01:22:47,580 AUDIENCE: [INAUDIBLE] 1696 01:22:47,580 --> 01:22:48,871 GABRIEL SANCHEZ-MARTINEZ: Yeah. 1697 01:22:50,580 --> 01:22:56,630 Some of the issues are very slowly-moving dots. 1698 01:22:56,630 --> 01:23:00,520 So you see some green dots that barely move. 1699 01:23:00,520 --> 01:23:04,490 So those are errors that have been fixed already, 1700 01:23:04,490 --> 01:23:07,250 and we should run this program again 1701 01:23:07,250 --> 01:23:11,180 to generate a new animation without those errors. 1702 01:23:11,180 --> 01:23:13,790 But yeah, you see the same pattern. 1703 01:23:13,790 --> 01:23:15,920 This is now late at night. 1704 01:23:15,920 --> 01:23:20,300 So you can see where people are still at work, or perhaps 1705 01:23:20,300 --> 01:23:21,520 at restaurants, or bars. 1706 01:23:21,520 --> 01:23:28,380 And Boston system just shuts down. 1707 01:23:28,380 --> 01:23:33,170 So it's not as alive at night because we're only looking 1708 01:23:33,170 --> 01:23:37,430 at people who would take the T. 1709 01:23:37,430 --> 01:23:40,080 All right? 1710 01:23:40,080 --> 01:23:40,710 Thank you. 1711 01:23:40,710 --> 01:23:42,480 And if you have questions, I'll take them. 1712 01:23:42,480 --> 01:23:47,090 Otherwise, I'll see you next class.