1 00:00:04,500 --> 00:00:08,420 First, let's make sure we have ggplot2 loaded. 2 00:00:08,420 --> 00:00:10,110 So library(ggplot2). 3 00:00:16,420 --> 00:00:20,460 Now let's load our data frame, which is in households.csv. 4 00:00:20,460 --> 00:00:22,880 So read.csv(households.csv). 5 00:00:31,190 --> 00:00:36,180 If we look at the structure of households, 6 00:00:36,180 --> 00:00:39,020 we see that there is a year column and then 7 00:00:39,020 --> 00:00:44,220 six other columns for each of the different household types. 8 00:00:44,220 --> 00:00:46,740 So this is actually a problem for us. 9 00:00:46,740 --> 00:00:49,190 Given this structure of a data frame, 10 00:00:49,190 --> 00:00:50,840 what would we put in the aesthetic 11 00:00:50,840 --> 00:00:54,370 for our ggplot command? 12 00:00:54,370 --> 00:00:56,710 It's not obvious, to me at least, 13 00:00:56,710 --> 00:00:59,970 and in fact, I don't think it's really possible. 14 00:00:59,970 --> 00:01:02,040 The reason is that ggplot needs it 15 00:01:02,040 --> 00:01:08,750 in the form of: year, group, and fraction. 16 00:01:11,330 --> 00:01:14,150 The solution is to use the melt function 17 00:01:14,150 --> 00:01:16,720 from the reshape package. 18 00:01:16,720 --> 00:01:19,920 Melt will take a 2-dimensional data frame like ours, 19 00:01:19,920 --> 00:01:24,550 and convert it into exactly the right form we need for ggplot2. 20 00:01:24,550 --> 00:01:28,490 So first, let's load reshape2 -- library(reshape2). 21 00:01:31,820 --> 00:01:35,229 Now, let's look at the first two columns of our households data 22 00:01:35,229 --> 00:01:37,880 frame -- households[,1:2]. 23 00:01:45,750 --> 00:01:49,610 So there's a Year and a MarriedWChild for each year. 24 00:01:49,610 --> 00:01:53,390 Now, let's look at the first few rows of our melted households 25 00:01:53,390 --> 00:01:54,320 data frame. 26 00:01:54,320 --> 00:01:55,860 So head(melt(households, id="Year")). 27 00:02:09,990 --> 00:02:10,930 And there you have it. 28 00:02:10,930 --> 00:02:13,000 So, basically, what's happened is 29 00:02:13,000 --> 00:02:16,730 that each value of MarriedWChild has 30 00:02:16,730 --> 00:02:19,870 turned into its own row in the new data frame. 31 00:02:22,810 --> 00:02:24,610 To make it more clear, perhaps, let's look 32 00:02:24,610 --> 00:02:26,610 at the first three columns of households. 33 00:02:32,890 --> 00:02:35,250 Now we have MarriedWOChild. 34 00:02:35,250 --> 00:02:39,500 Now let's look at, instead of just the first rows 35 00:02:39,500 --> 00:02:43,290 of our melted data frame, let's look at the first 10 rows. 36 00:02:43,290 --> 00:02:46,070 So rows 1 to 10, all columns. 37 00:02:53,420 --> 00:02:55,750 There we go. 38 00:02:55,750 --> 00:03:01,110 So there you can see the eight values of MarriedWChild, 39 00:03:01,110 --> 00:03:03,810 and the first two values of MarriedWOChild. 40 00:03:03,810 --> 00:03:09,020 So there's that 30.3 up there, gone down to 30.3 here, 41 00:03:09,020 --> 00:03:11,540 29.9 gone to down here. 42 00:03:11,540 --> 00:03:13,280 So every value in our data frame now 43 00:03:13,280 --> 00:03:17,470 corresponds to a new row in our melted data frame, 44 00:03:17,470 --> 00:03:20,070 which is exactly what we need for ggplot. 45 00:03:20,070 --> 00:03:26,060 So let's try plotting this melted data frame -- ggplot, 46 00:03:26,060 --> 00:03:33,430 melt, households, using the Year column as an id column, 47 00:03:33,430 --> 00:03:35,670 the key column. 48 00:03:35,670 --> 00:03:41,910 Our aesthetic is going to be now to use Year on the x-axis. 49 00:03:41,910 --> 00:03:45,260 Our y-axis will be the value column of our melted data 50 00:03:45,260 --> 00:03:47,970 frame. 51 00:03:47,970 --> 00:03:50,579 And the color of the line will depend 52 00:03:50,579 --> 00:03:53,850 on the group, which is called variable in the melted data 53 00:03:53,850 --> 00:03:54,350 frame. 54 00:04:00,080 --> 00:04:01,430 So that's our aesthetic. 55 00:04:01,430 --> 00:04:05,220 Our geometry is going to be lines. 56 00:04:05,220 --> 00:04:07,610 And I want to make the lines a little bit thicker. 57 00:04:07,610 --> 00:04:12,380 So let's say line size is 2. 58 00:04:12,380 --> 00:04:16,060 And I also want to have points for each year 59 00:04:16,060 --> 00:04:18,470 in the data frame, so I'm going to have lines and points. 60 00:04:18,470 --> 00:04:20,820 So geom_point. 61 00:04:20,820 --> 00:04:23,270 And I'm going to make the size of these a little bit bigger 62 00:04:23,270 --> 00:04:26,010 than normal too, size = 5. 63 00:04:26,010 --> 00:04:28,140 And we should put a y-axis label. 64 00:04:28,140 --> 00:04:29,640 So ylab("Percentage of Households"). 65 00:04:37,710 --> 00:04:39,970 And there you go. 66 00:04:39,970 --> 00:04:42,350 Now, this is actually quite interesting 67 00:04:42,350 --> 00:04:45,670 when we compare it back to the chart we had in the slides. 68 00:04:45,670 --> 00:04:48,159 Now you can see just how quickly MarriedWChild 69 00:04:48,159 --> 00:04:49,760 is decreasing as a relative share. 70 00:04:49,760 --> 00:04:53,960 You can also more clearly see that MarriedWOChild 71 00:04:53,960 --> 00:04:57,490 is pretty much flat, and that the differences being made up 72 00:04:57,490 --> 00:04:59,790 by the other four types of households 73 00:04:59,790 --> 00:05:01,430 is steadily increasing over the years. 74 00:05:03,990 --> 00:05:05,820 So there you have it, the same data, 75 00:05:05,820 --> 00:05:07,780 plotted in two different ways. 76 00:05:07,780 --> 00:05:11,410 Now, I'm not saying one of these is better than the other one. 77 00:05:11,410 --> 00:05:13,940 For example, if I want to compare inside a given year, 78 00:05:13,940 --> 00:05:18,330 say 1970, it's not the most easy thing, at a glance, 79 00:05:18,330 --> 00:05:22,370 to see just how much of a total hundred percent 80 00:05:22,370 --> 00:05:24,280 is taken up by each. 81 00:05:24,280 --> 00:05:26,980 But if I want to see across years, it's far superior. 82 00:05:26,980 --> 00:05:29,980 And I can clearly see that the last data point is pretty much 83 00:05:29,980 --> 00:05:31,730 right next to the second to last data 84 00:05:31,730 --> 00:05:33,150 point, which is something that was 85 00:05:33,150 --> 00:05:36,690 hard to tell with the other visualization. 86 00:05:36,690 --> 00:05:38,970 So I hope this has made you think a little bit more 87 00:05:38,970 --> 00:05:41,560 about the different ways you can plot the same data. 88 00:05:41,560 --> 00:05:43,690 And hopefully improved your ggplot2 skills 89 00:05:43,690 --> 00:05:45,270 a little bit more. 90 00:05:45,270 --> 00:05:47,140 Thanks for watching.