6.5 Assignment 6

Let's start by building a hierarchical clustering model. First, read the data set into R. Then, compute the distances (using method="euclidean"), and use hclust to build the model (using method="ward.D2"). You should cluster on all of the variables.

Running the dist function will probably take you a while. Why? Select all that apply.

We have a lot of observations, so it takes a long time to compute the distance between each pair of observations.

We have a lot of variables, so the distance computation is long.

Our variables have a wide range of values, so the distances are more complicated.

The euclidean distance is known to take a long time to compute, regardless of the size of the data.

Explanation

You can read in the data set, compute the distances, and build the hierarchical clustering model by using the following commands:

dailykos = read.csv("dailykos.csv")

kosDist = dist(dailykos, method="euclidean")

kosHierClust = hclust(kosDist, method="ward.D2")

The distance computation can take a long time if you have a lot of observations and/or if there are a lot of variables.

In this problem, we are trying to cluster news articles or blog posts into groups. This can be used to show readers categories to choose from when trying to decide what to read. Just thinking about this application, what are good choices for the number of clusters? Select all that apply.

2

3

7

8

Explanation

Thinking about the application, it is probably better to show the reader more categories than 2 or 3. These categories would probably be too broad to be useful. Seven or eight categories seems more reasonable.

Let's pick 7 clusters. This number is reasonable according to the dendrogram, and also seems reasonable for the application. Use the cutree function to split your data into 7 clusters, calling the grouping variable "hierGroups".

Create 7 new datasets, each containing the observations from one of the clusters, using the following commands:

library(dplyr)

HierCluster1 = dailykos %>% filter(hierGroups == 1)

HierCluster2 = dailykos %>% filter(hierGroups == 2)

HierCluster3 = dailykos %>% filter(hierGroups == 3)

HierCluster4 = dailykos %>% filter(hierGroups == 4)

HierCluster5 = dailykos %>% filter(hierGroups == 5)

HierCluster6 = dailykos %>% filter(hierGroups == 6)

HierCluster7 = dailykos %>% filter(hierGroups == 7)

How many observations are in cluster 3?

Which cluster has the most observations?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Which cluster has the fewest observations?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Explanation

You can split your data into clusters by first using the cutree function to compute the cluster numbers:

hierGroups = cutree(kosHierClust, k = 7)

Then, you can compute the subset associated with each cluster with the provided calls to the filter() function.

If you use the nrow function on each of these new datasets, you can see that cluster 3 has 324 observations, cluster 1 has the most observations, and cluster 7 has the fewest number of observations.

Alternatively, you could answer these questions by looking at the output of table(hierGroups).

More Advanced Approach:

There is a very useful function in R called the "split" function. Given a vector assigning groups like hierGroups, you could split dailykos into the clusters by typing:

HierCluster = split(dailykos, hierGroups)

Then cluster 1 can be accessed by typing HierCluster[[1]], cluster 2 can be accessed by typing HierCluster[[2]], etc. If you have a variable in your current R session called "split", you will need to remove it with rm(split) before using the split function.

Note that you will need to answer both questions before checking your answers.

Instead of looking at the average value in each variable individually, we'll just look at the top 6 words in each cluster. To do this for cluster 1, type the following in your R console (where "HierCluster1" should be replaced with the name of your first cluster subset):

tail(sort(colMeans(HierCluster1)))

This computes the mean frequency values of each of the words in cluster 1, and then outputs the 6 words that occur the most frequently. The colMeans function computes the column (word) means, the sort function orders the words in increasing order of the mean values, and the tail function outputs the last 6 words listed, which are the ones with the largest column means.

What is the most frequent word in this cluster, in terms of average value? Enter the word exactly how you see it in the output:

Explanation

After running the R command given above, we can see that the most frequent word on average is "bush". This corresponds to President George W. Bush.

Now repeat the command given in the previous problem for each of the other clusters, and answer the following questions.

Which cluster could best be described as the cluster related to the Iraq war?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

In 2004, one of the candidates for the Democratic nomination for the President of the United States was Howard Dean, John Kerry was the candidate who won the democratic nomination, and John Edwards with the running mate of John Kerry (the Vice President nominee). Given this information, which cluster best corresponds to the democratic party?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Explanation

You can repeat the command on each of the clusters by typing the following:

tail(sort(colMeans(HierCluster2)))

tail(sort(colMeans(HierCluster3)))

tail(sort(colMeans(HierCluster4)))

tail(sort(colMeans(HierCluster5)))

tail(sort(colMeans(HierCluster6)))

tail(sort(colMeans(HierCluster7)))

The most common words in Cluster 6 are iraq, war, bush, and iraqi, so it is the cluster that can best be described as corresponding to the Iraq war. And the most common words in Cluster 7 are dean, kerry, democrat, and edward, so it looks like the democratic cluster.

Now, run k-means clustering, setting the seed to 1000 right before you run the kmeans function. Again, pick the number of clusters equal to 7. You don't need to add the iters.max argument.

Subset your data into the 7 clusters (7 new datasets) by using the "cluster" variable of your kmeans output.

How many observations are in Cluster 3?

Which cluster has the most observations?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Which cluster has the fewest number of observations?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Explanation

You can run k-means clustering by using the following commands:

set.seed(1000)

KmeansCluster = kmeans(dailykos, centers=7)

Then, you can subset your data into the 7 clusters by using the following commands:

library(dplyr)

KmeansCluster1 = dailykos %>% filter(KmeansCluster$cluster == 1)

KmeansCluster2 = dailykos %>% filter(KmeansCluster$cluster == 2)

KmeansCluster3 = dailykos %>% filter(KmeansCluster$cluster == 3)

KmeansCluster4 = dailykos %>% filter(KmeansCluster$cluster == 4)

KmeansCluster5 = dailykos %>% filter(KmeansCluster$cluster == 5)

KmeansCluster6 = dailykos %>% filter(KmeansCluster$cluster == 6)

KmeansCluster7 = dailykos %>% filter(KmeansCluster$cluster == 7)

Alternatively, you could answer these questions by looking at the output of table(KmeansCluster$cluster).

More Advanced Approach:

There is a very useful function in R called the "split" function. Given a vector assigning groups like KmeansCluster$cluster, you could split dailykos into the clusters by typing:

AllKmeansCluster = split(dailykos, KmeansCluster$cluster)

Then cluster 1 can be accessed by typing AllKmeansCluster[[1]], cluster 2 can be accessed by typing AllKmeansCluster[[2]], etc. If you have a variable in your current R session called "split", you will need to remove it with rm(split) before using the split function.

Now, output the six most frequent words in each cluster, like we did in the previous problem, for each of the k-means clusters.

Which k-means cluster best corresponds to the Iraq War?

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Which k-means cluster best corresponds to the democratic party? (Remember that we are looking for the names of the key democratic party leaders.)

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Explanation

You can output the most frequent words in each of the k-means clusters by using the following commands:

tail(sort(colMeans(KmeansCluster1)))

tail(sort(colMeans(KmeansCluster2)))

tail(sort(colMeans(KmeansCluster3)))

tail(sort(colMeans(KmeansCluster4)))

tail(sort(colMeans(KmeansCluster5)))

tail(sort(colMeans(KmeansCluster6)))

tail(sort(colMeans(KmeansCluster7)))

By looking at the output, you can see that the cluster best correponding to the Iraq War is cluster 3 (top words are iraq, war, and bush) and the cluster best corresponding to the democratic party is cluster 2 (top words dean, kerry, clark, and edward).

For the rest of this problem, we'll ask you to compare how observations were assigned to clusters in the two different methods. Use the table function to compare the cluster assignment of hierarchical clustering to the cluster assignment of k-means clustering. The following will place the k-means clusters on the rows of the table and the hierarchical clusters on the columns:

table(KmeansCluster$cluster, hierGroups)

Which Hierarchical Cluster best corresponds to K-Means Cluster 2?

Hierarchical Cluster 1

Hierarchical Cluster 2

Hierarchical Cluster 3

Hierarchical Cluster 4

Hierarchical Cluster 5

Hierarchical Cluster 6

Hierarchical Cluster 7

No Hierarchical Cluster contains at least half of the points in K-Means Cluster 2.

Explanation

From "table(hierGroups, KmeansCluster$cluster)", we read that 116 (55.6%) of the observations in K-Means Cluster 2 also fall in Hierarchical Cluster 1.

Which Hierarchical Cluster best corresponds to K-Means Cluster 3?

Hierarchical Cluster 1

Hierarchical Cluster 2

Hierarchical Cluster 3

Hierarchical Cluster 4

Hierarchical Cluster 5

Hierarchical Cluster 6

Hierarchical Cluster 7

No Hierarchical Cluster contains at least half of the points in K-Means Cluster 3.

Explanation

From "table(hierGroups, KmeansCluster$cluster)", we read that 179 (64.6%) of the observations in K-Means Cluster 3 also fall in Hierarchical Cluster 6.

Document Clustering with Daily Kos

Problem 1.1 - Hierarchical Clustering

Problem 1.2 - Hierarchical Clustering

Problem 1.3 - Hierarchical Clustering

Problem 1.4 - Hierarchical Clustering

Problem 1.5 - Hierarchical Clustering

Problem 2.1 - K-Means Clustering

Problem 2.2 - K-Means Clustering

Problem 2.3 - K-Means Clustering

Problem 2.4 - K-Means Clustering