In R load the tidyverse package
Consider the `USArrests` dataset, which contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
(a) Perform k-means clustering using all numerical variables in this dataset, scaling the variables before running the clustering algorithm
(b) Try two different values of $k$ and comment on your results.
(c) Visualize the results of the clustering using the variables `Murder` and `UrbanPop`
data("USArrests") mydata <- USArrests mydata <- na.omit(mydata) mydata <- scale(mydata) head(mydata, n=10)
set.seed(124) ss <- sample(1:50,10) df <- USArrests[ss, ] df <- na.omit(df) head(df,n=6) df.scaled <- scale(df) head(round(df.scaled, 2)) desc_stats <- data.frame( Min = apply(USArrests, 2, min), Max = apply(USArrests, 2, max), Med = apply(USArrests, 2, median), SD = apply(USArrests, 2, sd), Mean = apply(USArrests, 2, mean)) desc_stats <- round(desc_stats,1) head(desc_stats) library(stats) eucl <- dist(df.scaled, method = "euclidean" ) round(as.matrix(eucl)[1:6,1:6],1) cor <- cor(t(df.scaled), method = "pearson") dist_cor <- as.dist(1 - cor) round(as.matrix(dist_cor)[1:6,1:6],1)
#daisy() to compute dissimilarity matrices between observations library(cluster) library(factoextra) daisy(df.scaled, metric = c("euclidean", "manhattan", "gower"), stand = FALSE) data("flower") head(flower) str(flower) daisy_dist <- as.matrix(daisy(flower)) head(round(daisy_dist[1:6,1:6]),2) library(corrplot) corrplot(as.matrix(eucl), is.corr = FALSE, method = "color") corrplot(as.matrix(eucl), is.corr = FALSE, method = "color", order = "hclust", type = "upper") plot(hclust(eucl, method = "ward.D2"))
heatmap(as.matrix(eucl), symm = TRUE, distfun = function(x) as.dist(x))
In R load the tidyverse package Consider the `USArrests` dataset, which contains statistics, in a...
For the following exercises you can use the 'Wooldridge' package in R to load the data 9. (7 marks) (using dataset: "k401k") The data in 401K are a subset of data analyzed by Papke (1995) to study the relationship between participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the percentage of eligible workers with an active account; this is the variable we would like to explain. The dummy variable sole represents whether...
R studio #Exercise : Calculate the following probabilities : #1. Probability that a normal random variable with mean 22 and variance 25 #(i)lies between 16.2 and 27.5 #(ii) is greater than 29 #(iii) is less than 17 #(iv)is less than 15 or greater than 25 #2.Probability that in 60 tosses of a fair coin the head comes up #(i) 20,25 or 30 times #(ii) less than 20 times #(iii) between 20 and 30 times #3.A random variable X has Poisson...
ies yuu t pret and comimuhicate the findings of two linear regression models. The data is from an article that studies the relationship between salaries of legislators and representation of the working-classes in state legislatures in the US. Background If politicians in the United States were paid better, would more working-class people become politicians? It is often argued that if politicians are paid too little, then it is economically too difficult for lower-income citizens to hold positions of office. This...