K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

Question

Question

K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

K-means clustering

Problem 1. (10 pts) Suppose that we have the gene expression values for 5 genes (G1 to G5) under 4 time points (t1 to t4) as shown in the following table. Please use K-Means clustering to group 5 genes into 2 clusters based on Euclidean distance. Find out the final centroids and their affiliated genes. The initial centroids are c1=(1,2,3,4) and c2=c(9,8,7,6). Please write down your algorithm step by step. Result without steps won't get points.

	t1	t2	t3	t4
G1	2	2	1	5
G2	2	2	2	2
G3	2	2	0	2
G4	10	8	10	6
G5	11	10	10	8

Problem 2. (10 pts) Use R function kmeans to write code in Jupyter and give the results. The document is at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans (Links to an external site.)

You should specify K=2 and don't give initial centroids. Your R code should return the final centroids (cluster means) and clustering vector showing the cluster where each gene falls. Please post your R code and output below.

Problem 3. (10 pts) Use the same kmeans function, give the initial centroids as c1 and c2. Your R code should return the final centroids (cluster means) and clustering vector showing the cluster where each gene falls. Please post your R code and output below.

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

ANSWER

PROBLEM 1

The initial centroids are c1=(1,2,3,4) and c2=c(9,8,7,6). So distance to each centriod is calculated as follows using the Eucledian distance SQRT((x2-x1)^2 - (y2-y1)^2).

Iteration 1

					(1,2,3,4)	(9,8,7,6)
Data	t1	t2	t3	t4	Distance Mean 1	Distance Mean 2	Cluster
G1	2	2	1	5	2.45	11.05	1
G2	2	2	2	2	2.45	11.22	1
G3	2	2	0	2	3.74	12.25	1
G4	10	8	10	6	13.04	3.16	2
G5	11	10	10	8	15.13	4.58	2

The distance between (1,2,3,4) and G1 is calculated as SQRT((t1-1)^2+(t2-2)^2 +(t3-3)^2 +(t4-4)^2) = 2.45. Similarly distance between (9,8,7,6) and G1 is calculated as SQRT((t1-9)^2+(t2-8)^2 +(t3-7)^2 +(t4-6)^2) = 11.05.

Since the shortest distance 2.45 is to first centroid, G1 belongs to first cluster. In a similar way distance to each centroid from G2, G3, G4 and G5 are calculated and the clusters are decided with shortest distance to centroids and is given in the above table. So the the resultant clusters are

Cluster C1={G1,G2,G3}

Cluster C2={ G4,G5}

The centroids is recalculated as follows.

Cluster 1 = (2,2,1,5), (2,2,2,2) and (2, 2, 0,2)

Mean of Cluster 1 = (2+2+2)/3 ,( 2+2+2)/3, (1+2+0)/3, (5+2+2)/3 = (2,2,1,3)

Cluster 2 = (10,8,10,6) and (11, 10,10,8)

Mean of Cluster 2 = (10+11)/2, (8+10)/2, (10+10)/2, (6+8)/2 = (10.5,9,10,7)

Distance to the new mean to each item is recomputed and given below.

					(2,2,1,3)	(10.5,9,10,7)
Data	t1	t2	t3	t4	Distance Mean 1	Distance Mean 2	Cluster
G1	2	2	1	5	2.00	14.36	1
G2	2	2	2	2	1.41	14.50	1
G3	2	2	0	2	1.41	15.69	1
G4	10	8	10	6	13.78	1.50	2
G5	11	10	10	8	15.84	1.50	2

Since there is no change in clusters in two consecutive iterations, the k-means algorithm can be terminated.

So final results are Cluster C1={G1,G2,G3}

Cluster C2={ G4,G5}

Problem 2

The R Script for generating K-means without centroids is pasted below.

df=data.frame("t1" =c(2,2,2,10,11), "t2"=c(2,2,2,8,10),"t3"=c(1,2,0,10,10),"t4"=c(5,2,2,6,8))
df
genecluster<-kmeans(df[,1:4],2)
genecluster

Output

> df=data.frame (tl =C(2,2,2,10,11), t2=C(2,2,2,8,10),t3=C(1,2,0,10,10),t4=C(5,2,2,6,8)) > df tl t2 t3 t4 1 2 2 1 5 2

From the output, we can see that both the cluster means, (Calculated and Computed) are the same.

Problem 3

Haven't seen provision for passing initial cluster centers to k-means function in R documentation. The syntax of kmeans in R is as follows.

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace=FALSE)

where x = numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).

centers =either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.

iter.max =the maximum number of iterations allowed.

nstart -if centers is a number, how many random sets should be chosen?

algorithm -character: may be abbreviated. Note that "Lloyd" and "Forgy" are alternative names for one algorithm.

Add a comment

Answer 2

K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

Homework Answers

Add Answer to:
K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

Post as a guest

Earn Coins

K-means clustering K-means clustering is a very well-known method of clustering unlabeled data. The simplicity of...

1. apply k-means clustering to a dataset Task Consider the following set of two-dimensional records: RID...

Please write full justification for (a) and (b). Will uprate/vote! 4. K-means The goal of K-means clustering is to divide a set of n points into k< n subgroups of points that are "close" t...

1. Implement the K-means algorithm using these two as a reference. 2.Use Matlab’s implementation of kmeans...

Data clustering and the k means algorithm. However, I'm not able to list all of the...

1 - [30 pts] Scheduling Algorithms Comparison Assume that we have 5 independent and aperiodic tasks...

Problem 1: Most humans are trichromats, which means they have three different pigments in their eyes...

Problem 4: 9 ptsl Suppose that a >0 and consider the initial value problem below dz 1. I2 pts] Sk...

does anyone know what High and low group means in this context? i really do not...

14. (10 pts: 5 +5) Suppose that you want to invest 100 in two securities whose rates of return have the following expected values and standard devi- ations: r10.15, 2 0.20, v10.20,20.30. Moreover...

K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

Homework Answers

Add Answer to: K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

Post as a guest

Earn Coins

Add Answer to:
K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...