Question

K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...

K-means clustering

Problem 1. (10 pts) Suppose that we have the gene expression values for 5 genes (G1 to G5) under 4 time points (t1 to t4) as shown in the following table. Please use K-Means clustering to group 5 genes into 2 clusters based on Euclidean distance. Find out the final centroids and their affiliated genes. The initial centroids are c1=(1,2,3,4) and c2=c(9,8,7,6). Please write down your algorithm step by step. Result without steps won't get points.

t1

t2

t3

t4

G1

2

2

1

5

G2

2

2

2

2

G3

2

2

0

2

G4

10

8

10

6

G5

11

10

10

8

Problem 2. (10 pts) Use R function kmeans to write code in Jupyter and give the results. The document is at https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/kmeans (Links to an external site.)

You should specify K=2 and don't give initial centroids. Your R code should return the final centroids (cluster means) and clustering vector showing the cluster where each gene falls. Please post your R code and output below.

Problem 3. (10 pts) Use the same kmeans function, give the initial centroids as c1 and c2. Your R code should return the final centroids (cluster means) and clustering vector showing the cluster where each gene falls. Please post your R code and output below.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

ANSWER

PROBLEM 1

The initial centroids are c1=(1,2,3,4) and c2=c(9,8,7,6). So distance to each centriod is calculated as follows using the Eucledian distance SQRT((x2-x1)^2 - (y2-y1)^2).

Iteration 1

(1,2,3,4) (9,8,7,6)
Data t1 t2 t3 t4 Distance Mean 1 Distance Mean 2 Cluster
G1 2 2 1 5 2.45 11.05 1
G2 2 2 2 2 2.45 11.22 1
G3 2 2 0 2 3.74 12.25 1
G4 10 8 10 6 13.04 3.16 2
G5 11 10 10 8 15.13 4.58 2

The distance between (1,2,3,4) and G1 is calculated as SQRT((t1-1)^2+(t2-2)^2 +(t3-3)^2 +(t4-4)^2) = 2.45. Similarly distance between (9,8,7,6) and G1 is calculated as SQRT((t1-9)^2+(t2-8)^2 +(t3-7)^2 +(t4-6)^2) = 11.05.

Since the shortest distance 2.45 is to first centroid, G1 belongs to first cluster. In a similar way distance to each centroid from G2, G3, G4 and G5 are calculated and the clusters are decided with shortest distance to centroids and is given in the above table. So the the resultant clusters are

Cluster C1={G1,G2,G3}

Cluster C2={ G4,G5}

The centroids is recalculated as follows.

Cluster 1 = (2,2,1,5), (2,2,2,2) and (2, 2, 0,2)

Mean of Cluster 1 = (2+2+2)/3 ,( 2+2+2)/3, (1+2+0)/3, (5+2+2)/3 = (2,2,1,3)

Cluster 2 = (10,8,10,6) and (11, 10,10,8)

Mean of Cluster 2 = (10+11)/2, (8+10)/2, (10+10)/2, (6+8)/2 = (10.5,9,10,7)

Distance to the new mean to each item is recomputed and given below.

(2,2,1,3) (10.5,9,10,7)
Data t1 t2 t3 t4 Distance Mean 1 Distance Mean 2 Cluster
G1 2 2 1 5 2.00 14.36 1
G2 2 2 2 2 1.41 14.50 1
G3 2 2 0 2 1.41 15.69 1
G4 10 8 10 6 13.78 1.50 2
G5 11 10 10 8 15.84 1.50 2

Since there is no change in clusters in two consecutive iterations, the k-means algorithm can be terminated.

So final results are Cluster C1={G1,G2,G3}

Cluster C2={ G4,G5}

Problem 2

The R Script for generating K-means without centroids is pasted below.

df=data.frame("t1" =c(2,2,2,10,11), "t2"=c(2,2,2,8,10),"t3"=c(1,2,0,10,10),"t4"=c(5,2,2,6,8))
df
genecluster<-kmeans(df[,1:4],2)
genecluster

Output

> df=data.frame (tl =C(2,2,2,10,11), t2=C(2,2,2,8,10),t3=C(1,2,0,10,10),t4=C(5,2,2,6,8)) > df tl t2 t3 t4 1 2 2 1 5 2

From the output, we can see that both the cluster means, (Calculated and Computed) are the same.

Problem 3

Haven't seen provision for passing initial cluster centers to k-means function in R documentation. The syntax of kmeans in R is as follows.

kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"), trace=FALSE)

where x = numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns).

centers =either the number of clusters, say k, or a set of initial (distinct) cluster centres. If a number, a random set of (distinct) rows in x is chosen as the initial centres.

iter.max =the maximum number of iterations allowed.

nstart -if centers is a number, how many random sets should be chosen?

algorithm -character: may be abbreviated. Note that "Lloyd" and "Forgy" are alternative names for one algorithm.

Add a comment
Know the answer?
Add Answer to:
K-means clustering Problem 1. (10 pts) Suppose that we have the gene expression values for 5...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • K-means clustering K-means clustering is a very well-known method of clustering unlabeled data. The simplicity of...

    K-means clustering K-means clustering is a very well-known method of clustering unlabeled data. The simplicity of the process made it popular to data analysts. The task is to form clusters of similar data objects (points, properties etc.). When the dataset given is unlabeled, we try to make some conclusion about the data by forming clusters. Now, the number of clusters can be pre-determined and number of points can have any range. The main idea behind the process is finding nearest...

  • 1. apply k-means clustering to a dataset Task Consider the following set of two-dimensional records: RID...

    1. apply k-means clustering to a dataset Task Consider the following set of two-dimensional records: RID Dimension 1 Dimension2 1 00 8 4 5 4 N 3 2 4 4 6 N 5 2. 00 6 00 8 6 Use the k-means algorithm to cluster the data in the dataset with K=3. You can assume that the records with RIDS 1, 3, and 5 are used for the initial cluster centroids (means). You must include the intermediate results in each...

  • Please write full justification for (a) and (b). Will uprate/vote! 4. K-means The goal of K-means clustering is to divide a set of n points into k< n subgroups of points that are "close" t...

    Please write full justification for (a) and (b). Will uprate/vote! 4. K-means The goal of K-means clustering is to divide a set of n points into k< n subgroups of points that are "close" to each other. Each subgroup (or cluster) is identified by the center of the cluster, the centroid (μι, μ2' ··· ,14k) In class, we have seen a brute force approach to solve this problem exactly. Each of the k clusters is represented by a color, e.g.,...

  • 1. Implement the K-means algorithm using these two as a reference. 2.Use Matlab’s implementation of kmeans...

    1. Implement the K-means algorithm using these two as a reference. 2.Use Matlab’s implementation of kmeans to check your results on the fisheriris dataset (https://www.mathworks.com/help/stats/kmeans.html) a. The fisheriris dataset is built into Matlab, and you can load it using ‘load fisheriris’. b. Please note the labels are available for the dataset, so you can check the performance of the kmeans algorithm on the dataset. 274 14 Unsupervised Lnn Fig. 14.1 A two-dimensional domain with clusters of examples weight bot initial...

  • Data clustering and the k means algorithm. However, I'm not able to list all of the...

    Data clustering and the k means algorithm. However, I'm not able to list all of the data sets but they include: ecoli.txt, glass.txt, ionoshpere.txt, iris_bezdek.txt, landsat.txt, letter_recognition.txt, segmentation.txt vehicle.txt, wine.txt and yeast.txt. Input: Your program should be non-interactive (that is, the program should not interact with the user by asking him/her explicit questions) and take the following command-line arguments: <F<K><I><T> <R>, where F: name of the data file K: number of clusters (positive integer greater than one) I: maximum number...

  • 1 - [30 pts] Scheduling Algorithms Comparison Assume that we have 5 independent and aperiodic tasks...

    1 - [30 pts] Scheduling Algorithms Comparison Assume that we have 5 independent and aperiodic tasks (T1, ... , Ts) and they arrive to the system at times indicated below. Each task will run for the amount of execution time listed and is assigned a priority ranging from 0 (highest) to 10 (lowest), i.e. lower value means higher priority. There are no other tasks scheduled to arrive to the system until T1, ... , Ts complete. Task Arrival Time Execution...

  • Problem 1: Most humans are trichromats, which means they have three different pigments in their eyes...

    Problem 1: Most humans are trichromats, which means they have three different pigments in their eyes that are sensitive to three different parts of the color spectrum. Dichromats have only two such pigments and see fewer colors. The three pigments in trichromat animals, like humans, are coded for by opsin genes called sws, MWs, and LWS (which stands for short-, medium-, and long- wavelength sensitive. respectively). These genes have been duplicated many times in evolutionary history and in fact, the...

  • Problem 4: 9 ptsl Suppose that a >0 and consider the initial value problem below dz 1. I2 pts] Sk...

    Problem 4: 9 ptsl Suppose that a >0 and consider the initial value problem below dz 1. I2 pts] Sketch the solutions to the IVP for a-10 and a = 1 on the direction field below. Based on the direction field, does it look like the solution is defined for all real r for your choices for a? dy cos(4) II. (5 ptsl Solve the initial value problem recall that α > 0). , y(0-a. Explicitly solve for y in...

  • does anyone know what High and low group means in this context? i really do not...

    does anyone know what High and low group means in this context? i really do not understand this article so anyone that does please explain it to me and what the hugh and low group mean in the figures. Received: 21 November 2018 Revised: 27 February 2019 Accepted: 6 March 2019 DOE: 10.1002p28546 ORIGINAL RESEARCnes-highdearee of intra modole connecHvity WILEYa Phypliology ARTICLE Four novel biomarkers for bladder cancer identified by weighted gene coexpression network analysis Zi-Xin Guo | Xiao-Ping Liu...

  • 14. (10 pts: 5 +5) Suppose that you want to invest 100 in two securities whose rates of return have the following expected values and standard devi- ations: r10.15, 2 0.20, v10.20,20.30. Moreover...

    14. (10 pts: 5 +5) Suppose that you want to invest 100 in two securities whose rates of return have the following expected values and standard devi- ations: r10.15, 2 0.20, v10.20,20.30. Moreover, the correlation between the rates of return is 0.2. Assume that the final wealth has a normal distribution. (a) If your goal is to maximize the expected utility and your utility function is U(x) 1- e-0.005 how much should you invest in each security? (b) If you...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT