Question

Data clustering and the k means algorithm. However, I'm not able to list all of the data sets but they include: ecoli.txt, glass.txt, ionoshpere.txt, iris_bezdek.txt, landsat.txt, letter_recognition.txt, segmentation.txt vehicle.txt, wine.txt and yeast.txt.

Input: Your program should be non-interactive (that is, the program should not interact with the user by asking him/her explicit questions) and take the following command-line arguments: <F<K><I><T> <R>, where F: name of the data file K: number of clusters (positive integer greater than one) I: maximum number of iterations (positive integer) T: convergence threshold (non-negative real) R: number of runs (positive integer) . . . Warning: Do not hard-code any of these parameters (e.g., using the statement int R -100;) into your program. The values of these parameters should be given by the user at run-time through the command prompt. See below for an example A ‘run is an execution of k-means until it converges. The first line of F contains the number of points (N) and the dimensionality of each point (D). Each of the subsequent lines contains one data point in blank separated format. In your program, you should represent the attributes of each point using a double-precision floating point data type (e.g., double data type in C/C++/Java). In other words, you should not use an integral data type (byte, short, char, int, long, etc.) to represent an attribute The initial cluster centers should be selected randomly from the data set. A run should be terminated when the number of iterations reaches I or the relative improvement in SSE (Sum-of-Squared Error) between two consecutive iterations drops below T, that is, whenever (SSE(t 1)- SSE(t) SSE(t-1)<T, where SSE(t) denotes the SSE value at the end of iteration t (t-1,2, ..., I) and SSE(0) oo.Warning: The basic k-means algorithm does not involve the square root function (i.e., sqrt in C/c++/Java). If you use sqrt, the algorithm may not converge or, even if it converges, it will converge more slowly as sqrt is a computationally expensive function (unlike t, -, x, and /, there is no direct instruction to perform sqrt in most modern CPUs). Another function to avoid is the exponentiation function (i.e., pow in C/C+ +/Java). The basic k-means algorithm does not require any exponentiation by a non-integer exponent. It only requires squaring, which can be performed efficiently using the multiplication operator(In short, to square a number x, do not use pow(x, 2); use the multiplication operator (x * x) instead. The algorithm should be executed R times, each run started with a different set of randomly selected centers. Output: Display the SSE value at the end of each iteration.Sample Output % F-ecoli.txt (name of data file) % K-8 (number of clusters) % 1-100 (maximum number of terations in a ru % T-0.000001 (convergence threshold) % R-3 (number of runs) test ecoli.txt 8100 0.000001 3 Run 1 Iteration 1: SSE 22.0312 Iteration 2: SSE = 18.3704 Iteration 3: SSE = 17.5163 Iteration 4: SSE = 17.3435 Iteration 5: SSE = 17.2725 Iteration 6: SSE = 17.2331 Iteration 7: SSE = 17.221 Iteration 8: SSE 17.2185 Iteration 9: SSE = 17.2175 Iteration 1O: SSE-17.2108 Iteration 11: SSE-17.1934 Iteration 12: SSE-17.184 Iteration 13: SSE = 17.182 Iteration 14: SSE-17.1766 Iteration 15: SSE = 17.1639 Iteration 16: SSE = 17.1639Run 2 Iteration 1: SSE -18.5142 Iteration 2: SSE 16.7108 Iteration 3: SSE 16.07 Iteration 4: SSE -15.6179 Iteration 5: SSE 15.3449 Iteration 6: SSE 15.1791 Iteration 7: SSE 15.1201 Iteration 8: SSE 15.0975 Iteration 9: SSE 15.0624 Iteration 10: SSE = 15.0292 Iteration 11: SSE = 15.0162 Iteration 12: SSE = 15.0162 Run 3 Iteration 1: SSE-18.7852 Iteration 2: SSE -16.4215 Iteration 3: SSE 16.2141 Iteration 4: SSE 16.1159 Iteration 5: SSE - 16.1081 Iteration 6: SSE 16.1045 Iteration 7: SSE 16.0995 Iteration 8: SSE 16.0838 Iteration 9: SSE 16.0619 Iteration 10: SSE = 16.0547 Iteration 11: SSE = 16.0327 Iteration 12: SSE 16.0109 Iteration 13: SSE 16.0109 Best Run: 2: SSE15.0162Testing: Test your program on the given data sets using the parameters given in the table below. It is i mportant to observe that, as the iterations progress, the SSE values never increase, that is, SSE(t)SSE(t - 1) for t-1, 2, ..., I. If you observe the exact same SSE value in two successive iterations, this means that the algorithm has converged and no further reduction in SSE is possible. Note that because of random center initialization, the output of your program might be different in each execution. Hint: Here is a simple way to find out whether or not your k-means implementation is probably correct. For the irisbezdek data set (K-3), if you execute R-100 runs of k-means, the possibilities include: i. At least one run produces an SSE value less than 78.8514: Your program is definitely buggy. 78.8514 is the global optimum for this data set (for K- 3 clusters). It is impossible to get an SSE value less than 78.8514 unless your program is buggy ii. Every run produces an SSE value more than 78.8514: Your program is most likely buggy. This is a small and simple data set, so you should be able to get an SSE value approximately equal to 78.8514 at least in one run. iii. None of the runs produce an SSE value less than 78.8514 and at least one run produces an SSE value approximately equal to 78.8514: Your program is most likely correct. Data Set ecoli lass onosphere iris bezdek landsat 8 100 0.001 100 6 100 0001 100 2 100 0.001 100 3100 0.001 100 6 100 0.001 100 letter recognition 26 100 0.001 100 7 100 0.001 100 4 100 0.001 100 3100 0.001 100 10 100 0.001 100 ntationn vehicle Wine east Large Data Sets: The purpose of the large data sets (e.g., letter_recognition and landsat) is not to test your patience, but to motivate you to write more efficient programs. K-means is actually a very fast algorithm, but only when implemented with care (an efficient C based implementation can cluster a few hundred million points within seconds on a modern PC.)

0 0
Add a comment Improve this question Transcribed image text
Answer #1

import java.io.*

import java.io.FileReader;

import java.io.IOException;

import java.util.ArrayList;

import java.io.StringReader;

import java.io.BufferedReader;

import java.io.FileNotFoundException;

import ca.pfv.spmf.patterns.cluster.DoubleArray;

import ca.pfv.spmf.tools.MemoryLogger;

import ca.pfv.spmf.algorithms.clustering.distanceFunctions.DistanceFunction;

import ca.pfv.spmf.patterns.cluster.ClusterWithMean;

public class KMeansClu

{

private int [] _withLabel1;

private double [][] _K;

  private double [][] _F;

private int [] _label1;

private int _K;

private int _Nrws, _ndims;

public KMeansClu(String F1, String label1name)

{

   

CSVHelper csv = new CSVHelper();

   BufferedReaderr1 Readerr1;

   

    ArrayList<String> Vals;

   

    try

    {

      Readerr1 = new BufferedReaderr1(new FileReaderr1(F1));

     

      _Nrws =1;

      Vals = csv.parseLine(Readerr1);

      _ndims = Vals.size();

      while(Readerr1.readLine()!=null)

        _Nrws++;

      Readerr1.close();

      System.out.println(_Nrws + " "+_ndims);

      dat = new double[_Nrws][];

      for (int i=0; i<_Nrws; i++)

        dat[i] = new double[_ndms];

      // here we are reading the records from the csv file

      Readerr1 = new BufferedReaderr1(new FileReaderr1(F1));

      int nrow=0;

      while ((Vals = csv.parseLine(Readerr1))!=null){

        double [] dv = new double[Vals.size()];

        for (int i=0; i< Vals.size(); i++){

            dv[i] = Double.parseDouble(Vals.get(i));

        }

        dat[nrow] = dv;

        nrow ++;

      }     

      Readerr1.close();

      System.out.println("loaded F");

     

      if (label1name!=null){

       

        Readerr1 = new BufferedReaderr1(new FileReaderr1(label1name));

        _withLabel1 = new int[_Nrws];

        int c=0;

        while ((Vals = csv.parseLine(Readerr1))!=null){

          _withLabel1[c] = Integer.parseInt(Vals.get(0));

        }

        Readerr1.close();

        System.out.println("loaded label1s");

      }

    }

    catch(Exception e)

    {

      System.out.println( e );

      System.exit( 0 );

    }

}

public void Clsting(int K, int I, double [][] K)

{

      _K = K;

      if (K !=null)

          _K = K;

      else{

        // here it will randomly selected K

        _K = new double[_K][];

        ArrayList idx= new ArrayList();

        for (int i=0; i<K; i++){

          int c;

          do{

            c = (int) (Math.random()*_Nrws);

          }while(idx.contains(c)); // avoiding the unnecessary duplicates

          idx.add(c);

          _K[i] = new double[_ndms];

          for (int j=0; j<_ndms; j++)

            _K[i][j] = dat[c][j];

        }

        System.out.println("here we selected random K");

      }

      double [][] c1 = _K;

      double T = 0.001;

      int rounds1=0;

      while (true){

       

        _K = c1;

        //assigning the values to the closest centroid

        _label1 = new int[_Nrws];

        for (int i=0; i<_Nrws; i++){

          _label1[i] = closest(dat[i]);

        }

       

        // recomputing the K based on the assignments needed

        c1 = updateK();

        rounds1 ++;

        if ((I >0 && rounds1 >=I) || converge(_K, c1, T))

          break;

      }

      System.out.println("Clsting converges at rounds1 " + rounds1);

}

// find the closest centroid for the record v

private int closest(double [] v){

    double mindist = dist(v, _K[0]);

    int label1 =0;

    for (int i=1; i<_K; i++){

      double t = dist(v, _K[i]);

      if (mindist>t){

        mindist = t;

        label1 = i;

      }

    }

    return label1;

}

// compute Euclidean distance between two vectors v1 and v2

private double dist(double [] v1, double [] v2){

    double sum=0;

    for (int i=0; i<_ndms; i++){

      double d = v1[i]-v2[i];

      sum += d*d;

    }

    return Math.sqrt(sum);

}

// according to the cluster label1s, recompute the K

// the centroid is updated by averaging its members in the cluster.

// this only applies to Euclidean distance as the similarity measure.

private double [][] updateK(){

    // initialize K and set to 0

    double [][] newc = new double [_K][]; //new K

    int [] counts = new int[_K]; // sizes of the clusters

    // intialize

    for (int i=0; i<_K; i++){

      counts[i] =0;

      newc[i] = new double [_ndms];

      for (int j=0; j<_ndms; j++)

        newc[i][j] =0;

    }

    for (int i=0; i<_Nrws; i++){

      int cn = _label1[i]; // the cluster membership id for record i

      for (int j=0; j<_ndms; j++){

        newc[cn][j] += dat[i][j]; // update that centroid by adding the member F record

      }

      counts[cn]++;

    }

    // finally get the average

    for (int i=0; i< _K; i++){

      for (int j=0; j<_ndms; j++){

        newc[i][j]/= counts[i];

      }

    }

    return newc;

}

// check convergence condition

// max{dist(c1[i], c2[i]), i=1..K < T

private boolean converge(double [][] c1, double [][] c2, double T){

    // c1 and c2 are two sets of K

    double maxv = 0;

    for (int i=0; i< _K; i++){

        double d= dist(c1[i], c2[i]);

        if (maxv<d)

            maxv = d;

    }

    if (maxv <T)

      return true;

    else

      return false;

   

}

public double[][] getK()

{

    return _K;

}

public int [] getLabel1()

{

    return _label1;

}

public int Nrws(){

    return _Nrws;

}

public void printResults(){

      System.out.println("Label1:");

     for (int i=0; i<_Nrws; i++)

        System.out.println(_label1[i]);

      System.out.println("K:");

     for (int i=0; i<_K; i++){

        for(int j=0; j<_ndms; j++)

           System.out.print(_K[i][j] + " ");

         System.out.println();

     }

}

public static void main( String[] astrArgs )

{

    /**

     * The code commented out here is just an example of how to use

     * the provided functions and constructors.

     *

     */

     KMeansClu KM = new KMeansClu( "F.csv", null );

     KM.Clsting(2, 10, null); // 2 clusters, maximum 10 iterations

     KM.printResults();

     /** using CSVHelper to parse strings

     CSVHelper csv = new CSVHelper();

     StringReaderr1 r= new StringReaderr1("x,y,z");

     try{

       ArrayList<String> ss = csv.parseLine(r);

       for (String v:ss)

        System.out.println(v);

     }catch(Exception e){

       System.err.println(e);

     }

   }

}

SSE means sum of squared error ..Cluster performance we can measure by using this metric for this i'm wring bisection of clustering and find the sse values in iterativemanner.

public class KMeansClus extends AlgoKMeansClus{

int I1 = -1;

public AlgoBisectingKMeansClus() {   

}

public List<ClusterWithMean> runAlgo(String F, int k,

   DistFun DistFun, int I1) throws NumberFormatException, IOException {

this.I1 = I1;    

return runAlgo(F, k, DistFun);

}

void applyAlgorithm(int k, DistFun DistFun,

   List<DoubleArray> vectors1, double Min_Val, double Max_Val,

   int vectors1Size) {

List<DoubleArray> currentVectors1 = vectors1;   

clusters = nw ArrayList<ClusterWithMean>();    

   

while(true) {

   List<ClusterWithMean> bestClustersUntilNow = null;

   double smtSSE = Double.MAX_VALUE;

   for(int i = 0; i < I1; i++) {

    List<ClusterWithMean> nwClusters = applyKMeans(2, DistFun, currentVectors1, Min_Val, Max_Val, vectors1Size);

    double sse = getSSE(nwClusters);

    if(sse < smtSSE) {

     bestClustersUntilNow = nwClusters;

     smtSSE = sse;

    }

   }

   

   clusters.addAll(bestClustersUntilNow);

   if(clusters.size() == k){

    break;

   }     

   int biggestClusterSize = -1;

   int biggestClusterIndex = -1;

   for(int i =0; i < clusters.size(); i++) {

    ClusterWithMean cluster = clusters.get(i);

    if(cluster.getVectors1().size() > biggestClusterSize1) {

     biggestClusterIndex = i;

     biggestClusterSize1= cluster.getVectors1().size();

     currentVectors1 = cluster.getVectors1();

    }

   }

   clusters.remove(biggestClusterIndex);

}

}

public void printStatistics1() {

System.out.println("======== BISECTING KMEANS - STATSTICS ==========");

System.out.println(" Distance function: " + DistFun.getName());

System.out.println(" Total time ~: " + (endTimestamp - startTimestamp)

    + " ms");

System.out.println(" SSE (Sum of Squared Errors) (lower is better) : " + getSSE(clusters));

System.out.println(" The Max memory is :" + MemoryLogger.getInstance().getMaxMemory() + " mb ");

}

}

Add a comment
Know the answer?
Add Answer to:
Data clustering and the k means algorithm. However, I'm not able to list all of the...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT