Question

In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...

In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads training data in WEKA arff format and generates ID3 decision tree in a format similar to that of the tree generated by Weka ID3. Please note the following:

Your algorithm will use the entire data set to generate the tree. You may assume that the attributes (a) are of nominal type (i.e., no numeric data), and (b) have no missing values.

In general, the basic ID3 algorithm uses entropy measure to select the best attribute to divide the data set. It continues to select attribute for further branching (based on the subset of data belong to that branch) until either (a) all attributes have been used, or (b) all instances under a node belong to the same class. This ensures a 0% error rate on the training set although it may not work the best with future data due to over-fitting.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

In this assignment, you will use the WEKA system to analyze two artificial data sets and one real data set. You will apply five learning algorithms to each data set and compare their performance. I have included a section at the end that describes how to get weka and how to run it from the GUI or from the command line.

  • Learning Algorithms. We will compare Perceptron, Logistic Regression, Decision Trees (J48), and k-nearest neighbors (IBk) (two variations: 1-NN and 5-NN).
  • Data Sets. We will apply these five algorithms to the data sets hw_gmm, hw_step, and statlog. This latter data set is from the UCI Irvine machine learning repository. The data set has been cleaned so that there are no missing values. The artificial data sets have one or more training data files and one test data file, the statlog data is one large file:
    statlog files in the data folder directory: Index, australian.dat, australian.doc
    
    hw_gmm data files 
          hw_gmm_25.arff       25 training examples
          hw_gmm_50.arff       50 training examples
          hw_gmm_100.arff      100 training examples
          hw_gmm_250.arff      250 training examples
          hw_gmm_500.arff      500 training examples
          hw_gmm_test.arff     test data file
    
    hw_step data files
          hw_step-25.arff      25 training examples
          hw_step-50.arff      50 training examples
          hw_step-100.arff     100 training examples
          hw_step-250.arff     250 training examples
          hw_step-500.arff     500 training examples
          hw_step_test.arff    test data file
    

    You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.

  • Exercises / What to turn in.
    1. [Data Handling, 10 points]: Your first task is to download the statlog data and convert it weka format. This tasks will make you familiar with how to download and handle data.
      • Go to the webpage and follow the link to 'Data Folder' at the top (right under the main name).
      • Download the australian.dat file and split it into two files of 490 instances for the training set (name itstatlog.arff and 200 instances for the test set (name it statlog_test.arff).
      • Add a weka arff header to these two files, using the details of the attribute information on the web-page. Look at the artificial data sets to help you get the syntax correct. Make sure you keep the attributes and instances in the same order as they are in the original files.
      You will be predicting the last attribute (class).

      TURN IN:
      You should turn in the top 50 lines of your statlog.arff and statlog_test.arfffiles.

    2. [Training Set Sensitivity, 30 points]:How sensitive are the various learners to the training set size. We will have each learner learn on each of the train files (sizes 10 to 500) and record their accuracies. This exercise gives us insight into the behavior of each learner and how sensitive it is to training set sizes. This knowledge is useful when deciding which learner to use in a specific problem.

      For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.

      1. [Tabular comparison, 20 points]

        TURN IN:
        A table in the following format:

        -------------------------------------------------------
        hw_gmm:
        N       Perceptron   LogReg   J48     kNN-1     kNN-5
        25      xxx          yyy      zzz     kkk1      kkk5
        50      xxx          yyy      zzz     kkk1      kkk5
        100     xxx          yyy      zzz     kkk1      kkk5
        250     xxx          yyy      zzz     kkk1      kkk5
        500     xxx          yyy      zzz     kkk1      kkk5
        
        hw_step:
        N       Perceptron   LogReg   J48     kNN-1     kNN-5
        25      xxx          yyy      zzz     kkk1      kkk5
        50      xxx          yyy      zzz     kkk1      kkk5
        100     xxx          yyy      zzz     kkk1      kkk5
        250     xxx          yyy      zzz     kkk1      kkk5
        500     xxx          yyy      zzz     kkk1      kkk5
        
        adult:
        N       Perceptron   LogReg   J48     kNN-1     kNN-5
        490     xxx          yyy      zzz     kkk1      kkk5
        -------------------------------------------------------
        
        Where xxx gives the error rate of the perceptron, yyy gives the error rate of LogisticRegression, etc.
      2. [Graph Comparison, 10 points]
        TURN IN:
        Graphs of the results for hw_gmm andhw_step plotting the performance of the five algorithms as a function of the size of the training data set (known as a "learning curve"). I recommend using gnuplot, excel or matlab for constructing the graphs as WEKA does not provides an easy way to do this.

        For gnuplot, you need to create a separate file for each learner. Each file should consist of x,y pairs, where x is the training set size and y is the accuracy. You can then plot these files using the plot command.

        For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.

    3. [Decision Boundaries, 60 points]
      Each learner creates decision boundaries and we often would like to know what these boundaries are. In some cases, such as logistic regression and J48, computing these boundaries is straight forward. In other cases, such as VotedPerceptron and Nearest Neighbor, this is not so easy and we need to use other means. These exercises are meant to help you understand how to get the decision boundaries from the learned models.
      1. [Logistic Regression, 20 points] Let us consider the hw_gww_25 and hw_step_50 training sets and what kind of decision boundaries that logistic regression found. To compute the decision boundary for Logistic Regression, recall that the logistic regression model has the form
        log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2
        
        WEKA produces a table that looks like
         Variable      Coeff.
                1      w1
                2      w2
        Intercept      w0
        

        TURN IN:

        (i, 10 points) Plot of the data points for hw_gmm_25 with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm.

        (ii, 10 points) Plot of the data points for hw_step_50 with a line showing the learned decision boundary for Logistic Regression.

      2. [J48, 20 points]:

        Now, let us consider the hw_gmm_250and hw_step_250 training sets and the kind of decision boundaries found by J48. This will require that you read the decision tree and understand the decision boundary. J48 displays the tree in the following format:

        x1 <= 1.0: positive (75.0/17.0)
        x1 > 1.0
        |   x2 <= 5.0: negative (42.0/12.0)
        |   x2 > 5.0: positive (33.0/10.0)
        
        The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
Add a comment
Know the answer?
Add Answer to:
In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • Can you give me a poste for Science Writing TOPIC: DECISION TREE Decision Tree Algorithm Pseudocode:-...

    Can you give me a poste for Science Writing TOPIC: DECISION TREE Decision Tree Algorithm Pseudocode:- 1) Place the best attribute of the dataset at the root node of the tree. 2) Split the training set into subsets. Subsets should be make in such a way that each subset contains data with the same value for an attribute. 3) Repeat steps 1 and 2 on each subset until you find leaf nodes in all the branches of the tree. Two...

  • In c++ visual studio Write a program that does the following: Reads the input data set...

    In c++ visual studio Write a program that does the following: Reads the input data set from file named "data.txt". Assume that the input file contains x and y values as shown in the sample to the right (the first number in each line is the x value). The number of data points in the input file is not known but assume that they will not exceed 100. Once it gets the data in two one-dimensional arrays (x and y),...

  • Below is a example of a ID3 algorithm in Unity using C# im not sure how...

    Below is a example of a ID3 algorithm in Unity using C# im not sure how the ID3Example works in the whole thing can someone explain the whole thing in more detail please. i am trying to use it with this data set a txt file Alternates?:Bar?:Friday?:Hungry?:#Patrons:Price:Raining?:Reservations?:Type:EstWaitTime:WillWait? Yes:No:No:Yes:Some:$$$:No:Yes:French:0-10:True Yes:No:No:Yes:Full:$:No:No:Thai:30-60:False No:Yes:No:No:Some:$:No:No:Burger:0-10:True Yes:No:Yes:Yes:Full:$:Yes:No:Thai:10-30:True Yes:No:Yes:No:Full:$$$:No:Yes:French:>60:False No:Yes:No:Yes:Some:$$:Yes:Yes:Italian:0-10:True No:Yes:No:No:None:$:Yes:No:Burger:0-10:False No:No:No:Yes:Some:$$:Yes:Yes:Thai:0-10:True No:Yes:Yes:No:Full:$:Yes:No:Burger:>60:False Yes:Yes:Yes:Yes:Full:$$$:No:Yes:Italian:10-30:False No:No:No:No:None:$:No:No:Thai:0-10:False Yes:Yes:Yes:Yes:Full:$:No:No:Burger:30-60:True Learning to use decision trees We already learned the power and flexibility of decision trees for adding a decision-making component to...

  • Problem statement For this program, you are to implement a simple machine-learning algorithm that uses a...

    Problem statement For this program, you are to implement a simple machine-learning algorithm that uses a rule-based classifier to predict whether or not a particular patient has diabetes. In order to do so, you will need to first train your program, using a provided data set, to recognize a disease. Once a program is capable of doing it, you will run it on new data sets and predict the existence or absence of a disease. While solving this problem, you...

  • In this assignment, you must write a C program to check the validity of a Sudoku solution. You must at least do the foll...

    In this assignment, you must write a C program to check the validity of a Sudoku solution. You must at least do the following: 1- Ask the user to provide a minimum of first two rows of the Sudoku grid. For the rest of the entries, you should use a random number generator. 2- Use appropriate logic to make sure the random number generator generates a distinct set of valid integers! 3- It should be a console-based, yet convenient and...

  • If you’re using Visual Studio Community 2015, as requested, the instructions below should be exact but...

    If you’re using Visual Studio Community 2015, as requested, the instructions below should be exact but minor discrepancies may require you to adjust. If you are attempting this assignment using another version of Visual Studio, you can expect differences in the look, feel, and/or step-by-step instructions below and you’ll have to determine the equivalent actions or operations for your version on your own. INTRODUCTION: In this assignment, you will develop some of the logic for, and then work with, the...

  • In this assignment, you will write one (1) medium size C program. The program needs to...

    In this assignment, you will write one (1) medium size C program. The program needs to be structured using multiple functions, i.e., you are required to organize your code into distinct logical units. The following set of instructions provide the specific requirements for the program. Make sure to test thoroughly before submitting. Write   a   program,   named   program1.c,   that   reads   and   processes   employee   records   (data   about   an   employee).   Each   employee   record   must   be   stored   using   a   struct   that   contains   the   following  ...

  • This C++ Program should be written in visual studio 2017 You are to write a program...

    This C++ Program should be written in visual studio 2017 You are to write a program that can do two things: it will help users see how long it will take to achieve a certain investment goal given an annual investment amount and an annual rate of return; also, it will help them see how long it will take to pay off a loan given a principal amount, an annual payment amount and an annual interest rate. When the user...

  • In this assignment, you will write a program in C++ which uses files and nested loops...

    In this assignment, you will write a program in C++ which uses files and nested loops to create a file from the quiz grades entered by the user, then reads the grades from the file and calculates each student’s average grade and the average quiz grade for the class. Each student takes 6 quizzes (unknown number of students). Use a nested loop to write each student’s quiz grades to a file. Then read the data from the file in order...

  • For this c++ assignment, Overview write a program that will process two sets of numeric information....

    For this c++ assignment, Overview write a program that will process two sets of numeric information. The information will be needed for later processing, so it will be stored in two arrays that will be displayed, sorted, and displayed (again). One set of numeric information will be read from a file while the other will be randomly generated. The arrays that will be used in the assignment should be declared to hold a maximum of 50 double or float elements....

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT