In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads training data in WEKA arff format and generates ID3 decision tree in a format similar to that of the tree generated by Weka ID3. Please note the following:
Your algorithm will use the entire data set to generate the tree. You may assume that the attributes (a) are of nominal type (i.e., no numeric data), and (b) have no missing values.
In general, the basic ID3 algorithm uses entropy measure to select the best attribute to divide the data set. It continues to select attribute for further branching (based on the subset of data belong to that branch) until either (a) all attributes have been used, or (b) all instances under a node belong to the same class. This ensures a 0% error rate on the training set although it may not work the best with future data due to over-fitting.
In this assignment, you will use the WEKA system to analyze two artificial data sets and one real data set. You will apply five learning algorithms to each data set and compare their performance. I have included a section at the end that describes how to get weka and how to run it from the GUI or from the command line.
statlog files in the data folder directory: Index, australian.dat, australian.doc hw_gmm data files hw_gmm_25.arff 25 training examples hw_gmm_50.arff 50 training examples hw_gmm_100.arff 100 training examples hw_gmm_250.arff 250 training examples hw_gmm_500.arff 500 training examples hw_gmm_test.arff test data file hw_step data files hw_step-25.arff 25 training examples hw_step-50.arff 50 training examples hw_step-100.arff 100 training examples hw_step-250.arff 250 training examples hw_step-500.arff 500 training examples hw_step_test.arff test data file
You will run the five learning algorithms on each training data file and evaluate the results on the corresponding test data files.
TURN IN:
You should turn in the top 50 lines of your statlog.arff and
statlog_test.arfffiles.
For each classifier and each problem domain, you should learn using each of the training files (e.g., hw_step_10.arff) and test the learned model on the given test file (e.g., hw_step_test.arff). Record the accuracy of the learned model and report it in a table and a graph as specified in (a) and (b). Look at the end of the homework on how to do these runs and get the accuracies. I suggest you use the command-line to do these in a batch-setting.
TURN IN:
A table in the following format:
------------------------------------------------------- hw_gmm: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 hw_step: N Perceptron LogReg J48 kNN-1 kNN-5 25 xxx yyy zzz kkk1 kkk5 50 xxx yyy zzz kkk1 kkk5 100 xxx yyy zzz kkk1 kkk5 250 xxx yyy zzz kkk1 kkk5 500 xxx yyy zzz kkk1 kkk5 adult: N Perceptron LogReg J48 kNN-1 kNN-5 490 xxx yyy zzz kkk1 kkk5 -------------------------------------------------------Where xxx gives the error rate of the perceptron, yyy gives the error rate of LogisticRegression, etc.
For gnuplot, you need to create a separate file for each learner. Each file should consist of x,y pairs, where x is the training set size and y is the accuracy. You can then plot these files using the plot command.
For excel, you can plot the graphs using the table above and use the chart wizard to draw your graphs.
log [ P(y=1|X) / P(y=0|X) ] = w0 + w1*x1 + w2*x2WEKA produces a table that looks like
Variable Coeff. 1 w1 2 w2 Intercept w0
TURN IN:
(i, 10 points) Plot of the data points for hw_gmm_25 with lines showing the decision boundary learned by Logistic Regression. That is, you should plot the data as points in the x/y plane and then plot the decision boundary learned by the algorithm.
(ii, 10 points) Plot of the data points for hw_step_50 with a line showing the learned decision boundary for Logistic Regression.
Now, let us consider the hw_gmm_250and hw_step_250 training sets and the kind of decision boundaries found by J48. This will require that you read the decision tree and understand the decision boundary. J48 displays the tree in the following format:
x1 <= 1.0: positive (75.0/17.0) x1 > 1.0 | x2 <= 5.0: negative (42.0/12.0) | x2 > 5.0: positive (33.0/10.0)The first line indicates a split on feature x1 with threshold 1.0. The first branch leads to a leaf labeled "positive". The numbers in parentheses indicate that this leaf contains 75 data points of which 17 were misclassified. Indentation indicates child nodes. The vertical bars are intended to make it easier to see the indentations.
In this assignment, you are to write a C++ program (using Visual C++ 2015) that reads...
Can you give me a poste for Science Writing TOPIC: DECISION TREE Decision Tree Algorithm Pseudocode:- 1) Place the best attribute of the dataset at the root node of the tree. 2) Split the training set into subsets. Subsets should be make in such a way that each subset contains data with the same value for an attribute. 3) Repeat steps 1 and 2 on each subset until you find leaf nodes in all the branches of the tree. Two...
In c++ visual studio Write a program that does the following: Reads the input data set from file named "data.txt". Assume that the input file contains x and y values as shown in the sample to the right (the first number in each line is the x value). The number of data points in the input file is not known but assume that they will not exceed 100. Once it gets the data in two one-dimensional arrays (x and y),...
Below is a example of a ID3 algorithm in Unity using C# im not sure how the ID3Example works in the whole thing can someone explain the whole thing in more detail please. i am trying to use it with this data set a txt file Alternates?:Bar?:Friday?:Hungry?:#Patrons:Price:Raining?:Reservations?:Type:EstWaitTime:WillWait? Yes:No:No:Yes:Some:$$$:No:Yes:French:0-10:True Yes:No:No:Yes:Full:$:No:No:Thai:30-60:False No:Yes:No:No:Some:$:No:No:Burger:0-10:True Yes:No:Yes:Yes:Full:$:Yes:No:Thai:10-30:True Yes:No:Yes:No:Full:$$$:No:Yes:French:>60:False No:Yes:No:Yes:Some:$$:Yes:Yes:Italian:0-10:True No:Yes:No:No:None:$:Yes:No:Burger:0-10:False No:No:No:Yes:Some:$$:Yes:Yes:Thai:0-10:True No:Yes:Yes:No:Full:$:Yes:No:Burger:>60:False Yes:Yes:Yes:Yes:Full:$$$:No:Yes:Italian:10-30:False No:No:No:No:None:$:No:No:Thai:0-10:False Yes:Yes:Yes:Yes:Full:$:No:No:Burger:30-60:True Learning to use decision trees We already learned the power and flexibility of decision trees for adding a decision-making component to...
Problem statement For this program, you are to implement a simple machine-learning algorithm that uses a rule-based classifier to predict whether or not a particular patient has diabetes. In order to do so, you will need to first train your program, using a provided data set, to recognize a disease. Once a program is capable of doing it, you will run it on new data sets and predict the existence or absence of a disease. While solving this problem, you...
In this assignment, you must write a C program to check the validity of a Sudoku solution. You must at least do the following: 1- Ask the user to provide a minimum of first two rows of the Sudoku grid. For the rest of the entries, you should use a random number generator. 2- Use appropriate logic to make sure the random number generator generates a distinct set of valid integers! 3- It should be a console-based, yet convenient and...
If you’re using Visual Studio Community 2015, as requested, the instructions below should be exact but minor discrepancies may require you to adjust. If you are attempting this assignment using another version of Visual Studio, you can expect differences in the look, feel, and/or step-by-step instructions below and you’ll have to determine the equivalent actions or operations for your version on your own. INTRODUCTION: In this assignment, you will develop some of the logic for, and then work with, the...
In this assignment, you will write one (1) medium size C program. The program needs to be structured using multiple functions, i.e., you are required to organize your code into distinct logical units. The following set of instructions provide the specific requirements for the program. Make sure to test thoroughly before submitting. Write a program, named program1.c, that reads and processes employee records (data about an employee). Each employee record must be stored using a struct that contains the following ...
This C++ Program should be written in visual studio 2017 You are to write a program that can do two things: it will help users see how long it will take to achieve a certain investment goal given an annual investment amount and an annual rate of return; also, it will help them see how long it will take to pay off a loan given a principal amount, an annual payment amount and an annual interest rate. When the user...
In this assignment, you will write a program in C++ which uses files and nested loops to create a file from the quiz grades entered by the user, then reads the grades from the file and calculates each student’s average grade and the average quiz grade for the class. Each student takes 6 quizzes (unknown number of students). Use a nested loop to write each student’s quiz grades to a file. Then read the data from the file in order...
For this c++ assignment, Overview write a program that will process two sets of numeric information. The information will be needed for later processing, so it will be stored in two arrays that will be displayed, sorted, and displayed (again). One set of numeric information will be read from a file while the other will be randomly generated. The arrays that will be used in the assignment should be declared to hold a maximum of 50 double or float elements....