Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Question

Question

Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Data Mining:

Explain why decision tree algorithm based on impurity measures such as entropy and Gini index tends to favor attributes with larger number of distinct values. How would you overcome this problem?

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

Impurity measures such as entropy and Gini index to favor attributes that have a large number of distinct values.

Car Customer Gender Type Female Family Luxury Male V20 V11 V10 Sports CO: 1 CO: 6 CO: 4 CO:1 CO: 8 CO: 1 CO:1 CO: 0 CO: 0 C1:

See above figure:

It will shows three alternative test conditions for partitioning given data set. Comparing the first test condition, Gender, with the second, Car Type, it is easy to convey that Car Type seems to provide a better way of splitting the data since it produces purer descendent nodes. However, if we compare both conditions with customer Id, the latter appears to produce purer partitions. Customer ID is not a predictive attribute because its value is unique for each record. Even in a less extreme situation, a test condition that results in a large number of outcomes may not be desirable because the number of records associated with each partition is too small which makes any reliable predictions.

There are 2 strategies for overcoming this problem. The first strategy is to restrict the test conditions to binary splits only. This strategy is used by decision tree algorithms such as CART. Another strategy is to modify the splitting criterion to take into account the number of outcomes produced by the attribute test condition. For example, in the C4.5 decision tree algorithm a splitting criterion knows as gain ratio is used to determine the goodness of a split. This criterion is defined as follows:

Gain ratio = ∆info/Split info

Here, Split info = - $ΣΡ- 1$ * log₂P(v_i) and is the total number of splits. For example, if each attribute value has the same number of records, then ∀i : P(v_i) = 1/k and the split information would be equal to log₂k. This example suggests that if an attribute produces a large number of splits, its split information will also be large, which in turn reduces its gain ratio.

Add a comment

Answer 2

Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Homework Answers

Add Answer to:
Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Post as a guest

Earn Coins

1. Decision trees As part of this question you will implement and compare the Information Gain,...

Below is a example of a ID3 algorithm in Unity using C# im not sure how...

In a 1-2 page paper explain the difference in how you would mine data based on the 3 categories; Prediction, Clustering,...

The following table consists of training data from an employee database. The data have been generalized....

k-d tree Background One generalization of binary trees is the k-d tree, which stores k-dimensional data....

In this assignment, you will be using a regression tree to analyze some data about contract...

0.2 pts) Based upon the data what can you say about the mean size of leaves...

based cost assignments Explain why the trial 0 3 .106-4 he Path PROBLEM 6-17 Comparing fra...

2 (00 polnt) Reaoareh ProposalL You wore given a data set with the folowing varlables based...

The manager for a large grocery store chain would like to determine if a new cash...

Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Homework Answers

Add Answer to: Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Post as a guest

Earn Coins

Add Answer to:
Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...