Question

Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...

Data Mining:

Explain why decision tree algorithm based on impurity measures such as entropy and Gini index tends to favor attributes with larger number of distinct values. How would you overcome this problem?

0 0
Add a comment Improve this question Transcribed image text
Answer #1

Impurity measures such as entropy and Gini index to favor attributes that have a large number of distinct values.

Car Customer Gender Type Female Family Luxury Male V20 V11 V10 Sports CO: 1 CO: 6 CO: 4 CO:1 CO: 8 CO: 1 CO:1 CO: 0 CO: 0 C1:

See above figure:

It will shows three alternative test conditions for partitioning given data set. Comparing the first test condition, Gender, with the second, Car Type, it is easy to convey that Car Type seems to provide a better way of splitting the data since it produces purer descendent nodes. However, if we compare both conditions with customer Id, the latter appears to produce purer partitions. Customer ID is not a predictive attribute because its value is unique for each record. Even in a less extreme situation, a test condition that results in a large number of outcomes may not be desirable because the number of records associated with each partition is too small which makes any reliable predictions.

There are 2 strategies for overcoming this problem. The first strategy is to restrict the test conditions to binary splits only. This strategy is used by decision tree algorithms such as CART. Another strategy is to modify the splitting criterion to take into account the number of outcomes produced by the attribute test condition. For example, in the C4.5 decision tree algorithm a splitting criterion knows as gain ratio is used to determine the goodness of a split. This criterion is defined as follows:

Gain ratio = ∆info/Split info

Here, Split info = -ΣΡ- 1* log2P(vi) and is the total number of splits. For example, if each attribute value has the same number of records, then ∀i : P(vi) = 1/k and the split information would be equal to log2k. This example suggests that if an attribute produces a large number of splits, its split information will also be large, which in turn reduces its gain ratio.

Add a comment
Know the answer?
Add Answer to:
Data Mining: Explain why decision tree algorithm based on impurity measures such as entropy and Gini...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • 1. Decision trees As part of this question you will implement and compare the Information Gain,...

    1. Decision trees As part of this question you will implement and compare the Information Gain, Gini Index and CART evaluation measures for splits in decision tree construction.Let D= (x,y), D = n be a dataset with n samples. The entropy of the dataset is defined as H(D)= P(c|D)log2P(c|D), where P(CD) is the fraction of samples in class i. A split on an attribute of the form X, <c partitions the dataset into two subsets Dy and Dn based on...

  • Below is a example of a ID3 algorithm in Unity using C# im not sure how...

    Below is a example of a ID3 algorithm in Unity using C# im not sure how the ID3Example works in the whole thing can someone explain the whole thing in more detail please. i am trying to use it with this data set a txt file Alternates?:Bar?:Friday?:Hungry?:#Patrons:Price:Raining?:Reservations?:Type:EstWaitTime:WillWait? Yes:No:No:Yes:Some:$$$:No:Yes:French:0-10:True Yes:No:No:Yes:Full:$:No:No:Thai:30-60:False No:Yes:No:No:Some:$:No:No:Burger:0-10:True Yes:No:Yes:Yes:Full:$:Yes:No:Thai:10-30:True Yes:No:Yes:No:Full:$$$:No:Yes:French:>60:False No:Yes:No:Yes:Some:$$:Yes:Yes:Italian:0-10:True No:Yes:No:No:None:$:Yes:No:Burger:0-10:False No:No:No:Yes:Some:$$:Yes:Yes:Thai:0-10:True No:Yes:Yes:No:Full:$:Yes:No:Burger:>60:False Yes:Yes:Yes:Yes:Full:$$$:No:Yes:Italian:10-30:False No:No:No:No:None:$:No:No:Thai:0-10:False Yes:Yes:Yes:Yes:Full:$:No:No:Burger:30-60:True Learning to use decision trees We already learned the power and flexibility of decision trees for adding a decision-making component to...

  • In a 1-2 page paper explain the difference in how you would mine data based on the 3 categories; Prediction, Clustering,...

    In a 1-2 page paper explain the difference in how you would mine data based on the 3 categories; Prediction, Clustering, and Association. Within the paper, please include responses to the following questions: What is the difference in the type of data needed? Which data mining approach would you choose? Why? How will the outcomes of the analysis be used?

  • The following table consists of training data from an employee database. The data have been generalized....

    The following table consists of training data from an employee database. The data have been generalized. For example, “31 . . . 35” for age represents the age range of 31 to 35. For a given row entry, count represents the number of data tuples having the values for department, status, age, and salary given in that row. department status age salary count sales senior 31. . . 35 46K. . . 50K 30 sales junior 26. . . 30...

  • k-d tree Background One generalization of binary trees is the k-d tree, which stores k-dimensional data....

    k-d tree Background One generalization of binary trees is the k-d tree, which stores k-dimensional data. Every internal node of a k-d tree indicates the dimension d and the value v in that dimension that it discriminates by. An internal node has exactly two children, containing data that is less-than-or-equal and data that is greater than v in dimension d. For example, if the node distinguishes on dimension 1, value 107, then the left child is for data with y...

  • In this assignment, you will be using a regression tree to analyze some data about contract...

    In this assignment, you will be using a regression tree to analyze some data about contract negotiations. Athlete Contract Negotiations (regression tree). Casey Deesel is a sports agent negotiating a contract for Titus Johnston, an athlete in the National Football League (NFL). An important aspect of any NFL contract is the amount of guaranteed money over the life of the contract. Casey has gathered data on 506 NFL athletes who have recently signed new contracts. Each observation (NFL athlete) includes...

  • 0.2 pts) Based upon the data what can you say about the mean size of leaves...

    0.2 pts) Based upon the data what can you say about the mean size of leaves you would expect on trees that are 20 meters away from the road? (explain your reasoning) (k, 2 pts) Based upon the data what can you say about the mean size of leaves you would expect on trees that are 30 meters away from the road? (explain your reasoning) (1.2 pts) Based upon your results above can we conclude that pollution from the road...

  • based cost assignments Explain why the trial 0 3 .106-4 he Path PROBLEM 6-17 Comparing fra...

    based cost assignments Explain why the trial 0 3 .106-4 he Path PROBLEM 6-17 Comparing fra 17 Comparing Traditional and Activity-Based Product Margins (LOG-1, LO y Mountain Corporation makes two types of hiking boots Xtreme and the concerning these two product lines appear below Xtreme Selling price per unit ....... Direct materials per unit..... Direct labor per unit ...... Direct labor-hours per unit.... Estimated annual production and sales .. $140.00 $72.00 $24.00 2.0 DLHS 20,000 units Partidade $99.00 $53.00 $12.00...

  • 2 (00 polnt) Reaoareh ProposalL You wore given a data set with the folowing varlables based...

    2 (00 polnt) Reaoareh ProposalL You wore given a data set with the folowing varlables based on a survey of 850 adults: VerlableLevel Level of Measurement Additional Datalls or Type of Variable Annual income Number of years of education beyond high school Categorical (ordina) 18-30, 31-45, 45-60 Age group Number of years of experience Numerical Gender (nominal)African American Race Asian American White (non-Hispanic) Two or more Other Marital status (nominal) Never married Categorical (nomina) Rent Rent or own Own Other...

  • The manager for a large grocery store chain would like to determine if a new cash...

    The manager for a large grocery store chain would like to determine if a new cash register will enable cashiers to process a larger number of items on average than the cash register which they are currently using. Seven cashiers are randomly selected, and the number of grocery items which they can process in three minutes is measured for both the old cash register and the new cash register. Thus, for the same 7 cashiers, you get two data values...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT