a)Consider following taxpayers dataset. Several interesting
characteristics of taxpayer like marital
status, DOB, income, refund status along with tax evasion status
are captured. Using this dataset,
answer the following sub questions.
Txn_ID Marital_Statu
s
Date_of_Birt
h
Taxable_Incom
e
Refund_Statu
s
Evasion_Statu
s
1 Single 20 March 82 125K Yes No
2 Married 31 July 86 100K No No
3 Single 17 Jan 89 70K No No
4 Married 25 Aug 84 120K Yes No
5 Divorced 17 Sept 91 95K No Yes
6 Married 2 Nov 89 60K No No
7 Divorced 8 Nov 87 220K Yes No
8 Single 9 Feb 81 85K No Yes
9 Married 18 Apr 85 75K No No
10 Single 7 March 87 90K No Yes
You need to use this dataset to predict the probability that a
taxpayer will evade the tax.
a) What is the modelling technique that will be useful for above
requirement? Why?
b) List the significant changes needs to be done in this dataset so
that it can be used as input to the
modelling technique identified in (a)?
c) Show the final dataset structure that can be used as input?
a)Clustering techniques can be used. We can go for K means
clustering because in this technique we can randomly initialise the
centres in a particular group.
b) This can be done in many ways . For example based on Date of birth I will make groups and choose a centre for each group. May be 0-10 years comes under group 1 with centre as 5 , age 11-20 under group 2 with centre 15 like wise.Now if I want to find probability of new record , I will place it in the closest group . Every time a record gets added in the group I will find the centre again (by taking mean).
c) the final data set is same as the input data set .
a)Consider following taxpayers dataset. Several interesting characteristics of taxpayer like marital status, DOB, income, refund status...