Data is only as good as its completeness. And we have all dealt with incomplete data. But when a dataset has missing or incomplete data, it creates a set of dynamics that are centered on how to handle the missing data. There are, of course, several strategies one can use to address this issue. Some strategies often include:
Deleting the row where the missing data appears
Placing a unique number in the missing data field or location
Averaging the data that DOES appear and placing the average value into the missing data field or location.
Placing a value in the missing data field or location from an adjacent field or location
Analyze each strategy, and other strategies you may think of, and determine the advantages and disadvantages of each strategy. As part of your analysis, make sure to address if anything can be gained or lost from using one strategy over another.
Thanks for asking!
First of all, I would like to tell you that there is no specific best method which should be performed every time to deal with the missing values, we encounter, when dealing with data.
We will discuss all the methods one by one :
1) Deleting the row when missing data appears
So, this is the simplest work to do and sorry to say but most inconsistent and least formal way to deal with the missing values. When we delete the whole row to get rid of a single missing value, not only we are decreasing the sample size but also maybe removing some important data which may affect our results a lot. So, this method should only be performed when we are sure that deleting the row will not affect the results much.
2) Placing a unique number
Before performing this operation, first we should decide what is that missing value we want to insert. Example if the column has values from 100 to 900 and we are inserting a unique value from 1 to 9 , then there is no use of that insertion and it only increases the data inconsistency which eventually worsens the situation.
3) Taking average and putting it at the place of missing values
This can be a good option to handle missing values. When we take average, it will return a more close value to the actual value rather than putting a random unique value but let's say we have 10 rows , 9 rows have values beween 1 to 9 and one row is having the value which is very small or very large like 100. So, on taking the average, this large value will also be contributing a lot to the average. And finally, we will get a value which is no doubt closer but still significantly inconsistent to be used.
A solution to this can be Taking the median of the data and then putting it in the place of missing value or we can also Sort the values first and then take middle two values , take their average and then use it.(Average of medians)
4) Placing a value from adjacent field
This method can be used when the column is having a limited values, like a classification problem, if only two values are possible 0 and 1. But on the other hand, if adjacent values are abruptly large or small, then this method also fails.
Conclusion:
We should use the correct method among these, according to the size of data, values possible in the data field and after determining that how much the change can expect the results.
I hope you like it!!
Data is only as good as its completeness. And we have all dealt with incomplete data....