Question

In this Module 2 Discussion, we shall discuss how to use R to obtain information by...

In this Module 2 Discussion, we shall discuss how to use R to obtain information by exploring, cleaning, and preprocessing the data. The following is a kind of checklist of frequent steps in data preparation. More precisely, they are also typical steps in “cleansing” data. Such steps include (at least):

No.

Steps

R functions     

1

Loading and looking at the dataset in R

2

Identify missing values

3

Identify outliers

4

Check for overall plausibility and errors (e.g, typos)

5

Identify highly correlated variables

6

Identify variables with (nearly) no variance

7

Identify variables with strange names or values

8

Check variable classes (eg. Characters vs factors)

9

Remove/transform some variables (maybe your model does not like categorial variables)

10

Rename some variables or values (especially interesting if large number)

11

Check some overall pattern (statistical/numerical summaries/graphical illustrations)

12

Center/scale variables

In view of the above steps, please scan through the three examples (Example 1,2,3) in Data Mining and Business Analytics with R Chapter 2 and Data Mining for Business Analytics: Concepts, Techniques, and Applications in R section 2.4 (found in this week's Reading & Resources) to find and then fill in the blanks in the above table for those R functions we can use to handle these steps, respectively. For example, you may put read.csv() and view() in the first row as they are the ways to realize that specific step. You may also refer to some open resources to find relevant R functions to fill in those blanks and each blank can have multiple R functions as answers.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

1.Loading and looking at the dataset in R

#Here we will be loading and looking at the brief summary of the txhousing default dataset available in ggplot2 library
#loading the library
library(ggplot2)
#loading the dataset
data(txhousing)
#Printing the structure of the dataset
str(txhousing)

2. Identify missing values

# Gives the number of all missing values in the dataset
sum(is.na(txhousing))
# Gives the number of all missing values within individual columns for the dataset
colSums(is.na(txhousing))

3.Identify outliers

#Outlier detection for columns housing$sales using ggplot
ggplot(txhousing) +
  aes(x = "", y = sales) +
  geom_boxplot(fill = "#0c4c8a")

8. Check variable classes (eg. Characters vs factors)

sapply(txhousing, class)

9. Remove/transform some variables (maybe your model does not like categorial variables)

txhousing <- txhousing[,-(2:3),drop=FALSE]

10. Rename some variables or values (especially interesting if large number)

library(questionr)
rename.variable(txhousing, "listings", "listing")
Add a comment
Know the answer?
Add Answer to:
In this Module 2 Discussion, we shall discuss how to use R to obtain information by...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • The discussion: 150 -200 words. Auditing We know that computer security audits are important in business....

    The discussion: 150 -200 words. Auditing We know that computer security audits are important in business. However, let’s think about the types of audits that need to be performed and the frequency of these audits. Create a timeline that occurs during the fiscal year of audits that should occur and “who” should conduct the audits? Are they internal individuals, system administrators, internal accountants, external accountants, or others? Let me start you: (my timeline is wrong but you should use some...

  • How can we assess whether a project is a success or a failure? This case presents...

    How can we assess whether a project is a success or a failure? This case presents two phases of a large business transformation project involving the implementation of an ERP system with the aim of creating an integrated company. The case illustrates some of the challenges associated with integration. It also presents the obstacles facing companies that undertake projects involving large information technology projects. Bombardier and Its Environment Joseph-Armand Bombardier was 15 years old when he built his first snowmobile...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT