Question

Robust data loading poses a challenge in database systems because the input data are often dirty....

Robust data loading poses a challenge in database systems because the input data are often dirty. In many cases and input record may miss multiple values; some records could be contaminated, with some data values out of range or of a different data type than expected. Work out an automated data cleaning and loading algorithm so that the erroneous data will be marked and contaminated data will not be mistakenly inserted into the database during data loading.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

Begin
                  for each record r
begin
            check r for missing values
            If possible, Fll in missing values according to domain knowledge
                             (e.g. mean, mode, most likely value, etc.).
            check r for out of range values
                             If possible, correct out of range values according to domain knowledge
                             (e.g. min or max value for the attribute).
            check r for erroneous data types
                                      If possible, correct data type using domain knowledge
                              If r could not be corrected, mark it as bad and output it to a log, otherwise load r into                               the database.
        end
end

The domain knowledge can be a combination of manual and automatic work. We can, for example, use the data in the database to construct a decision tree to induce missing values for a given attribute, and at the same time have human-entered rules on how to correct wrong data types. - 4 -

Add a comment
Know the answer?
Add Answer to:
Robust data loading poses a challenge in database systems because the input data are often dirty....
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
  • Coding for Python - The pattern detection problem – part 2: def calculate_similarity_list(data_series, pattern) Please do not use 'print' or 'input' statements. Context of the assignme...

    Coding for Python - The pattern detection problem – part 2: def calculate_similarity_list(data_series, pattern) Please do not use 'print' or 'input' statements. Context of the assignment is: In this assignment, your goal is to write a Python program to determine whether a given pattern appears in a data series, and if so, where it is located in the data series. Please see attachments below: We need to consider the following cases: Case 1 - It is possible that the given...

  • Title: Partners Health Care Systems (PHS): Transforming Health Care Services Delivery through Information Management According to...

    Title: Partners Health Care Systems (PHS): Transforming Health Care Services Delivery through Information Management According to government sources, U.S. expenditures on health care in 2009 reached nearly $2.4 trillion dollars ($2.7 trillion by the end of 2010).[1] Despite this vaunting national level of expenditure on medical treatment, death rates due to preventable errors in the delivery of health services rose to approximately 98,000 deaths in 2009.[2] To address the dual challenges of cost control and quality improvement, some have argued...

  • How can we assess whether a project is a success or a failure? This case presents...

    How can we assess whether a project is a success or a failure? This case presents two phases of a large business transformation project involving the implementation of an ERP system with the aim of creating an integrated company. The case illustrates some of the challenges associated with integration. It also presents the obstacles facing companies that undertake projects involving large information technology projects. Bombardier and Its Environment Joseph-Armand Bombardier was 15 years old when he built his first snowmobile...

  • 10. Write a one-page summary of the attached paper? INTRODUCTION Many problems can develop in activated...

    10. Write a one-page summary of the attached paper? INTRODUCTION Many problems can develop in activated sludge operation that adversely affect effluent quality with origins in the engineering, hydraulic and microbiological components of the process. The real "heart" of the activated sludge system is the development and maintenance of a mixed microbial culture (activated sludge) that treats wastewater and which can be managed. One definition of a wastewater treatment plant operator is a "bug farmer", one who controls the aeration...

  • internal project 1 anything helps! thank you!! Instructions: Study the case that starts on page 3...

    internal project 1 anything helps! thank you!! Instructions: Study the case that starts on page 3 carefully. Then write concise answers to the following questions regarding the internal control system of Duarf, Inc. Clearly label your responses with proper headings and subheadings. Be very specific and precise. Answers that appear to be beating around the bush will not get any credit. 1. What are the controls in place that under normal conditions should function well to prevent embezzlements or frauds?...

  • Discussion questions 1. What is the link between internal marketing and service quality in the ai...

    Discussion questions 1. What is the link between internal marketing and service quality in the airline industry? 2. What internal marketing programmes could British Airways put into place to avoid further internal unrest? What potential is there to extend auch programmes to external partners? 3. What challenges may BA face in implementing an internal marketing programme to deliver value to its customers? (1981)ǐn the context ofbank marketing ths theme has bon pururd by other, nashri oriented towards the identification of...

  • All of the following questions are in relation to the following journal article which is available...

    All of the following questions are in relation to the following journal article which is available on Moodle: Parr CL, Magnus MC, Karlstad O, Holvik K, Lund-Blix NA, Jaugen M, et al. Vitamin A and D intake in pregnancy, infant supplementation and asthma development: the Norwegian Mother and Child Cohort. Am J Clin Nutr 2018:107:789-798 QUESTIONS: 1. State one hypothesis the author's proposed in the manuscript. 2. There is previous research that shows that adequate Vitamin A intake is required...

ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT