Question

3) In a Hadoop environment, there are many capabilities which allow for Hadoop to be integrated as an integral part of a ware

0 0
Add a comment Improve this question Transcribed image text
Answer #1

solution:

given data:

a)

Hadeep is an ideal platform to run ETL. You can feed the results into a traditional data warehouse, or better yet, simply use Hadoop itself as your warehouse. Two for the price of one. And injesting data from all sources into a centralized Hadoop repository is future proof, as your business scale and the data grows rapidly, the Hadoop infrastructure can scale easily.

ETL process in Hadoop:

Kere the typical steps to Hadoop for ETL,

1. Set up a Hadoop cluster,

2. Connect data sources,

3. Define the metadata,

4. Create the ETL jobs,

5. Create the workflow.

b)

For creating high available environmen the data cleansing and transformations are easier done when multiple jobs cascade into a workflow, each performing a special task, often data mappings/transformations need to be executed in a specific order and/or there may be dependencies to check.These dependencies and sequences are captured in workflows-parallel flow allow parallel execution that can speed up the ETL process. Finally the entire workflow needs to be scheduled. They may have to run weekly, nightly, or perhaps even hourly.

Although technologies such as Oozie provide some workflow management, it is typically insufficient. Many organizations create their own workflow management tools. This can be a complex process as it is important to take care of failure scenarios and restart the workflow appropriatly.

c)

Structured data is comprised of clearly defined data type whose pattern makes them easily searchable, while unstructured data -"everything else'-is comprised of data that is usually not as easily searchable, including formats like audio, video, and social media postings.

Unstructured data vs. structured data does not denote conflict between the two. Customers select one or the other not based on their data structure, but on the applications that use them:relational database for structured, and most any other type os application for unstructured data.

d)

i.Click on the HDFS service, and under quick links choose "replication". From the drop down list choose HDFS replication.

ii.Then fill the replication form. You have to supply the source and destination clusters, which path to replicate (choose/for all) , what kind of schedule to set ( run once now, run once in the future or recurring schedule)and when. The default user to run the replication task is hdfs, so it's better to just leave it that way.

iii. If you want to change the default values you can go to resources tab, where you can set how many MapReduce jobs will run concurrently(default is 20) and how they will pick their work.

e)

Do one of the following:

i. Select Backup >Replications

ii. Click schudule HDFS Replications

or

i. Select cluster> HDFS service name

ii. Select Quick Link> Replication

iii. Click Schedule HDFS Replications

The create Replication dialog box displays. Click the source filed and select the source HDFS service.

f)

It is easy to quickly get lost in the details when talking about information security. To minimize confussion we will focus on three fundamental areas.

1. How data is encrypted or otherwise protected while it is in storage (at rest) and when it is moving across the network(in motion).

2. How syatems and users are authenticated before they access data in the Hadoop infrastructure.

3. How accsee to different data is managed within the environment.

The Hadoop ecosystem has resources to support security.Knock and Ranger are two important Apache open source projects.

g)

SQL - on -Hadoop is a class of analytical application tools that combine established SQL-style querying with newer Hadoop data framework elements.

The different means for executing SQL in Hadoop environment can be divided into (1)Connectors that translate SQL into a MapReduce format, (2)"push-down" systems that forgo batch-oriented MapReduce and execute SQL within Hadoop clusters and, (3)Systems that apportion SQL work between MapReduce HDFS clusters or raw HDFS clusters, depending on the workload.

h)

Apache Hadoop is a comprehensive ecosystem which now features many open source components that can fundamentally change an enterprises approach to storing, processing, and analyzing data.

Unlike traditional ralationship database management systems, Hadoop now enables different types, Hadoop now enables different types of analytical workloads to run the same set of data and can also manage data volumes and massive scale with advanced hardware and software applications. We can see many soyrce platforms as popular distributions of Hadoop.

please give me thumb up

Add a comment
Know the answer?
Add Answer to:
3) In a Hadoop environment, there are many capabilities which allow for Hadoop to be integrated...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT