Question

hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

hadoop command below:

hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output

part.sample.tbl= dataset {198|khaki papaya|MFGR#3|MFGR#33|MFGR#338|orange|PROMO BRUSHED NICKEL|43|SM PACK|
16406 199|slate lace|MFGR#1|MFGR#13|MFGR#136|royal|ECONOMY PLATED STEEL|23|JUMBO DRUM|
16487 200|mint navajo|MFGR#2|MFGR#22|MFGR#2227|burnished|MEDIUM POLISHED BRASS|22|LG PKG|}

after running the above hadoop command we should get the following delimiter (~) tida in the file

198~khaki papaya~MFGR#3~MFGR#33~MFGR#338|orange~PROMO BRUSHED NICKEL~43~SM PACK~

I am not getting tida (~) in my file. Again, the above command is a hadoop command.

0 0
Add a comment Improve this question Transcribed image text
Answer #1

According to the hadoop apache documentation, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. Now how does it decide how to split the line? That is what the stream.map.output.field.separator dictates.

Example,

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-D stream.map.output.field.separator=. \

-D stream.num.map.output.key.fields=4 \

-input myInputDirs \

-output myOutputDir

Here, "-D stream.map.output.field.separator=." specifies "." as the field separator and "-D stream.num.map.output.key.fields=4" specifies that prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. For example, if the mapper outputs "alpha.beta.gamma.delta.theta". The key would be alpha.beta.gamma.delta and the value would be theta.

So, the usage of sed in stream.map.output.field.separator is not allowed. What you can do instead is create a mapper only job in which the mapper can use the sed command to achieve what you're trying to do!

So the job would be something like below:

hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D mapred.reduce.tasks=0 -input /user/FinalData/part.sample.tbl -output /user/FinalData/output -mapper "sed -ie s/|/~/g"

Here the mapred.reduce.tasks=0 implies it is a mapper only job (no reduce required).

Hope this answers your question. Feel free to comment if you have any doubts or if the command still doesn't produce the required results and I'll be pleased to help! Also, it would be better if you could post the output of the command as well, so it is easier to debug :)

Add a comment
Know the answer?
Add Answer to:
hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...
Your Answer:

Post as a guest

Your Name:

What's your source?

Earn Coins

Coins can be redeemed for fabulous gifts.

Not the answer you're looking for? Ask your own homework help question. Our experts will answer your question WITHIN MINUTES for Free.
Similar Homework Help Questions
ADVERTISEMENT
Free Homework Help App
Download From Google Play
Scan Your Homework
to Get Instant Free Answers
Need Online Homework Help?
Ask a Question
Get Answers For Free
Most questions answered within 3 hours.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT