hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

Question

Question

hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

hadoop command below:

hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output

part.sample.tbl= dataset {198|khaki papaya|MFGR#3|MFGR#33|MFGR#338|orange|PROMO BRUSHED NICKEL|43|SM PACK|
16406 199|slate lace|MFGR#1|MFGR#13|MFGR#136|royal|ECONOMY PLATED STEEL|23|JUMBO DRUM|
16487 200|mint navajo|MFGR#2|MFGR#22|MFGR#2227|burnished|MEDIUM POLISHED BRASS|22|LG PKG|}

after running the above hadoop command we should get the following delimiter (~) tida in the file

198~khaki papaya~MFGR#3~MFGR#33~MFGR#338|orange~PROMO BRUSHED NICKEL~43~SM PACK~

I am not getting tida (~) in my file. Again, the above command is a hadoop command.

engineering Computer-Science

Add a comment Improve this question Transcribed image text

Answer 1

Answer #1

According to the hadoop apache documentation, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. Now how does it decide how to split the line? That is what the stream.map.output.field.separator dictates.

Example,

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-D stream.map.output.field.separator=. \

-D stream.num.map.output.key.fields=4 \

-input myInputDirs \

-output myOutputDir

Here, "-D stream.map.output.field.separator=." specifies "." as the field separator and "-D stream.num.map.output.key.fields=4" specifies that prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. For example, if the mapper outputs "alpha.beta.gamma.delta.theta". The key would be alpha.beta.gamma.delta and the value would be theta.

So, the usage of sed in stream.map.output.field.separator is not allowed. What you can do instead is create a mapper only job in which the mapper can use the sed command to achieve what you're trying to do!

So the job would be something like below:

hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D mapred.reduce.tasks=0 -input /user/FinalData/part.sample.tbl -output /user/FinalData/output -mapper "sed -ie s/|/~/g"

Here the mapred.reduce.tasks=0 implies it is a mapper only job (no reduce required).

Hope this answers your question. Feel free to comment if you have any doubts or if the command still doesn't produce the required results and I'll be pleased to help! Also, it would be better if you could post the output of the command as well, so it is easier to debug :)

Add a comment

Answer 2

hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

Homework Answers

Add Answer to:
hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

Post as a guest

Earn Coins

hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

Homework Answers

Add Answer to: hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...

Post as a guest

Earn Coins

Add Answer to:
hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...