hadoop command below:
hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output
part.sample.tbl= dataset {198|khaki
papaya|MFGR#3|MFGR#33|MFGR#338|orange|PROMO BRUSHED NICKEL|43|SM
PACK|
16406 199|slate lace|MFGR#1|MFGR#13|MFGR#136|royal|ECONOMY PLATED
STEEL|23|JUMBO DRUM|
16487 200|mint navajo|MFGR#2|MFGR#22|MFGR#2227|burnished|MEDIUM
POLISHED BRASS|22|LG PKG|}
after running the above hadoop command we should get the following delimiter (~) tida in the file
198~khaki papaya~MFGR#3~MFGR#33~MFGR#338|orange~PROMO BRUSHED NICKEL~43~SM PACK~
I am not getting tida (~) in my file. Again, the above
command is a hadoop command.
According to the hadoop apache documentation, when the Map/Reduce framework reads a line from the stdout of the mapper, it splits the line into a key/value pair. Now how does it decide how to split the line? That is what the stream.map.output.field.separator dictates.
Example,
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-input myInputDirs \
-output myOutputDir
Here, "-D stream.map.output.field.separator=." specifies "." as the field separator and "-D stream.num.map.output.key.fields=4" specifies that prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. For example, if the mapper outputs "alpha.beta.gamma.delta.theta". The key would be alpha.beta.gamma.delta and the value would be theta.
So, the usage of sed in stream.map.output.field.separator is not allowed. What you can do instead is create a mapper only job in which the mapper can use the sed command to achieve what you're trying to do!
So the job would be something like below:
hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D mapred.reduce.tasks=0 -input /user/FinalData/part.sample.tbl -output /user/FinalData/output -mapper "sed -ie s/|/~/g"
Here the mapred.reduce.tasks=0 implies it is a mapper only job (no reduce required).
Hope this answers your question. Feel free to comment if you have any doubts or if the command still doesn't produce the required results and I'll be pleased to help! Also, it would be better if you could post the output of the command as well, so it is easier to debug :)
hadoop command below: hadoop jar ~/hadoop-2.6.4/hadoop-streaming-2.6.4.jar -D stream.map.output.field.separator="sed -ie s/|/~/g" -input /user/FinalData/part.sample.tbl -output /user/FinalData/output part.sample.tbl= dataset...