Question:
I’ve splitted big binary file to (2Gb) chunks and uploaded it to Amazon S3.
Now I want to join it back to one file and process with my custom
I’ve tried to run
1 2 3 |
elastic-mapreduce -j $JOBID -ssh \ "hadoop dfs -cat s3n://bucket/dir/in/* > s3n://bucket/dir/outfile" |
but it failed due to -cat output data to my local terminal – it does not work remotely…
How I can do this?
P.S. I’ve tried to run cat as a streaming MR job:
1 2 3 |
den@aws:~$ elastic-mapreduce --create --stream --input s3n://bucket/dir/in \ --output s3n://bucket/dir/out --mapper /bin/cat --reducer NONE |
this job was finished successfully. But. I had 3 file parts in dir/in – now I have 6 parts in /dir/out
1 2 3 4 5 6 7 |
part-0000 part-0001 part-0002 part-0003 part-0004 part-0005 |
And file _SUCCESS ofcource which is not part of my output…
So. How to join splitted before file?
Answer:
So. I’ve found a solution. Maybe not better – but it is working.
So. I’ve created an EMR job flow with bootstrap action
1 2 |
--bootstrap-action joinfiles.sh |
in that joinfiles.sh I’m downloading my file pieces from S3 using wget, join them using regular cat a b c > abc.
After that I’ve added a s3distcp which copied result back to S3. ( sample could be found at: https://stackoverflow.com/a/12302277/658346 ).
That is all.