Question:
I have some files that are being uploaded to S3 and processed for some Redshift task. After that task is complete these files need to be merged. Currently I am deleting these files and uploading merged files again.
These eats up a lot of bandwidth. Is there any way the files can be merged directly on S3?
I am using Apache Camel for routing.
Answer:
S3 allows you to use an S3 file URI as the source for a copy operation. Combined with S3’s Multi-Part Upload API, you can supply several S3 object URI’s as the sources keys for a multi-part upload.
However, the devil is in the details. S3’s multi-part upload API has a minimum file part size of 5MB. Thus, if any file in the series of files under concatenation is < 5MB, it will fail.
However, you can work around this by exploiting the loop hole which allows the final upload piece to be < 5MB (allowed because this happens in the real world when uploading remainder pieces).
My production code does this by:
- Interrogating the manifest of files to be uploaded
- If first part is
under 5MB, download pieces* and buffer to disk until 5MB is buffered. - Append parts sequentially until file concatenation complete
- If a non-terminus file is < 5MB, append it, then finish the upload and create a new upload and continue.
Finally, there is a bug in the S3 API. The ETag (which is really any MD5 file checksum on S3, is not properly recalculated at the completion of a multi-part upload. To fix this, copy the fine on completion. If you use a temp location during concatenation, this will be resolved on the final copy operation.
*
Note that you can download a byte range of a file. This way, if part 1 is 10K, and part 2 is 5GB, you only need to read in 5110K to get meet the 5MB size needed to continue.
**
You could also have a 5MB block of zeros on S3 and use it as your default starting piece. Then, when the upload is complete, do a file copy using byte range of 5MB+1 to EOF-1
P.S. When I have time to make a Gist of this code I’ll post the link here.