Question:
I have a use case where I upload hundreds of file to my S3 bucket using multi part upload. After each upload I need to make sure that the uploaded file is not corrupt (basically check for data integrity). Currently, after uploading the file, I re-download it and compute the md5
on the content string and compare it with the md5
of local file. So something like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
conn = S3Connection('access key', 'secretkey') bucket = conn.get_bucket('bucket_name') source_path = 'file_to_upload' source_size = os.stat(source_path).st_size mp = bucket.initiate_multipart_upload(os.path.basename(source_path)) chunk_size = 52428800 chunk_count = int(math.ceil(source_size / chunk_size)) for i in range(chunk_count + 1): offset = chunk_size * i bytes = min(chunk_size, source_size - offset) with FileChunkIO(source_path, 'r', offset=offset, bytes=bytes) as fp: mp.upload_part_from_file(fp, part_num=i + 1, md5=k.compute_md5(fp, bytes)) mp.complete_upload() obj_key = bucket.get_key('file_name') print(obj_key.md5) #prints None print(obj_key.base64md5) #prints None content = bucket.get_key('file_name').get_contents_as_string() # compute the md5 on content |
This approach is wasteful as it doubles the bandwidth usage. I tried
1 2 3 |
bucket.get_key('file_name').md5 bucket.get_key('file_name').base64md5 |
but both return None.
Is there any other way to achieve md5
without downloading the whole thing?
Answer:
yes
use bucket.get_key('file_name').etag[1 :-1]
this way get key’s MD5 without downloading it’s contents.