Amazon S3 ETag Advanced Information

You are probably here because you looked at one of your S3 object’s ETag, and it had a dash character (“-“) in it. Most of your other ETag values are simple and correct md5sum hashes. But this one is weird.

Or, you’re here because one of your S3 object’s ETag ends with “-2”, and you’ve looked up multipart, and you’ve seen the multipart documentation around “multipart_threshold’ and ‘multipart_chunksize’, so you know that “-2” means the ETag was computed as two (2) chunks. But things are still not working out.

Or, you’re here because you know that “-2” means two (2) chunks, and you know the default chunk size is 8MB (8*1024*1024 bytes). Which is all super, except the object is 18MB in size – and 8+8 is only 16 – surely S3 is not throwing away chunks? What is going on here?

The TL;DR answer is – S3 uses both 8MB and 16MB as the “default” chunk size (and, I assume, 32MB, 64MB, etc. Once you break the rules, nothing stops you from doing it again.) As a concrete example – the object size was 17,325,568 bytes and the ETag was “c44bfa98b2c188777ed18cb9190e304b-2”. I used aws cli (aws-cli/2.0.50 Python/3.7.3 Linux) for this upload, so it should have used 8MB chunks, which means the ETag should end in “-3”, not “-2”. Running the code (below) shows that 16MB chunks creates a matching ETag using the local file.

I used “calculate_s3_etag” from this stackoverflow post by hypernot [which seems to be in github – but I used the stackoverflow code, not the github code]. I have confirmed the stackoverflow code works with my 30,000+ files – after trying 8MB, then trying 16MB – to compute the ETag from a local file.

Other references:

  • Seems to indicate only 8MB and 16MB are chunk sizes (8MB for aws cli aka boto3, and 16MB for s3cmd). Since I’ve only used ‘aws s3 sync …’ to upload files, and I’ve seen ETags “right next to each other” where one uses 8MB and the other uses 16MB, I know this is not a rule. Maybe it’s a “guideline”.
  • Another stackoverflow has the code in python, go, powershell, etc. This article also mentions – but I have not tried this yet:
aws configure set default.s3.multipart_threshold 64MB
  • Pypi has a page that talks about defaults of 5MB, 8MB, 15MB and 16MB
  • This teppen.io post has some information (but the description doesn’t agree with any S3 documentation)
  • This savjee.be post has the implementation in Bash.

This entry was posted in Software Engineering, Storage. Bookmark the permalink.