AWS EC2 GPU instance comparison

These are the results from running the ml-style-transfer project on three different AWS EC2 instance types.

Instance Name$Cost/hour250 epochs2,500 epochs50,000 epochs
p3.2xlarge$3.0614s
$0.0119
55s
$0.0468
928s
$0.7888
t2.large$0.093849s
$0.0219
14,676s
$0.3791
293,520s
(extrapolated)
$7.5826
c5.4xlarge$0.68221s
$0.0417
2,152s
$0.4065
43,040
(extrapolated)
$8.1298
Comparing p3.2xlarge GPU with t2.large non-GPU

The p3.2xlarge is 32x more expensive per hour. Yet, for the 2,500 epoch tests, it is 266x faster, which combines to be 8x more cost-effective (e.g. $0.05 versus $0.40 for 2,500 epochs).

The p3.2xlarge also gets results faster (wall clock) – the 50,000 epoch run on p3.2xlarge only took 15 minutes wall clock, yet the projected run on t2.large is 4,892 minutes (over 3 days).

The c5.4xlarge is a “compute optimized” – large vCPU, large RAM. It is 7x the hourly price of the t2.large, and on the 2,500 epoch test, delivers about that much wall clock improvement – so, basically the same cost, but 7x faster delivery of results.

Note: the total test time was approximately 40 minutes on the p3.2xlarge ($2.04) and 280 minutes on the t2.large ($0.44). But, for a mere 5x total cost difference, the p3.2xlarge performed a 50,000 epoch run that would have taken forever on the t2.large.

In contrast, purchasing a RTX 3080Ti for $900 and using it for the 2,500 epoch run took 16 seconds (which is barely slower than the current record-holder p3.2xlarge at 14 seconds).

Next up is to upgrade my AWS account to allow “spot” pricing for p3.2xlarge – if Amazon will allow it for my non-commercial account. The $3.06 on-demand price seems to drop to about $1.01 for spot instances. Update (1 day later): “We have approved and processed your service quota increase request”. So my “EC2->Limits->All P Spot Instance Requests now says “8 vCPUs”, which is enough for one p3.2xlarge instance.

Just FYI: when that AMI is run under a non-GPU machine type, the first run takes an extra 5+ minutes, as the system says “Matplotlib is building the font cache; this may take a moment.” It only does this the first run. (This is an example of a thing that cause differences in “python elapsed time” and “wall clock elapsed time”.)

This entry was posted in Hardware, Machine Learning. Bookmark the permalink.