AWS EC2 GPU instance comparison

These are the results from running the ml-style-transfer project on three different AWS EC2 instance types.

Instance Name	$Cost/hour	250 epochs	2,500 epochs	50,000 epochs
p3.2xlarge	$3.06	14s $0.0119	55s $0.0468	928s $0.7888
t2.large	$0.093	849s $0.0219	14,676s $0.3791	293,520s (extrapolated) $7.5826
c5.4xlarge	$0.68	221s $0.0417	2,152s $0.4065	43,040 (extrapolated) $8.1298

Comparing p3.2xlarge GPU with t2.large non-GPU

The p3.2xlarge is 32x more expensive per hour. Yet, for the 2,500 epoch tests, it is 266x faster, which combines to be 8x more cost-effective (e.g. $0.05 versus $0.40 for 2,500 epochs).

The p3.2xlarge also gets results faster (wall clock) – the 50,000 epoch run on p3.2xlarge only took 15 minutes wall clock, yet the projected run on t2.large is 4,892 minutes (over 3 days).

The c5.4xlarge is a “compute optimized” – large vCPU, large RAM. It is 7x the hourly price of the t2.large, and on the 2,500 epoch test, delivers about that much wall clock improvement – so, basically the same cost, but 7x faster delivery of results.

Note: the total test time was approximately 40 minutes on the p3.2xlarge ($2.04) and 280 minutes on the t2.large ($0.44). But, for a mere 5x total cost difference, the p3.2xlarge performed a 50,000 epoch run that would have taken forever on the t2.large.

In contrast, purchasing a RTX 3080Ti for $900 and using it for the 2,500 epoch run took 16 seconds (which is barely slower than the current record-holder p3.2xlarge at 14 seconds).

Next up is to upgrade my AWS account to allow “spot” pricing for p3.2xlarge – if Amazon will allow it for my non-commercial account. The $3.06 on-demand price seems to drop to about $1.01 for spot instances. Update (1 day later): “We have approved and processed your service quota increase request”. So my “EC2->Limits->All P Spot Instance Requests now says “8 vCPUs”, which is enough for one p3.2xlarge instance.

Just FYI: when that AMI is run under a non-GPU machine type, the first run takes an extra 5+ minutes, as the system says “Matplotlib is building the font cache; this may take a moment.” It only does this the first run. (This is an example of a thing that cause differences in “python elapsed time” and “wall clock elapsed time”.)

Recent Posts

Recent Comments

Archives

Categories

Meta