We ran into an issue recently where the CPU of a T2.Small instance in amazon was using 100% cpu for a period and then dropping down to a consistent 20%. Having a high trigger interval on the CPU usage meant we were not alerted till after it was at 100% for a fair while and then as it dropped down to 20% and the alert quickly cleared, looking at the graphs it looked good a nice even 20%.
What we did not realise at the time was that it was now 20% thanks to Amazon throttling it to a base level of 20%. When running TOP on the machine we noticed it was in fact running at a 100% as far as the OS was concerned.
Amazon throttles it when your instance runs out of CPU Credits and if you are only monitoring your CPU Usage you are not going to see the issue.
So what is a CPU Credit? From Amazon’s help pages.
A CPU Credit provides the performance of a full CPU core for one minute. Traditional Amazon EC2 instance types provide fixed performance, while T2 instances provide a baseline level of CPU performance with the ability to burst above that baseline level. The baseline performance and ability to burst are governed by CPU credits.
What is a CPU credit?
One CPU credit is equal to one vCPU running at 100% utilization for one minute. Other combinations of vCPUs, utilization, and time are also equal to one CPU credit; for example, one vCPU running at 50% utilization for two minutes or two vCPUs running at 25% utilization for two minutes.
So every minute you spend with one vCPU running full blast takes a credit from your bank, when it hits zero, very quickly in two CPU instances you find yourself throttled to 10,15 or 20% (depending on which instance type you have).
So in order to monitor CPU usage properly you need to monitor your CPUCreditBalance, which luckily AWS reports through Cloudwatch.
We added two new datapoints to our existing EC2 datasource, CPUCreditBalance and CPUCreditUsage. The second one is more of an interesting rather than useful metric as it simply shows the rate at which you are using or earning CPU credits. But setting alerting on the CPUCreditBalance allows us to know that Amazon is going to throttle us before they do.
This image shows the misleading CPU Usage. The second image shows clearly that we ran out of CPU credit.