Category Archives: Uncategorized

Monitoring EC2 T2.Small and T2.Micro instances.

We ran into an issue recently where the CPU of a T2.Small instance in amazon was using 100% cpu for a period and then dropping down to a consistent 20%. Having a high trigger interval on the CPU usage meant we were not alerted till after it was at 100% for a fair while and then as it dropped down to 20% and the alert quickly cleared, looking at the graphs it looked good a nice even 20%.

What we did not realise at the time was that it was now 20% thanks to Amazon throttling it to a base level of 20%. When running TOP on the machine we noticed it was in fact running at a 100% as far as the OS was concerned.

Amazon throttles it when your instance runs out of CPU Credits and if you are only monitoring your CPU Usage you are not going to see the issue.

So what is a CPU Credit? From Amazon’s help pages.

CPU Credits

A CPU Credit provides the performance of a full CPU core for one minute. Traditional Amazon EC2 instance types provide fixed performance, while T2 instances provide a baseline level of CPU performance with the ability to burst above that baseline level. The baseline performance and ability to burst are governed by CPU credits.

What is a CPU credit?

One CPU credit is equal to one vCPU running at 100% utilization for one minute. Other combinations of vCPUs, utilization, and time are also equal to one CPU credit; for example, one vCPU running at 50% utilization for two minutes or two vCPUs running at 25% utilization for two minutes.

So every minute you spend with one vCPU running full blast takes a credit from your bank, when it hits zero, very quickly in two CPU instances you find yourself throttled to 10,15 or 20% (depending on which instance type you have).

So in order to monitor CPU usage properly you need to monitor your CPUCreditBalance, which luckily AWS reports through Cloudwatch.

We added two new datapoints to our existing EC2 datasource, CPUCreditBalance and CPUCreditUsage. The second one is more of an interesting rather than useful metric as it simply shows the rate at which you are using or earning CPU credits. But setting alerting on the CPUCreditBalance allows us to know that Amazon is going to throttle us before they do.

This image shows the misleading CPU Usage. The second image shows clearly that we ran out of CPU credit.

cpu usage

cpu credit

Windows update monitoring

There is an simple method to help keep your PC safer and running smoothly. I mean by using  Windows Update of course. All you have to do is turn it on, and you will get the latest security and other important updates from Microsoft automatically.

However, running it on a mission critical server is the last thing you want. You do not get a chance to evaluate updates before installing them and nor are you protected when Windows Update decides its time to reboot. You may have just rebooted in the middle of your latest back up for example. Most people set updates and backups in the same early hours of the morning to avoid disruption.

So in the majority of cases administrators often disable automatic updating all together. Making plans to check them on a regular basis and install them within a reasonable time frame. But the best-laid plans of mice and men often go awry.

Its forgotten, pushed back or simple delayed due to other issues. And before you know it there are potentially 100’s of critical updates waiting to be installed. Mainly security updates meaning your servers are susceptible.

 

So we created a Logicmodule, CriticalUpdateCheckPS, which checks your servers using Powershell for any available updates, and alerts you by email, text or voice using Logicmonitor.

In its first version it suffered a huge drawback in that the time taken to establish a Microsoft Update Session and then interrogate it for updates was so long scripts were tying up threads and using collector resources more rapidly than required.

So back to the drawing board, we wrote a script that runs locally on the servers, called it using task scheduler once per day and that script quizzes the update site for the number of critical updates and writes this number to a file stored on the server.

Then we created the CriticalUpdateCheckPS datasource with a much simpler script that simply logs onto the server being monitored for updates and reads the contents of the file written by the scheduled script. This allowed us to collect the data in seconds.

You can set your thresholds however you want  we suggest > 5 10 20, so you get an warning alert on 5 or more, an error alert on 10 or more, a critical alert on 20 or more. But this is of course your choice.

An example alert is shown here:-

  • ID: LMD12345
  • This server, APP013XYZ, has 17 outstanding critical updates to be installed. 

By alerting you to the fact and reminding you (also known as nagging!) you are more likely to deal with it.

The datasource applies to any device that has a system category of “CheckUpdates”

  1. Navigate to the Devices tab
  2. Navigate to the level that you want to set the property – the root level for your device tree, a group, or a device
  3. Click the Manage button for that group or device
  4. From the Manage dialog you can change the value for  system category by clicking on the value field and adding CheckUpdates. It there is already values in there remember to separate them with a comma.
  5. You will also need two new properties PS.USER and PS.PASS (credentials which allow you to run remote scripts in Powershell)

 

 

Additionally you will need to deploy the local script, a folder (c:/LMCriticalUpdates) which contains the local script, and set it to run once per day.

Your servers must be set for remote Powershell scripts as per Logicmonitor’s help page.

SSL errors and alerting

SSL/TLS is a deceptively simple technology. It is simple to deploy, and it just works.

Except the truth is it does not really work, as it turns out that it is not easy to deploy correctly. To ensure that SSL provides the necessary security, you have to  put effort into properly configuring your servers.

 

For example, consider the  POODLE attack (Padding Oracle On Downgraded Legacy Encryption) which is a man-in-the-middle exploit that takes advantage of Internet and security software clients’ fallback to SSL V3.

An attacker can successfully exploit this vulnerability by making no more than 256 SSL 3.0 requests to reveal one byte of encrypted messages.

But the time taken to check all sites under your control can quickly mount up and become a task that you leave for another day, which in IT means someday you will get around to sorting it.

So we created a Logicmodule, SSL Test, which checks your sites for certian vulnerabilities, and alerts you by email, text or voice using Logicmonitor. At the time of publication it checks for Beast, Logjam, Freak, Heartbleed,  Luckyminus20, Debian Flaw, OpenSslCcs, drown, Known DH primes and poodle Attacks vulnerabilities. It also checks the SSL certificate matches the address.

An example alert is shown here:-

  • ID: LMD12345
  • This server, www.yourwebsite.co.uk, is vulnerable to the POODLE attack. If possible, disable SSL 3 to mitigate.

By alerting you to the fact and letting you know how to deal with it, you are saving time having to trawl through RSS feeds and security updates.

You need to manually add each website you want to check as an instance in Logicmonitor.

To do this , select a host in Logicmonitor, (it doesn’t matter which one as it is just a placeholder for the datasource, the actual check is done from the collector,) and click the down arrow shown here.

one

And select Add monitored instance.

Then fill out the various required values.

two

And that is it!