Alerting Linux Monitoring Performance Splunk Unix

Monitoring system load average with Nmon Performance

For my own development purposes, I am using a great Linux VPS server from vpsdime.com, their offer is great and cheap, 4 vCPUS, 6GB of memory and 30 GB of disk for 7$ per month, nice.

Cool isn’it ? But… there is a “but”, yesterday I received this kind of message:

Excessive load on your virtual machine, your VM has been running above acceptable load limits for an extended period of time

Oops ! Ok, so let’s have a detailed look at the terms of service:

To ensure the best performance for all of our customers, we enforce the following limits:

– Your VPS should not have a load average of 4 for more than one hour, or Your VPS will get restarted.

– Your VPS should not have a load average of 2 for more than two hours, or Your VPS will get restarted.

– Our system will automatically suspend VPSs with 5 minute load average more than 30.0

All right, that’s a job for Nmon and Splunk no ? Let’s have some amazing monitoring and be warmed before they shutdown my server !

SYSTEM LOAD AVERAGE

vpsime reports in my client area an RRD chart of my system load average:

We can see the famous peak of load at the beginning of the chart, when I received the ticket issue, I got an attached file with the current situation:

Nmon Performance application reports the system load average using the “nmon external” features, this allows integrating any non nmon metric in the nmon processing, such that the data is naturally parsed with the nmon metrics and easily made available to Splunk.

See: http://ta-nmon.readthedocs.io/en/latest/external.html

The “NMON SUMMARY OVERVIEW” dashboard exposes the load average statistics:

Cool 😉

We also have the picture in the “NMON DARK MONITORING” dashboard:

Let’s have a look at the SPL search behind those dashboards:

eventtype=nmon:performance type=UPTIME (host="splunkdev")
| timechart `nmon_span` avg(load_average_1min) as load_average_1min, avg(load_average_5min) as load_average_5min, avg(load_average_15min) as load_average_15min

Pretty simple.

The terms of service have a notion of duration, so my alerting must be able to fire not only if I have an excess in the load average, but as well if the duration of that peak exceeds the terms of service.

Nmon Performance embeds various alerting features that use the “transaction” commands to provide smart monitoring, not only if have a peak, but if that peak excesses a given amount of time.

That’s cool, and this is what I need, having a look at the “HOWTO Interface”, an pre-built example is provided using over the CPU usage statistics:

All right, more or less what I need but we have the logic here, so let’s start to work on our search.

BUILDING OUR ALERT

Let’s start by searching for our events, and filtering out any event with a load average lower than 2, which is our first level of condition:

eventtype=nmon:performance type=UPTIME (host="splunkdev")
| where load_average_5min>=2

Not too bad, let’s add some streamstats stuff to compute the hourly load average and creating a “state” field on it:

| streamstats avg(load_average_5min) as average_load_per_hour time_window=60m
| eval average_load_state=case(average_load_per_hour>=4, "excess_load_4", average_load_per_hour>=2, "excess_load_2" )

 

So far so good, what if we use some transaction stuff to compute our duration and groups our events:

| transaction host average_load_state maxpause=60m

 

This starts to look nice isn’t it ?

Let’s add some conditions to respect our 2 main conditions for alerting:

| table _time host average_load_state duration
| where ( average_load_state="excess_load_2" AND duration>(60*60*2) ) OR ( average_load_state="excess_load_4" AND duration>(60*60*1) )

And some improvements to understand the duration value:

| eval duration_nb_hours=round(duration/3600, 2), duration_string=tostring(duration, "duration")

That looks good ! Let’s make an alert that will run every 5 minutes over last 4 hours, send me an email in case of an excess load average:



Et voila ! I should be warmed before they shutdown my server !

What the about last critical condition when the load excesses 30 ?

Well, this one is a quite simple condition and has no duration notion, a simple dedicated alert based on the value would do the job:

eventtype=nmon:performance type=UPTIME (host="splunkdev")
| where load_average_5min>=30
| stats latest(_time) as _time, latest(load_*) AS "load_*" by host
| fields host, _time, load_average_1min, load_average_5min, load_average_15min

However, as they would instantly suspend the service, I would probably be keen to reduce the value and get an alerting starting half of this value, ending with a 15 load average for instance:

eventtype=nmon:performance type=UPTIME (host="splunkdev")
| where load_average_5min>=30
| stats latest(_time) as _time, latest(load_*) AS "load_*" by host
| fields host, _time, load_average_1min, load_average_5min, load_average_15min

Enjoy 😉

Leave a Reply

Your email address will not be published. Required fields are marked *