{"id":180,"date":"2017-05-13T12:33:32","date_gmt":"2017-05-13T11:33:32","guid":{"rendered":"https:\/\/www.octamis.com\/octamis-blog\/?p=180"},"modified":"2017-05-15T00:48:22","modified_gmt":"2017-05-14T23:48:22","slug":"monitoring-system-load-average-with-nmon-performance","status":"publish","type":"post","link":"https:\/\/www.octamis.com\/octamis-blog\/monitoring-system-load-average-with-nmon-performance\/","title":{"rendered":"Monitoring system load average with Nmon Performance"},"content":{"rendered":"<p>For my own development purposes, I am using a great Linux VPS server from vpsdime.com, their offer is great and cheap, 4 vCPUS, 6GB of memory and 30 GB of disk for 7$ per month, nice.<\/p>\n<p>Cool isn&#8217;it ? But&#8230; there is a &#8220;but&#8221;, yesterday\u00a0I received this kind of message:<\/p>\n<blockquote>\n<h2>Excessive load on your virtual machine,\u00a0your VM has been running above acceptable load limits for an extended period of time<\/h2>\n<\/blockquote>\n<p>Oops\u00a0! Ok, so let&#8217;s have a detailed look at the terms of service:<\/p>\n<blockquote><p>To ensure the best performance for all of our customers, we enforce the following limits:<\/p>\n<p>&#8211; Your VPS should not have a load average of 4 for more than one hour, or Your VPS will get restarted.<\/p>\n<p>&#8211; Your VPS should not have a load average of 2 for more than two hours, or Your VPS will get restarted.<\/p>\n<p>&#8211;\u00a0Our system will automatically suspend VPSs with 5 minute load average more than 30.0<\/p><\/blockquote>\n<p>All right, that&#8217;s a job for Nmon and Splunk no ? Let&#8217;s have some amazing monitoring and be warmed before they shutdown my server !<\/p>\n<h2><span style=\"color: #339966;\">SYSTEM LOAD AVERAGE<\/span><\/h2>\n<p>vpsime reports in my client area an RRD chart of my system load average:<\/p>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_701.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-182\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_701.png\" alt=\"\" width=\"576\" height=\"203\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_701.png 576w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_701-300x106.png 300w\" sizes=\"(max-width: 576px) 100vw, 576px\" \/><\/a><\/p>\n<p>We can see the famous peak of load at the beginning of the chart, when I received the ticket issue, I got an attached file with the current situation:<\/p>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/load-average.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-183\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/load-average.png\" alt=\"\" width=\"591\" height=\"219\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/load-average.png 591w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/load-average-300x111.png 300w\" sizes=\"(max-width: 591px) 100vw, 591px\" \/><\/a><\/p>\n<p>Nmon Performance application reports the system load average using the &#8220;nmon external&#8221; features, this allows integrating any non nmon metric in the nmon processing, such that the data is naturally parsed with the nmon metrics and easily made available to Splunk.<\/p>\n<p><strong>See:<\/strong>\u00a0<a href=\"http:\/\/ta-nmon.readthedocs.io\/en\/latest\/external.html\">http:\/\/ta-nmon.readthedocs.io\/en\/latest\/external.html<\/a><\/p>\n<p><strong>The &#8220;NMON SUMMARY OVERVIEW&#8221;\u00a0dashboard exposes the load average statistics:<\/strong><\/p>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_702.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-185\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_702.png\" alt=\"\" width=\"1872\" height=\"620\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_702.png 1872w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_702-300x99.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_702-768x254.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_702-1024x339.png 1024w\" sizes=\"(max-width: 1872px) 100vw, 1872px\" \/><\/a><\/p>\n<p>Cool \ud83d\ude09<\/p>\n<p><strong>We also have the picture in the &#8220;NMON DARK MONITORING&#8221; dashboard:<\/strong><\/p>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_703.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-186\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_703.png\" alt=\"\" width=\"1875\" height=\"443\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_703.png 1875w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_703-300x71.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_703-768x181.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_703-1024x242.png 1024w\" sizes=\"(max-width: 1875px) 100vw, 1875px\" \/><\/a><\/p>\n<p><strong>Let&#8217;s have a look at the SPL search behind those dashboards:<\/strong><\/p>\n<pre>eventtype=nmon:performance type=UPTIME (host=\"splunkdev\")\r\n| timechart `nmon_span` avg(load_average_1min) as load_average_1min, avg(load_average_5min) as load_average_5min, avg(load_average_15min) as load_average_15min\r\n<\/pre>\n<p>Pretty simple.<\/p>\n<p>The terms of service have a notion of duration, so my alerting must be able to fire\u00a0not only if I have an excess in the load average, but as well if the duration of that peak exceeds the terms of service.<\/p>\n<p>Nmon Performance embeds various alerting features that use the &#8220;transaction&#8221; commands to provide smart monitoring, not only if have a peak, but if that peak excesses a given amount of time.<\/p>\n<p><strong>That&#8217;s cool, and this is what I need, having a look at the &#8220;HOWTO Interface&#8221;, an pre-built example is provided using over the CPU usage statistics:<\/strong><\/p>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_704.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-187\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_704.png\" alt=\"\" width=\"1875\" height=\"715\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_704.png 1875w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_704-300x114.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_704-768x293.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_704-1024x390.png 1024w\" sizes=\"(max-width: 1875px) 100vw, 1875px\" \/><\/a><\/p>\n<p><strong>All right, more or less what I need but we have the logic here, so let&#8217;s start to work on our search.<\/strong><\/p>\n<h2><span style=\"color: #339966;\">BUILDING OUR ALERT<\/span><\/h2>\n<p><strong>Let&#8217;s start by searching for our events, and filtering out any event with a load average lower than 2, which is our first level of condition:<\/strong><\/p>\n<pre>eventtype=nmon:performance type=UPTIME (host=\"splunkdev\")\r\n| where load_average_5min&gt;=2\r\n<\/pre>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_705.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-188\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_705.png\" alt=\"\" width=\"1901\" height=\"857\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_705.png 1901w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_705-300x135.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_705-768x346.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_705-1024x462.png 1024w\" sizes=\"(max-width: 1901px) 100vw, 1901px\" \/><\/a><\/p>\n<p><strong>Not too bad, let&#8217;s add some streamstats stuff to compute the hourly load average and creating a &#8220;state&#8221; field on it:<\/strong><\/p>\n<pre>| streamstats avg(load_average_5min) as average_load_per_hour time_window=60m\r\n| eval average_load_state=case(average_load_per_hour&gt;=4, \"excess_load_4\", average_load_per_hour&gt;=2, \"excess_load_2\" )\r\n<\/pre>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_706.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-189\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_706.png\" alt=\"\" width=\"1899\" height=\"874\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_706.png 1899w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_706-300x138.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_706-768x353.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_706-1024x471.png 1024w\" sizes=\"(max-width: 1899px) 100vw, 1899px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p><strong>So far so good, what if we use some transaction stuff to compute our duration and groups our events:<\/strong><\/p>\n<pre>| transaction host average_load_state maxpause=60m\r\n<\/pre>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_707.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-190\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_707.png\" alt=\"\" width=\"1903\" height=\"958\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_707.png 1903w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_707-300x151.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_707-768x387.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_707-1024x515.png 1024w\" sizes=\"(max-width: 1903px) 100vw, 1903px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>This starts to look nice isn&#8217;t it ?<\/p>\n<p><strong>Let&#8217;s add some conditions to respect our 2 main conditions for alerting:<\/strong><\/p>\n<pre>| table _time host average_load_state duration\r\n| where ( average_load_state=\"excess_load_2\" AND duration&gt;(60*60*2) ) OR ( average_load_state=\"excess_load_4\" AND duration&gt;(60*60*1) )\r\n<\/pre>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_708.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-191\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_708.png\" alt=\"\" width=\"1902\" height=\"590\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_708.png 1902w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_708-300x93.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_708-768x238.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_708-1024x318.png 1024w\" sizes=\"(max-width: 1902px) 100vw, 1902px\" \/><\/a><\/p>\n<p><strong>And some improvements to understand the duration value:<\/strong><\/p>\n<pre>| eval duration_nb_hours=round(duration\/3600, 2), duration_string=tostring(duration, \"duration\")\r\n<\/pre>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_709.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-192\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_709.png\" alt=\"\" width=\"1903\" height=\"528\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_709.png 1903w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_709-300x83.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_709-768x213.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_709-1024x284.png 1024w\" sizes=\"(max-width: 1903px) 100vw, 1903px\" \/><\/a><\/p>\n<p><strong>That looks good ! Let&#8217;s make an alert that will run every 5 minutes over last 4 hours, send me an email in case of an excess\u00a0load average:<\/strong><\/p>\n<p><a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_710.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-193\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_710.png\" alt=\"\" width=\"1915\" height=\"787\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_710.png 1915w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_710-300x123.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_710-768x316.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_710-1024x421.png 1024w\" sizes=\"(max-width: 1915px) 100vw, 1915px\" \/><\/a><br \/>\n<a href=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_711.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-194\" src=\"https:\/\/51.68.196.81\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_711.png\" alt=\"\" width=\"1918\" height=\"612\" srcset=\"https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_711.png 1918w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_711-300x96.png 300w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_711-768x245.png 768w, https:\/\/www.octamis.com\/octamis-blog\/wp-content\/uploads\/2017\/05\/Selection_711-1024x327.png 1024w\" sizes=\"(max-width: 1918px) 100vw, 1918px\" \/><\/a><br \/>\nEt voila ! I should be warmed before they shutdown my server !<\/p>\n<p><strong>What the about last critical condition when the load excesses 30 ?<\/strong><\/p>\n<p><strong>Well, this one is a quite simple condition and has no duration notion, a simple dedicated alert based on the value would do the job:<\/strong><\/p>\n<pre>eventtype=nmon:performance type=UPTIME (host=\"splunkdev\")\r\n| where load_average_5min&gt;=30\r\n| stats latest(_time) as _time, latest(load_*) AS \"load_*\" by host\r\n| fields host, _time, load_average_1min, load_average_5min, load_average_15min\r\n<\/pre>\n<p><strong>However, as they would instantly suspend the service, I would probably be keen to reduce the value and get an alerting starting half of this value, ending with a 15 load average for instance:<\/strong><\/p>\n<pre>eventtype=nmon:performance type=UPTIME (host=\"splunkdev\")\r\n| where load_average_5min&gt;=30\r\n| stats latest(_time) as _time, latest(load_*) AS \"load_*\" by host\r\n| fields host, _time, load_average_1min, load_average_5min, load_average_15min\r\n<\/pre>\n<p>Enjoy \ud83d\ude09<\/p>\n","protected":false},"excerpt":{"rendered":"<p>For my own development purposes, I am using a great Linux VPS server from vpsdime.com, their offer is great and cheap, 4 vCPUS, 6GB of memory and 30 GB of disk for 7$ per month, nice. Cool isn&#8217;it ? But&#8230; there is a &#8220;but&#8221;, yesterday\u00a0I received this kind of message: Excessive load on your virtual [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[7,9,6,3,2,5],"tags":[19,17,14,13,12,11,10,18,15],"_links":{"self":[{"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/posts\/180"}],"collection":[{"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/comments?post=180"}],"version-history":[{"count":8,"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/posts\/180\/revisions"}],"predecessor-version":[{"id":215,"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/posts\/180\/revisions\/215"}],"wp:attachment":[{"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/media?parent=180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/categories?post=180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.octamis.com\/octamis-blog\/wp-json\/wp\/v2\/tags?post=180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}