Performance Splunk Windows

WINDOWS PERFORMANCE MONITORING TIPS WITH SPLUNK

At Octamis we love Splunk, and we love to share our knowledge and experience, so let’s study some tips on Windows monitoring with Splunk !

PREPARING YOUR SPLUNK

Let’s proceed in the order, we want first to get Splunk ready to receive Windows performance data.

This is quite simple and relies on deploying the Windows technical add-on built by Splunk:

https://splunkbase.splunk.com/app/742/

Depending on your Splunk architecture, ensure to deploy the technical add-on everywhere it is required:

  • Indexers (clustered or standalone)
  • Search heads
  • Intermediate forwarder on the path to your indexers, if any

The add-on has no inputs activated by default, at this point there are no modification required.

Indexes creation:

The Windows technical add-on contains the embedded definition for a few indexes:

  • perform: dedicated index for Windows monitoring data using the perfmon
  • windows: dedicated index for various log data and monitoring not related to perfmon data
  • wineventlog: security and log data

If you are running a standalone server then you have nothing else to do as the indexes have created for you by the add-on.

If  you are running Splunk with clustered indexers, be sure to declare those indexes properly before continuing the setup.

DEPLOYING AND CONFIGURING THE ADD-ON

For the demonstration purpose, we will assume that:

  • You already have servers running with the Splunk Universal Forwarder
  • The servers are connected to your Splunk indexer(s) and properly configured for Splunk indexing
  • The servers are connected to a Splunk deployment server (recommended) or you use your deployment solution

Deploy the technical add-on as usual and continue the setup.

COLLECTING PERFORMANCE DATA

Let’s have a look at the default “inputs.conf” file provided within the technical add-on, since we focus on performance metric, we are only interested for now in the “perfmon” stanzas.

For the the demonstration purposes, let’s have a look at CPU and memory metrics:

Splunk_TA_windows/default/inputs.conf

## CPU
[perfmon://CPU]
counters = % Processor Time; % User Time; % Privileged Time; Interrupts/sec; % DPC Time; % Interrupt Time; DPCs Queued/sec; DPC Rate; % Idle Time; % C1 Time; % C2 Time; % C3 Time; C1 Transitions/sec; C2 Transitions/sec; C3 Transitions/sec
disabled = 1
instances = *
interval = 10
object = Processor
useEnglishOnly=true
index = perfmon

## Memory
[perfmon://Memory]
counters = Page Faults/sec; Available Bytes; Committed Bytes; Commit Limit; Write Copies/sec; Transition Faults/sec; Cache Faults/sec; Demand Zero Faults/sec; Pages/sec; Pages Input/sec; Page Reads/sec; Pages Output/sec; Pool Paged Bytes; Pool Nonpaged Bytes; Page Writes/sec; Pool Paged Allocs; Pool Nonpaged Allocs; Free System Page Table Entries; Cache Bytes; Cache Bytes Peak; Pool Paged Resident Bytes; System Code Total Bytes; System Code Resident Bytes; System Driver Total Bytes; System Driver Resident Bytes; System Cache Resident Bytes; % Committed Bytes In Use; Available KBytes; Available MBytes; Transition Pages RePurposed/sec; Free & Zero Page List Bytes; Modified Page List Bytes; Standby Cache Reserve Bytes; Standby Cache Normal Priority Bytes; Standby Cache Core Bytes; Long-Term Average Standby Cache Lifetime (s)
disabled = 1
interval = 10
object = Memory
useEnglishOnly=true
index = perfmon

Things you will (should) probably want to customise:

  • “interval” : this is the time in seconds between 2 performance collections, and will influence the volume of data to be generated. 10 seconds is probably quite high, 30 or 60 seconds are good values that save license, bandwidth and CPU footprint on the servers
  • “mode = multikv” : this is a great option introduced years ago (see: https://www.splunk.com/blog/2013/10/28/new-features-for-perfmon-in-splunk-6), this is smart, it saves license, storage and bandwidth
  • “disabled = 1”: This deactivates the input which is the case by default but you need to explicitly activate each input

Let’s with the following configuration, as always do never modify a default file, create a local file and copy only the stanzas you are interested in:

Splunk_TA_windows/local/inputs.conf

## CPU
[perfmon://CPU]
counters = % Processor Time; % User Time; % Privileged Time; Interrupts/sec; % DPC Time; % Interrupt Time; DPCs Queued/sec; DPC Rate; % Idle Time; % C1 Time; % C2 Time; % C3 Time; C1 Transitions/sec; C2 Transitions/sec; C3 Transitions/sec
disabled = 0
instances = *
interval = 30
object = Processor
useEnglishOnly=true
index = perfmon
mode = multikv

## Memory
[perfmon://Memory]
counters = Page Faults/sec; Available Bytes; Committed Bytes; Commit Limit; Write Copies/sec; Transition Faults/sec; Cache Faults/sec; Demand Zero Faults/sec; Pages/sec; Pages Input/sec; Page Reads/sec; Pages Output/sec; Pool Paged Bytes; Pool Nonpaged Bytes; Page Writes/sec; Pool Paged Allocs; Pool Nonpaged Allocs; Free System Page Table Entries; Cache Bytes; Cache Bytes Peak; Pool Paged Resident Bytes; System Code Total Bytes; System Code Resident Bytes; System Driver Total Bytes; System Driver Resident Bytes; System Cache Resident Bytes; % Committed Bytes In Use; Available KBytes; Available MBytes; Transition Pages RePurposed/sec; Free & Zero Page List Bytes; Modified Page List Bytes; Standby Cache Reserve Bytes; Standby Cache Normal Priority Bytes; Standby Cache Core Bytes; Long-Term Average Standby Cache Lifetime (s)
disabled = 0
interval = 30
object = Memory
useEnglishOnly=true
index = perfmon
mode = multikv

Deploy this configuration to your Windows servers, and if you use Splunk deployment server, ensure you check “restart splunkd”.

CHECKING DATA COMING IN

Next step, let’s check for some data coming in:

index=perfmon

Depending on the mode (multikv or not), the data will be available in:

CPU statistics:

  • standard mode: index=perfmon sourcetype=”Perfmon:cpu”
  • multikv mode: index=perfmon sourcetype=”PerfmonMk:cpu”

Memory statistics:

  • index=perfmon sourcetype=”Perfmon:Memory”
  • index=perfmon sourcetype=”PerfmonMk:Memory”

In this article, we will go will the multikv mode.

ANALYSING CPU STATISTICS

Let’s get some CPU statistics:

Per host average CPU usage over time: 

index=perfmon sourcetype="PerfmonMk:CPU" instance=_Total
| timechart avg(%_Processor_Time) as cpu_usage by host

Very simple.

WHERE IS MY PERCENTAGE OF MEMORY UTILISATION ?

If you are “like me”, when looking at memory statistics, the first (and potentially the only) metric you want to be able to retrieve is the percentage of memory being used, or eventually memory free.

So what’s the problem then ? Well, “as it” although we have dozens of various metrics, the percentage of utilisation is not available with perfmon data.

What ???

Hopefully, we can calculate it ! Using Splunk power and features, we can correlate between the inventory data which contains the amount of physical memory available, and the memory metrics available in perfmon.

The following search reports the amount of physical memory in KB:

index=windows sourcetype=WinHostMon
| stats latest(TotalPhysicalMemoryKB) as TotalPhysicalMemoryKB, latest(TotalVirtualMemoryKB) as TotalVirtualMemoryKB by host | sort 0 host

Notes:

This requires the input “OperatingSystem” to be activated in your deployment, using:

[WinHostMon://OperatingSystem]
interval = 600
disabled = 1
type = OperatingSystem
index = windows

For the demonstration, let’s store this result in a temporarily lookup csv file:

index=windows sourcetype=WinHostMon
| stats latest(TotalPhysicalMemoryKB) as TotalPhysicalMemoryKB, latest(TotalVirtualMemoryKB) as TotalVirtualMemoryKB by host | sort 0 host
| outputlookup windows_memory_inventory.csv

Then, looking at the memory statistics, we have the amount of currently used volume of memory in KB, let’s map this with the inventory data and use some easy calculation:

index=perfmon sourcetype="PerfmonMk:Memory"
| eval used_memory_KB=coalesce('Available_KBytes', Value)
| lookup windows_memory_inventory.csv host as host OUTPUTNEW TotalPhysicalMemoryKB
| eval free_memory_pct=((used_memory_KB/TotalPhysicalMemoryKB)*100), used_memory_pct=(100-free_memory_pct)
| timechart avg(used_memory_pct) as used_memory_pct by host

There you go!

Resilient solution:

  • create a KVstore based lookup table to store our Windows configuration inventory data
  • schedule a report to update the lookup table on a regular basis (per day basis for example)
  • create an auto lookup configuration such that it is not necessary to perform the lookup command manually

WHAT ABOUT PROCESSES ?

Understanding a system CPU load requires knowing what and when the processes consumes resources, the perfmon provides processes related data with the “[perfmon://Process]” stanza.

However, for some reasons the perfmon data is not accurate on multi core systems, a nice article gave me the answer I was looking for:

Windows CPU monitoring with Splunk

Based on this great article, let’s add our WMI input to generate accurate processes CPU statistics: (caution: this is a “wmi.conf” and not “inputs.conf”)

Splunk_TA_windows/local/wmi.conf

[WMI:process]
index = windows
disabled = 0
interval = 30
wql = Select IDProcess,Name,PercentProcessorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process

Once deployed, let’s use some magic searches and start analysing processes activity:

index=windows sourcetype="WMI:process" Name!=_Total Name!=Idle
| reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name
| eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS)
| search cputime > 0
| timechart limit=50 useother=f avg(cputime) by Name

Since Windows will create a new process for a given program able to run in multi core mode, we can improve this search and aggregate a per command invocation basis:

index=windows sourcetype="WMI:process" Name!=_Total Name!=Idle
| reverse | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name
| eval cputime = 100 * (PercentProcessorTime - last_PercentProcessorTime) / (Timestamp_Sys100NS - last_Timestamp_Sys100NS)
| search cputime > 0
| stats avg(cputime) as cputime by _time,host,Name
| rex field=Name "(?[^#]*)#{0,}"
| stats sum(cputime) as cputime by _time,host,Command
| timechart limit=50 useother=f avg(cputime) as cputime by Command

 

Et voila !

You now have all the main pieces of work to start analysing Windows performance with accuracy, enjoy.

 

7 thoughts on “WINDOWS PERFORMANCE MONITORING TIPS WITH SPLUNK”

  1. Doesn’t work.

    G:\SplunkForwarder\bin>splunk restart
    SplunkForwarder: Stopped

    Splunk> All batbelt. No tights.

    Checking prerequisites…
    Checking mgmt port [8089]: open
    Checking conf files for problems…
    Invalid key in stanza [WMI:process] in G:\SplunkForwarder\etc\sy
    stem\local\inputs.conf, line 8: wql (value: Select IDProcess,Name,PercentProce
    ssorTime,TimeStamp_Sys100NS from Win32_PerfRawData_PerfProc_Process)
    Your indexes and inputs configurations are not internally consis
    tent. For more information, run ‘splunk btool check –debug’
    Done
    All preliminary checks passed.

    Starting splunk server daemon (splunkd)…

    SplunkForwarder: Starting (pid 1284)
    Done

    1. Hello David,

      This works perfectly fine 😉

      You issue that you have put the configuration into your “local/inputs.conf” instead of “local/wmi.conf”, which is why it does not work and why Splunk complains…

      Guilhem

    1. Hello,

      Do you mean that copy / paste of the wmi.conf code block didn’t work first and after you did ?
      When I checked recently step by step it worked fine directly…. strange

  2. Hi Guilhem,

    Than you for your post. i am trying to get a cpu usage from a server results like %20,%30 etc. i did some digging and even asked in splunk forums. i am just wondering,is it really hard to get a value for cpu usage on a server?i haven’t got a good answer or solution to my answer.i am not into getting it by every single process or anything,just a simple value so i can run a report and get a value for every 12 hours maybe.

    i ended up using your solution and query;

    sourcetype=”WMI:CPU” index=main sourcetype=”WMI:CPU” Name!=_Total Name!=Idle Name!=_Total Name!=Idle
    | streamstats current=f last(PercentProcessorTime) as last_PercentProcessorTime last(Timestamp_Sys100NS) as last_Timestamp_Sys100NS by Name
    | eval cputime = 100 * (PercentProcessorTime – last_PercentProcessorTime) / (Timestamp_Sys100NS – last_Timestamp_Sys100NS)
    | search cputime > 0
    | timechart span=5m eval(round(avg(cputime),0)) by host

    and this is what i am getting

    x server value is 8172
    y server value is 110003

    and the results i am getting does not make sense to me.numbers are so big and don’t even know what they represent.

    so someone suggested using different stanza

    ## Processes
    [WMI:LocalProcesses]
    interval = 5
    wql = SELECT Name, IDProcess, PrivateBytes, PercentProcessorTime FROM Win32_PerfFormattedData_PerfProc_Process
    index = cpu
    disabled =0

    so i sued that and from what i understand “percentprocessortime” is the percentage of the process for giving Name,so then i should be able to get this with simple timechart command,right?

    index=5sv sourcetype=”WMI:LocalProcesses” host=hc1aptr5sv Name!=_Total Name!=Idle Name!=_Total Name!=Idle|search PercentProcessorTime > 0|timechart eval(round(avg(PercentProcessorTime),0)) by host

    i actually tried running this in realtime and at the same time going into the host machine and running some processes.numbers are close,but not sure if they are accurate.

    if it is not too much to ask,can you at least tell me or point me out to right direction on how to get this resolved.or even better if you have a suggestion.

    Thanks,
    Seyhun

Leave a Reply

Your email address will not be published. Required fields are marked *