Real Time Performance Monitoring

Every administrator should know the performance of the servers and services he is responsible for. Only thus he can be sure that the servers offer the performance needed in daily business. Aditionally he can identify bottlenecks faster. Normally the performance of services if measured in actions per time. For instance for a database this would be accesses or inserts per second, for a webserver sites delivered per second. For a mailserver this figure are messages handled per second. Of course, a continous measurement of the performance of the system in real-time is best.

Every admin can estimate server performance. But measuring is always better than speculating. Measuring things on productive live systems is the task of a monitoring systems. Or shorter: Stop speculating! Start measuring!

In this article I want to present a method to measure the performance of productive systems in real time. This results in much more accurate values than any educated guess. I will demonstrate my method with a mail server, but the method can be applied to any service. To calculate the maximum performance of your system you just have to measure the throughtput together with the CPU usage and combine both values with some simple math.

The moment you add a corellating value (here: CPU), you begin to see the machines potential. From that moment on you will be able to tune and optimize your system and see the result in the monitoring graphs. Optimize step by step, one change at a time and you can identify the performance impact of different changes.

Mail Server Performance

A mailserver of a domain is crucial for modern business. Especially service providers should be sure that their mail servers are capable to deal with the usual load peaks. The domian should provide at least enough mail service capacity that no mail is delayed with a 421 warning during normal business hours. A mail server that doesn't perform well slows down or even delays message transport. This slows down communication. It hinders business.

In my setup I will measure the performance of a mail server in terms of forwarded messages per second and correlate this value with the utilization of the CPUs of that server.

To get access to the processed mails we use the amavis SNMP subagent. It provides access to the important fault and performance data of his mail server setup. For the performance measurement the inMsgs OID is important. The data type is COUNTER, so no mail slips through the system uncounted, even if the monitoring system does not request the data for some time. Of course the monitoring system has to calculate the difference between the actual measurement and the last one. But this is implemented in every modern monitoring system.

Mail Server Load

There are several ways to determine the load of a server. I use the ssCpuRawIdle OID of the net-snmp agent. The CPU idle figure correlates closely with the messages processed and also reflects the real work load of the server. Other figures like the load as displayed by uptime might be misleading. The data type of the OID is a COUNTER again, that tells me how many ticks the Linux kernel idled. In a standard setup every CPU provides 100 ticks per second. So a hexa-core system has 600 ticks per second. If the kernel does not idle any more, it is completely busy. In other words, if the CPUs do not idle any more the server reached its maximum performance.

Measurement

To measure and to store the values inMsgs and ssCpuRawIdle I use rrdtools. The command

rrdtool --step 300 \
  DS:inMsgs:DERIVE:600:0 \
  DS:ssCpuIdle:DERIVE:600:0 \
  RA:AVERAGE:0.5:1:576

creates a RRD that can hold the data in intervals of five minutes for the last 48 hours. The rrdtools calculate the difference automatically to store the rate of mails rushing in.

With the following cronjob my monitoring system gathers the data all 5 minutes:

NOW=$(date +%s)
INMSGS=$(snmpget -Oeqv mailserver .1.3.6.1.4.1.15312.2.1.1.2.1.0)
CPUIDLE=$(snmpget -Oeqv mailserver .1.3.6.1.4.1.2021.11.53.0)
rrdtool update mailstats.rrd $NOW:$INMSGS:$CPUIDLE

Review of the Data

The cronjob collects the data. Now lets review, how the data look like. I like the program gnuplot. As you will see, I use this program not only for viewing the data, but also to calculate the performance. But first we export the data with

rrdtool dump mailstats.rrd | grep "2015-12-15" | awk -F"<|>" '{print $7,$11}' > tmpdata

Please note that you have to use the correct date. After the export you can display the data with gnuplot

gnuplot
gnuplot> set xlabel "Msgs/sec"
gnuplot> set ylabel "Idle CPU"
gnuplot> plot 'tmpdata' title 'data'
Sample performance plot (idle CPU vs. msgs/sec).

As you can see the load versus throughput nicely forms a curve. When there are not mail rushing through the system, nearly all CPUs of my 6 core server idle (left part). When my server processes more than approximately 1.5 messages/sec all CPUs are working and none is ideling any more. With the naked eye, you could estimate the performance to 1.5 msgs/sec.

But we want to calculate it more precise and automate the process, so we get a number that we can display in our monitoring system.

Calculating the Performance

To calculate the performance automatically again I use the gnuplot program. It provides a nice algorithm to find the best fit of a curve to the given data. From the graph, you can see that a quadratic equation would fit the data. It should fit the data better then a straight line.

First I export the data of the last 20 hours:

rrdtool fetch mailstats.rrd AVERAGE --end NOW --start end-20h | grep -v "^\W\|^$\|nan" > tmpdata

20 hours are a compromise between seeing changes fast enough and gathering enough data to get a good fit.

The following gnuplot script fits a quadradtic equation to the data and solves its zero (CPU not ideling any more) to calculate the performance:

DATE=`date +%s`
set fit quiet
f(x) = A*x*x + B*x + C
fit f(x) 'tmpdata' using 2:3 via A,B,C
Z= (-B - sqrt(B*B - 4*A*C)) / (2*A)
set print "performance.dat" append
print DATE, Z

Please note that you have to optimize the script in case you get a complex solution. The script also writes the result together with the time into a file. You also should feed the data to you monitoring system that could display the calculated performance together with the actual throughput and the CPU load.

If you run the perfomance calculation in a cronjob you can measure the performance of your system continuosly. So your monitoring system will display the maximum performance every time you change your configuration. You also get a good feeling for the power of your system or when it is time to add more CPUs to your server, buy a new server or put more servers behind a loadbalancer.

Configuration Change

A good test for the performance monitoring came when we changed the setup of the server. In the figure below you can see the effect of the change. Before the change the server could deal with 1.8 msgs per second. After the change the performance skyrocket to 7.3 msgs/sec.

Performance impact of a configuration change.

Kommentare

Kommentare deaktiviert.