Monitoring Systems, SNMPv3 and the engineID

Sometimes encrypted SNMPv3 requests from Network Monitoring Systems (NMS) to devices just fail out of the blue. There is no reasonable explanation for the behaviour. In this article I will describe the problem that I happened to me in Zabbix, speculate about the reason, and present a workaround (or solution, if you like). It works for me but I am not sure if I really found the real problem. At least it solves the problem which shows some real weird behaviour of the Zabbix server and the monitored device. But the problem not only occurs in the Zabbix NMS but I also noticed it in the CA's Spectrum system.

SNMPv3

For more that 10 years SNMPv3 is the only valid SNMP standard. Older versions like SNMPv1 and v2c are considered obsolete, although these versions are still in use for the management of most devices in networks.

SNMPv3 offers a full framework for a User-based Security Model (usm) and a View Access Control Model (vacm) to limit the information an authenticated user is authorized to see. Of course all requests from the manager to the agent can be authenticated and, if needed, encrypted. In most of the designs it is sufficient to authenticate monitoring traffic inside a internal network but if the traffic passes a external network (i.e. the internet) it also should be encrypted. SNMPv3 is offering all these options.

engineID

Every SNMPv3 entity has its own ID, the so called engineID. It is a unique number for every context that a agents operates in. But on most of the devices only one SNMP agent runs, so every device has a unique engineID. The RFC 3411 describes in section snmpModules.10 (page 40) how the engineID should be composed. If the traffic is encrypted the engineID is part of the algorithm so the data is encrypted different for all devices.

Since the manager initially does not know about the engineID of the agent on the managed device, RFC 5343 describes a method for the initial discovery of the engineID.

Replay Protection

The SNMPv3 standard also protects the communication against replay attacks when an attacker records the packets and sends it later again towards the destination station to provoke some kind of reaction. To achieve that goal the manager first asks the agent how many times it rebooted already (snmpEngineBoots) and how long ago the last reboot was (snmpEngineTime) . The manager encrypts the data with the number of boots and the time since the last boot of the agent. The agent only can decrypt the package correctly if its time fits to the time of the incoming request. Please note that no absolute time reference is used in SNMPv3 but a relative time frame. So a manager also can retrieve information from the monitoring device even if the time does not coincide on both systems.

Communication

In the line the encrypted SNMPv3 communication looks like that:

manager -> agent: What is your engineID?
agent -> manager: My engineID is ..... My SNMP Agent restarted x times and the last time y seconds ago.
manager -> agent: What is your processor load, encrypted with the engineID, snmpEngineBoots, and snmpEngineTime.
agent -> manager: The processor load was x, encrypted with the engineID, snmpEngineBoots, and snmpEngineTime.

As you can see any problem with the interpretation of the engineID, snmpEngineBoots, or snmpEngineTime will be fatal for the encrypted communication.

Zabbix

Zabbix is a monitoring system the makes use of the SNMP protocol. Of course you can use the version 3 of the protocol. In Zabbix you just have to configure the SNMP user, the authentication passphrase and the encryption passphrase. Username and the passphrases can be configured as macros in the host definition. The use of these macros makes it much simpler to re-use templates for hosts that might have different parameters.

Symptoms

The problems start, when you recognize that suddenly one device does not resend to the encrypted SNMPv3 requests form the Zabbix system. Checking the SNMP configurations with snmpget or snmpwalk from the command line of the monitoring host will show no problem. All requests form the command line do result in the correct answer.

In the next step you start tcpdump to check the communication on the line. In the first packet the manager asks the Agent for its engineID:

07.406643 IP (tos 0x0, ttl 53, id 0, offset 0, flags [DF], proto UDP (17), length 92)
  manager.37780 > agent.161: [udp sum ok]  { SNMPv3 { F=r } { USM B=0 T=0 U= }
  { ScopedPDU E=  C= { GetRequest(14) R=537447614  } } }

As you can see, there is no user name included (U=), the number of boots is zero (B=0), as well as the time since last boot (T=0). In the next packet the agent provides the manager with the necessary information:

07.408916 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 143)
  agent.161 > manager.37780: [udp sum ok]  { SNMPv3 { F= } { USM B=28 T=27 U= }
  { ScopedPDU E= 0x800x000x1F0x880x800x8E0xE50xD80x7B0x120x0C0x330x520x000x000x000x00
    C= { Report(31) R=537447614  .1.3.6.1.6.3.15.1.1.4.0=1 } } }

It tells te manager about its engineID, that it booted already 28 times and the last boot was 27 seconds ago. And now the manager behaves totally strange. On the line you can see a packet like this:

07.427289 IP (tos 0x0, ttl 53, id 0, offset 0, flags [DF], proto UDP (17), length 177)
  manager.37780 > agent.161: [udp sum ok]  { SNMPv3 { F=apr } { USM B=27 T=178 U=user }
  { ScopedPDU [!scoped PDU] (encrypted data) } }

You see that the manager asks for 27 boots and 178 seconds sine last reboot, which is completely nonsense. The agent reported 28 boots and 27 seconds less than half a second before. Of course the agents answers with a:

07.429382 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 163)
  agent.161 > manager.37780: [udp sum ok]  { SNMPv3 { F=a } { USM B=28 T=27 U=user }
  { ScopedPDU E= 0x800x000x1F0x880x800x8E0xE50xD80x7B0x120x0C0x330x520x000x000x000x00
    C= { Report(28) R=0  .1.3.6.1.6.3.15.1.1.2.0=1 } } }

With this OID the agent tells the manager that the requested PDU could not be decrypted since it was outside of the acceted time window.

As I wrote before the same snmp request from the command line works perfectly. Also a SNMP request that only is authenticated, but not encrypted works. I saw this behaviour with Zabbix installed on a Linux server and a Spectrum Server in a Windows environment. On the agent side I had a Cisco IOS, Cisco Nexus, or a Linux machine with net-snmp installed.

The Cause

It seems that the Zabbix server does not interpret the snmpEngineBoots and snmpEngineTime correctly. But the developer of Zabbix swear to high heaven that they use the snmp libraries of net-snmp. The same libraries that also the command line tools use. So the investigation seems to lead to a dead end here.

On the other hand, there are some bug reports on the Zabbix site that point to a non-unique engineID on the agent side. See also ZBX-2152 bug for details.

The Cure

Sine there was no reasonable explanation I tried to change the engineID on the agent side. A

engineIDType 3

in the snmpd.conf configuration file of the agent changed the ID from net-snmp random to the MAC address of eth0, which really should be unique somehow. Of course only after the next reboot. And really, Zabbix resumed the communication with the agent and gathered data. Problem solved.

If you have any further questions, please mail me ms@sys4.de


Kommentare

  1. Essence

    Essence (3 Jahre, 7 Monate) # Reply

    Very informational! Thank you.

Kommentare deaktiviert.