Network Monitoring/Ten Best Practices

Ten best practices for network monitoring: the short list:

Establish baseline behavior
Perform network inventory
Avoid network alert sawtoothing/flapping
Don't filter your email alerts
Monitor deltas
Provide details
Escalation policy
Parent-Child
Event Correlation
View traffic from application endpoint

Establish baseline behavior

Establishing a network baseline is important to establishing a sense of how the network performs normally. (Note that, to this end, Bro can be used for network baselining, even though it is designed as an intrusion detection system, not as a network monitoring tool.)

Perform network inventory

Keep an inventory of devices on the network:

Network devices
Ports
Interfaces being used for network connections
Network hardware (links, switches, controllers, power supplies)
Servers
Virtual machines
SAN devices

If you don't know what's on your network, you can't monitor it very well!

Avoid network alert sawtoothing

Alert sawtoothing is where an element's numerical value hovers right around the threshold, causing the alert to be triggered multiple times. This is a sign the threshold needs to be changed.

Options:

Once a single alert is triggered, silence that alert for a given window of time
Add a delay before the alert is triggered
Add a "state" to each alert, and don't re-trigger alerts until the state of the alert has been returned to normal
Two-way communication with ticket system or alert management system

Don't filter your email alerts

This is sage advice - if you need to set a filter on your alert emails, it means they're happening too frequently. Alerts should land PLOP in the center of your inbox when they happen.

Monitor deltas

Rather than, or in addition to, implementing threshold alerts, you may also want to monitor deltas. For example, you might monitor disk usage and alert when it exceeds 90%, but you may also monitor disk usage and alert when it changes by more than X% over Y minutes.

Provide details

The alert is the entry point for identifying and responding to the problem, so make sure you provide enough detail with the alert to jump-start the troubleshooting process. Include details like:

The machine the fault was detected on
The machine that detected the fault (if different)
Name of alert
Duration of alert
Link or reference to where current state of this element can be seen/monitored

Escalation policy

Sometimes an alert is triggered, but it doesn't go to the right person or the person who receives the alert is not equipped to solve the problem. There should be a policy in-place to determine the chain of command: who gets notified of what kind of alerts and when.

Parent Child

In the context of network monitoring, a parent-child relationship (set up manually for devices) tells the monitoring software what entities are related to what, and create a chain of authority for alerts.

For example, suppose that there is a router that connects a handful of servers running virtual appliances. In this case, the router is the "parent" and the servers and virtual appliances are the "children".

If that particular router goes down, everything else (servers, virtual appliances, etc.) will also go down. The alert system should be smart enough to identify that the real problem is with the parent, and not with any of the children. Alerts related to the children should be suppressed if there is an existing alert about the parent appliance.

Upstream verification is a process by which the network monitoring tool checks each upstream parent of a given device before the device is marked as down and an alert created.

Event Correlation

Once you've gathered a bunch of network data, it's important to utilize it! This leads to the much deeper dive of how you actually analyze your network data. The event correlation component of the network monitor should utilize multiple network alerts to identify patterns.

On event X, look for event Y
On event X, wait Y minutes and look for event Z
If an event X occurs multiple times, suppress duplicate alerts
Alert when the alert occurs X times

View traffic from application endpoint

Users don't care about network components, they care about whatever they're using the network to do. Measure network performance and traffic as close to the user endpoint as possible, and use techniques like packet inspection (?).

Flags