From charlesreid1

Ten best practices for network monitoring: the short list:

  1. Establish baseline behavior
  2. Perform network inventory
  3. Avoid network alert sawtoothing/flapping
  4. Don't filter your email alerts
  5. Monitor deltas
  6. Provide details
  7. Escalation policy
  8. Parent-Child
  9. Event Correlation
  10. View traffic from application endpoint

Establish baseline behavior

Establishing a network baseline is important to establishing a sense of how the network performs normally. (Note that, to this end, Bro can be used for network baselining, even though it is designed as an intrusion detection system, not as a network monitoring tool.)

Perform network inventory

Keep an inventory of devices on the network:

  • Network devices
  • Ports
  • Interfaces being used for network connections
  • Network hardware (links, switches, controllers, power supplies)
  • Servers
  • Virtual machines
  • SAN devices

If you don't know what's on your network, you can't monitor it very well!

Avoid network alert sawtoothing

Alert sawtoothing is where an element's numerical value hovers right around the threshold, causing the alert to be triggered multiple times. This is a sign the threshold needs to be changed.

Options:

  • Once a single alert is triggered, silence that alert for a given window of time
  • Add a delay before the alert is triggered
  • Add a "state" to each alert, and don't re-trigger alerts until the state of the alert has been returned to normal
  • Two-way communication with ticket system or alert management system

Don't filter your email alerts

This is sage advice - if you need to set a filter on your alert emails, it means they're happening too frequently. Alerts should land PLOP in the center of your inbox when they happen.

Monitor deltas

Rather than, or in addition to, implementing threshold alerts, you may also want to monitor deltas. For example, you might monitor disk usage and alert when it exceeds 90%, but you may also monitor disk usage and alert when it changes by more than X% over Y minutes.

Provide details

The alert is the entry point for identifying and responding to the problem, so make sure you provide enough detail with the alert to jump-start the troubleshooting process. Include details like:

  • The machine the fault was detected on
  • The machine that detected the fault (if different)
  • Name of alert
  • Duration of alert
  • Link or reference to where current state of this element can be seen/monitored

Escalation policy

Sometimes an alert is triggered, but it doesn't go to the right person or the person who receives the alert is not equipped to solve the problem. There should be a policy in-place to determine the chain of command: who gets notified of what kind of alerts and when.

Parent Child

In the context of network monitoring, a parent-child relationship (set up manually for devices) tells the monitoring software what entities are related to what, and create a chain of authority for alerts.

For example, suppose that there is a router that connects a handful of servers running virtual appliances. In this case, the router is the "parent" and the servers and virtual appliances are the "children".

If that particular router goes down, everything else (servers, virtual appliances, etc.) will also go down. The alert system should be smart enough to identify that the real problem is with the parent, and not with any of the children. Alerts related to the children should be suppressed if there is an existing alert about the parent appliance.

Upstream verification is a process by which the network monitoring tool checks each upstream parent of a given device before the device is marked as down and an alert created.

Event Correlation

Once you've gathered a bunch of network data, it's important to utilize it! This leads to the much deeper dive of how you actually analyze your network data. The event correlation component of the network monitor should utilize multiple network alerts to identify patterns.

  • On event X, look for event Y
  • On event X, wait Y minutes and look for event Z
  • If an event X occurs multiple times, suppress duplicate alerts
  • Alert when the alert occurs X times

View traffic from application endpoint

Users don't care about network components, they care about whatever they're using the network to do. Measure network performance and traffic as close to the user endpoint as possible, and use techniques like packet inspection (?).

Flags