From charlesreid1

Basics

Terminology

Element - the fundamental unit of network monitoring, an element consists of a single metric that is being monitored. There are usually hundreds or thousands of elements in a given network.

Acquisition - the process of actually obtaining the observational data from the element

Frequency - related to acquisition, what is the frequency at which data arrives? what kind of data is being sent? under what conditions?

Data warehousing - depending on the size of the network and the amount of data, you can end up with a big storage problem on your hands. For purposes of monitoring, you may decide not to store the data at all, you may decide to keep it for a short amount of time, or you may decide to archive it somewhere.

Threshold value - this gets into the "what" part of your monitoring. What, exactly, are you monitoring, and what is the value of the element that will trigger an alert? (What constitutes an emergency?)

Reset value - opposite of threshold, what is the value of the element that will un-trigger an alert and signify the "all clear"?

Threshold response - what is the response when a threshold is reached and an alert is triggered?

Requester - the entity that is requesting the monitoring data, and where it lives (may be on-board the machine, or may be a networked data store)

List of Monitoring Tools

Cross-platform tools:

  • Ping - checks if a target machine is online/up and running, and how long it takes to reach the machine
  • SNMP - simple network management protocol, this tool can generate data about elements on a network
  • ICMP - internet control messaging protocol, used by routers/switches to send error messages about unreachable hosts
  • Syslog - of course, the system log is a useful place for data about what's happening on a particular machine and can yield data about elements
  • Other log files - programs will typically provide a way to log information to a log file, so this is another source of data about various elements on the network
  • Scripting - scripting is the best way to collect information, and allows for custom element data to be collected and sent off to the receiver
  • Flow - understanding the flow of traffic on a network (where it comes from, where it goes, and what kind of traffic it is) is important to understanding the network

Platform-specific tools:

  • (cisco) IP SLA - internet protocol service level agreements are usually found onboard Cisco routers, and can keep the WAN running smoothly
  • (windoze) WMI - windows management instrumentation is a windoze scripting language for collecting information about a target system
  • (windoze) PerfMon - performance monitor that gives information about the machine's current state, as well as information about errors
  • (windoze) Event log - the event log in Windows is the equivalent of the syslog, recording everything happening onboard the machine

What To Monitor

Let's cover what you actually want to monitor on the network.

Availability, Faults, Performance

Three important things to measure for each element:

Availability - is an element online/responding to requests? or is it offline/not responding?

Faults - is a given element functioning correctly? have any failures been detected? (Can failures be detected?)

Performance - how well is the network performing? (throughput, utilization, response times, error rates)

The most useful tools for determining these metrics are Ping, SNMP, and ICMP, to measure:

  • Response time
  • Packet loss
  • CPU load
  • Memory utilization
  • Hardware status

If you can't physically access each of the networks between you and your target, it is impossible to measure availability/faults/performance. In this case, use IP SLA to simulate traffic between two networks and measure the performance of the connection. (Particularly useful for things like audio or video, which tend to be more sensitive to network routes/delays).

Address Space Monitoring

With thousands of IP addresses being assigned on a network, it's important to keep track of what IP addresses have already been used and when subnets are full. It is also important to identify when a network component (e.g., DHCP or DNS) is misconfigured.

Different tools are useful for monitoring different types of elements. Figuring out what tools you need will depend on your role monitoring the network. If you are a one-man band, you need something that can do everything and keeps it simple. If you're monitoring something specific (i.e., virtual containers), you can go for specialized/expensive tools that help you monitor that one particular thing deeply.

Things to consider:

  • What is the (one-time) purchase cost?
  • What is the (ongoing) maintenance cost?
  • What is the support cost?
  • What is the customization cost?

DART Framework

SolarWinds recommends using a DART framework, which stands for:

  • Discovery
  • Alerting
  • Remediation
  • Troubleshooting

DART Explanation

Discover

Discover consists of finding out what is happening. What is the health of the network? Where are the problems? Where are the potential/future points of failure?

The discover process helps you:

  • Identify all of your assets and find out if they are connected
  • Provide a network baseline for network performance
  • Gather data needed to compute network performance/efficiency statistics

Alerting

Alerting is a notification that something has gone wrong or is broken. It is important to correctly calibrate your alerts and thresholds so your network isn't crying wolf and the person doing the monitoring doesn't get totally overwhelmed. Alerts are an indication that something has failed or is about to fail.

Alert lifecycle:

  • Creation - decide what elements are most important to monitor, and set up alerts by picking an appropriate threshold
  • Handling/Routing - create a meaningful notification trigger, send it out via email/sms/etc
  • Feedback - modify and update alerts if they are too noisy or not noisy enough.

Remediation

Remediation is fixing the problem that you were alerted about. This is where time is of the essence - every moment a network is down is another moment of lost revenue and lost opportunities.

"As an IT administrator, you get paid for performance, but you keep your job with recovery." - Thomas LaRock

Three-step procedure for remediation:

  • Stop - take a step back, assess, come up with a game plan to resolve the issue
  • Drop - drop everything and focus on the problem. remove all non-essential users from the system in question.
  • Roll - roll out/implement the game plan, and follow up with more monitoring to ensure problem is solved

Troubleshooting

Troubleshooting is the flip side of remediation - you're usually digging deeper into the issue that you just (temporarily?) fixed, and identifying the root cause of what went wrong. For example, you may have recovered from a nearly-full hard drive failure by emptying out non-essential junk from the drive (remediation), but troubleshooting is identifying where the non-essential junk is coming from.

Here is an eight-step procedure for troubleshooting: 1. Define the problem 2. Gather information about the problem 3. Construct a hypothesis 4. Devise a game plan for a solution based on the hypothesis 5. Implement the plan 6. Observe the results 7. Repeat steps 2 through 6 8. Document your solution

Ten Best Practices

References

Oreilly Books:

Network Troubleshooting Tools: https://docstore.mik.ua/orelly/networking_2ndEd/tshoot/index.htm

TCP/IP Network Administration: https://docstore.mik.ua/orelly/networking_2ndEd/tcp/index.htm

Flags