Network Monitoring
From charlesreid1
Contents
Basics
Terminology
Element - the fundamental unit of network monitoring, an element consists of a single metric that is being monitored. There are usually hundreds or thousands of elements in a given network.
Acquisition - the process of actually obtaining the observational data from the element
Frequency - related to acquisition, what is the frequency at which data arrives? what kind of data is being sent? under what conditions?
Data warehousing - depending on the size of the network and the amount of data, you can end up with a big storage problem on your hands. For purposes of monitoring, you may decide not to store the data at all, you may decide to keep it for a short amount of time, or you may decide to archive it somewhere.
Threshold value - this gets into the "what" part of your monitoring. What, exactly, are you monitoring, and what is the value of the element that will trigger an alert? (What constitutes an emergency?)
Reset value - opposite of threshold, what is the value of the element that will un-trigger an alert and signify the "all clear"?
Threshold response - what is the response when a threshold is reached and an alert is triggered?
Requester - the entity that is requesting the monitoring data, and where it lives (may be on-board the machine, or may be a networked data store)
List of Monitoring Tools
Cross-platform tools:
- Ping - checks if a target machine is online/up and running, and how long it takes to reach the machine
- SNMP - simple network management protocol, this tool can generate data about elements on a network
- ICMP - internet control messaging protocol, used by routers/switches to send error messages about unreachable hosts
- Syslog - of course, the system log is a useful place for data about what's happening on a particular machine and can yield data about elements
- Other log files - programs will typically provide a way to log information to a log file, so this is another source of data about various elements on the network
- Scripting - scripting is the best way to collect information, and allows for custom element data to be collected and sent off to the receiver
- Flow - understanding the flow of traffic on a network (where it comes from, where it goes, and what kind of traffic it is) is important to understanding the network
Platform-specific tools:
- (cisco) IP SLA - internet protocol service level agreements are usually found onboard Cisco routers, and can keep the WAN running smoothly
- (windoze) WMI - windows management instrumentation is a windoze scripting language for collecting information about a target system
- (windoze) PerfMon - performance monitor that gives information about the machine's current state, as well as information about errors
- (windoze) Event log - the event log in Windows is the equivalent of the syslog, recording everything happening onboard the machine
What To Monitor
Let's cover what you actually want to monitor on the network.
Availability, Faults, Performance
Three important things to measure for each element:
Availability - is an element online/responding to requests? or is it offline/not responding?
Faults - is a given element functioning correctly? have any failures been detected? (Can failures be detected?)
Performance - how well is the network performing? (throughput, utilization, response times, error rates)
The most useful tools for determining these metrics are Ping, SNMP, and ICMP, to measure:
- Response time
- Packet loss
- CPU load
- Memory utilization
- Hardware status
If you can't physically access each of the networks between you and your target, it is impossible to measure availability/faults/performance. In this case, use IP SLA to simulate traffic between two networks and measure the performance of the connection. (Particularly useful for things like audio or video, which tend to be more sensitive to network routes/delays).
Address Space Monitoring
With thousands of IP addresses being assigned on a network, it's important to keep track of what IP addresses have already been used and when subnets are full. It is also important to identify when a network component (e.g., DHCP or DNS) is misconfigured.
Different tools are useful for monitoring different types of elements. Figuring out what tools you need will depend on your role monitoring the network. If you are a one-man band, you need something that can do everything and keeps it simple. If you're monitoring something specific (i.e., virtual containers), you can go for specialized/expensive tools that help you monitor that one particular thing deeply.
Things to consider:
- What is the (one-time) purchase cost?
- What is the (ongoing) maintenance cost?
- What is the support cost?
- What is the customization cost?
DART Framework
SolarWinds recommends using a DART framework, which stands for:
- Discovery
- Alerting
- Remediation
- Troubleshooting
DART Explanation
Discover
Discover consists of finding out what is happening. What is the health of the network? Where are the problems? Where are the potential/future points of failure?
The discover process helps you:
- Identify all of your assets and find out if they are connected
- Provide a network baseline for network performance
- Gather data needed to compute network performance/efficiency statistics
Alerting
Alerting is a notification that something has gone wrong or is broken. It is important to correctly calibrate your alerts and thresholds so your network isn't crying wolf and the person doing the monitoring doesn't get totally overwhelmed. Alerts are an indication that something has failed or is about to fail.
Alert lifecycle:
- Creation - decide what elements are most important to monitor, and set up alerts by picking an appropriate threshold
- Handling/Routing - create a meaningful notification trigger, send it out via email/sms/etc
- Feedback - modify and update alerts if they are too noisy or not noisy enough.
Remediation
Remediation is fixing the problem that you were alerted about. This is where time is of the essence - every moment a network is down is another moment of lost revenue and lost opportunities.
"As an IT administrator, you get paid for performance, but you keep your job with recovery." - Thomas LaRock
Three-step procedure for remediation:
- Stop - take a step back, assess, come up with a game plan to resolve the issue
- Drop - drop everything and focus on the problem. remove all non-essential users from the system in question.
- Roll - roll out/implement the game plan, and follow up with more monitoring to ensure problem is solved
Troubleshooting
Troubleshooting is the flip side of remediation - you're usually digging deeper into the issue that you just (temporarily?) fixed, and identifying the root cause of what went wrong. For example, you may have recovered from a nearly-full hard drive failure by emptying out non-essential junk from the drive (remediation), but troubleshooting is identifying where the non-essential junk is coming from.
Here is an eight-step procedure for troubleshooting: 1. Define the problem 2. Gather information about the problem 3. Construct a hypothesis 4. Devise a game plan for a solution based on the hypothesis 5. Implement the plan 6. Observe the results 7. Repeat steps 2 through 6 8. Document your solution
Ten Best Practices
References
Oreilly Books:
Network Troubleshooting Tools: https://docstore.mik.ua/orelly/networking_2ndEd/tshoot/index.htm
TCP/IP Network Administration: https://docstore.mik.ua/orelly/networking_2ndEd/tcp/index.htm
Flags
network monitoring tools and techniques for monitoring networks to avoid pain and suffering
Network Monitoring/Ten Best Practices
Network Monitoring Tools: Bro (network baselining): Bro Snort (IDS): Snort
Category:Network Monitoring · Category:Networking · Category:Linux Flags · Template:NetworkMonitoringFlag · e |