From charlesreid1

Prometheus is a time series database backend. It connects to data sources like Netdata and graph libraries like Grafana, making Prometheus a solid choice for a time series database.

Mostly followed this guide from Netdata wiki on how to set up Netdata with Prometheus: [1]

Also very useful: https://www.digitalocean.com/community/tutorials/how-to-install-prometheus-on-ubuntu-16-04

What Is It For


Prometheus is designed for reliability, to be the system you go to during an outage to allow you to quickly diagnose problems. Each Prometheus server is standalone, not depending on network storage or other remote services. You can rely on it when other parts of your infrastructure are broken, and you do not need to setup extensive infrastructure to use it.

- https://prometheus.io/docs/introduction/overview/


Installing

Quick and Dirty

The following is the quick and dirty installation procedure. This will download the pre-built Linux binary and put it in /opt/prometheus:

$ curl -L 'https://github.com/prometheus/prometheus/releases/download/v1.7.1/prometheus-1.7.1.linux-amd64.tar.gz' -o /tmp/prometheus.tar.gz
$ mkdir /opt/prometheus
$ tar -xf /tmp/prometheus.tar.gz -C /opt/prometheus/ --strip-components 1

Thorough and Secure

For a more thorough and secure installation of Prometheus, we'll want to create dedicated service user accounts and set the ownership of all Prometheus files accordingly.

Link: https://www.digitalocean.com/community/tutorials/how-to-install-prometheus-on-ubuntu-16-04

End Result

Command line where Prometheus is running:

PrometheusCLI.png

Web console:

PrometheusFirstVisit.png

Securing

See Prometheus/Security

Using

To run, just point your browser to localhost:9090 (or, if you are on another machine, access port 9090 of the machine running Prometheus.)

Word of warning, the Prometheus web portal is insecure by default.

Web Dashboard

Navigate to localhost:9090 to see the web dashboard. You can run queries and plot the results. Example:

PrometheusPlot.png

Scraping

The configuration file for Prometheus can be configured with several sections specifying services to scrape.

Config file is in /opt/prometheus/prometheus.yml

Here is an example with three machines (one local) running Netdata and being scraped for Netdata metrics (also see Netdata/Prometheus):

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'codelab-monitor'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first.rules"
  # - "second.rules"



scrape_configs:


  - job_name: 'netdata_jupiter'

    params:
      format:
      - prometheus

    scrape_interval: 15s
    scrape_timeout: 10s

    metrics_path: /api/v1/allmetrics
    scheme: http

    static_configs:
      - targets:
        - localhost:19999


  - job_name: 'netdata_basilisk'

    params:
      format:
      - prometheus

    scrape_interval: 15s
    scrape_timeout: 10s

    metrics_path: /api/v1/allmetrics
    scheme: http

    static_configs:
      - targets:
        - 192.168.25.137:19999


  - job_name: 'netdata_morpheus'

    params:
      format:
      - prometheus

    scrape_interval: 15s
    scrape_timeout: 10s

    metrics_path: /api/v1/allmetrics
    scheme: http

    static_configs:
      - targets:
        - 192.168.25.242:19999

Alerting

Promgen is a utility for providing a web interface for generating Prometheus configuration files for complicated alert scenarios.

Link: https://github.com/line/promgen

Pushing for Short-Lived Jobs

Sometimes the scraping model does not work, because a job will be ephemeral or short-lived. In this case, you can have the job push metrics to a pushgate, which Prometheus then uses to collect data.

Link: https://github.com/prometheus/pushgateway

The push gateway can also be used from the command line: https://github.com/prometheus/pushgateway#use-it

Simple Example

Push a single sample into a group identified by a job name "my_job" (assuming the Prometheus server is at localhost):

$ echo "metric_1 3.14" | curl --data-binary @- http://localhost:9091/metrics/job/my_job

This metric will be of type "untyped" since no type information was provided.

Complicated Example

Push multiple metrics to a push gateway at pushgateway.example.org, a job called some_job, and an instance called some_instance:

(Note the TYPE statements here define the data types)

  cat <<EOF | curl --data-binary @- http://pushgateway.example.org:9091/metrics/job/some_job/instance/some_instance
  # TYPE some_metric counter
  some_metric{label="val1"} 42
  # TYPE another_metric gauge
  # HELP another_metric Just an example.
  another_metric 2398.283
  EOF

Removing Metrics from Push Gateway

To delete metrics at the push gateway (here we assume localhost) grouped by job and instance:

curl -X DELETE http://localhost:9091/metrics/job/some_job/instance/some_instance

Delete all metrics grouped by job only:

curl -X DELETE http://localhost:9091/metrics/job/some_job

Using Other Languages

The curl examples above are written for the command line, but you can send similar data requests via any decent programming language.

Queries

To run a query, start typing the name of a time series and you will see an autocomplete drop-down menu from Prometheus.

You can type the name of a time series and press enter (and click the Graph button) to see a graph of your time series quantity.

However, as you collect and group more metrics together, you can also filter and match different time series.

Link: Querying Basics: https://prometheus.io/docs/prometheus/latest/querying/basics/

Dimensions

As an example of dimensions: we are collecting data with Netdata and sending it to Prometheus. Netdata collects several pieces of information about the CPU utilization under the netdata_cpu_cpu_percentage_average variable. Prometheus collects these as a single time series with multiple dimensions. So we can specify the "dimension" keyword to filter on dimensions. The dimensions are things like "user", "system", "idle", etc.

To graph all dimensions of this time series, type this in the query box:

netdata_cpu_cpu_percentage_average

To graph only the system CPU utilization, specify the "system" dimension:

netdata_cpu_cpu_percentage_average{dimension="system"}

Jobs

If we end up monitoring several services on a single machine, or if we monitor multiple machines across a network, we will have different sections of the Prometheus configuration file, corresponding to different jobs. For example, I might have three netdata instances running on three computers: jupiter, basilisk, and morpheus.

When we query data in Prometheus, we can specify which job's data to use:

netdata_cpu_cpu_percentage_average{dimension="system",job="netdata_jupiter"}

Alternatively, we can match multiple jobs by using regular expressions and the =~ operator:

netdata_cpu_cpu_percentage_average{dimension="system",job=~"netdata_.*"}

and the output:

PrometheusFilterJobsRegex.png

Best Practices

The Prometheus documentation has links to several best practice recommendations.

Building Dashboards

Here is one about building dashboards: https://prometheus.io/docs/practices/consoles/

  • Have no more than 5 graphs on a console.
  • Have no more than 5 plots (lines) on each graph. You can get away with more if it is a stacked/area graph.
  • When using the provided console template examples, avoid more than 20-30 entries in the right-hand-side table.
  • It is difficult for a set of consoles to serve more than one master. What you want to know when oncall (what is broken?) tends to be very different from what you want when developing features (how many people hit corner case X?). In such cases, two separate sets of consoles can be useful.

Grafana

http://docs.grafana.org/installation/debian/

https://blog.hda.me/2017/01/09/using-netdata-with-influxdb-backend.html

https://www.digitalocean.com/community/tutorials/how-to-set-up-real-time-performance-monitoring-with-netdata-on-ubuntu-16-04

Flags