Sensu - Toe deep

I’ve put this off for a long time, so here we go: Sensu.

About this blog post: I started my first real ‘I want to learn this properly now’ in July. I got caught up in the big waves of getting the community onto the new rewrite: Sensu Go. A lot of stuff happening, a lot of contradictory information, 99% of what you got from Google was about the legacy version. I now in November got a second chance to really sit down and have a second proper stab at this. Although I got a lot more to work better, using simpler commands, the frustration and feeling of “not production grade” is still there.

My “simple” challenge

This is what I set out to achieve (in November 2019), which I think is a relevant use case for Sensu:

  • get two ubuntu servers up (one agent, one server)
  • wire things up so that when CPU goes above 75% for 5 minutes, an alert is started in OpsGenie.
  • succeed with this in an hour

I’ve failed miserably on the above, due to

  • poor documentation
    • example: adding assets can be supersimple, getting started documentation doesn’t even mention it: sensuctl asset add group/plugin
  • poor logging
    • example agent: I had to enable debug level logging on the agent, to see that dependencies for a check wasn’t downloaded!
    • example backend: A handler on the backend (influxdb) has been failing for 3 weeks and there’s nothing telling me it’s not working
  • poor plugin documentation
    • unclear what plugins are working in Sensu Go (new rewrite) vs Sensu Core (legacy version)
    • “want to use this? read the code!”
  • poor plugins support for Ubuntu
    • the sensu-plugin/sensu-plugin-x where x is one of many common things like: cpu, disk, load, ntp, uptime, logging, … - 50% support debian, 50% support debian-esque (Ubuntu included ie supported)

I managed to get a docker-compose powered set up running within minutes, and created an Ansible playbook to get some agents running on 6 ubuntu machines using Sensu provided APT repo. That part works perfectly fine, and I’ve upgraded the backend and agents a couple of times with no issues.

All in all, I wan’t Sensu Go to work. After two deep sessions, I like it to the point that I’m getting so frustrated by how great it could be, but it isn’t… Instead of telling me what’s wrong in my setup, I have to chase poorly documented and empty logs - like everything’s fine when it isn’t.

I’ll very likely go back to Prometheus + Grafana, and explore how the TICK stack stacks up these days.

Everything below is from July

Consider reading these, before reading on …

Two major versions

Sensu was rewritten and the new rewrite, called Sensu Go, was released rather recently. The older version requires more infrastructure to run (RabbitMQ, Redis). This post is about version 5.10, which is Sensu Go.

Main Components

Sensu’s model for monitoring is to have a central “brain” - the Sensu backend - which knows a couple of things:

  • things that should exists (entities)
  • checks to execute (checks)
  • results from those checks (events)
  • what check should execute on which entity (subscriptions)

All of the above is not configured through files, but through API calls. sensuctl is a command line tool intended to be what one uses to install and configure most of the above.

Simplest form of “things that should exists”, entities, are hosts (or target environments) where a sensu-client is installed and running. A systemd service (or standalone process if you want to).

More:

  • agents: execute checks in order to produce events; these events are transported to the Sensu Backend
  • runtime assets: downloadable assets that agents can pull (however: asset-manager of the agent won’t warn if it can’t install an asset
  • plugins: plugins are assets, which can provide the software needed for checks, handlers, mutators, etc.
  • ruby: yeah, seems most “default” plugins can’t do without a Ruby runtime.

To be honest, this is a mess. And the “Getting started” documentation is a mess, it mentions “handler” in 6 or more different ways, in just 10 lines of text. No graphics to present the above components.

Sensu Agent

Each agent needs to be configured with:

  • where the sensu backend is, such as http://sensu-backend.mycorp.vpn/
  • what subscriptions to subscribe to, such as dom0, ubuntu, webserver

For all checks that needs to be executed, you must make sure the agent can execute the command specified in that check. Sensu doesn’t distribute any binaries, packages, …

Here’s a clear job for Ansible - essentially each subscription likely brings a host of binaries/packages to make available.

Every agent also has implementation of

  • statsd - to collect metrics local to the agent

and a few other protocols where metrics can be collected. Some of these are referred to as

A Check

A check is essentially:

  • a command to execute, like check-cpu.rb
  • scheduling information, one of
    • interval: seconds between each invokation
    • cron: yes, cron pattern of when to execute
    • publish: true/false, and if false only allows execution of this check based on an API call made to the Sensu backend: not scheduled

Checks are defined in the sensu backend, but executed on the sensu agents.

The protocol around a check is simple: it must be executable (the command), and the exit status code has special meaning:

0 = OK
1 = WARNING
2 = CRITICAL
3 = UNKNOWN

STDOUT can carry metrics, which can be processed by handlers in order to be stored in Graphite, InfluxDB, OpenTSDB, …

Sensu-install

This non-official magic tool comes from the package sensu-... and is Ruby scripts that downloads other Ruby stuff and puts it into some obscure path, that agents are likely to pick up (PATH).

It is a legacy tool, and for Sensu Go, one should use Runtime Assets instead. These are distributed to all checks, filters, etc. This mapping of check/filter and Runtime Asset, is handled manually in the check/filter definition.

A better way is to simply use sensuctl asset add group/the-plugin:1.2.3 --rename the-plugin which will install the asset from Bonsai. Unfortunately, more than 50% of the plugins in sensu

Sensu Backend

Exposes four ports:

  • 2380 - for statsd metric collection (ie some process push metric data), converted from statsd format into sensu event format
  • 3000 - web UI
  • 8080 - Sensu API, this is where sensu agents connect to, as well as sensuctl
  • 8081 - don’t remember FIXME :-)

The sensu backend is the orchestrator of everything. Out of the box, sensu agents does nothing but send keepalive (I’m alive!) events to the backend. Everything else has to be configured (through API calls, or using the sensuctl CLI tool).

Events

This is data coming into the sensu backend. Whenever a check was executed, the result of that execution is an event. It carries an exit status code, and any output to stdout.

ceda@lx1carbon:~$ sensuctl event list
   Entity       Check                                   Output                                  Status   Silenced             Timestamp             
 ─────────── ─────────── ───────────────────────────────────────────────────────────────────── ──────── ────────── ──────────────────────────────── 
  filserver   keepalive   Keepalive last sent from filserver at 2019-07-07 15:06:11 +0000 UTC        0   false      2019-07-07 17:06:11 +0200 CEST  
  jjim        keepalive   Keepalive last sent from jjim at 2019-07-07 15:10:15 +0000 UTC             0   false      2019-07-07 17:06:23 +0200 CEST  
  lx1carbon   keepalive   Keepalive last sent from lx1carbon at 2019-07-07 15:06:25 +0000 UTC        0   false      2019-07-07 17:06:25 +0200 CEST  
  trumpet     keepalive   Keepalive last sent from trumpet at 2019-07-07 15:06:23 +0000 UTC          0   false      2019-07-07 17:06:23 +0200 CEST  
  water       keepalive   Keepalive last sent from water at 2019-07-07 15:06:22 +0000 UTC            0   false      2019-07-07 17:06:22 +0200 CEST  

Filters

These are named filters, executed in the sensu backend, processing incoming events. There are two special filters:

  • is_incident - which filters out all success events (where the check’s exit status code is 0),
  • has_metrics - which filters out events that doesn’t carry metrics (STDOUT from the check’s command)

Handlers

Handlers are executed on sensu backend(s), and process events that pass any filters setup for the checks.

ceda@lx1carbon:~$ sensuctl handler info keepalive
=== keepalive
Name:                  keepalive
Type:                  pipe
Timeout:               0
Filters:               is_incident
Mutator:               
Execute:               RUN:  sensu-slack-handler -c "${KEEPALIVE_SLACK_CHANNEL}" -w "${KEEPALIVE_SLACK_WEBHOOK}"
Environment Variables: KEEPALIVE_SLACK_WEBHOOK=https://hooks.slack.com/services/AAA/BBB/CCC, KEEPALIVE_SLACK_CHANNEL=#monitoring
Runtime Assets:        sensu-slack-handler

Runtime Assets

These magic beasts are increadibly poorly described in Getting Started docs, but appears to be things that the sensu backend can make us of. After an hour of reading docs and trying things out, it’s still unclear how.

Assets can be executed by the backend (for handler, filter, and mutator assets), or by the agent (for check assets). At runtime, the entity sequentially fetches assets and stores them in its local cache. Asset dependencies are then injected into the PATH so they are available when the command is executed.

Except when they don’t, and nothing in the Sensu Go system tells you: bug report

found here

This work by Fredrik Wendt is licensed under CC by-sa.