Scaling Sensu - Overview

In this article we’ll provide brief overviews of the various ways that you can scale your Sensu deployment, from scaling individual components, to scaling across regions.

Sensu Components

A typical Sensu deployment consists of four pieces:

There can be variation when it comes to the message bus and data store components, but using Redis as the data store and RabbitMQ as the message bus is the most common (and supported) way of deploying those components.

Sensu Server

The sensu-server process is the workhorse of any deployment. It performs a number of tasks including check scheduling and publishing, monitoring clients via keepalives, and event processing. To scale this component, add the desired number of Sensu servers and point them at your RabbitMQ instance where they’ll do their own internal leader election.

Sensu API

The sensu-api component is a stateless HTTP frontend. It can be scaled with traditional HTTP load-balancing strategies (HAproxy, Nginx, etc.). Configure each additional API instance to point to your Redis instance, and add the API instance to your load balancing pool.

Redis

Redis can be scaled out in several different ways. Using Redis Sentinel is the primarily supported way of scaling Redis. You can read more about installing and configuring Sentinel in our Redis reference documentation.

RabbitMQ

RabbitMQ can be used in a clustered configuration for Sensu. You can read more about configuring RabbitMQ clusters in our RabbitMQ reference documentation.

Scaling Sensu at a Single Site

Each Sensu component can be scaled independently at a single site, whether you need to ensure that Redis is highly available or you need to scale out the number of consumers (sensu-server instances) to keep your RabbitMQ queue depth to manageable levels. We’ll put all of these elements together in the next guide.

Scaling Sensu Across Multiple Sites

Every distributed system, Sensu included, must take into account special considerations when scaling across multiple sites (datacenters) where the networking (WAN) will be unreliable.

For the purpose of this documentation each site will be referred to as a “datacenter”.

Strategy 1: Isolated Clusters Aggregated by Uchiwa

This strategy involves building isolated, independent Sensu server/clusters at each datacenter, and then using Uchiwa’s multi-datacenter configuration option to get an aggregate view of the events and clients.

Pros

  • WAN instability does not lead to flapping Sensu checks
  • Sensu operation continues un-interrupted during a WAN outage
  • The overall architecture is easier to understand and troubleshoot

Cons

  • WAN outages mean a whole datacenter can go dark and not set off alerts (cross-datacenter checks are therefore essential)
  • WAN instability can lead to a lack of visibility as Uchiwa may not be able to connect to the remote Sensu APIs
  • Requires all the Sensu infrastructure in every datacenter

Strategy 2: Centralized Sensu and Distributed RabbitMQ

Sensu clients only need to connect to a RabbitMQ server to submit events. One scaling strategy is to centralize the Sensu infrastructure in one location, and have remote sites only have a remote RabbitMQ broker, which in turn forwards events to the central cluster.

This is done either by the RabbitMQ Federation plugin or via the Shovel plugin. (See a comparison here)

NOTE: This is picking availability and partition tolerance over consistency with RabbitMQ.

Pros

  • Fewer infrastructure components necessary at remote datacenters
  • All Sensu server alerts originate from a single source

Cons

  • WAN instability can result in floods of client keepalive alerts. (The Sensu Enterprise check dependencies filter can help with this.)
  • Increased RabbitMQ configuration complexity
  • All clients appear to be in the same datacenter in Uchiwa

Strategy 3: Centralized Sensu and Directly Connected Clients

All Sensu clients execute checks locally. Their only interaction with Sensu servers is to push events onto RabbitMQ. Therefore, remote clients can connect directly to a remote RabbitMQ broker over the WAN.

Pros

  • Very simple architecture, no additional infrastructure needed at remote sites
  • Centralized alert handling

Cons

  • Keepalive failures are now indistinguishable from WAN instability
  • Lots of remote clients means lots of TCP connections over the WAN
  • All clients appear to be in the same datacenter in Uchiwa