Keep calm and Observe
Or how to build custom, holistic monitoring for you and your team
Yes, if monitoring complex cloud setups is something you have to deal with day by day without fatigue— this IS a challenge.
On one hand side you want (nearly) perfect control what you are responsible for and on the other hand you just want to pay attention when its really needed so you can care about more interesting things. Striking a balance here can be really hard.
Tuning the noise is a constant effort and requires some endurance. How long did it take you to properly configure Prometheus Alertmanager to nag you when it’s really important?
There is a ton of really good monitoring services out there but what I was actually longing for is bundling all the noise in a single funnel and get one status in one tool and weight the signals according to our use and interest. Thats why my fellows and me started to kick off a new piece of service on our own and we want to share it with you: MetaControl.
MetaControl is about Collaborative Observability — meaning to break silos and still get signals from all layers of the cookie. It’s about creating a common visibility and designed to encourage teams to share insights within their network.
This series is a hands-on experience how you can build a holistic view, scraping data, visualizing it and take a high compressed view along with you on your smartwatch.
Our Showcase Scenario
Let’s pick something near to reality and define our scenario as a bunch of microservices, deployed from pipelines in Gitlab onto a managed kubernetes cluster on Azure. The cluster should be monitored by the usual suspects (yes, Prometheus) but we will interconnect all environments on our own:
Part 1 will examine what is important to us to learn from the scenario, setup the MetaControl project and design our workspace.
Part 2 will cover our workload — a sample Java microservice and the Gitlab pipeline. We will use webhooks to consume the pipeline’s state and look into mappings to make them visible in MetaControl.
Part 3 will have a look on how to send custom events from an application using REST APIs. We will also look on specific integrations with Spring Boot and application logging
Part 4 is where we consume events from our AKS and Azure infra using the Collector SDK and actually scrape data in a custom way.
Part 5 will dive into alerting, how to set it up without getting annoyed.
Part 6 will describe how to visualize the environment in all its beauty to make it explainable to others.
Also make sure to visit verticle.io and learn a bit about MetaControl.
Part 1: Identifying the important bits
Probably the hardest part is to find out what brings most value to the team. Its easy to drown in vanity metrics, so the first question is:
Question: What information is worth the time and attention from the team?
Answer: Anything that is actionable enough and reasonably impacts your business.
This is of course a very simplified view. In reality there is a big grey zone that can quickly change severity from “unimportant” to “critical”.
We know that important things happen in all places throughout the scenario environment, so let’s try to structure this first.
We will use a forked version of Spring Boot’s Petclinic that we extended a bit.
Spring Boot comes with an excellent interface to get telemetry, health states and meta information from its stack. We will scrape the actuator endpoints, monitor exceptions and raise events for the important bits of the business logic.
We use gitlab.com as our main source repository. It will also run our pipelines to create
- dockerized microservices images for the Java application
- deployment pipelines with helm to kubernetes
So obviously this is one of the major assets in automation we should care for. Let’s watch the pipelines states by utilizing Gitlab’s webhook feature.
This is big. Azure allows some pretty nifty integrations via its APIs. You can watch infra events, platform alerts and dig into logs for all components.
We will add an Application Gateway to our setup and monitor the response time and the response http states. That way we can quickly learn about the users experience.
For AKS we will monitor events with Fireboard Oculus (Github), a small service that reports k8s events to MetaControl. We are especially interested in failing application pods.
We will also deploy and monitor Prometheus and its Alerts API (so somewhat replace Alertmanager) and watch some prominent labels.
And finally we will propagate events to Slack using FireBot, MetaControl’s ChatOps app.
If you want to build this setup along the series you need
- a Gitlab.com account (free) or an installation
- an active Azure subscription
- a MetaControl Early-Access seat
We are about to start early access, make sure you reserve your seat at https://verticle.io.
After receiving your invite you can use the default team workspace and attach an external git repository (e.g. Github, Bitbucket). In our case we will choose a free Gitlab.com repo and also run our scenario pipelines there.
MetaControl will commit the default configuration to it and you are ready to start.
The workspace repo
MetaControl makes heavy use of YAML files stored in external git repositories. Each repo is team specific and contains e.g.
- workspace structure definitions
- view configurations, like e.g. the stackboard and schematics
- alert configs for the team
- webhook payload mappings
That way all configurations can be version-controlled and MetaControl will pull the latest commit.
All incoming events are stored in buckets that can be shared with other teams. It is useful to create buckets for different event categories.
We will start off with the default bucket and will create a more fine grained structure throughout the series.
The project setup is done at this point. Let’s dive into some implementations.