Monitoring Kubernetes jobs status in Amazon EKS. Cronitor or Prometheus

Oleksii Bebych

January 3, 2024

Problem statement

In the previous post, “How we migrated applications from Heroku to AWS” I described the migration planning, process, and problems encountered. Once we migrated, operations and monitoring became a cornerstone.

Just for general understanding, here is what we got after the migration: EKS cluster with several node groups, a set of controllers (Cluster Autoscaler, AWS Ingress Controller, External Secrets, External DNS, FluentBit, Prometheus stack, KEDA, ArgoCD), a couple of web services + a lot of Kubernetes Jobs that are started by KEDA based on messages in the queues.

Our customers used to have a convenient page in Heroku where they could see the status of job execution + history:

So, they requested something similar to our current monitoring solution. As we use Amazon Managed Service for Prometheus and Amazon Managed Grafana, the most logical choice is to use them for monitoring everything, including Jobs, but we decided to check the market in parallel and see what we can do as an alternative.

Monitoring jobs with Cronitor

Installation

Cronitor is an easy-to-install and use product for monitoring Kubernetes jobs, websites, and other things. Once you sign up, you can start a 14-day trial, as we did. Helm chart is available on GitHub.

The first thing you need for installation is to generate a new API token:

Add helm repo:

helm repo add cronitor https://cronitorio.github.io/cronitor-kubernetes/

Create K8s secret with previously generated API token:

kubectl create secret generic cronitor-secret -n <namespace> --from-literal=CRONITOR_API_KEY=<api key>

Deploy Helm chart:

helm upgrade --install <release name> cronitor/cronitor-kubernetes --namespace <namespace> --set credentials.secretName=cronitor-secret --set credentials.secretKey=CRONITOR_API_KEY

By default, the agent will monitor all CronJobs in your Kubernetes cluster, but you can exclude any Kubernetes jobs from monitoring if you want.

After the installation, you will see a simple application, just one deployment with one pod.

kubectl get all -n cronitor

NAME                                                READY   STATUS    RESTARTS   AGE
pod/cronitor-cronitor-kubernetes-74dd7f557d-qc584   1/1     Running   0          10d

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/cronitor-cronitor-kubernetes   1/1     1            1           11d

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/cronitor-cronitor-kubernetes-74dd7f557d   1         1         1       11d

Jobs monitoring

See what’s really happening with your jobs. Monitor every job in one place.

Cronitor monitors any type of cron job, tracking each execution and recording the exit status and metrics, and logs in a single location. Instant alerts for failed or missed executions mean the right people are alerted when anything goes wrong.

We can see the current status of every job (completed, running, failed):

We can see the success rate, job duration, number of executions and failures, job schedule, and timezone:

Find the details you need, without hunting through log files.

Audit your jobs and resolve incidents faster by accessing execution logs and error messages without leaving Cronitor:

You can see a timeline of job status:

Number of complete and failed invocation:

The “Issues” page shows current alerts:

You can see the failure details:

Alerts can be sent to relevant people:

A list of recipients is configurable; email or SMS are available to use:

Reports and Status Pages keep everybody informed.

Cronitor gives you multiple ways to keep teammates, stakeholders, and customers informed about job health and downtime:

And how the report looks like:

Website and API monitoring

Global uptime monitoring

Run your website and API checks from 12+ locations across 5 continents. Understand performance trends for each region, and focus your monitoring on where your users are located.

The monitor checks a site availability and response time:

As well as an SSL certificate validity:

Monitor anything using a simple heartbeat

Anything can send a heartbeat

An instant pulse on the health of any software component. It can be done via a simple “curl” command or using one of the proposed programming languages:

Direct Integration

1. This is the unique Telemetry URL for this monitor.

https://cronitor.link/p/a6425************98853846f/ROPIkn

2. Send simple HTTP pings when your job runs, completes, or fails

# Send a heartbeat
> curl https://cronitor.link/p/a6425**************53846f/ROPIkn

# You can even report failures with heartbeats
> curl https://cronitor.link/p/a6425************846f/ROPIkn?state=fail

3. Optionally send messages and metrics.

# Add a status or error message
> curl "https://cronitor.link/p/a64***********846f/ROPIkn?msg=Success!"

# Metrics will be aggregated automatically
> curl "https://cronitor.link/p/a642*******846f/ROPIkn?metric=count:33012"

4. Pings are recorded in your default environment unless you send them elsewhere.

# Usually, this will be an environment variable or config param
$ENVIRONMENT=staging

# If the staging environment doesn't exist it will be created automatically on the first ping
> curl "https://cronitor.link/p/a64********53846f/ROPIkn?env=$ENVIRONMENT"

Python Integration

1. Install Cronitor’s Python SDK source

pip install cronitor

2. Configure Cronitor with your API key. You can also do this by setting the CRONITOR_API_KEY env var.

import cronitor
cronitor.api_key = '990ac*********ba975'

3. Send telemetry events from within your code.

monitor = cronitor.Monitor('ROPIkn')

# send a heartbeat event with a message
monitor.ping(message="Alive!")

# include counts & error counts
monitor.ping(metrics={'count': 100, 'error_count': 3})

A simple “curl” call from cron in EC2:

Real User Monitoring (RUM)

See site traffic in real-time.

Monitor what’s happening on your website in real time. Measure and compare visits by country, browser, and referrer to better understand your traffic.

It works with many modern frameworks and providers:

For example, a React site:

Simply install the library in your project:

npm install @cronitorio/cronitor-rum

# Or with yarn:

yarn add @cronitorio/cronitor-rum

You can now import, and use the Cronitor client on your project.

import * as Cronitor from "@cronitorio/cronitor-rum";

// Load the Cronitor tracker once in your app
Cronitor.load("YOUR_SITE_ID");

// This is how you record page views
// You should trigger this on router/page changes
Cronitor.track("Pageview");

// You can also trigger custom events
Cronitor.track("NewsletterSignup");

Pricing

Pay-as-you-go model. You pay for a number of monitors and number of users per month:

In our case, it’s 195$ per month:

Back to Prometheus and Grafana

Even though Cronirot is a pretty good, convenient, and affordable tool, our customers did not want to have several monitoring tools. They already had Amazon Managed Service for Prometheus, and Amazon Managed Grafana for monitoring many things in AWS and Kubernetes, so we created one Grafana dashboard, that has the required information about Jobs status:

Conclusion

Prometheus and Grafana are powerful tools for monitoring infrastructure in general and Kubernetes workloads in particular. We got a trivial task to monitor Kubernetes’ job status and execution history, looked in the market, and found a really interesting tool, Cronitor. It’s definitely simple to install and use, quite cheap and helpful. Prometheus and Grafana in our case required some time to implement the required dashboard, but we could not use two different tools for monitoring simultaneously. Prometheus already had a lot of monitoring, not only Kubernetes, so we decided to stop here. Anyway, Cronitor is a good experience, and it is worth attention.

Oleksii Bebych

AWS expert and engineer with 10 years of experience in Information Technologies (product and outsourcing companies), networking, technical support, system administration, DevOps, and banking, certified by several world-famous vendors (AWS, Google, Cisco, Linux Foundation, Microsoft, Hashicorp). He is participating in AWS competency programs and the development of AWS partnerships. He writes posts for the company's tech blog and conducts webinars. He participates in well-architected reviews and leads strategic projects that improve delivery results and help in the presale phase.