Reliable monitoring with AWS-managed Prometheus and Grafana

Table of Contents

Problem statement

Prometheus is an open-source monitoring system with a dimensional data model, flexible query language, efficient time series database, and a modern alerting approach. It is widely used for monitoring different parts of the infrastructure, including Kubernetes clusters. An excellent helm chart can be used to deploy Prometheus in Kubernetes (Amazon EKS in our case), but it has a couple of limitations that will be discussed in this post.

Imagine you have an EKS cluster, nodes in several availability zones, and Prometheus deployed there, including Alert Manager and Grafana. Every component generates some data and needs a persistent volume. The most common case is using EBS volumes for persistent storage, but what will happen if a Pod fails and Kubernetes moves it to a node in a different AZ, or whole AZ fails?

Every EBS volume resides in a particular availability zone. If Pod is moved to a node in a different AZ, EBS can not be attached to it anymore and Pod will not start. We can use node affinity or nodeSelector to “bind” pods to nodes in specific AZ, but in this case, we make our monitoring solution “NOT Highly Available”.

Solution design

Amazon Managed Service for Prometheus can be used as persistent storage for the previously mentioned solution. AlertManager can be enabled within this service.
Amazon Managed Grafana is a fully managed and highly-available service that can substitute Grafana in Kubernetes.

In the new design, we deploy a monitoring solution without AlertManager and Grafana and use RemoteWrite for the Prometheus pod to use AWS Managed Prometheus.
AWS Managed Prometheus will use its own AlertManager and send notifications via Amazon SNS. Login to the Grafana will be configured via AWS IAM Identity Center.

Amazon Managed Service for Prometheus configuration

Creation of a Prometheus workspace is super simple:

Once it’s done, you will see the “Remote write URL” endpoint that will be used as persistent storage during the installation of the Prometheus Helm Chart. The “Query URL” endpoint will be used in Grafana as a source of metrics.

We need to overwrite some Prometheus configurations. Use the following Helm values for the installation:

## The following is a set of default values for prometheus server helm chart which enable remoteWrite to AMP
## For the rest of prometheus helm chart values see:
            name: "amp-iamproxy-ingest-service-account"
        - url:****-****-****-1c0cf252b5f3/api/v1/remote_write
            region: us-east-1
            max_samples_per_send: 1000
            max_shards: 200
            capacity: 2500
    ## If true, alertmanager will create/use a Persistent Volume Claim
    ## If false, use emptyDir
      enabled: false

     enabled: false

Or update your current Prometheus installation:


helm upgrade prometheus-chart-name prometheus-community/prometheus -n prometheus_namespace -f my_prometheus_values_yaml --version current_helm_chart_version

You will need to set up the OIDC provider and IAM roles for service accounts yourself in the Kubernetes cluster.

Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their results as a new set of time series. Querying the precomputed result is often much faster than running the original expression every time it is needed.

Amazon Managed Grafana configuration

There are two versions of Grafana currently available, 8.4 and 9.4

Only two authentication methods are available, AWS IAM Identity Center and SAML:

You can deploy Grafana within or outside VPC:

Access to the web UI can be public or restricted:

The following data sources are supported in every Amazon Managed Grafana workspace:

– Alertmanager data source

– Amazon CloudWatch

– Amazon OpenSearch Service

– AWS IoT SiteWise

– AWS IoT TwinMaker

– Amazon Managed Service for Prometheus and open-source Prometheus

– Amazon Timestream

– Amazon Athena

– Amazon Redshift

– AWS X-Ray

– Azure Monitor

– Cloudflare

– GitHub

– Graphite

– Google BigQuery

– Google Cloud Monitoring

– Google Sheets

– InfluxDB

– Jaeger


– Loki

– Microsoft SQL Server

– Moogsoft AIOps


– OpenSearch

– OpenTSDB

– Pixie

– PostgreSQL

– Redis

– Tempo

– TestData

– Zabbix

– Zipkin

The following data sources are supported in workspaces that have been upgraded to Grafana Enterprise:

– AppDynamics

– Databricks

– Datadog

– Dynatrace

– GitLab

– Honeycomb

– Jira

– MongoDB

– New Relic

– Oracle Database

– Salesforce


– ServiceNow

– Snowflake

– Splunk

– Splunk Infrastructure Monitoring (Formerly SignalFX)

– Wavefront (VMware Tanzu Observability by Wavefront)

You have a 30-day free trial, after that, you may buy the Grafana Enterprise subscription:

Once Grafana is deployed, you can assign users and groups to access:

Grafana application appears on the AWS IAM Identity Center web page:

When you enable some data source from the AWS console, it just adds the required permissions:

You also need to complete the data source configuration from the Grafana side:

Select AWS Manager Prometheus service, created in the previous steps:

Many Grafana dashboards are publicly available and can be imported, for example:


Prometheus price depends on metrics ingested, queried, and stored.

Grafana price depends on an active user license per workspace. Amazon Managed Grafana will offer a 90-day free trial, with up to five free users per account. Amazon Managed Grafana has two user license types, Editor and Viewer. Customers with usage beyond the five free users will be billed at the standard Editor ($9 per active editor or administrator user per workspace) or Viewer ($5 per active user per workspace) user license rates. Total Grafana Enterprise Monthly charges = $3500

In our case, we paid 7.5$ per day for Prometheus and 0$ for Grafana, because it’s a free trial.


In this post, we demonstrated the capabilities of AWS Managed Grafana and Prometheus. For sure it costs more than Kubernetes installation with persistent volumes and EBS, but it resolves several issues with high availability and data persistency, which may be critical for some organizations.