Simulating failures in Amazon Aurora MySQL / PostgreSQL

Table of Contents

Clouds allow us to design highly-available and fault-tolerant systems quite easily. Moreover, we can use managed services (for example, databases) to reduce the operational overhead and focus on our business logic. Design and implementation are not enough; we should also test how our system tolerates failures and continuously improves.
In this post, we will look at ways to simulate database failures using the example of Amazon Aurora MySQL and PostgreSQL. We will check the AWS Fault Injection Service and Aurora fault injection queries.

AWS Fault Injection Service for Amazon Aurora

Part of AWS Resilience Hub, AWS Fault Injection Service (FIS) is a fully managed service for running fault injection experiments to improve an application’s performance, observability, and resilience. FIS simplifies the process of setting up and running controlled fault injection experiments across a range of AWS services so that teams can build confidence in their application behavior.

Aurora MySQL experiment

We can test the Aurora/RDS failover by creating and running an experiment:

You can use it within one AWS account or with Multiple accounts within an Organization:

Set a name and description of the experiment template:

Add action (in our case it’s “failover-db-cluster”):

Chose the Target cluster (the first experiment will be for Aurora MySQL):

I’ve executed two simple scripts. One INSERTs data into a table every 0.5 seconds (into writer instance), another one SELECTs from the table in the Read Replica every 1 second:

Start the experiment:

In the RDS console, we can see the cluster is Failing over:

INSERT queries started failing at 18:23:03 and continued at 18:23:09.
So the writer instance recovered in 6 seconds.

The read replica failed at 18:22:49 and continued working at 18:23:05.

So, the reader recovered in 16 seconds.

 

In the RDS console, we can see recent events. Failover started at 18:22:48 (UTC)

So, the full failover process took 56 seconds.

Aurora PostgreSQL experiment

Let’s try the same with Aurora PostgreSQL.

Similar scripts, where one INSERTs data into a table every 0.5 seconds (into writer instance), another one SELECTs from the table in the Read Replica every 1 second:

The experiment started, and queries started failing:

In the RDS console, we can see the cluster is Failing over:

INSERT queries started failing at 18:32:36 and continued at 18:32:49. So the writer instance recovered in 13 seconds.

The read replica failed at 18:32:36 and continued working at 18:32:47.

So, the reader recovered in 11 seconds.

In the RDS console, we can see recent events. Failover started at 18:32:27 (UTC)

And completed at 18:32:54 (UTC).

So, the full failover process took 27 seconds.

Fault injection queries

Injection queries for MySQL

Using fault injection queries, you can test your Aurora MySQL DB cluster’s fault tolerance. Fault injection queries are issued as SQL commands to an Amazon Aurora instance. They let you schedule a simulated occurrence of one of the following events:

Simulate the crash of the instance. You can choose a DB Instance, Dispatcher, or both (NODE):

 

ALTER SYSTEM CRASH [ INSTANCE | DISPATCHER | NODE ];

I’ve tried it for the NODE:

INSERT queries stopped at 14:02:58 and recovered at 14:03:45

In this case, my Read Replica replaces the Writer instance:

Next, let’s try a Dist Failure:

During a disk failure simulation, the Aurora DB cluster randomly marks disk segments as faulting. Requests to those segments will be blocked for the duration of the simulation.

 

ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK FAILURE
    [ IN DISK index | NODE index ]
    FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

I’ve simulated 80% of the Disk Failure during 30 seconds:

We can see that INSERT queries succeeded once per 4-8 seconds rather than 0.5 seconds, which was supposed to be.

Simulate disk congestion

During a disk congestion simulation, the Aurora DB cluster randomly marks disk segments as congested. Requests to those segments will be delayed between the specified minimum and maximum delay time for the simulation duration.

 

ALTER SYSTEM SIMULATE percentage_of_failure PERCENT DISK CONGESTION
    BETWEEN minimum AND maximum MILLISECONDS
    [ IN DISK index | NODE index ]
    FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

We can see that INSERT queries succeeded once per 5-10 seconds rather than 0.5 seconds, which was supposed to be.

Simulate replica failure

An Aurora Replica failure blocks all requests from the writer instance to an Aurora Replica or all Aurora Replicas in the DB cluster for a specified time interval. When the time interval is complete, the affected Aurora Replicas will automatically sync up with the master instance.

 

ALTER SYSTEM SIMULATE percentage_of_failure PERCENT READ REPLICA FAILURE
    [ TO ALL | TO "replica name" ]
    FOR INTERVAL quantity { YEAR | QUARTER | MONTH | WEEK | DAY | HOUR | MINUTE | SECOND };

INSERT queries are continuously going every 0.5 seconds, but SELECT queries to the Read Replica are not showing new data:

Note
Take care when specifying the time interval for your Aurora Replica failure event. If you specify too long of a time interval, and your writer instance writes a large amount of data during the failure event, your Aurora DB cluster might assume that your Aurora Replica has crashed and replace it.

As we can see, this happened:

Replica lag can be seen in the default CloudWatch metric:

Injection queries for PostgreSQL

You can simulate Aurora Replica failure, disk failure, and disk congestion. Fault injection queries are supported by all available Aurora PostgreSQL versions, as follows.

  • Aurora PostgreSQL versions 12, 13, 14, and higher
  • Aurora PostgreSQL version 11.7 and higher
  • Aurora PostgreSQL version 10.11 and higher

Testing an instance crash

You can force a crash of an Aurora PostgreSQL instance by using the fault injection query function aurora_inject_crash().

SELECT aurora_inject_crash ('instance' | 'dispatcher' | 'node');

Testing an Aurora Replica failure

You can simulate the failure of an Aurora Replica by using the fault injection query function aurora_inject_replica_failure().

 

SELECT aurora_inject_replica_failure(
   percentage_of_failure, 
   time_interval, 
   'replica_name'
);

Testing a disk failure

You can simulate a disk failure for an Aurora PostgreSQL DB cluster by using the fault injection query function aurora_inject_disk_failure().

 

SELECT aurora_inject_disk_failure(
   percentage_of_failure, 
   index, 
   is_disk, 
   time_interval
);

Testing disk congestion

You can simulate a disk failure for an Aurora PostgreSQL DB cluster by using the fault injection query function aurora_inject_disk_congestion().

 

SELECT aurora_inject_disk_congestion(
   percentage_of_failure, 
   index, 
   is_disk, 
   time_interval, 
   minimum, 
   maximum
);

Conclusion

Simulating failures in your infrastructure is just as important as designing for high availability and fault tolerance. AWS cloud takes a lot of operational effort if you use managed services like Amazon Aurora (check the shared responsibility model), but you also need to understand how your application runs during a database failure (or not only a database). In this particular database case, you can simulate failures in two ways: AWS Fault Injection Service or Aurora fault injection queries. Injection queries give more testing capabilities when writing the post, but AWS FIS is being actively developed and can cover other AWS services besides databases.