How we migrated applications from Heroku to AWS. Issues and limitations from AWS and Heroku sides.

Table of Contents

Inputs and requirements

Our customer had a production-running application in Heroku, consisting of several containerized Web applications and APIs (dynos), PostgreSQL database, Redis, RabbitMQ, and Jobs (Workers) running depending on a message queue in RabbitMQ. The business decision was made to migrate to AWS because Heroku became too expensive, and Automat-it was chosen to design and complete the migration.

Initial plan

The initial plan was to deploy EKS, Amazon Aurora, ElastiCache, and AmazonMQ, deploy applications in EKS, migrate the database using AWS Database Migration Service, and switch DNS from Heroku to AWS.

But then we figured out that our timeline is very tight, and we must migrate the database as soon as possible. Heroku Enterprise subscription is going to end. Applications may work after that in Heroku for a while, but the database must be migrated before this time.

According to the best security practices, if a Database and Application live in different networks (VPCs, clouds, etc.), we have to configure private connectivity (VPC peering, VPN connection, etc.)

Heroku services are based on AWS EC2, so creating private connectivity like VPC peering looks straightforward (in theory), but we have a surprise here. If you want a kind of VPC peering between Heroku and AWS, you must have a Heroku Enterprise subscription and pay ~1200$ per month for such a connection (called Heroku Private Space).

Nobody wanted to pay such an amount of money for VPC peering, while it’s totally free in AWS, so we were thinking about alternatives.

We considered RDS Proxy but figured out that it’s not resolved to Public IP even if it’s deployed in a Public subnet, so connecting to RDS Proxy from the internet is impossible.

Internet-facing Network Load Balancer in front of RDS may be risky because a DNS name can not be a target for NLB. We could configure a Private IP address of RDS as a target for NLB, but there is no guarantee that this IP address will not change in the future.

So, in this EXCEPTIONAL case, we decided to deploy a publicly resolvable RDS in a subnet with a temporary default route to the Internet Gateway. The old Heroku Database was publicly available its whole life, so our customer insisted on a similar approach due to a close migration deadline.

Database migration

We had to migrate ~400 GB of PostgreSQL data as fast as possible and ensure that data was consistent.
The first plan was to download the latest daily DB backup from Heroku, restore it to RDS, and configure a replication of ongoing changes via the AWS Database Migration Service. We have to complete the following prerequisites:

For full-load plus CDC tasks or CDC-only tasks, grant superuser permissions for the user account specified for the PostgreSQL source database. The user account needs superuser permissions to access replication-specific functions in the source. For full-load only tasks, the user account needs SELECT permissions on tables to migrate them.

But surprise-surprise:

 

Heroku Postgres does not give customers access to the superuser role, or allow customers to grant other users the replication role, due to these roles being very privileged.

So, we can not configure the replication of ongoing changes.

The next option is to configure a Full-Load migration task, but our tests showed that some data was just lost somewhere during the migration (maybe LOBs), and migration time was not acceptable (~24 hours) because the application serves many customers around the globe 24/7. The maximum downtime that the company could afford was several hours.

So, as a result, we just switched off the application in the scheduled maintenance window, made a DB dump from Heroku, copied it to AWS, restored it to RDS, and pointed applications to the new DB host.

Post-migration RDS issues

Once we started using the new RDS database with applications from Heroku, massive “memory leaks” of AWS RDS were detected.

Continuously, we had the following behavior: 128GB memory of db.r6g.4xlarge consumed to 0 in 3-6 hours

After that, the DB restarted with the following messages:

 

{
"SourceIdentifier": "******",
"SourceType": "db-instance",
"Message": "DB instance restarted",
"EventCategories": [
"availability"
],
"Date": "2023-10-23T23:45:30.314000+00:00",
"SourceArn": "arn:aws:rds:us-east-1:************:db:******"
},
{
"SourceIdentifier": "******",
"SourceType": "db-instance",
"Message": "The database process was killed by the OS due to excessive memory consumption.

It is recommended to tune your database workload and/or parameter usage to reduce memory
consumption.",

"EventCategories": [],
"Date": "2023-10-23T23:46:19.938000+00:00",
"SourceArn": "arn:aws:rds:us-east-1:************:db:******"
},

The first suspect about the root cause appeared when we looked at the OS process list in the RDS console. We saw many idle connections, and after discussing it with the application developers, we figured out that the application itself does not manage DB connections properly (does not reuse, does not close unneeded, etc.). Application instead relies on the connections pooling mechanism, implemented via pgbouncer in Heroku, which is absent in AWS.

Then we tried many different things: added RDS Proxy, tuned Postgres engine parameters together with AWS technical support, migrated from RDS to Aurora (AWS claims Aurora provides Up 3x the throughput of PostgreSQL), increased Aurora instance type, and added more read replicas – nothing helped and DB continued restarting several times per day.

After further investigation, we found that applications intensively create and destroy temporary PostgreSQL functions.
Example of the function code (data of the actual insert skipped due to security reasons):

CREATE
OR REPLACE FUNCTION pg_temp.testfunc(
  OUT response "hidden_table_name",
  OUT sequelize_caught_exception text
) RETURNS RECORD AS $func_b529ba47fad14b51883867eb94fa17fc$ BEGIN INSERT INTO "test" ("id")
VALUES (1) RETURNING * INTO response;
EXCEPTION WHEN unique_violation THEN GET STACKED DIAGNOSTICS sequelize_caught_exception =
PG_EXCEPTION_DETAIL;
END $func_b529ba47fad14b51883867eb94fa17fc$ LANGUAGE plpgsql;
SELECT
  (testfunc.response).*,
  testfunc.sequelize_caught_exception
FROM
  pg_temp.testfunc();
DROP
  FUNCTION IF EXISTS pg_temp.testfunc();

The query rate is approximately 18000 queries per hour per application. Each application has two connections to the DB. The total amount of running application instances was 74.

Example of the query from one application:

To test the query influence on the DB, the following script was developed:

import psycopg2
import sys
import threading
from time import sleep, perf_counter
from concurrent.futures import ThreadPoolExecutor
import concurrent.futures

def run_query(id, n, cur):
    query=(f"CREATE OR REPLACE FUNCTION pg_temp.testfunc(OUT response test, OUT sequelize_caught_exception text) \
        RETURNS RECORD AS $func_a8%s$ BEGIN INSERT INTO test (\"id\") VALUES (%s) RETURNING * INTO response; \
        EXCEPTION WHEN unique_violation THEN GET STACKED DIAGNOSTICS sequelize_caught_exception = PG_EXCEPTION_DETAIL; \
        END $func_a8%s$ LANGUAGE plpgsql; \
        SELECT (testfunc.response).*, testfunc.sequelize_caught_exception FROM pg_temp.testfunc(); \
        DROP FUNCTION IF EXISTS pg_temp.testfunc();"%(str(n),str(n),str(n)))
    cur.execute(query)
    conn.commit()

def run_task(id):
    cur = conn.cursor()
    print ("Thread: " + str(id))

    for n in range(1, 100000000):
        run_query(id, n+id, cur)

    conn = psycopg2.connect(
        database="XXXXXXX",
        user="XXXXXX",
        password="XXXXXXX",
        host="XXXXXXX",
        port='XXXX'
    )

with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
    futures = []
    for i in range(1, 50):
        futures.append(executor.submit(run_task, id=i))
    for future in concurrent.futures.as_completed(futures):
        print(future.result())

The script was executed on the test database and we found that its DB session consumed ±500 MB memory after 18000 executions.

Also, we found the same not-fixed bug reported on the PostgreSQL bug tracker:
https://www.postgresql.org/message-id/flat/20161028162317.15991.74004%40wrigleys.postgresql.org

The root of the problem:

PostgreSQL session sub-process does not release memory after destroying the temp function. The processes consumed all DB memory and DB crashes. The issue is reproducible on PostgreSQL 11.16, 12. Other PostgreSQL versions were not tested.
The same test was performed on Heroku DB (Standard 0 type was used). Memory consumption behavior is the same, but Heroku manages memory differently and OOM killer terminates only sub-processes with the biggest memory consumptions without DB crashes and entire restarts.

Such OOM-killer behavior does not cause massive application terminations.

Application log:

DB metrics (from Heroku), where we can see the DB memory leak and automatic process kill and memory release:

2023-10-26T13:13:01.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=506981 sample#db_size=15524719bytes
sample#tables=1 sample#active-connections=21 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99874 sample#load-avg-1m=1.34
sample#load-avg-5m=1.25 sample#load-avg-15m=0.98 sample#read-iops=0 sample#write-iops=1.2083
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=932408kB sample#memory-cached=2573720kB sample#memory-postgres=4135276kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:14:26.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=510056 sample#db_size=15606639bytes
sample#tables=1 sample#active-connections=21 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99875 sample#load-avg-1m=1.06
sample#load-avg-5m=1.185 sample#load-avg-15m=0.98 sample#read-iops=0 sample#write-iops=0.92308
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=886012kB sample#memory-cached=2578276kB sample#memory-postgres=4176212kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:15:50.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=513367 sample#db_size=15663983bytes
sample#tables=1 sample#active-connections=21 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99876 sample#load-avg-1m=1.49
sample#load-avg-5m=1.3 sample#load-avg-15m=1.035 sample#read-iops=0 sample#write-iops=0.8
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=834188kB sample#memory-cached=2583840kB sample#memory-postgres=4222112kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:17:14.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=515149 sample#db_size=15680367bytes
sample#tables=1 sample#active-connections=21 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99877 sample#load-avg-1m=1.36
sample#load-avg-5m=1.3 sample#load-avg-15m=1.045 sample#read-iops=0 sample#write-iops=1.075
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=807004kB sample#memory-cached=2584244kB sample#memory-postgres=4247168kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:18:40.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=521118 sample#db_size=15762287bytes
sample#tables=1 sample#active-connections=22 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99879 sample#load-avg-1m=1.295
sample#load-avg-5m=1.325 sample#load-avg-15m=1.075 sample#read-iops=0 sample#write-iops=0.77632
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=759424kB sample#memory-cached=2588780kB sample#memory-postgres=4290188kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:20:04.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=524076 sample#db_size=15778671bytes
sample#tables=1 sample#active-connections=21 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99881 sample#load-avg-1m=1.765
sample#load-avg-5m=1.4 sample#load-avg-15m=1.12 sample#read-iops=0 sample#write-iops=0.52632
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=698500kB sample#memory-cached=2604852kB sample#memory-postgres=4334216kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:20:04.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=525146 sample#db_size=15819631bytes
sample#tables=1 sample#active-connections=21 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99881 sample#load-avg-1m=1.19
sample#load-avg-5m=1.33 sample#load-avg-15m=1.12 sample#read-iops=0 sample#write-iops=0.61728
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=1186388kB sample#memory-cached=2609660kB sample#memory-postgres=3841152kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:21:29.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=528400 sample#db_size=15819631bytes
sample#tables=1 sample#active-connections=20 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99882 sample#load-avg-1m=0.955
sample#load-avg-5m=1.245 sample#load-avg-15m=1.11 sample#read-iops=0 sample#write-iops=0.4375
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=1139696kB sample#memory-cached=2613400kB sample#memory-postgres=3887232kB
sample#wal-percentage-used=0.06824191790795707

2023-10-26T13:22:53.000000+00:00 app[heroku-postgres]: source=DATABASE
addon=postgresql-tetrahedral-52858 sample#current_transaction=533032 sample#db_size=15967087bytes
sample#tables=1 sample#active-connections=20 sample#waiting-connections=0
sample#index-cache-hit-rate=0 sample#table-cache-hit-rate=0.99884 sample#load-avg-1m=0.85
sample#load-avg-5m=1.18 sample#load-avg-15m=1.095 sample#read-iops=0 sample#write-iops=0.97872
sample#tmp-disk-used=543600640 sample#tmp-disk-available=72435191808 sample#memory-total=8024092kB
sample#memory-free=1114152kB sample#memory-cached=2615652kB sample#memory-postgres=3910812kB
sample#wal-percentage-used=0.06824191790795707

DB log (from Heroku):

2023-10-27T05:26:43.000000+00:00 app[postgres.21575]: [DATABASE] [167715-1] sql_error_code = 53200
time_ms = "2023-10-27 05:26:43.884 UTC" pid="5798" proc_start_time="2023-10-26 12:22:52 UTC"
session_id="653a5a1c.16a6" vtid="14/128115" tid="1003779" log_line="167716" database="XXXXXXXXXXX"
connection_source="45.9.29.247(63540)" user="XXXXXXXXXXX" application_name="[unknown]" ERROR: out of
memory

2023-10-27T05:26:43.000000+00:00 app[postgres.21575]: [DATABASE] [167715-2] sql_error_code = 53200
time_ms = "2023-10-27 05:26:43.884 UTC" pid="5798" proc_start_time="2023-10-26 12:22:52 UTC"
session_id="653a5a1c.16a6" vtid="14/128115" tid="1003779" log_line="167717" database="XXXXXXXXXXX"
connection_source="45.9.29.247(63540)" user="XXXXXXXXXXX" application_name="[unknown]" DETAIL: Failed on
request of size 528 in memory context "TopTransactionContext".

2023-10-27T05:26:44.000000+00:00 app[postgres.21575]: [DATABASE] [167717-1] sql_error_code = 53200
time_ms = "2023-10-27 05:26:44.533 UTC" pid="5798" proc_start_time="2023-10-26 12:22:52 UTC"
session_id="653a5a1c.16a6" vtid="14/128116" tid="1003783" log_line="167720" database="XXXXXXXXXXX"
connection_source="45.9.29.247(63540)" user="XXXXXXXXXXX" application_name="[unknown]" ERROR: out of
memory

2023-10-27T05:26:44.000000+00:00 app[postgres.21575]: [DATABASE] [167717-2] sql_error_code = 53200
time_ms = "2023-10-27 05:26:44.533 UTC" pid="5798" proc_start_time="2023-10-26 12:22:52 UTC"
session_id="653a5a1c.16a6" vtid="14/128116" tid="1003783" log_line="167721" database="XXXXXXXXXXX"
connection_source="45.9.29.247(63540)" user="XXXXXXXXXXX" application_name="[unknown]" DETAIL: Failed
on request of size 16384 in memory context "ExecutorState".

2023-10-27T05:33:22.000000+00:00 app[postgres.2007]: [DATABASE] [248-1] sql_error_code = 00000
time_ms = "2023-10-27 05:33:22.466 UTC" pid="42" proc_start_time="2023-10-26 10:53:34 UTC"
session_id="653a452e.2a" vtid="" tid="0" log_line="249" LOG: checkpoint starting: time

2023-10-27T05:33:31.000000+00:00 app[postgres.2007]: [DATABASE] [249-1] sql_error_code = 00000
time_ms = "2023-10-27 05:33:31.912 UTC" pid="42" proc_start_time="2023-10-26 10:53:34 UTC"
session_id="653a452e.2a" vtid="" tid="0" log_line="250" LOG: checkpoint complete: wrote 94 buffers
(0.1%); 0 WAL file(s) added, 0 removed, 4 recycled; write=9.434 s, sync=0.001 s, total=9.447 s; sync
files=28, longest=0.001 s, average=0.001 s; distance=65443 kB, estimate=154005 kB

2023-10-27T05:59:37.000000+00:00 app[postgres.2007]: [DATABASE] [250-1] sql_error_code = 00000
time_ms = "2023-10-27 05:59:37.184 UTC" pid="42" proc_start_time="2023-10-26 10:53:34 UTC"
session_id="653a452e.2a" vtid="" tid="0" log_line="251" LOG: checkpoint starting: force wait

2023-10-27T05:59:37.000000+00:00 app[postgres.2007]: [DATABASE] [251-1] sql_error_code = 00000
time_ms = "2023-10-27 05:59:37.203 UTC" pid="42" proc_start_time="2023-10-26 10:53:34 UTC"
session_id="653a452e.2a" vtid="" tid="0" log_line="252" LOG: checkpoint complete: wrote 0 buffers
(0.0%); 0 WAL file(s) added, 1 removed, 0 recycled; write=0.003 s, sync=0.001 s, total=0.020 s; sync
files=0, longest=0.000 s, average=0.000 s; distance=16383 kB, estimate=140243 kB

2023-10-27T05:59:44.000000+00:00 app[postgres.145408]: [DATABASE] [11-1] sql_error_code = 00000
time_ms = "2023-10-27 05:59:44.266 UTC" pid="24304" proc_start_time="2023-10-27 05:59:43 UTC"
session_id="653b51cf.5ef0" vtid="4/74001" tid="0" log_line="3" database="postgres"
connection_source="[local]" user="postgres" application_name="psql" NOTICE: all required WAL segments
have been archived

2023-10-27T06:09:37.000000+00:00 app[postgres.2007]: [DATABASE] [252-1] sql_error_code = 00000
time_ms = "2023-10-27 06:09:37.739 UTC" pid="42" proc_start_time="2023-10-26 10:53:34 UTC"
session_id="653a452e.2a" vtid="" tid="0" log_line="253" LOG: checkpoint starting: time

Resolution

The following recommendations were provided to the application developers to fix the problem:

  1. Change application logic to insert data into tables without a function wrapper – the best option.
  2. Change long-running applications to short-running. It’s not good because there are no guarantees that the application will definitely be terminated in a few minutes after execution.
  3. Enforce Sequilaze (ORM of the applications) to terminate sessions every XX minutes. The example is here – it is not the best solution, but fast.

We have created a small automation as a workaround while developers were working on the permanent fix from the application side. Lambda function was executed every 30 minutes, checked DB Free Memory, and killed abusive idle sessions:

Once developers changed SQL queries and reduced the number of CREATE OR REPLACE FUNCTION we got the following result, when DB memory stopped leaking and Aurora stopped restarting:

The next our step was changing Aurora instance type x2 smaller, so we reduced DB pricing twice compared with Heroku.

Conclusion

It was a pretty challenging and interesting migration from Heroku to AWS. We found out some limitations from both sides, e.g., a huge price for such a simple thing as VPC peering from the Heroku side, the impossibility of having a superuser in Heroku PostgreSQL and configuring replication, differences between the behavior of OOM-Killer in Heroku PostgreSQL and Amazon Aurora, unexpected results of migration via AWS Database Migration Service and other small nuances. Automat-it as an AWS Premier Partner, was in constant contact with the AWS RDS Service team and shared all evidence to improve the RDS service automatic remediation as OOM killer, which worked in Heroku but led to major unexpected issues in AWS. As a result, we finished migration and reduced costs significantly.