Moving to AWS Graviton. Why and How?

Table of Contents

AWS continuously improves cloud services and introduces new hardware for processing power, but customers usually do not rush to move to newer instance generations. AWS documents state that newer generations are more powerful and cheaper, but what is the difference in numbers? In this post, I researched and compared four generations of the instance type M (general purpose) to show the difference in performance and price.

Comparing M4, M5, M6g and M7g instances

Four instance generations of the same instance type and family (2 vCPUs and 8Gib RAM) will be compared. All have 2 vCPUs and 8 Gib RAM:

I first checked the price (in the us-east-1 region) and measured network performance via Speedtest:

### Test for m6g.large

# curl -s https://raw.githubusercontent.com/sivel/speedtest-cli/master/speedtest.py | python -
Retrieving speedtest.net configuration...
Testing from Amazon.com (52.205.53.191)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by eero (Ashburn, VA) [0.81 km]: 1.447 ms
Testing download speed................................................................................
Download: 3633.89 Mbit/s
Testing upload speed......................................................................................................
Upload: 3298.03 Mbit/s

Here is a table mix of AWS-provided data + my first findings:

Instance Size / GenvCPUMemory (GiB)Instance Storage (GB)Network Bandwidth (Gbps)Speedtest (approximately Mbit/s)EBS Bandwidth (Gbps)Hourly Price $ (us-east-1)AWS Declares
m4.large28EBS-onlyModerate5004500,1-
m5.large28EBS-onlyUp to 103000Up to 4,7500,096up to 20% improvement in price/performance compared to M4 instances
m6g.large28EBS-OnlyUp to 103500Up to 4,7500,077up to 40% better price performance over M5 instances
m7g.large28EBS-OnlyUp to 12.55000Up to 100,0816up to 25% better performance over the sixth-generation AWS Graviton2-based M6g instances

DDR5 memory, which provides 50% higher memory bandwidth compared to DDR4 memory

20% higher enhanced networking bandwidth
compared to M6g instances

Price difference

Price difference between M4 and M7 is about 20%

M7 is a bit more expensive than M6 because M7 uses newer RAM (DDR5) instead of DD4 in M6.

M7g instances feature Double Data Rate 5 (DDR5) memory, which provides 50% higher memory bandwidth compared to DDR4 memory to enable high-speed access to data in memory.

Network performance

AWS categorizes network performance for some instances with qualitative descriptors like “Low,” “Moderate,” “High,” etc., rather than specifying exact numerical bandwidth values. For “Moderate” network performance, AWS does not publicly disclose precise bandwidth figures, as the actual throughput can vary based on multiple factors, including network congestion and the instance’s physical location.

Speedtest utility was used to get numbers. Network performance significantly increased over the generation evolution:

CPU performance check

Sysbench was used to test the CPU and memory performance.

Sysbench is a scriptable multi-threaded benchmark tool based on LuaJIT. It is most frequently used for database benchmarks but can also create arbitrarily complex workloads that do not involve a database server.

Sysbench comes with the following bundled benchmarks:

  • oltp_*.lua: a collection of OLTP-like database benchmarks
  • fileio: a filesystem-level benchmark
  • cpu: a simple CPU benchmark
  • memory: a memory access benchmark
  • threads: a thread-based scheduler benchmark
  • mutex: a POSIX mutex benchmark

How to install the tool on Amazon Linux 2023:

yum -y install make automake libtool pkgconfig libaio-devel
yum -y install openssl-devel

sudo wget https://dev.mysql.com/get/mysql80-community-release-el9-1.noarch.rpm
sudo dnf install mysql80-community-release-el9-1.noarch.rpm -y
sudo rpm --import https://repo.mysql.com/RPM-GPG-KEY-mysql-2023
sudo dnf install mysql-community-client -y
sudo dnf install mysql-devel -y

git clone https://github.com/akopytov/sysbench.git
cd sysbench
./autogen.sh
./configure
make -j
make install

M4 instance CPU / Memory test

Info about the CPU:

# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel
  Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    CPU family:          6
    Model:               79
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           1
    Stepping:            1
    BogoMIPS:            4599.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopolo
                         gy cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invp
                         cid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
Virtualization features:
  Hypervisor vendor:     Xen
  Virtualization type:   full
Caches (sum of all):
  L1d:                   32 KiB (1 instance)
  L1i:                   32 KiB (1 instance)
  L2:                    256 KiB (1 instance)
  L3:                    45 MiB (1 instance)

This will run a single-threaded CPU benchmark.

$ sysbench cpu run

sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Prime numbers limit: 10000

Initializing worker threads...
Threads started!

CPU speed:
    events per second:   757.73

Throughput:
    events/s (eps):                      757.7278
    time elapsed:                        10.0010s
    total number of events:              7578

Latency (ms):
         min:                                    1.30
         avg:                                    1.32
         max:                                    1.68
         95th percentile:                        1.34
         sum:                                 9987.17

Threads fairness:
    events (avg/stddev):           7578.0000/0.00
    execution time (avg/stddev):   9.9872/0.00

One more test with 16 threads:

# sysbench --threads=16 cpu run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Initializing random number generator from current time

Prime numbers limit: 10000

Initializing worker threads...
Threads started!

CPU speed:
    events per second:  1270.80

Throughput:
    events/s (eps):                      1270.7967
    time elapsed:                        10.0055s
    total number of events:              12715

Latency (ms):
         min:                                    1.55
         avg:                                   12.52
         max:                                  141.00
         95th percentile:                       71.83
         sum:                               159180.16

Threads fairness:
    events (avg/stddev):           794.6875/7.86
    execution time (avg/stddev):   9.9488/0.04

Test memory (single thread):

$ sysbench memory run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...
Threads started!

Total operations: 4285343 (428530.63 per second)
4184.91 MiB transferred (418.49 MiB/sec)

Throughput:
    events/s (eps):                      428530.6314
    time elapsed:                        10.0001s
    total number of events:              4285343

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.15
         95th percentile:                        0.00
         sum:                                 3419.16

Threads fairness:
    events (avg/stddev):           4285343.0000/0.00
    execution time (avg/stddev):   3.4192/0.00

Test memory (16 threads):

$ sysbench --threads=16 memory run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 16
Initializing random number generator from current time

Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 5716923 (571674.03 per second)
5582.93 MiB transferred (558.28 MiB/sec)

Throughput:
    events/s (eps):                      571674.0298
    time elapsed:                        10.0003s
    total number of events:              5716923

Latency (ms):
         min:                                    0.00
         avg:                                    0.01
         max:                                  140.03
         95th percentile:                        0.00
         sum:                                54925.26

Threads fairness:
    events (avg/stddev):           357307.6875/2433.99
    execution time (avg/stddev):   3.4328/0.25

MUTEX benchmark

A mutex benchmark evaluates mutex implementations’ performance, scalability, and overhead in a multi-threaded environment. The primary goal is to measure how efficiently a mutex can manage access to shared resources by multiple threads, especially under heavy concurrency.

Throughput refers to the number of operations (or events) completed within a given time frame when the mutex synchronizes access to shared resources. Higher throughput indicates better performance under concurrent access.

$ sysbench mutex run
sysbench 1.1.0-2ca9e3f (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Initializing worker threads...

Threads started!

Throughput:
    events/s (eps):                      4.4504
    time elapsed:                        0.2247s
    total number of events:              1

Latency (ms):
         min:                                  224.58
         avg:                                  224.58
         max:                                  224.58
         95th percentile:                      223.34
         sum:                                  224.58

Threads fairness:
    events (avg/stddev):           1.0000/0.00
    execution time (avg/stddev):   0.2246/0.00

M5 instance CPU / Memory test

Info about the CPU:

# lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  2
  On-line CPU(s) list:   0,1
Vendor ID:               GenuineIntel
  BIOS Vendor ID:        Intel(R) Corporation
  Model name:            Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
    BIOS Model name:     Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  1
    Socket(s):           1
    Stepping:            4
    BogoMIPS:            4999.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
                         ology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand h
                         ypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx
                         512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   32 KiB (1 instance)
  L1i:                   32 KiB (1 instance)
  L2:                    1 MiB (1 instance)
  L3:                    33 MiB (1 instance)

The full sysbench output is omitted because all details will be provided in a table and graphs later:

$ sysbench cpu run
CPU speed:
    events per second:  1064.75

$ sysbench --threads=16 cpu run
CPU speed:
    events per second:  1671.36

M6g instance CPU / Memory test

Info about the CPU:

# lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 2
  On-line CPU(s) list:  0,1
Vendor ID:              ARM
  BIOS Vendor ID:       AWS
  Model name:           Neoverse-N1
    BIOS Model name:    AWS Graviton2
    Model:              1
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          1
    Stepping:           r3p1
    BogoMIPS:           243.75
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp ssbs
Caches (sum of all):
  L1d:                  128 KiB (2 instances)
  L1i:                  128 KiB (2 instances)
  L2:                   2 MiB (2 instances)
  L3:                   32 MiB (1 instance)

The full sysbench output is omitted because all details will be provided in a table and graphs later:

$ sysbench cpu run
CPU speed:
    events per second:  2853.55


$ sysbench --threads=16 cpu run
CPU speed:
    events per second:  5696.65

M7g instance CPU / Memory test

Info about the CPU:

# lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 2
  On-line CPU(s) list:  0,1
Vendor ID:              ARM
  BIOS Vendor ID:       AWS
  BIOS Model name:      AWS Graviton3
  Model:                1
  Thread(s) per core:   1
  Core(s) per socket:   2
  Socket(s):            1
  Stepping:             r1p1
  BogoMIPS:             2100.00
  Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm
                         ssbs paca pacg dcpodp svei8mm svebf16 i8mm bf16 dgh rng
Caches (sum of all):
  L1d:                  128 KiB (2 instances)
  L1i:                  128 KiB (2 instances)
  L2:                   2 MiB (2 instances)
  L3:                   32 MiB (1 instance)

The full sysbench output is omitted because all details will be provided in a table and graphs later:

$ sysbench  cpu run
CPU speed:
    events per second:  3024.28

$ sysbench --threads=16 cpu run
CPU speed:
    events per second:  6044.47

Benchmark results

Here is a table, I collected all results from two experiments (single thread and 16 threads) for four instance generations (M4, M5, M6g, and M7g):

1 thread test16 threads test
Instance FamilyInstance SizeCPU (events/s)Memory (events/s)Memory (MiB/sec)Mutex (events/s)CPU (events/s)Memory (events/s)Memory (MiB/sec)Mutex (events/s)
M4m4.large757,73428530,63418,494,451270,80571674,03558,284,51
M5m5.large1064,755774973,915639,626,071671,369205780,948990,026,12
M6gm6g.large2853,555020851,874903,184,285696,653973599,353880,478,34
M7gm7g.large3024,285570464,395439,915,136044,475794674,125658,869,88

 

CPU results show a significant performance increase, but the memory test shows a curious result (M5 is the best).

Consideration for migration to Graviton

The performed tests showed a significant increase in CPU and Network performance in Graviton instances + some cost savings.

AWS Graviton is a family of processors designed to deliver the best price performance for your cloud workloads running in Amazon Elastic Compute Cloud (Amazon EC2).

AWS Graviton-based instances cost up to 20% less than comparable x86-based Amazon EC2 instances.

AWS Graviton-based instances use up to 60% less energy than comparable EC2 instances.

Is your application ready to run on ARM?

A tool, “Porting Advisor for Graviton“, analyzes source code for known code patterns and dependency libraries. It then generates a report with any incompatibilities with Graviton processors. This tool provides suggestions of minimal required and/or recommended versions to run on Graviton instances for both language runtime and dependency libraries.

Currently, the tool supports the following languages/dependencies:

  • Python 3+
  • Java 8+
  • Go 1.11+
  • C, C++, Fortran

You can run it as a Docker container. This option eliminates the need to worry about Python or Java versions or any other dependency that the tool needs, and it is the quickest way to get started:

docker build -t porting-advisor .
docker run --rm -v my/repo/path:/repo -v my/output:/output porting-advisor /repo --output /output/report.html

Scan a sample Python code:

Scan a sample Java code:

Scan a sample Go code:

PLEASE NOTE: Even though the tool does its best to find known incompatibilities, it’s still recommended that you perform the appropriate tests on your application on a Graviton instance before going to Production.

Conclusion

Graviton instances look great. They are much more powerful and a bit cheaper than previous generations. In this post, I tested CPU, Memory, and Network performance for M4, M5, M6g, and M7g instances, compared costs, built graphs for visibility, and demonstrated a tool that can help you with the preliminary assessment of how ready your applications are for running on ARM instances.