Comparing s3cp API vs FSx Lustre Performance

Table of Contents

FSx for Lustre is based on the open-source Lustre file system, which is known for its scalability, high throughput, and low latency. FSx for Lustre provides two deployment options: scratch and persistent. Scratch file systems are designed for temporary storage and shorter-term processing of data. Persistent file systems are designed for longer-term storage and workloads.

Advantages of Amazon FSx for Lustre:

High Performance: FSx for Lustre is optimized for high-throughput, low-latency access to data, making it suitable for data-intensive applications and workloads that require fast I/O.

Scalability: It can scale to support storage capacities and throughput requirements for large datasets, allowing you to accommodate growing workloads seamlessly.

Fully Managed Service: AWS takes care of the underlying infrastructure, maintenance, and updates, reducing operational overhead for users.

Integration with AWS Ecosystem: FSx for Lustre integrates well with other AWS services, making it easier to build comprehensive solutions within the AWS cloud environment.

Data Security: Amazon FSx for Lustre supports data encryption at rest and in transit, helping you maintain data security and compliance.

Disadvantages of Amazon FSx for Lustre:

Cost: While FSx for Lustre offers high performance and scalability, it may also come with higher costs compared to other AWS storage options for less demanding workloads.

AZ Availability: AWS services may not be available in multiple AZs at one time, so you may need to specify only one AZ per volume.

Problem Statement

In most cases, when you need to synchronize data from S3 to local storage, we use the S3 API, which is suitable for relatively small files. However, in cases where we need to handle quite large files and do so frequently, I suggest using FSx Lustre in combination with Data Repository Association (DRA).

Solution design

In order to clearly understand the performance of S3CP or FSx Lustre, I implemented a workflow based on Argo Workflows, which provides flexibility and integration with EKS. So, how does it work:

  1. Installed ArgoWF
  2. Installed Karpenter for provisioning nodes
  3. Created an ArgoWF workflow templates
  4. Created a service account and k8s manifests
  5. Create two s3 buckets for input and output files
  6. Conducted tests on both single instances and multiple instances of different types
Stresstest workflow

More detailed information about pre-requirements for launching a workflow can be found in the description for two general folders argowf_templates and k8s_manifests each of them will provide a more detailed explanation of what needs to be done to replicate a similar scenario. Now, let’s proceed to the actual stress test simulation and the results.

Comparing S3 CP vs FSx Lustre Performance simulation

When you invoke launch main workflow with name stresstest.yaml, you will receive a set of parameters to fill in namely:

  • Task_id: The task number (1, 2, 3, etc.).
  • Create_fsx_associations: Set to true if you are running the workflow for the first time.
  • pvc_size: Supported sizes are 1.2 TiB or increments of 2.4 TiB.
  • Namespace: The namespace where you are launching the workflow.
  • Instance_type: The type of instance you are using for comparison between s3cp and fsx because different instances have different bandwidths.
  • Action_file_size: The size of the file you upload to S3.
  • Download_bucket_name: The bucket you use for downloading the file.
  • Download_file_name: The file name you use for downloading in s3cp and for MD5 validation.
  • Upload_bucket_name: The bucket you use for uploading the action file in DRA.

Upon initiating the workflow, it will commence by establishing a Persistent Volume Claim through an EKS aws-fsx-csi-driver that links it with FSx Lustre on Amazon.

FSx Lustre with DRA

Subsequently, file copying occurs in parallel, employing the boto3 s3cp library and multipart upload, alongside the creation of a DRA for FSx. For convenience, log generation was added to the s3cp-download-performance step to get complete information about the file download status and speed on a specific instance type stored in the s3 bucket for future viewing.

s3cp-download-perfomance logs

The subsequent action involves simulating the creation of a file (s3cp-action-result-simulation or fsx-action-result-simulation steps), which we use for uploading to S3, and the concluding step entails report generation and PVC deletion. Reports stored in s3 output bucket with this structure:

  • fsx – contain action file.
  • reports – contain reports about the comparison performance of s3cp and Fsx Lustre.
  • s3cp – contains logs for s3cp steps and also action file.
Structure of reports

Results and conclusion

Launching a single workflow is not as effective as initiating several concurrent workflows to compare the performance of s3cp and FSx more clearly. Therefore, multiple-stresstest.yaml was created for this purpose that runs 8 parallel stress tests.

 

Launched 8 parallel stresstest

Input parameters:

  • chunk size: 40
  • threshold: 80
  • upload/download file size: 10Gb
  • instance types: c5.xlarge/c5.9xlarge/c5.24xlarge

Reports:

So, after conducting all the tests, I can unequivocally say that if you are concerned about the speed and availability of large files, the clear choice is FSx because s3 has limits for uploading and downloading and you cannot speed up that. If your product requires running a lot of workloads on batch or Kubernetes jobs and you must upload/download files from s3 every time the workload starts FSX is the best way to speed up that. Also, FSx for Lustre can be scaled to accommodate changing storage and performance requirements. Users can adjust the storage capacity and throughput as needed.