Updated: Sep 29
FSx for Lustre is based on the open-source Lustre file system, which is known for its scalability, high throughput, and low latency. FSx for Lustre provides two deployment options: scratch and persistent. Scratch file systems are designed for temporary storage and shorter-term processing of data. Persistent file systems are designed for longer-term storage and workloads.
Advantages of Amazon FSx for Lustre:
High Performance: FSx for Lustre is optimized for high-throughput, low-latency access to data, making it suitable for data-intensive applications and workloads that require fast I/O.
Scalability: It can scale to support storage capacities and throughput requirements for large datasets, allowing you to accommodate growing workloads seamlessly.
Fully Managed Service: AWS takes care of the underlying infrastructure, maintenance, and updates, reducing operational overhead for users.
Integration with AWS Ecosystem: FSx for Lustre integrates well with other AWS services, making it easier to build comprehensive solutions within the AWS cloud environment.
Data Security: Amazon FSx for Lustre supports data encryption at rest and in transit, helping you maintain data security and compliance.
Disadvantages of Amazon FSx for Lustre:
Cost: While FSx for Lustre offers high performance and scalability, it may also come with higher costs compared to other AWS storage options for less demanding workloads.
AZ Availability: AWS services may not be available in multiple AZs at one time, so you may need to specify only one AZ per volume.
In most cases, when you need to synchronize data from S3 to local storage, we use the S3 API, which is suitable for relatively small files. However, in cases where we need to handle quite large files and do so frequently, I suggest using FSx Lustre in combination with Data Repository Association (DRA).
In order to clearly understand the performance of S3CP or FSx Lustre, I implemented a workflow based on Argo Workflows, which provides flexibility and integration with EKS. So, how does it work:
Installed Karpenter for provisioning nodes
Created an ArgoWF workflow templates
Created a service account and k8s manifests
Create two s3 buckets for input and output files
Conducted tests on both single instances and multiple instances of different types
More detailed information about pre-requirements for launching a workflow can be found in the description for two general folders argowf_templates and k8s_manifests each of them will provide a more detailed explanation of what needs to be done to replicate a similar scenario. Now, let's proceed to the actual stress test simulation and the results.
Comparing S3 CP vs FSx Lustre Performance simulation
When you invoke launch main workflow with name stresstest.yaml, you will receive a set of parameters to fill in namely:
Task_id: The task number (1, 2, 3, etc.).
Create_fsx_associations: Set to true if you are running the workflow for the first time.
pvc_size: Supported sizes are 1.2 TiB or increments of 2.4 TiB.
Namespace: The namespace where you are launching the workflow.
Instance_type: The type of instance you are using for comparison between s3cp and fsx because different instances have different bandwidths.
Action_file_size: The size of the file you upload to S3.
Download_bucket_name: The bucket you use for downloading the file.
Download_file_name: The file name you use for downloading in s3cp and for MD5 validation.
Upload_bucket_name: The bucket you use for uploading the action file in DRA.
Upon initiating the workflow, it will commence by establishing a Persistent Volume Claim through an EKS aws-fsx-csi-driver that links it with FSx Lustre on Amazon.
Subsequently, file copying occurs in parallel, employing the boto3 s3cp library and multipart upload, alongside the creation of a DRA for FSx. For convenience, log generation was added to the s3cp-download-performance step to get complete information about the file download status and speed on a specific instance type stored in the s3 bucket for future viewing.
The subsequent action involves simulating the creation of a file (s3cp-action-result-simulation or fsx-action-result-simulation steps), which we use for uploading to S3, and the concluding step entails report generation and PVC deletion. Reports stored in s3 output bucket with this structure:
fsx - contain action file.
reports - contain reports about the comparison performance of s3cp and Fsx Lustre.
s3cp - contains logs for s3cp steps and also action file.
Results and conclusion
Launching a single workflow is not as effective as initiating several concurrent workflows to compare the performance of s3cp and FSx more clearly. Therefore, multiple-stresstest.yaml was created for this purpose that runs 8 parallel stress tests.
chunk size: 40
upload/download file size: 10Gb
instance types: c5.xlarge/c5.9xlarge/c5.24xlarge
So, after conducting all the tests, I can unequivocally say that if you are concerned about the speed and availability of large files, the clear choice is FSx because s3 has limits for uploading and downloading and you cannot speed up that. If your product requires running a lot of workloads on batch or Kubernetes jobs and you must upload/download files from s3 every time the workload starts FSX is the best way to speed up that. Also, FSx for Lustre can be scaled to accommodate changing storage and performance requirements. Users can adjust the storage capacity and throughput as needed.