Summary
Techbio startup Numenos — builder of the world’s first clinico-genomic foundation model for drug discovery — was held back by a fragmented hybrid setup and had looming SOC2 obligations for clinical trial data. Automat-it re-architected everything into a unified, fully Terraform-managed, multi-account AWS environment, freeing Numenos from third-party GPU dependency and standing up infrastructure that scales from 100 concurrent base-training jobs to 1,500 parallel fine-tuning jobs while migrating 10TB of sensitive clinical data with 100% checksum-verified accuracy.
About Numenos
Numenos AI is a techbio startup founded in Israel in 2023. It developed the world’s first clinico-genomic foundation model for drug discovery and development.
Its patient-centric biological world models learn from individual patient data across all diseases simultaneously. Instead of finding patterns specific to lung cancer or kidney disease, our platform discovers the invariant causal mechanisms that underlie all human biology.
This breakthrough enables pharma companies to dramatically improve the cost, speed, and success of their clinical trials and to identify novel targets and shelved drugs with the greatest potential for approval.
The Challenge: Overcoming Fragmented Infrastructure
Numenos faced a critical operational hurdle: their core infrastructure could not keep pace with their data science ambitions. The in-house team managed a deeply fragmented hybrid setup split across a GPU provider for GPU training, AWS for data processing, and a cloud application platform for their client app.
Particular challenges included:
- Operational Bottlenecks: A complete lack of CI/CD and Infrastructure as Code (IaC) meant Docker builds and deployments were entirely manual.
- Third-Party Dependency: Numenos was completely reliant on a specific GPU provider for GPU provisioning and training orchestration, severely limiting their control over workload prioritization.
- Massive Scaling Demands: The startup needed to scale from 20 models to up to 100+. Each model required 15 fine-tuning experiments across 15 datasets, processing 300GB of genetic sequences per job. Parallel training was going to be required to meet demand.
- Compliance and Data Growth: The team anticipated a 10x increase in data within a single year, all of which needed to meet strict SOC2 compliance standards for clinical trial data.
- Time Delays & Version Control Issues: Delays and problems arising from version control accuracy limited how in line training on new data was.
The Solution: Unified AWS-Native Architecture
The engineering team at Automat-it stepped in to overhaul the fragmented setup, designing a fully Terraform-managed, multi-account AWS environment (production, development, and website) tailored for complex AI workloads.
By taking full ownership of the infrastructure architecture, Automat-it enabled Numenos to maintain total focus on their core applicative logic, data science, and performance optimization.
Key architectural implementations included:
- Scalable Compute & Orchestration: Built multi-GPU Amazon EKS clusters managed by Karpenter and KEDA for automated job scaling, bioinformatics, pre-training, and fine-tuning (GPU+CPU). Training and pipelines are orchestrated on the cluster via DAGster.
- Automated GitOps Pipelines: Deployed GitHub Actions and ArgoCD to establish reliable, frictionless CI/CD workflows, eliminating manual deployments. This reliable CI/CD workflow eliminated manual bottlenecks, freeing up 5 hours of engineering time per week.
- Advanced Experiment Tracking: Integrated MLFlow and Optuna with an Aurora MySQL backend, alongside extensive Prometheus and Grafana monitoring on EKS. Job distribution was streamlined using an SQS-based multi-queue system.
- Strict Security & Compliance: Established a SOC2-compliant baseline using GuardDuty, Audit Manager, and EventBridge, paired with network isolation via VPC and VPN access.
- Data Migration & Storage: Executed a 10TB data migration (S3 copy and RDS backup/restore) using multi-tier S3 storage with Glacier Deep Archive lifecycle policies. The pre-existing GPU provider was maintained purely as a fallback during the gradual transition to ensure zero downtime.
The Results: High-volume Agentic Workloads
Numenos was able to successfully transition to an enterprise-grade AWS environment built for high-volume Agentic AI workloads.
Key Outcomes Included:
- Infrastructure Autonomy: Successfully eliminated the dependency on SwarmOne, giving the Numenos team full control over GPU provisioning and training orchestration.
- Verified Data Migration: Achieved a 100% successful migration of sensitive clinical data, fully verified with checksum validation.
- Massive Workload Scalability: Established an infrastructure capable of supporting 100 concurrent base training jobs, and scaling to 1,500 parallel fine-tuning jobs without performance degradation.
- Enterprise-Grade Compliance: Achieved a secure, SOC2-compliant framework, completely provisioned via Terraform with automated Slack alerting via SNS and Lambda.
Start Your Journey with Automat-it
Achieve an enterprise-grade environment that lets you scale your AI workloads as you grow and join the ranks of high-growth startups like Numenos.