Principal DevOps / SRE Engineer
You won't be starting from scratch — you'll be building on a capable team that already responds to incidents in under 10 minutes at 10 PM without a formal on-call. Your job is to turn that into a scal
Location: Remote (India preferred)
Department: Product, Engineering & Data Science
Report to: Senior Director of Engineering
About Us
ELSA is a global leader in AI-powered English communication training, dedicated to transforming how people learn and speak English with confidence. Founded in 2016 and headquartered in San Francisco, we operate across the U.S., Vietnam, Portugal, Indonesia, Brazil and Japan.
Powered by proprietary speech-recognition technology and generative AI, ELSA delivers real-time, hyper-personalized feedback to help learners improve pronunciation, fluency, and overall communication effectiveness. With over 50 million learners and 1 billion hours of anonymized speech data, ELSAs depth of language training intelligence is unmatched in the industry.
Our B2B flagship platforms ELSA Enterprise and ELSA Schools empower organizations and educational institutions to elevate communication capabilities and unlock personal and professional opportunities for their people. We design engaging, bite-sized learning experiences that adapt to each learner's goals and context, ensuring measurable improvement and lasting confidence.
Our vision is to become the global standard for real-time English communication training, enabling 1.5 billion language learners worldwide to speak clearly, be understood, and share their stories with the world.
Backed by world-class investors including Googles Gradient Ventures, Monks Hill Ventures, and SOSV, ELSA has been recognized among the top global AI innovators:
Forbes Top 4 Companies Using AI to Transform the World
Research Sniper Top 5 Best AI Apps
ASU+GSV EdTech 150
CB Insights Top 100 AI Companies
Join us in shaping the future of language learning and empowering millions to unlock opportunity through confident communication.
Role Summary
We are looking for a Principal DevOps / SRE engineer to build and own our reliability practice end-to-end. This is not a firefighting role — our team already responds well to incidents. This person will formalize what works, automate what repeats, and build the foundation for enterprise-grade SRE as ELSA scales its B2B footprint.
Key Responsibilities
Own the SRE practice: define severity tiers (P1–P4), formalize on-call rotation, build SLA tracking dashboards, and establish incident management workflows across a team of 4 DevOps engineers.
Build runbooks for the top recurring operational issues — pod scaling, deploy rollbacks, access management, EKS upgrades, CI/CD pipeline failures — and automate L1/L2 responses using tools like Shoreline.io, Rundeck, or PagerDuty automation.
Introduce and operationalize AI-assisted DevOps tooling: AIOps for alert correlation, CastAI/Kubecost for cost optimization, GitHub Copilot for IaC acceleration. Train the existing team on these tools.
Drive infrastructure modernization: EKS upgrades, Karpenter migration, observability (SigNoz/Prometheus), secrets management (ArgoCD/SOPS), and Terraform-based IaC maturity.
Collaborate with AI Engineering, Mobile, and B2B teams to ensure infrastructure supports real-time speech processing, GPU workloads, and multi-region enterprise deployments.
Design and plan round-the-clock SRE coverage model as B2B enterprise SLA commitments grow — evaluate vendor partnerships or strategic hires for Americas timezone coverage.
What You Will Have
2+ years in DevOps/SRE, with at least 2 years in a principal or staff-level role owning reliability practices for a production SaaS product.
Deep hands-on experience with AWS (EKS, EC2, DynamoDB, S3, IAM, Secrets Manager), Kubernetes (HPA, KEDA, Karpenter, pod scheduling, GPU workloads), and IaC (Terraform, Helm, ArgoCD).
Track record of building runbooks, on-call rotations, and incident management frameworks — not just participating in them.
Experience with observability stacks (Prometheus, Grafana, SigNoz or Datadog), CI/CD (GitLab CI, GitHub Actions), and alerting (PagerDuty, Opsgenie).
Comfort working across timezones with distributed teams (India, Vietnam, Portugal). Strong written communication — you'll be writing runbooks, RCAs, and proposals as much as Terraform.
Nice to Have
Experience with AI/ML infrastructure (GPU scheduling, model serving, real-time audio/speech workloads).
Familiarity with compliance frameworks (ISO 27001, SOC 2, Vanta) in a DevOps context.
Hands-on experience with AIOps tooling, automated remediation platforms (Shoreline, Rundeck), or FinOps tools (CastAI, Kubecost).
What We Offer
Flexible work setup: Remote-first for Singapore, India, Indonesia, Malaysia; hybrid model for Vietnam.
Comprehensive employee well-being benefits.
Free ELSA Premium courses to polish your language skills
Collaborative, international team culture.
Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.
- Locations
- India, Remote
- Remote status
- Fully Remote