Team: AI Infrastructure & Platform

Location preferred: Remote

About ELSA

ELSA is a global leader in AI-powered English communication training, dedicated to transforming how people learn and speak English with confidence. Founded in 2016 and headquartered in San Francisco, we operate across the U.S., Vietnam, Portugal, Indonesia, Brazil and Japan.

Powered by proprietary speech-recognition technology and generative AI, ELSA delivers real-time, hyper-personalized feedback to help learners improve pronunciation, fluency, and overall communication effectiveness. With over 50 million learners and 1 billion hours of anonymized speech data, ELSAs depth of language training intelligence is unmatched in the industry.

Our B2B flagship platforms ELSA Enterprise and ELSA Schools empower organizations and educational institutions to elevate communication capabilities and unlock personal and professional opportunities for their people. We design engaging, bite-sized learning experiences that adapt to each learners goals and context, ensuring measurable improvement and lasting confidence.

Our vision is to become the global standard for real-time English communication training, enabling 1.5 billion language learners worldwide to speak clearly, be understood, and share their stories with the world.

Backed by world-class investors including Googles Gradient Ventures, Monks Hill Ventures, and SOSV, ELSA has been recognized among the top global AI innovators:

Forbes Top 4 Companies Using AI to Transform the World
Research Sniper Top 5 Best AI Apps
ASU+GSV EdTech 150
CB Insights Top 100 AI Companies

Join us in shaping the future of language learning and empowering millions to unlock opportunity through confident communication.

Job overview:

Join the AI Infrastructure & Platform team to build, operate, and scale the production systems that power ELSA’s APIs, platform services, and AI-enabled applications. This Senior Site Reliability Engineer / API Platform Engineer role bridges software engineering, cloud infrastructure, and operational excellence, requiring a pragmatic, highly productive individual who can use modern AI tools and automation to accelerate delivery and improve reliability.

You will collaborate closely with engineering, AI, and product teams to ensure our services are secure, scalable, observable, and resilient in real-world production environments. This is not an AI Engineer role; rather, it is an infrastructure and reliability role for someone who works in an AI-first way and uses AI as a force multiplier in execution, automation, and systems operations.

Key Responsibilities

Design, build, and operate reliable, scalable infrastructure for APIs, platform services, and AI-enabled applications on AWS and Kubernetes.
Own and enhance CI/CD pipelines, deployment workflows, and operational tooling to enable safe and fast software delivery.
Build and maintain robust observability systems across metrics, logging, tracing, alerting, and service health.
Lead incident response, root cause analysis, postmortems, and remediation efforts to continuously improve production reliability.
Automate repetitive operational work through software, infrastructure-as-code, and AI-assisted workflows.
Use AI-native engineering tools including copilots, intelligent automation, and agentic operational tooling to improve debugging, response time, analysis, and team productivity.
Partner with backend, platform, and AI engineering teams to productionize new services and ensure they meet reliability, security, and scalability standards.
Optimize infrastructure and runtime performance across latency, throughput, availability, and cost.
Define and enforce engineering standards for reliability, security, observability, and operational excellence across services.
Contribute production-grade software and internal tools that reduce toil and improve platform leverage across the organization.

What You Will Have

Strong experience in Site Reliability Engineering, DevOps, Platform Engineering, or Infrastructure Software Engineering, with a track record of operating production systems at scale.
Solid experience writing and maintaining production-grade software for live systems and internal platform tooling.
Deep expertise in cloud infrastructure and distributed systems, particularly on AWS, including EKS, EC2, IAM, VPC, CloudWatch, and related services.
Hands-on experience running Kubernetes-based services in production environments.
Strong experience operating APIs and microservices in production, including release workflows, failure recovery, and service hardening.
Hands-on experience with observability and monitoring tools such as Prometheus, Grafana, SigNoz, Sentry, OpenTelemetry, or similar systems.
Strong understanding of CI/CD practices, incident management, production monitoring, and service reliability engineering.
Experience with infrastructure-as-code and automation tooling.
Experience using AI tools and automation as a core part of your engineering workflow to increase productivity, reduce toil, and improve execution quality.
Strong judgment, ownership, and follow-through. You take on hard operational problems and drive them through resolution.

Nice-to-Haves:

Experience supporting AI-powered products, inference services, or ML-adjacent systems in production.
Familiarity with GPU-based workloads and performance optimization for compute-intensive services.
Experience with performance tuning, benchmarking, capacity planning, and load testing.
Experience building internal developer platforms, self-service infrastructure, or reliability tooling.
Familiarity with AI-assisted incident response, automated remediation, or intelligent operational runbooks.
Experience working cross-functionally with AI, product, and engineering teams in fast-moving environments.
Good software engineering fundamentals, including distributed systems, APIs, containerization, and cloud-native deployment.

What We Offer

Flexible work setup: Remote-first for Indonesia, Malaysia, Thailand, Taiwan; hybrid model for Vietnam.
Comprehensive employee well-being benefits.
Free ELSA Premium courses to polish your language skills
Collaborative, international team culture.
Opportunity to contribute to a fast-growing, well-funded Silicon Valley startup with global impact.

Senior Site Reliability Engineer/ API Platform Engineer (AI-First)

About ELSA

Senior Site Reliability Engineer/ API Platform Engineer (AI-First)

Already working at ELSA ?