Lead Devops Engineer (Bangkok based, relocation provided)

Full Time
Bangkok, Thailand
4 hours ago

About Agoda

At Agoda, we bridge the world through travel. Our story began in 2005, when two lifelong friends and entrepreneurs, driven by their passion for travel, launched Agoda to make it easier for everyone to explore the world.  

 

Today, we are part of Booking Holdings [NASDAQ: BKNG], with a diverse team of over 7,000 people from 90 countries, working together in offices around the globe. Every day, we connect people to destinations and experiences, with our great deals across our millions of hotels and holiday properties, flights, and experiences worldwide.

 

No two days are the same at Agoda. Data and technology are at the heart of our culture, fueling our curiosity and innovation. If you’re ready to begin your best journey and help build travel for the world, join us.

 

In this Role, you'll get to:

· Lead the technical vision, architecture, and execution of new SRE platforms or reliability initiatives.

· Define and promote SRE best practices across Agoda’s services e.g., SLI/SLO-driven engineering, error budgets, and other data-driven reliability factors.

· Design, build, and operate reliability platforms including load shedding , business signals monitoring, and safe-deployment automation to reduce blast radius while preserving developer velocity..

· Own safe deployment strategies such as canary releases, automated rollback, and business-impact protection integrated with deployment & monitoring.

· Proactively identify and mitigate reliability and scaling risks across Agoda’s services.

· Improve system resilience and multi-cluster readiness by partnering with platform team and operation team.

· Lead major incident response and operational excellence, driving fast detection, mitigation, root cause analysis, postmortems, and learnings focused on business impact.

· Maintain and evolve incident, observability, alerting, and on-call tooling, improving signal quality, alert enrichment, grouping, and reducing time-to-clue and time-to-mitigation for NOC and on-call engineers.

· Advance platform observability and reliability signals using Prometheus and Grafana, balancing actionability, scale, and cost efficiency.

· Define reliability roadmaps and OKRs, translating ambiguous business reliability goals into clear technical requirements.

What You’ll Need to Succeed:

· Demonstrated ownership of architecting, building, and operating mission-critical production systems, making long-term technical and reliability trade-off decisions.

· Proven ability to lead and coordinate complex cross-team initiatives, setting technical direction and aligning stakeholders to deliver outcomes at organizational scale.

· Expertise in one or more programming skills (e.g., Go, Python, Rust, Java) with a solid understanding of distributed systems fundamentals (concurrency, backpressure, timeouts/retries, idempotency, circuit breaking).

· Deep hands-on experience with the Kubernetes ecosystem, service mesh technologies (e.g., Istio), Kubernetes deployment workflows (e.g., Argo CD).

· Observability & monitoring expertise, using Prometheus, Grafana, and common logging/telemetry stacks (e.g., OpenTelemetry), with an understanding of signal quality, scalability, and cost trade-offs.

· Strong incident management lifecycle aiming for improving area of alert quality, alert management, incident response, RCA, and postmortems.

· Experience with reliability engineering patterns such as canary deployments, automated rollback, capacity/right-sizing automation, and production operation.

· Solid data analysis, including SQL(e.g., PostgreSQL, MSSQL) and data pipelines.

· Data-driven mindset, able to perform deep research, analyze complex problems, and make informed technical decisions.

· Excellent communication and collaboration skills, able to explain complex technical concepts clearly to stakeholders at all levels, and to operate effectively both as a self-directed individual contributor and as part of a team.

· Curiosity and continuous learning, staying current with industry trends, open-source advancements, and emerging reliability practices.

Nice-to-Have:

· Experience operating large-scale, high-QPS systems serving millions of users in domains such as e-commerce, travel, or fintech.

· Hands-on experience with multi-region / multi-DC architectures and traffic isolation or failover strategies.

· Background in chaos engineering and resilience testing.

· Experience defining or scaling org-wide SLO/SRE frameworks.

· Built or operated Kubernetes controllers/operators.

· Exposure to ML-assisted detection or statistical methods for signal tuning (e.g., windowing strategies, precision/recall trade-offs).

 

Discover more about working at Agoda
  • Agoda Careers https://careersatagoda.com
  • Facebook https://www.facebook.com/agodacareers/
  • LinkedIn https://www.linkedin.com/company/agoda
  • YouTube https://www.youtube.com/agodalife

 

Equal Opportunity Employer 

At Agoda, we pride ourselves on being a company represented by people of all different backgrounds and orientations. We prioritize attracting diverse talent and cultivating an inclusive environment that encourages collaboration and innovation. Employment at Agoda is based solely on a person’s merit and qualifications. We are committed to providing equal employment opportunity regardless of sex, age, race, color, national origin, religion, marital status, pregnancy, sexual orientation, gender identity, disability, citizenship, veteran or military status, and other legally protected characteristics.

We will keep your application on file so that we can consider you for future vacancies and you can always ask to have your details removed from the file. For more details please read our privacy policy.

Disclaimer

We do not accept any terms or conditions, nor do we recognize any agency’s representation of a candidate, from unsolicited third-party or agency submissions. If we receive unsolicited or speculative CVs, we reserve the right to contact and hire the candidate directly without any obligation to pay a recruitment fee.