Observability DevOps
As a DevOps Engineer specializing in observability platforms, you will play a pivotal role in enhancing the visibility, reliability, and performance of our systems and applications. Leveraging your expertise in modern technologies and protocols, you will be responsible for implementing, and maintaining robust observability solutions to monitor, analyse, and troubleshoot our distributed systems.
This position requires passion about leveraging observability platforms to drive operational excellence and thrive in a dynamic, collaborative environment. It also requires strong technical skills in AWS, Terraform, Git workflows, and a deep understanding of DevOps principles.
A taster of what you will be involved with
- Design, deploy, and maintain observability platforms and tools such as Prometheus, Grafana, OpenSearch, Jaeger, and others to provide comprehensive insights into system behaviour and performance.
- Collaborate with development, operations, and other cross-functional teams to integrate observability solutions into the CI/CD pipeline and automate monitoring and alerting processes.
- Develop custom monitoring solutions and instrumentation libraries to capture relevant metrics, logs, traces, and events across microservices architectures.
- Configure and optimize telemetry collection, storage, and visualization components to ensure scalability, reliability, and cost-effectiveness.
- Implement anomaly detection algorithms and predictive analytics to proactively identify and mitigate potential issues before they impact users.
- Conduct thorough root cause analysis of incidents and performance bottlenecks, leveraging observability data to drive continuous improvement initiatives.
- Stay abreast of emerging trends, best practices, and industry standards in observability, and assess their applicability to our environment.
- Provide mentorship and technical guidance to junior team members, fostering a culture of knowledge sharing and collaboration.
- You will be expected to form part of a 24/7 on call roster to support your engineering team during out of office incidents calls
- Drive automation initiatives using Terraform, Git workflows, and other DevOps tools to streamline deployment processes and improve operational efficiency
You are good at:
- 3+ years’ hands on experience with most of the following technologies: AWS, Kubernetes, Helm, Ansible
- Proficiency in scripting languages such as Python, Bash, or Go for automation and tooling development.
- Hands-on experience with containerization and orchestration technologies (e.g., Docker, Kubernetes) in cloud-native environments.
- Strong understanding of distributed systems architecture, networking concepts, and security principles.
- Familiarity with modern observability protocols and standards, including OpenTelemetry for unified observability.
- Expertise in configuring and managing monitoring solutions like Prometheus, Grafana, and APM tools.
- Experience with infrastructure as code (IaC) tools such as Terraform or Ansible for provisioning and configuration management.
- Knowledge of eBPF (extended Berkeley Packet Filter) for deep observability into the Linux kernel and user-space applications.
- Excellent problem-solving skills, with the ability to analyse complex technical issues and provide effective solutions in a timely manner.
- Excellent communication and interpersonal skills, with the ability to collaborate with cross-functional teams, stakeholders, vendors, and partners.
By submitting your application, you understand that your personal data will be processed as set out in our Privacy Policy