Site Reliability Engineer - Cloud Infrastructure (all genders)

Remote
14 hours ago
How you can contribute to gridX

Please note: This position requires on-site work or remote work from within Germany.Do stuff that matters - Become a part of gridX and contribute your own part to digitalise the energy industry with us and thus make renewable energies accessible and affordable everywhere  #getshitdone

At gridX, we are building the digital brain for the energy transition. We are looking for an engineer who wants their code to have a tangible impact on a sustainable future.

As part of the SRE Cloud Infrastructure team, you will join a culture defined by a single principle: "Reliability First". However for us, reliability isn't about fixing broken things or keeping the lights on manually. It’s about enablement. We engineer the automated, self-service solutions that empower our engineering teams to own their services from development to production.

You are a builder. An experienced, autonomous engineer who is ready to evolve our systems, engineer away complexity and champion a culture where reliability is built in by design.

  • Take Ownership: You actively evolve our multi-tenant cloud and container infrastructure. You take end-to-end ownership of various components, ensuring they are secure, scalable, observable, and cost-efficient.

  • Engineer Infrastructure as Software: You bring a developer's mindset to operations. You solve complexity by writing high-quality code and automation, ensuring our platform is managed strictly via declarative code.

  • Drive Observability: You mature our observability platform, ensuring we aren't just collecting data but providing the insights teams need to drive architectural decisions, improve performance, and establish meaningful SLOs.

  • Architect for Resilience: You proactively identify bottlenecks before they become incidents and, when things do break, you lead the resolution and drive post-mortems to ensure we learn.

  • Empower Others: You build self-service capabilities that allow engineering teams to own their full lifecycle. You also drive the adoption of best practices through code or architecture reviews and technical deep-dives and share your expertise through high-quality documentation and operational runbooks.

This is how you and your application stand out

  • You have solid experience in an SRE or Platform role, building and managing distributed systems in production environments. You are comfortable working with a high degree of autonomy, navigating ambiguity and driving technical initiatives end-to-end.

  • You have strong hands-on experience with a major public cloud provider. You understand the architectural foundations of cloud infrastructure (Compute, Storage, Networking, and IAM) and are fluent in managing them as code.

  • You apply a pragmatic software engineering mindset to operations. You write clean, maintainable code and scripts, prioritizing long-term stability and quality.

  • You have operational experience with Kubernetes at scale, understanding how to manage upgrades, security and resource allocation in a production cluster.

  • You embody a "Reliability First" mindset, understanding incident lifecycle management and the importance of psychological safety in engineering.

What sets you apart
  • You have hands-on expertise in the AWS services we use heavily, such as EKS, EC2, VPC, RDS, Lambda, S3, Kinesis, DynamoDB, SNS and SQS.

  • You go beyond usage and understand the internal components of Kubernetes (scheduling, API server, controllers, RBAC). Experience writing custom Controllers or Operators is a significant plus.

  • You have strong skills in at least one modern programming language (e.g., Go, Typescript, Java, Python, Rust) have a willingness to work with Go, which is our core language for tooling and automation and embrace AI-assisted workflows to accelerate development.

  • You have expertise in modern observability stacks (e.g., Grafana LGTM, Thanos, VictoriaMetrics). You can operate and tune the platform at scale, while guiding teams on effective instrumentation and alerting strategies.

  • You have deep technical expertise in Release Engineering and GitOps, as well as maintaining infrastructure that enable developers to release their software securely and reliably.

  • You have deep knowledge of TCP/IP, DNS, and HTTP protocols, and you understand the intricacies of container networking.

Why gridX

  • Flexible & mobile working: Work remotely for up to 70 days from anywhere in the EU and other selected countries such as Indonesia, Canada, Brazil and many more
  • Vacation: 30 days for your relaxation
  • Sports: 30 Euro allowance for Urban Sports Club or E-Gym Wellpass
  • Health: Make use of our (mental) health management offers such as Nilo.health (e.g. 1:1 coaching sessions, daily meditation offers, Self-reflection options) for your mental-wellbeing
  • Personal development: Annual development budget of 1,500 euros per employee
  • Employee discounts: Access to gridX Corporate Benefits
  • Stay fit and safe the planet with our JobRad offer
  • Set up a pension plan and receive a fair monthly contribution
  • City travel subsidy: 30 Euros monthly allowance for your monthly/annual ticket
  • Modern workplace in the hearts of Aachen and Munich with IT equipment of your choice (Apple or Lenovo)
  • Annual Teamweek: Enjoy an unforgettable off-site, face extraordinary challenges together with all gridX teams and create unforgettable memories!
  • Experience the gridX culture at regular team events and receive 100 Euros on top per employee for your department event
  • We will donate 20 Euros to a charity of your choice on your birthday
  • Sabbatical option: Take a break from the daily work routine and realize personal projects, travel or further education (depends on length of employment)