SRE

We are looking for an Engineering Manager – SRE & Observability to lead technical teams and ensure system reliability, scalability, and automation. You will play a key role in developing strategies for transforming our observability stack, optimizing cloud infrastructure, and fostering a culture of continuous improvement. Daily communication with the team and stakeholders will be in English, and work can be done remotely.

Portugal

Responsibilities:

  • Lead SRE / Observability teams to ensure system reliability, resiliency, and automation.
  • Develop and implement strategies to transform and optimize the observability stack based on SaaS solutions.
  • Promote continuous improvement in system architecture, deployment, and operational processes through automation.
  • Collaborate with stakeholders to define, prioritize, and maintain high-performance systems aligned with business goals.
  • Oversee the development and implementation of monitoring, alerting, metrics, logs, and traces to enable rapid issue detection and resolution.
  • Foster a culture of continuous learning and knowledge sharing within the teams.
  • Ensure compliance with security standards and best practices in all infrastructure and operations.
  • Manage and optimize cloud infrastructure costs while maintaining high availability and performance.
  • Provide technical leadership in adopting new technologies and methodologies to improve system reliability and efficiency.

 

Skills & Qualifications:

  • Experience as an Engineering Manager or in a leadership role in SRE/DevOps.
  • Strong analytical and problem-solving skills with a focus on innovation and continuous improvement.
  • Proactive, results-driven, and committed to fostering excellence and continuous learning.

 

Technical Skills:

  • Cloud-native technologies and experience with AWS, Azure, or GCP.
  • Experience with monitoring tools (Grafana, Prometheus, Dynatrace), logging (ELK, Loki), and tracing (Jaeger, OpenTelemetry).
  • Expertise in containers (Docker, Kubernetes), cloud infrastructure, and Infrastructure as Code (Terraform, Ansible).
  • Strong programming skills in Python, Go, and SQL, along with knowledge of network protocols (HTTP, DNS, TCP/IP).
  • Experience with CI/CD pipelines and process automation.
  • Ability to troubleshoot distributed systems and identify performance bottlenecks.

 

If this sounds like you, share your CV with us through the e-mail address below, and let’s talk!

 

talent.europe@99x.io