SRE
We are looking for an Engineering Manager – SRE & Observability to lead technical teams and ensure system reliability, scalability, and automation. You will play a key role in developing strategies for transforming our observability stack, optimizing cloud infrastructure, and fostering a culture of continuous improvement. Daily communication with the team and stakeholders will be in English, and work can be done remotely.
Responsibilities:
- Lead SRE / Observability teams to ensure system reliability, resiliency, and automation.
- Develop and implement strategies to transform and optimize the observability stack based on SaaS solutions.
- Promote continuous improvement in system architecture, deployment, and operational processes through automation.
- Collaborate with stakeholders to define, prioritize, and maintain high-performance systems aligned with business goals.
- Oversee the development and implementation of monitoring, alerting, metrics, logs, and traces to enable rapid issue detection and resolution.
- Foster a culture of continuous learning and knowledge sharing within the teams.
- Ensure compliance with security standards and best practices in all infrastructure and operations.
- Manage and optimize cloud infrastructure costs while maintaining high availability and performance.
- Provide technical leadership in adopting new technologies and methodologies to improve system reliability and efficiency.
Skills & Qualifications:
- Experience as an Engineering Manager or in a leadership role in SRE/DevOps.
- Strong analytical and problem-solving skills with a focus on innovation and continuous improvement.
- Proactive, results-driven, and committed to fostering excellence and continuous learning.
Technical Skills:
- Cloud-native technologies and experience with AWS, Azure, or GCP.
- Experience with monitoring tools (Grafana, Prometheus, Dynatrace), logging (ELK, Loki), and tracing (Jaeger, OpenTelemetry).
- Expertise in containers (Docker, Kubernetes), cloud infrastructure, and Infrastructure as Code (Terraform, Ansible).
- Strong programming skills in Python, Go, and SQL, along with knowledge of network protocols (HTTP, DNS, TCP/IP).
- Experience with CI/CD pipelines and process automation.
- Ability to troubleshoot distributed systems and identify performance bottlenecks.
If this sounds like you, share your CV with us through the e-mail address below, and let’s talk!
talent.europe@99x.io