DevOps/Observability Engineer
TLDR
Design and scale a next-generation observability platform for complex, distributed systems, unifying metrics, logs, and traces with OpenTelemetry, Prometheus, Grafana, and Splunk.
- Design and implement end-to-end observability architectures using OpenTelemetry, Prometheus, Grafana, and related tools across cloud environments.
- Build and maintain centralized observability pipelines across multi-account AWS environments, including CloudWatch, CloudTrail, and VPC Flow Logs.
- Develop scalable log aggregation and routing strategies, including filtering, noise reduction, and integration with systems such as Splunk HEC.
- Create advanced alerting frameworks and high-quality dashboards using Alertmanager, CloudWatch Alarms, and Grafana with PromQL.
- Deploy and manage observability infrastructure using Infrastructure as Code tools such as Terraform.
- Support Kubernetes and container-based observability across EKS and ECS environments.
- Optimize observability systems for performance, cost efficiency, and scalability in large-scale production environments.
- Collaborate with engineering teams to improve system reliability, monitoring standards, and incident response capabilities.
- 8+ years of experience in DevOps, Site Reliability Engineering, or Observability Engineering roles.
- Strong hands-on experience designing unified observability pipelines using OpenTelemetry, Prometheus, and Grafana.
- Deep expertise in AWS observability services including CloudWatch, CloudTrail, and cross-account telemetry strategies.
- Proven ability to build and manage large-scale log aggregation systems and optimize high-volume data pipelines.
- Strong experience with Kubernetes (EKS) or containerized environments (ECS) in production settings.
- Advanced proficiency with Terraform or other Infrastructure as Code tools for infrastructure and observability deployments.
- Experience building alerting systems, dashboards, and monitoring frameworks for distributed systems.
- Strong understanding of cost optimization strategies for observability platforms (log filtering, metric reduction, storage tiering).
- Excellent problem-solving, debugging, and collaboration skills in complex cloud-native environments.
- Competitive compensation aligned with experience and market benchmarks.
- Remote work flexibility within Canada.
- Opportunity to work on large-scale, AI-driven, cloud-native infrastructure systems.
- Exposure to enterprise clients and high-impact digital transformation projects.
- Hands-on experience with leading observability and cloud technologies in production environments.
- Strong learning and upskilling culture in AI, cloud, and platform engineering.
- Collaborative, high-performance engineering environment focused on innovation and reliability.
- Opportunity to shape next-generation observability practices at scale.
Requirements:
Benefits:
Benefits
Equity Compensation
Competitive compensation aligned with experience and market benchmarks.
Learning Budget
Strong learning and upskilling culture in AI, cloud, and platform engineering.
Next-gen observability practices
Opportunity to shape next-generation observability practices at scale.
Remote-Friendly
Remote work flexibility within Canada.
Jobgether runs the largest remote job platform, effectively linking job seekers with over 200,000 flexible and remote opportunities that match their unique skills and preferences. Our focus is on enhancing the hiring process, ensuring efficiency while prioritizing the candidate experience, particularly in the growing health and wellness sector.
- Founded
- Founded 2020
- Employees
- 11-50 employees
- Industry
- Professional Services