Lead Observability Engineer
TLDR
Drive the strategy, adoption, and evolution of observability across all production environments, leveraging AIOps, automated remediation, and cross-team collaboration to elevate reliability.
Lead the observability strategy and execution, ensuring comprehensive visibility across all production and delivery environments.
· Own and govern the enterprise observability platform (New Relic or equivalent tools such as Datadog or Dynatrace) and ensure consistent monitoring standards across systems.
· Explore and adopt AI-driven monitoring capabilities (AIOps) to automate anomaly detection, reduce alert fatigue, and enable predictive problem management.
· Collaborate closely with Production Support (L1/L2), DevOps, CloudOps, Software Engineering, and Database teams to triage complex production issues and accelerate incident resolution.
· Act as the operational coordinator during service-impacting events, organizing workflows, managing cross-team dependencies, and providing structured updates to leadership.
· Design and implement automated remediation workflows and self-healing mechanisms for recurring incidents.
· Analyze telemetry data (logs, metrics, traces) to identify incident patterns and systemic anomalies, and continuously refine alert thresholds and routing logic.
· Develop and maintain dynamic dashboards that reflect real-time system health, application performance, and infrastructure behavior.
· Define and track reliability metrics such as SLOs, SLIs, MTTD, and MTTR to improve service reliability.
· Ensure clear, timely communication with stakeholders during incidents and operational events.
· Drive organization-wide adoption of observability best practices through documentation, training, and knowledge sharing.
8–10+ years of experience in observability, site reliability engineering (SRE), DevOps, or advanced production operations in large-scale enterprise environments.
· Expert-level hands-on experience implementing and optimizing observability platforms such as New Relic, Datadog, Dynatrace, or Splunk.
· Strong understanding of monitoring fundamentals including logs, metrics, traces, and alerting strategies.
· Experience working with cloud-native architectures (AWS preferred).
· Familiarity with containerized environments and orchestration platforms such as Kubernetes.
· Experience integrating observability practices into CI/CD pipelines to ensure applications are observable by design.
· Strong understanding of incident management, problem management, and change management practices (ITIL concepts).
· Demonstrated ability to analyze telemetry data to identify patterns, detect anomalies, and improve operational reliability.
· Strong leadership and collaboration skills with the ability to coordinate across engineering, DevOps, and operations teams.
· Excellent communication skills and a strong focus on operational excellence and continuous improvement.
Nice to Have
· Experience implementing AI/ML capabilities within observability tools for anomaly detection and predictive monitoring.
· Familiarity with AIOps platforms and automated remediation workflows.
· Experience with event streaming platforms such as Kafka for telemetry ingestion or real-time data processing.
· Basic understanding of application architecture and troubleshooting distributed systems.
· Experience with automation frameworks or serverless workflows (e.g., AWS Lambda, scripting, or infrastructure automation).
Benefits
Health Insurance
comprehensive health coverage
Paid Time Off
recognizing public holidays
Wellness Stipend
well-being perks
Kobie Marketing is a loyalty technology provider that partners with global brands to create personalized, data-driven loyalty experiences. By combining strategy-led technology with deep consumer insights, Kobie helps brands forge lasting emotional connections with their customers. With a commitment to innovation and an expanding presence, including a new tech hub in India, Kobie is shaping the future of loyalty solutions.
- Founded
- Founded 1990
- Employees
- 201-500 employees
- Industry
- Professional Services