Staff Reliability Engineer
Job Type: Full Time / Permanent
A Senior Reliability Engineer at is an engineer who has a deep level of knowledge in systems, software engineering and associated automation, tooling, and processes. They possess a breadth and depth of knowledge that allows them to iteratively improve the operability, observability, reliability, scalability, and performance of the systems to reduce the operational overhead, reduce risks and simplify the ecosystem. They drive operational excellence by enabling Balanced Product Teams and other Partner Teams to up-level the health of their services in production, improve reliability, and empower them to self-serve and run their services by having strong partnerships and continuous collaboration.
JOB RESPONSIBILITIES: • Leads software lifecycle, reliability, observability, and efficiency across product teams within your domain • Demonstrates operational excellence by leading major automation, toil reduction initiatives, simplifying ecosystem, and reducing risks • Guides product teams on building resilient and observable architectures • Builds products to help teams manage their own reliability • Advises product teams within your domain on capacity planning, chaos engineering, and DR/HA • Works with product owners within domain to define reliability best practices and set standards for SLIs/SLOs and error budgets • Mentors and assists engineers on team.
QUALIFICATIONS REQUIRED: • Bachelor’s Degree or equivalent in MIS, Computer Science, or related field • 6+ years of experience in software development • Have strong programming skills in one or more languages – Java, Python, Go or Node.js • Proven ability to manage multiple competing priorities • Advanced in-depth knowledge of application design patterns, event-driven architecture, database schemas, and testing strategies • Demonstrated experience with large scale application troubleshooting and performance tuning • Demonstrated experience working with at least one major cloud platforms (GCP, AWS, or Azure) • Deep experience in one of more Observability platforms – Prometheus, InfluxDB, Grafana, ELK or APM • Deep experience in at least one PasS & Containers – Openshift, Cloud Foundry, Kubernetes or equivalent • Deep experience with one or more configuration management systems like Chef, Ansible, Puppe.t
PREFERRED: • Advanced in-depth knowledge and experience with continuous integration, continuous deployment, and test-driven development • Advanced deep understanding of systems architecture, UNIX internals, networking topologies, multi-cluster applications, multi-tenant platforms, and systems/network security.