Director of Reliability
Location: West Pittsburgh, PA
Job Type: Full Time / Permanent
As Director, Reliability Engineering (RE), you will inspire a team of diverse RE’s with your leadership and passion for increasing system reliability. You will coordinate RE engagement needs across the organization focused on anticipating, identifying and solutioning for some of the most complex production issues impacting the business. You will provide a single interface to engineering leadership and partner with Product and UX leadership to ensure overall product success.
The ideal candidate is interested in building scalable infrastructure, adding system resiliency, improving developer productivity, automating everything that can (and should) be automated, as well as being a thoughtful people manager and leader. They will oversee a team of RE engineers responsible for overall system health, availability, performance and reducing operational issues as well as the long-term strategy for our infrastructure. You will report to the VP, Technology services and be based in our Pittsburgh office.
- Evaluate RE requirements and build out RE team to scale:
- Drive the reliability of business critical services in a complex distributed ecosystem
- Develop a set of support practices for all Vertical (business facing) Product Teams, as well as Foundational (shared services including Platform, Infrastructure, Security, D&A) technology domains
- Partner with domain teams to steer product roadmaps and ensure reliability is built in
- Serve as an extension of these domains by discovering ways to improve support operations.
- Creating scalable engineering solutions will be at the heart of what you do.
- As the primary leader of RE, you will study and understand RE industry best practice and help to elevate the company’s status within the broader RE community.
- Oversee day to day Reliability Engineering activities across all Brands and Channels:
- Implement best practices to improve scalability of our systems across Store, Omni, eComm, Marketing, Supply Chain, and Corporate Tech as well as horizontal Foundational domains
- Establish consistent reliability processes for all Digital and traditional Channels as we support more Brands, Vendors, Products, features, and technology platforms, etc.
- Build an ecosystem of Observability to aid in detection, triage, diagnosis and ultimate resolution of business and technology impacting events
- Establish and monitor KPIs for reliability, throughput, quality, and controls; deliver dashboards that provide operational and executive views
- Perform 24×7 Level 2 support functions for all critical applications, systems, and products
- Own system uptime, monitoring/alerting, CI/CD, cloud networking, security, and overall performance
- Be a hands-on contributor to projects, including some coding, code reviews, and architectural discussions
- Partner with Software Engineering to maximize product and platform reliability through code, tools, and monitoring improvements
- Lead the transformation of system reliability, resiliency, and performance for all products and services to the next generation.
- Automation everything
- Implement Self-Healing solutions to address failures and faults and reduce business impact
- Lead the Test Engineering team. Leverage test automation, end to end and exploratory testing to detect issues and flaws before they result in business disruption
- By thoughtfully setting strategies for reducing toil you will improve the athlete and teammate experiences and enable our engineering & support organizations to run highly reliable services.
- Staff Management and Financial Planning:
- Perform staff oversight and financial management for all aspects of functions described in this job description.
- Create, implement, and enhance an organization that best supports these responsibilities, and delivers world class operations and support functions to this Fortune 500 company.
- Control and manage a budget that leverages technology and automation to delivery seamless and reliable technology execution.
- As a technology leader – Participate in overall technology strategy, goal setting, and future vision activities.
Education & Experience:
- Bachelor’s degree in Computer Science, related technical field or equivalent practical experience.
- 10 years of experience with system design, algorithms, data structures, analysis, and software design.
- 10 years of experience managing a distributed team of engineers
- Experience growing and building teams
- Experience managing technology infrastructure and conducting technical deep dives into code
- Preferred Qualifications:
- 5+ years of site reliability engineering, DevOps, or related infrastructure exp
- 3+ years of engineering management experience
- 2+ years of retail and/or e-commerce experience
- Experience with modern architectures and cloud native design
- Experience with cloud infrastructure (Azure, GCP, etc.)
- Experience with data streaming platforms like Apache Kafka, and other utility services
- Experience with PCF, or similar PaaS providers
- Proficiency in data collection and display toolsets (e.g. ELK, Prometheus, etc.)
- Familiarity and exposure to Extreme Programming techniques
- Prior experience with test engineering and automation tools