Location: West Pittsburgh, PA
Job Type: Full Time / Permanent
- Troubleshoot high severity e-commerce, infrastructure and legacy business applications/websites performance and availability issues and manages the incident lifecycle to resolutions.
- Lead root cause analysis/investigations through identifying, analyzing and remediating service(s) performance and availability issues to ensure maximum service uptime and availability.
- Conducting Blameless Post Incident Review is expected.
- Engage in and improve the whole lifecycle of services-from inception and design, through deployment, operation and refinement.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- You’re expected to be on- call and have strong written communication skills and be able to develop working relationships with coworkers.
- Experience in balancing service reliability, metrics, sustainability, technical debt, and operational toil for live services running at scale.
- Work across multiple project teams simultaneously to support rapid development efforts.
- Solve complex, business critical issues that impact bottom line financial numbers and customer loyalty/experience.
- Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- Contribute positively to open source projects developed by the company and join existing communities.
- Navigate this broader ecosystem and structure projects with upstream/ downstream opportunities in mind.
- Identify and integrate with third-party solutions where it makes the most sense.
- Use data to understand the availability, reliability, and sustainability of our software.
- Bring experience, pragmatism, empathy, and composure to interactions with teams outside of the RE organization.
- Work frequently with Product teams on shared goals and cross-team projects.
- Balance planned and reactive work using basic project planning techniques and technical roadmaps.
- Work and collaborate across teams such Application services, Capacity Planning, Hardware, Network, and Datacenter Operations.
- Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across multiple environments.
- Experience negotiating SLIs, SLOs, and SLAs with product owners.
- 3-5+ years of applying reliability engineering principals to distributed services.
- Understanding of and comfort with the GNU/Linux operating system.
- Proficiency in high-level languages such as Ruby, Python, and Bash.
- Exposure to system-level languages such as Go, C/C++.
- Familiarity with configuration management software such as Puppet, Chef, Ansible, or Salt.
- Source control, branching, & merging: git/svn/etc (Repository Management)
- Networking basics: TCP vs UDP, basic troubleshooting, HTTP – load balancing, firewall, private networks, multi-tier design, scale-out, persistent data
- Databases – at a minimum understands the basics – select/insert
- Familiarity with standard infrastructure concepts like load balancers, firewalls, object storage and where/when they might be used.
- Service Management – Incident Response, Change, and Problem Management.
- Experience with Kubernetes and Docker.
- Cloud computing concepts (not necessarily provider specific) – VMs vs Docker Containers, block storage vs object storage, infra automation vs install automation.
- Experience operating a platform, software as a service, or shipping software.
- Experience as an open-source contributor.
- Valuable Technologies Like: WebSphere Commerce, WebSphere eXtreme Scale, WebSphere Application Server, WebSphere Message Broker, WebSphere MQ, Order Management, Web Services, Tomcat, Apache, TCP, UDP, Load Balancers, (Repository Management git/svn/), Puppet, Chef, Ansible, Salt, VM, Dockers Containers
- Valuable Methodologies Like: ITIL, Agile, SCRUM, Reliability Engineering
- Valuable Databases/OS Systems Like: Oracle, DB2, SQLServer, Windows, UNIX, Linux, SYSTEMi
- Valuable Monitoring Tools Like: IBM Monitoring, SCOM, CA Spectrum, AppDynamics, Soasta, Foglight
- Service Management Tools Like: Remedy, Service Now, Jira, Pivotal Tracker, Xmatters,