Reliability Engineer

Location: West Pittsburgh, PA

Job Type: Full Time / Permanent

Responsibilities:

  • Troubleshoot high severity e-commerce, infrastructure and legacy business applications/websites performance and availability issues and manages the incident lifecycle to resolutions.
  • Lead root cause analysis/investigations through identifying, analyzing and remediating service(s) performance and availability issues to ensure maximum service uptime and availability.
  • Conducting Blameless Post Incident Review is expected.
  • Engage in and improve the whole lifecycle of services-from inception and design, through deployment, operation and refinement.
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
  • You’re expected to be on- call and have strong written communication skills and be able to develop working relationships with coworkers.
  • Experience in balancing service reliability, metrics, sustainability, technical debt, and operational toil for live services running at scale.
  • Work across multiple project teams simultaneously to support rapid development efforts.
  • Solve complex, business critical issues that impact bottom line financial numbers and customer loyalty/experience.
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
  • Contribute positively to open source projects developed by the company and join existing communities.
  • Navigate this broader ecosystem and structure projects with upstream/ downstream opportunities in mind.
  • Identify and integrate with third-party solutions where it makes the most sense.
  • Use data to understand the availability, reliability, and sustainability of our software.
  • Bring experience, pragmatism, empathy, and composure to interactions with teams outside of the RE organization.
  • Work frequently with Product teams on shared goals and cross-team projects.
  • Balance planned and reactive work using basic project planning techniques and technical roadmaps.
  • Work and collaborate across teams such Application services, Capacity Planning, Hardware, Network, and Datacenter Operations.
  • Participate in building advanced tooling for testing, monitoring, administration, and operations of multiple clusters across multiple environments.
  • Experience negotiating SLIs, SLOs, and SLAs with product owners.

Experience:

  • 3-5+ years of applying reliability engineering principals to distributed services.
  • Understanding of and comfort with the GNU/Linux operating system.
  • Proficiency in high-level languages such as Ruby, Python, and Bash.
  • Exposure to system-level languages such as Go, C/C++.
  • Familiarity with configuration management software such as Puppet, Chef, Ansible, or Salt.
  • Source control, branching, & merging: git/svn/etc (Repository Management)
  • Networking basics: TCP vs UDP, basic troubleshooting, HTTP – load balancing, firewall, private networks, multi-tier design, scale-out, persistent data
  • Databases – at a minimum understands the basics – select/insert
  • Familiarity with standard infrastructure concepts like load balancers, firewalls, object storage and where/when they might be used.
  • Service Management – Incident Response, Change, and Problem Management.
  • Experience with Kubernetes and Docker.
  • Cloud computing concepts (not necessarily provider specific) – VMs vs Docker Containers, block storage vs object storage, infra automation vs install automation.
  • Experience operating a platform, software as a service, or shipping software.
  • Experience as an open-source contributor.
  • Valuable Technologies Like: WebSphere Commerce, WebSphere eXtreme Scale, WebSphere Application Server, WebSphere Message Broker, WebSphere MQ, Order Management, Web Services, Tomcat, Apache, TCP, UDP, Load Balancers, (Repository Management git/svn/), Puppet, Chef, Ansible, Salt, VM, Dockers Containers
  • Valuable Methodologies Like: ITIL, Agile, SCRUM, Reliability Engineering
  • Valuable Languages Like: Java, JavaScript, SQL, XML, HTML, CSS, Visual Basic, AJAX, C++, COBOL, JSTL, Ruby, Python, Bash. Go, C/C++.
  • Valuable Databases/OS Systems Like: Oracle, DB2, SQLServer, Windows, UNIX, Linux, SYSTEMi
  • Valuable Monitoring Tools Like: IBM Monitoring, SCOM, CA Spectrum, AppDynamics, Soasta, Foglight
  • Service Management Tools Like: Remedy, Service Now, Jira, Pivotal Tracker, Xmatters,
APPLY NOW