High Performing Computer Engineer
Location: Pittsburgh, PA
Job Type: Full Time / Permanent
This is an advanced technical position to administer and maintain various HPC environments and other mission critical technology resources. Responsible for ensuring continuity of service for HPC systems, and enterprise systems and services on UNIX & Linux platforms. Duties include installation, configuration, and day-to-day support of hardware, operating systems and system administrative applications for HPC, Linux, and UNIX infrastructures. The incumbent will coordinate and perform server hardware and software upgrades, as well as monitor server performance and network connectivity. The incumbent will also monitor and resolve assigned customer trouble tickets associated with the HPC infrastructure and other systems. The position is challenged with ensuring maximum availability and highest level of performance for production and mission critical technology resources. Excellent knowledge of RHEL Linux file server operating systems and strong problem solving and troubleshooting skills are required. Seeking the following related skills:
- Installation, configuration, and day-to-day support of hardware, operating systems and system administrative applications for HPC Infrastructures. Maintain, troubleshoot and resolve issues regarding enterprise applications and HPC and Linux/Unix systems including application servers, storage area networks and similar equipment.
- Ability to analyze information presented by advanced server monitoring technologies is critical for success. Must possess excellent troubleshooting and problem solving skills.
- Coordinate and perform server hardware and software upgrades, as well as monitor HPC/Linux/UNIX server performance and network connectivity. Install operating systems, apply maintenance releases of operating system and related software, and apply security and other patches as required in order to maintain systems. Utilizes and creates documentation to setup and test enterprise equipment. Make recommendations of purchasing software and hardware to improve HPC services.
- Must possess ability to comprehend complex technical concepts and be familiar with patch management strategies and other CSSD procedures relating to system maintenance and performance tuning in order to ensure that systems are operating with maximum security, reliability, and operating efficiency. Able to perform research as required using vendor technical resources, and other tools to apply appropriately tested patches and operating system maintenance releases according to established change management and problem resolution procedures.
- Prepare comprehensive root cause analyses documentation of system issues or failures. Document procedures. Document information on faults reported by NOC management software in the call tracking database, system documentation and related materials for enterprise class and HPC systems.
- Must possess excellent written and oral communication skills and be able to create documentation that is both timely and comprehensive in nature. Must be able to interpret documentation on available systems and utilize this information effectively in maintaining assigned systems and servers.
- Problem resolution – collaborate and works cooperatively with other NOC engineers, other technical staff and customers to resolve problems in a timely manner and communicate necessary information to the Technology Help Desk as required. Monitor and resolve assigned customer trouble tickets associated with the HPC infrastructure and other systems on Linux and UNIX platforms.
- Must be able to communicate necessary information clearly and concisely, participate effectively in ad hoc working groups to assess problems and devise problem solving strategies. Must be able to translate complex technical information into summaries useful to Help Desk staff.
- Must be willing to work day, evening, and night shifts as required to support a 24-hour, seven-day operation.
Education & Experience:
- Bachelor’s degree (or equivalent in experience) preferably in computer science or ion related discipline. 6-9 years of total IT experience with at least 3 years experience administering and maintaining High Performance Computing environments with various server technologies. Incumbent should demonstrate experience and knowledge in the following:
- Experience with High Performance Computing environments (HPC)
- Experience with Scyld ClusterWare cluster management applications
- RedHat and Linux derivatives experience
- Experience in high bandwidth network fabrics like lnfiniband or 10 Gigabit Ethernet
- Experience with Penguin Computing, Dell and IBM HPC hardware. Sun Solaris. EMC.
- Experience with Scyld Integrated Management Framework is a plus
- Application development background in C++, C#, and Java is a plus
- Knowledge of HPC job schedulers such as TORQUE and Scyld TaskMaster is a plus
- Familiarity of key protocols including TCP/IP, SSH, DNS, SMTP, SNMP, HTTP and LDAP,SAN
- Familiarity with network switch configurations involving IOS and complex VLANs
- Excellent verbal and written communication skills required
- Excellent customer service skills required
- Red Hat Linux server administration experience
- Strong knowledge of Red Hat Enterprise Linux Servers (networking & storage)
- Experience patching Unix/Linux OS
- Familiarity with certificate renewals
- Solid experience decentralizing multi-server environments
- UNIX/Linux scripting
- High Availability and failover scenarios, load balanced environments
- Perl, Python, lnfiniBand skills a plus