As a Platform Reliability Engineer on Vanguard's Runtime Engineering team you'll have the opportunity to put your operational savvy-ness and engineering skills to work! On the job you'll be ensuring the "-ilities" (Availability, Reliability, Scalability, Usability; etc.) of our private and public cloud platforms in both test and production environments. You'll respond to incidents, apply upgrades to the platform and leverage a strategic thinking mindset to "automate all the things"(repetitive manual work is the worst!).
Additionally, you can anticipate working with real-time monitoring and diagnostic data, analyze trends, and plan for future infrastructure needs. As a caretaker of these platforms you'll be collaborating and planning activities with our internal development teams to ensure that application service level objectives are met. As the name might suggest, a passion for platforms and reliability is a must!
On the job you'll be...
Maintaining, upgrading, and patching our private and public cloud platforms in test and production environments. Managing communications and coordinating change events with development and support teams. Identifying and resolving reliability issues and implementing long-term mitigation strategies - ideally through automation.
Responding to production incidents and availability needs. Facilitate and document platform post-mortems. Train and mentor junior staff members on reliability practices, processes and technologies. Participate in the Runtime engineering off-hours on-call rotation while helping to define future state implementations for Microservice runtime platforms.
Duties & Responsibilities:
- Provides senior level Tier 3 technical infrastructure support services for issues elevated from the Support Center and other Technical Services groups. Ensures reliable operation of production.
- Diagnoses and troubleshoots availability interruptions and other production issues.
- Plans and coordinates enterprise-wide infrastructure projects with other IT and client teams.
- Communicates with teams to keep them apprised of status and issues. Contacts vendors to resolve technical issues.
- Tests, installs, and migrates software, patches, upgrades, applications, and/or hardware.
- Develops technical standards. Tests and evaluates IT vendor products.
- Writes documentation, including project plans, installation procedures, and troubleshooting tips. Creates diagrams, including technical topology.
- Maintains, monitors, and tunes Production system and applications performance. Debugs source code and performance problems and/or provides debugging assistance to developers.
- Identifies opportunities to improve system and applications performance (e.g., automating manual system tasks).
- Trains and mentors staff. Resolves complex issues elevated from staff with less experience.
- Adds, updates, and closes IT Problem Management database records. Researches and resolves complex issues, and reviews related technology records to mitigate impact on assigned system.
- Reviews numerous IT knowledge repositories to update technical knowledge.
- Learns and understands client area business functions and requirements. Has the ability to determine the appropriate technical tool to address the client's business needs.
- Thoroughly understands and complies with IT policies and procedures, especially those for quality and productivity standards that enable the team to meet established client service levels.
- Thoroughly understands and complies with Information Security policies and procedures, and verifies deliverables meet Information Security and VSA requirements.
Participates in special projects and performs other duties as assigned.
Education & Experience:
- Minimum of 3+ yrs of overall technical engineering experience
- Bachelor’s Degree preferred or equivalent technical experience
- A deep understanding and practical experience with upgrading and maintaining distributed orchestration systems (Cloud Foundry, OpenShift, etc).
- Experience maintaining and monitoring distributed systems deployed in private and public clouds. Understanding of monitoring/telemetry solutions (Splunk, ELK, Nagios, etc...) data ingestion and analysis.
- Deep knowledge of Linux systems and cloud platforms/providers
- Strong oral and written communication skills
- Passion for problem solving and strategic thinking and a desire to own and execute
- Advanced understanding and application of at least one scripting language (Shell, PHP, Python; etc.)
- Development experience, Java; etc. a plus.
- A flexible schedule - some activities you'll be performing may require off-hours or weekend support
Location/Region: Malvern, PA