High Performance Computing Reliability Analyst
Jump Trading Group is committed to world class research. We empower exceptional talents in Mathematics, Physics, and Computer Science to seek scientific boundaries, push through them, and apply cutting edge research to global financial markets. Our culture is unique. Constant innovation requires fearlessness, creativity, intellectual honesty, and a relentless competitive streak. We believe in winning together and unlocking unique individual talent by incentivizing collaboration and mutual respect. At Jump, research outcomes drive more than superior risk adjusted returns. We design, develop, and deploy technologies that change our world, fund start-ups across industries, and partner with leading global research organizations and universities to solve problems.
Our Production Engineering team is comprised of Engineers from a variety of backgrounds, spread around the globe. Some have run clusters of thousands of machines at national labs, others have optimized systems to the hilt for distributed databases, contributed to core Python, or skipped college and taught themselves information security. What connects this team is a desire to work on challenging technical problems, continuous learning, and to do so in a collaborative and supportive environment. As a team, we design and support the trading infrastructure and the high-performance computing environment, solving for both latency and parallelism. We cover everything from BIOS tweaks and certifying the newest chassis to supporting internally written applications to helping advise people around the company, using a blend of open source and homegrown code.
The scale of our computing environments provides unique challenges in providing good performance and reliability. Several systems including compute, scheduling, networks, and large-scale data storage must integrate seamlessly to support data pipelines and quantitative research. We’re looking to expand our global team with a strong Linux administrator who is interested in analyzing systems at this scale and relentlessly seeking to eliminate inefficiency and downtime.
What you’ll do:
• Maintain and support high performance compute and storage systems
• Support performance monitoring and fault monitoring systems
• Monitor systems and storage performance, up to and including network components
• Write code to automate frequently performed tasks
• Develop and improve user documentation
• Develop and monitor the tools used to maintain a production computing environment
• Provide operational support on a rotating basis and as needed
• Other duties as assigned or needed
Skills you’ll need:
• Experience in high performance computing (HPC), including parallel filesystems (e.g., Lustre, GPFS), batch systems (e.g., Slurm, Grid Engine), and high-performance network interconnects experience is a plus, but not required
• Extensive experience with Linux systems administration
• High proficiency with at least one programming/scripting language (e.g., Go, Python)
• Extensive experience designing, building, and maintaining complicated, interdependent, and distributed systems
• Experience with system configuration management tools (SaltStack, Ansible, Puppet, etc.)
• A compulsion to perform root cause analysis
• Reliable and predictable availability