Dearborn, MI, 48123, USA
1 day ago
HPC SRE Systems Engineer
We are seeking a highly skilled and motivated HPC SRE Systems Engineer to join our growing team. You will be responsible for designing, building, and maintaining our HPC and SRE infrastructure that our platform depends on for daily operation, ensuring optimal performance and reliability for our critical applications. This role will also have a focus on automating deployments of our infrastructure and monitoring stack leveraging CICD and IaC. If you are interested to engage with a dynamic HPC stack and be a driving force working towards the resiliency of our platform, this position could be a good fit for you. What you'll do... + Design, implement, and maintain a robust and scalable HPC infrastructure to support containerized AI/ML workloads across traditional HPC and Kubernetes environments. + Implement monitoring solutions to ensure health and availability of critical infrastructure and applications. + Develop automation for repeatable and resilient infrastructure deployments. + Troubleshoot and resolve complex technical issues related to Linux systems, networking, storage, and HPC applications. + Develop and maintain documentation for software and procedures. + Collaborate with software engineers and researchers to ensure seamless integration of HPC resources and scaling of applications. + Stay up-to-date on the latest advancements in HPC and AI/ML technologies and best practices. You'll have... + Associate's degree in Computer Science, Engineering, or work experience equivalent. + 5+ years of experience in Systems or Software engineering + Strong understanding of Linux operating systems, preferably in an HPC environment + Proficiency programming in one or more languages, preferably go, python, or bash scripting. + Familiarity with how to scale applications and the metrics collection, analysis, and visualization tools used to identify bottlenecks like Prometheus and Grafana. + Excellent problem-solving and troubleshooting skills. The ability to define what problems need to be solved. + Strong communication and collaboration skills. Even better, you may have... + Experience with containerization technologies like Docker or Kubernetes. + Experience with automation tools like Ansible, Puppet, or Chef. + Experience with monitoring tools like Prometheus, Icinga, Nagios, or Elasticsearch. **Requisition ID** : 45304
Por favor confirme su dirección de correo electrónico: Send Email