Position Summary:
The Senior Application Developer is part of the Advanced Computing Center for Research and Education (ACCRE) cluster team at Vanderbilt University and is a key individual contributor responsible for acting as the team lead and subject matter expert for the cluster scheduler, the cluster software stack, and software pipelines. Reporting directly to the Director of Research Computing Operations, the Senior Application Developer role places and emphasis on computational cluster performance and the latest tools for cluster operations.
About the Work Unit:
The Advanced Computing Center for Research and Education (ACCRE) is built and operated by Vanderbilt faculty. Its mission is to enable Vanderbilt researchers to explore and benefit from the “New World” of computing, thereby addressing questions of great societal importance that they would not have been able to otherwise. To achieve this, the center has established the following goals:
• Application Driven: ACCRE emphasizes the application of computational resources to important questions across the diverse disciplines of Vanderbilt researchers, rather than focusing solely on the development of computational hardware, tools, and methodologies.
• Low Barriers: The center aims to provide computational services with low barriers to participation, working closely with researchers to develop and adapt computing tools to their specific areas of inquiry.
• Expand the Paradigm: ACCRE collaborates with the Vanderbilt community to discover innovative ways of utilizing computing in the humanities, arts, and education.
• Promote Community: The center fosters an interactive community of researchers and cultivates a campus culture that supports and promotes the use of computing tools.
• Investigator Driven: ACCRE maintains a grassroots, bottom-up approach, operating as a facility by and for Vanderbilt faculty.
ACCRE provides computing resources flexible enough to support High Performance Computing applications in a broad range of research projects. To meet the growing demand for data storage, the center is developing and deploying solutions for both online and offline data repositories. Moreover, ACCRE offers the necessary hardware for investigators and students to visualize high-dimensional data using parallel graphics and stereo projection technologies. The center’s infrastructure also includes expertise and support staff to facilitate usage, including educational/outreach staff focused on lowering barriers to use and expanding the paradigm to encompass new and non-traditional areas of investigation. The center operates a 14,000+ core Linux cluster comprised of multiple computer architectures and over 30 petabytes of parallel access, fault tolerant, distributed disk storage.
The center operates a 14,000+ core Linux cluster comprised of multiple computer architectures and over 30 petabytes of parallel access, fault tolerant, distributed disk storage.
Key Functions and Expected Performance:
• Continuously review cluster performance bottlenecks and investigate tuning opportunities
• Develop analytical tools to report on user-level and group-level cluster usage patterns and stats
• Curate the software stack for the cluster observing appropriate architectures and production pipelines
• Promote contributions to the software stack via the Easy Build community
• Develop stock virtual images/containers that can be customized by users
• Adapt and redesign the SLURM scheduler as the computational cluster grows and shifts
• Work closely with funded research groups to architect production workflows across ACCRE resources
• Research, develop, implement, maintain, document, and support infrastructure technology and services that facilitate the management and usage of the cluster, which includes elevated end-user cluster support
• Develop scripts to facilitate cluster and user management and assist computer system analysts in the creation of cluster related software applications
• Assess software packages that could expand ACCRE’s value to users
• Research and evaluate new technologies and concepts which could potentially further improve ACCRE’s capabilities and services
• Respond to help desk tickets to solve user problems and to educate users on ACCRE services
• Participate in the on-call rotation and after hours scheduled and unscheduled downtimes
Supervisory Relationships:
This position does not have supervisory responsibility. This position reports administratively and functionally to the Director of Research Computing Operations.
Education and Certifications
• A Bachelor’s degree in computer science, computer engineering, or similar field required.
• A Master’s degree is preferred.
Experience and Skills:
• A minimum of five years of experience with one or more major programming languages such as C, C++, Java, or Fortran, during work or school is required.
• Five years of experience with one or more Unix scripting languages such as Perl, Bash, Csh, or Python, plus a working knowledge of all of these, during work or school is preferred.
• A minimum of two years of experience managing cluster schedulers is required.
• Strong ability to work independently and in a team environment and make decisions is required.
• Strong ability to share knowledge coherently with others and motivate and integrate peers is preferred.
• Ability to communicate to researchers the value that ACCRE provides is required.
• Physical ability to lift 30 pounds is required.
• Strong programming ability and understanding of commonly used design patterns is required.
• Knowledge and experience in version control tools and configuration management tools is required.