Software Dvlpr 2

Stanford University • FULL_TIME • Stanford, California • 2d ago

Software Dvlpr 2

The Stanford Center for Genomics and Personalized Medicine (SCGPM) has an exciting opportunity available for a motivated Bioinformatics Systems Architect/Team Leader to create innovative data infrastructures that will automate the process of turning big genomic data into biomedical insights. The ideal person for this position is a keen listener who can interpret biological questions, assess the value and relevance of different technologies and methods, and deliver actionable technical solutions.

Background:
The Department of Veterans Affairs (VA) has commissioned the sequencing of thousands of whole genomes from participants in the Million Veteran Program (MVP). This data is currently being delivered to the SCGPM’s cloud computing environment and constitutes one of the largest repositories of whole-genome sequencing data in the world. The scale and richness of this data make it an incredible resource for biomedical research. Our goal is to turn this data lake into a data commons: a dynamic computing environment where researchers bring questions and get answers, all without having to go through the ordeal of manually collecting, cleaning, massaging, scrubbing, sorting, transforming, and filtering data.

Position:
In this position, you would be the lead architect and system implementer of the cloud-based MVP data management system that we have created called Trellis. Trellis keeps track of the petabytes of sequence data contributed to the MVP by veterans. It also orchestrates the processing of that data into derivative files, while keeping track of what programs were used to transform the data, maintaining a detailed record of data provenance.

To manage the enormous volumes of biomedical research data that the MVP generates, we built and run Trellis in the Google Cloud Platform. The Trellis architecture takes advantage of many serverless cloud services such as Cloud Functions and Pub/Sub to make a workflow which responds to the arrival of new data by initiating pipeline processes automatically.

A production version of Trellis has already processed the whole genomic sequences of 150,000 veterans and we plan to process at least as many more in the coming year. You would be in charge of keeping this production system running and optimized, and you would interface with the DevOps team which will maintain that system in a FedRAMP-secure environment.

Now that we have proven that we can process and manage biomedical data at scale, our desire is to make MVP data more easily accessible to VA-internal researchers and to the scientific community at large. Possible directions for this sharing include creating a visualization front-end to allow researchers to experiment with data graphically and providing a cohort selection mechanism so subpopulations of veterans can be studied. You would continue the development of the Trellis system to integrate new data from the VA and to present Trellis data to the research community with tools and interfaces which are easy-to-use and powerful.

To help you achieve these goals, you would direct a small team of excellent, self-starting engineers in tasks like devising new pipelines for quality control and integrating demographic data with sequence data.

This project has many open-source components, and you would be encouraged to publish details from your systems architecture work or results from processing the genomic data. As an example of a publication from this group, see this reference describing the early design of the Trellis system:

Ross, P.B., Song, J., Tsao, P.S. et al. Trellis for efficient data and task management in the VA Million Veteran Program. Scientific Reports 11, 23229 (2021). https://doi.org/10.1038/s41598-021-02569-5

Our Team:
Our SCGPM bioinformatics team is a multi-disciplinary group composed of about a dozen scientists, engineers, and software developers with complementary backgrounds, each contributing their own expertise in managing and analyzing complex biomedical data. Projects supported by this team include the Stanford Genomics Sequencing Center, the VA Million Veteran Project, the NCI Human Tumor Atlas Network, Human BioMolecular Atlas Program, and the Stanford Metabolic Health Center.

This position can be on-site, fully remote, or hybrid.

Bioinformatics System Architect Duties include:

Collaborating with researchers to design solutions to relevant biological questions and maximize the value of our whole-genome sequencing dataset to the public
Determining how to implement big data technical solutions to those questions in a cloud environment
Dockerizing bioinformatics tools and integrating them with our internal data management system to automate workflows
Implementing population-level genomic analyses (GWAS, PCA) using big data technologies
Connecting research data to biological knowledge to streamline the process of answering biological questions
Transforming genomic data from custom file formats into database-native formats
Developing an ontology to describe the relationships between data objects and resources involved in research data management

Team Leader Duties include:

Proposing, conceptualizing, designing, implementing, and developing solutions for difficult and complex applications independently
Overseeing systems testing, debugging, change control, and documentation
Supervising professional staff, as necessary, working on all phases of application development projects
Engaging in long-term strategic planning
Defining complex application development administration and programming standards
Overseeing the support, maintenance, operation, and upgrades of applications
Troubleshooting and resolving complex technical problems
Working with other technical professionals to develop globally-applicable standards and implement best practices

Desired qualifications:

Four-year degree in Genetics, Computer Science, Bioinformatics, Computational Physics, or a related field
Experience modelling biological/biomedical data and metadata
Experience with biological -omics data formats (FASTQ, FASTA, BAM, Proteomics, Metabolomics, et al.)
Comfortable in programming with Python
An ability to independently grasp the objectives of research projects and assemble solutions from a range of technologies, standards, and approaches
A desire to learn new methods and technologies and to adapt to demands of fast-paced research
Excellent verbal and written communication skills
Experience managing small teams
Experience managing projects
Experience with cloud computing, especially Google Cloud
Experience with databases, especially graph databases
Experience with big data technologies (e.g., BigQuery, Spark)
Familiarity with issues in computer data security
Familiarity with FAIR principles of data management

* - Other duties may also be assigned

EDUCATION & EXPERIENCE (REQUIRED):

Bachelor's degree and five years of relevant experience, or a combination of education and relevant experience.

KNOWLEDGE, SKILLS AND ABILITIES (REQUIRED):

Expertise in designing, developing, testing, and deploying applications.
Proficiency with application design and data modeling.
Ability to define and solve logical problems for highly technical applications.
Strong communication skills with both technical and non-technical clients.
Ability to lead activities on structured team development projects.
Ability to select, adapt, and effectively use a variety of programming methods.
Knowledge of application domain.

Apply