Repository

The OpenProblems repository structure

In the OpenProblems codebase, the src directory contains Viash components that manage various aspects of the project, such as common datasets, tasks, and common processing components. The target folder is where artifacts generated from these Viash components are stored, including Dockerized Nextflow modules. The resources_test directory contains the test resources required for running unit tests on the Viash components. It is important to note that these test resources are not stored within the git repository. Instead, they are obtained by running a sync resources script in the scripts directory.

The main data flow of the pipeline is shown in Figure 1. The common dataset components create common dataset objects which are used in one or more tasks.

graph LR
  subgraph Common dataset components
    dataset_loader[/Dataset<br/>loader/]:::component
    raw_dataset[Raw<br/>dataset]:::anndata
    preprocessing[/Pre-processing/]:::component
    common_dataset[Common<br/>dataset]:::anndata
  end
  subgraph Task-specific components
    task_benchmark[/Benchmarking<br/>workflow/]:::component
    results[Results]:::anndata
  end
  
  dataset_loader --> raw_dataset --> preprocessing --> common_dataset --> task_benchmark --> results
Figure 1: Flow of data in OpenProblems benchmarks. All datasets are processed by a common processing pipeline. Further task-specific processing can occur at prior to the task-specific benchmarking workflow. Legend: Grey rectangles are AnnData .h5ad files, purple rhomboids are Viash components.

Directory Structure

  • src/common: This subdirectory contains helper components that helps with creating new components, unit testing other components, or managing task results.
  • src/datasets: The dataset processing pipeline uses dataset loaders to create raw dataset files. The raw dataset files are then processed to generate common dataset files. Common dataset files are used in one or more tasks.
  • src/tasks/<task_id>: Each task should contain a data processor (to transform common datasets into task-specific datasets), methods, control methods (for quality control), and metrics.
  • resources_test: This directory contains the test resources required for running unit tests on the Viash components
  • target: This directory contains the artifacts built from the Viash components in the src directory.

Technology stack

  • AnnData: A file format designed for handling annotated, high-dimensional biological data (Virshup et al. 2021). In OpenProblems, AnnData serves as the standard data format for both input and output files of components, ensuring a consistent and seamless exchange of data between different components of the benchmarking pipelines.

  • AWS: Amazon Web Services provides scalable and cost-effective cloud computing and storage. AWS is being used to store datasets, test resources, and run the Nextflow benchmarking pipelines.

  • CELLxGENE Census: A cloud-based library of single-cell RNA sequencing (scRNA-seq) datasets, developed by the Chan Zuckerberg Initiative. OpenProblems uses the CELLxGENE Census platform to fetch datasets for benchmarking.

  • Docker: Provides a consistent and reproducible environment for building, packaging, and deploying applications and dependencies across different platforms. Docker images are generated by Viash and stored on ghcr.io.

  • GitHub Actions: A continuous integration and continuous deployment (CI/CD) platform integrated with GitHub. This project uses GitHub Actions to perform continuously build and unit test the components in the project.

  • Nextflow: A workflow management system that simplifies the design, deployment, and execution of complex data processing pipelines, enabling seamless scaling and parallelization. All Nextflow modules are generated by Viash and are stored in the target/nextflow/ folder in the project releases (and on the main_build branch).

  • Python: A widely used, high-level programming language, offering extensive libraries and packages for data manipulation, analysis, and machine learning. Most of the OpenProblems components are written in Python.

  • R: A programming language and software environment for statistical computing and graphics, widely used in data analysis and bioinformatics. OpenProblems also offers support for R components.

  • Viash: A tool that facilitates the creation of modular pipeline components by allowing developers to combine a code block or script with a small amount of metadata (Cannoodt et al. 2021). Viash components are used in OpenProblems for dataset loaders, dataset processors, methods, and metrics, enabling developers to focus on the core functionality of their components without worrying about the chosen pipeline framework.

Virshup, Isaac, Sergei Rybakov, Fabian J. Theis, Philipp Angerer, and F. Alexander Wolf. 2021. “Anndata: Annotated Data.” https://doi.org/10.1101/2021.12.16.473007.
Cannoodt, Robrecht, Hendrik Cannoodt, Eric Van de Kerckhove, Andy Boschmans, Dries De Maeyer, and Toni Verbeiren. 2021. “Viash: From Scripts to Pipelines.” arXiv. https://doi.org/10.48550/ARXIV.2110.11494.