Starter kit contents
The Starter Kits have all you need to compete in the Multimodal Single-Cell Data Integration Challenge. The following sections will use the Python starter kit for the Modality Prediction task. After unzipping the starter kit, the working directory will contain the following files.
├── LICENSE # MIT License
├── README.md # Some starter information
├── bin/ # Binaries needed to generate a submission
│ ├── check_format
│ ├── nextflow
│ └── viash
├── config.vsh.yaml # Viash configuration file
├── script.py # Script containing your method
├── sample_data/ # Small sample datasets for unit testing and debugging
│ ├── openproblems_bmmc_cite_starter/ # Contains H5AD files for CITE data
│ └── openproblems_bmmc_multiome_starter/ # Contains H5AD files for multiome data
├── scripts/ # Scripts to test, generate, and evaluate a submission
│ ├── 0_sys_checks.sh # Checks that necessary software installed
│ ├── 1_unit_test.sh # Runs the unit tests in test.py
│ ├── 2_generate_submission.sh # Generates a submission pkg by running your method on validation data
│ ├── 3_evaluate_submission.sh # (Optional) Scores your method locally
│ └── nextflow.config # Configurations for running Nextflow locally
└── test.py # Default unit tests. Feel free to add more tests, but don't remove any.
The most important parts of this directory are the script.py
and config.vsh.yaml
files, which you will need to modify in order to submit your method to EvalAI.
Anatomy of script.py
The Python script provided in this starter kit has three main sections: Imports, Viash block, and Method.
Imports
This section defines which packages the method expects, if you want to import a new different package, add the import
statement here and add the dependency to config.vsh.yaml
(See below). This ensures that we can build an appropriate image to run your code.
import logging
import anndata as ad
import numpy as np
=logging.INFO) logging.basicConfig(level
Viash block
This optional code block exists to facilitate prototyping so your script can run when called directly by running python script.py
(or Rscript script.R
for R users).
## VIASH START
# Anything within this block will be removed by `viash` and will be
# replaced with the parameters as specified in your config.vsh.yaml.
= 'sample_data/openproblems_bmmc_multiome_starter/openproblems_bmmc_multiome_starter'
input_path = {
par # Required arguments for the task
'input_train_mod1': input_path + '.train_mod1.h5ad',
'input_train_mod2': input_path + '.train_mod2.h5ad',
'input_test_mod1': input_path + '.test_mod1.h5ad',
'output': 'output.h5ad',
# Optional method-specific arguments
'n_neighbors': 5,
}= { 'resources_dir': '.', 'functionality_name': '.' }
meta ## VIASH END
Here, the par
dictionary contains all the arguments
defined in the config.vsh.yaml
file. When you create the component using the generate_submission.sh
script, Viash will remove everything between the ## VIASH START
and ## VIASH END
comments and replace it with a dictionary / named list containing the parameter values retrieved from an auto-generated command-line interface.
Method
This code block will typically consist of reading the input files, performing some preprocessing, training a model on the train cells, generating predictions for the test cells, and outputting the predictions as an AnnData file (See Data Formats).
## Data reader
'Reading `h5ad` files...')
logging.info(
= ad.read_h5ad(par['input_train_mod1'])
input_train_mod1 = ad.read_h5ad(par['input_train_mod2'])
input_train_mod2 = ad.read_h5ad(par['input_test_mod1'])
input_test_mod1
# ... preprocessing ...
# ... train model ...
# ... generate predictions ...
# write output to file
= ad.AnnData(
adata =y_pred,
X={
uns'dataset_id': input_train_mod1.uns['dataset_id'],
'method_id': "python_starter_kit", # all lowercase letters or underscores
},
)'output'], compress='gzip') adata.write_h5ad(par[
Anatomy of config.vsh.yaml
Once the script is working, you’ll probably need to update the config.vsh.yaml
file. The configuration has two sections:
- Functionality about the script, including method arguments and any additional resources (e.g. pretrained models)
- Platform targets that define the components dependencies (which are used to build a Docker container).
Full documentation on the Viash configuration file is available on the Viash documentation site.
Functionality
The first section of the configuration file contains information about the metadata of the script including parameters for the script and a list of resource files.
functionality:
# a unique name for your method, same as what is being output by the script.
# must match the regex [a-z][a-z0-9_]*
name: method
# metadata for your method
description: A description for your method.
authors:
- name: John Doe
email: johndoe@youremailprovider.com
roles: [ author, maintainer ]
props: { github: johndoe, orcid: "0000-1111-2222-3333" }
# component parameters
arguments:
# required inputs (do not modify)
- name: "--input_train_mod1"
type: "file"
example: "dataset_mod1.h5ad"
description: Censored dataset, training cells.
required: true
- name: "--input_test_mod1"
type: "file"
example: "dataset_mod1.h5ad"
description: Censored dataset, test cells.
required: true
- name: "--input_train_mod2"
type: "file"
example: "dataset_mod2.h5ad"
description: Censored dataset.
required: true
# required outputs (do not modify)
- name: "--output"
type: "file"
direction: "output"
example: "output.h5ad"
description: Dataset with predicted values for modality2.
required: true
# Method-specific parameters.
# Change these to expose parameters of your method to Nextflow (optional)
- name: "--n_neighbors"
type: "integer"
default: 5
description: Number of neighbors to use.
# files your script needs
resources:
# the script itself
- type: python_script
path: script.py
# additional resources your script needs (optional)
- path: pretrained_model.pt
In this section of the configuration you should focus on updating the following sections:
- Description and Authors - Information about who contributed the method
- Arguments - Each section here defines a command-line argument for the script. These sections are all passed to the script in the form of a dictionary called
par
. You only need to change the method-specific parameters, and if you would like these parameters hard-coded into the script, you do not need to provide any parameters here. Make sure not to remove required parameters such as the inputs. Your method cannot be executed properly otherwise! - Resources - This section describes the files that need to be included in your component. For example if you’d like to add a file containing model weights called
weights.pt
, add{ type: file, path: weights.pt }
to the resources. You can now load the additional resource in your script by using at the pathmeta['resources_dir'] + '/weights.pt'
.
Platform
The Platform section defines the information about how the Viash component is run on various backend platforms. The two platforms that are important for the competition are:
# target platforms
platforms:
# By specifying 'docker' platform, viash will build a standalone
# executable which uses docker in the back end to run your method.
- type: docker
# you need to specify a base image that contains at least bash and python
image: dataintuitive/randpy:py3.8
# You can specify additional dependencies with 'setup'.
# See https://viash.io/reference
# for more information on how to add more dependencies.
setup:
# - type: apt
# packages:
# - bash
# - type: python
# packages:
# - scanpy
- type: python
packages:
- scikit-learn
- anndata
- scanpy
# By specifying a 'nextflow', viash will also build a viash module
# which uses the docker container built above to also be able to
# run your method as part of a nextflow pipeline.
- type: nextflow
labels: [ lowmem, lowtime, lowcpu ]
The most important part of this section to update is the setup
definition that describes the packages that need to be installed in the docker container for the method to run. There are many different methods for specifying these requirements described in the Viash docs.
You can also change the memory, runtime, and CPU utilization be editing the Nextflow labels section. Available options are [low|med|high]
for each of mem
, time
, and cpu
. The corresponding resource values can be found in the scripts/nextflow.config
file.
Tip: After making changes to the components dependencies, you will need to rebuild the docker container as follows:
$ bin/viash run -- ---setup cachedbuild
[notice] Running 'docker build -t method:dev /tmp/viashsetupdocker-method-tEX78c'
Tip #2: You can view the dockerfile that Viash generates from the config file using the ---dockerfile
argument:
$ bin/viash run -- ---dockerfile
Input/output AnnData file format
All the datasets provided in the competition are organized in AnnData objects. This page describes the structure of AnnData objects so that you can work with the data in the competition efficiently.
AnnData is a generic data storage format especially well-suited for single-cell data. The class stores a data matrix X
together with annotations of observations and variables in Pandas DataFrames. The advantage of storing the annotations and data in the same object is to ensure that when inspecting slices of the data, relevant annotations are properly linked with the views of the data.
The most common attributes are:
adata.X
- The data. This can by a sparse array or a numpy array.adata.n_obs
- The number of observations, equal toadata.X.shape[0]
adata.obs_names
- The index for the observations, similar to a Pandas DataFrameindex
. Typically this contains cell barcodes.adata.obs
- A Pandas DataFrame containing annotations of the observations, for example, cluster labels or the donor of origin. The index of the DataFrame must be the same asadata.obs_names
.adata.obsm
- A dictionary of n-dimensional arrays (ndarrays) of shape(n_obs, m)
. For example, it is common to store a PCA or UMAP embedding of the data inadata.obsm["X_pca"]
oradata.obsm["X_umap"]
.adata.obsp
- Pairwise annotations of the observations. Most often this contains a graph adjacency matrix where the nodes in the graph are observations.adata.n_var
- The number of variables or features in the data, equal toadata.X.shape[1]
adata.var_names
- The index for the variables, similar to a Pandas DataFramecolumns
. Typically this contains gene names.adata.var
- Annotations on the variables, e.g. whether a feature is a protein coding gene or a non-coding gene. Index of this DataFrame must matchadat.var_names
.adata.varm
,adata.varp
- Analogous to theobsm
andobsp
attributes. Not commonly used.adata.uns
- A dictionary to story any unstructured annotations associated with the data. E.g, a data version.
The following diagram shows these attributes and their relation to the data:
]
AnnData objects can be opened in Python using the anndata.read_h5ad()
function or the scanpy.read_h5ad()
function. This object can be opened in R using the anndata::read_h5ad()
function.
For the full AnnData API, please consult the documentation of the anndata
package.