poss_dataset_ids = dataset_info
.map(d => d.dataset_id)
.filter(d => results.map(r => r.dataset_id).includes(d))
poss_method_ids = method_info
.map(d => d.method_id)
.filter(d => results.map(r => r.method_id).includes(d))
poss_metric_ids = metric_info
.map(d => d.metric_id)
.filter(d => results.map(r => Object.keys(r.scaled_scores)).flat().includes(d))
Perturbation Prediction
Predicting how small molecules change gene expression in different cell types.
1 datasets · 6 methods · 6 control methods · 5 metrics
Task info Method info Metric info Dataset info Results
Human biology can be complex, in part due to the function and interplay of the body’s approximately 37 trillion cells, which are organized into tissues, organs, and systems. However, recent advances in single-cell technologies have provided unparalleled insight into the function of cells and tissues at the level of DNA, RNA, and proteins. Yet leveraging single-cell methods to develop medicines requires mapping causal links between chemical perturbations and the downstream impact on cell state. These experiments are costly and labor intensive, and not all cells and tissues are amenable to high-throughput transcriptomic screening. If data science could help accurately predict chemical perturbations in new cell types, it could accelerate and expand the development of new medicines.
Several methods have been developed for drug perturbation prediction, most of which are variations on the autoencoder architecture (Dr.VAE, scGEN, and ChemCPA). However, these methods lack proper benchmarking datasets with diverse cell types to determine how well they generalize. The largest available training dataset is the NIH-funded Connectivity Map (CMap), which comprises over 1.3M small molecule perturbation measurements. However, the CMap includes observations of only 978 genes, less than 5% of all genes. Furthermore, the CMap data is comprised almost entirely of measurements in cancer cell lines, which may not accurately represent human biology.
This task aims to predict how small molecules change gene expression in different cell types. This task was a Kaggle competition as part of the NeurIPS 2023 competition track.
The task is to predict the gene expression profile of a cell after a small molecule perturbation. For this competition, we designed and generated a novel single-cell perturbational dataset in human peripheral blood mononuclear cells (PBMCs). We selected 144 compounds from the Library of Integrated Network-Based Cellular Signatures (LINCS) Connectivity Map dataset (PMID: 29195078) and measured single-cell gene expression profiles after 24 hours of treatment. The experiment was repeated in three healthy human donors, and the compounds were selected based on diverse transcriptional signatures observed in CD34+ hematopoietic stem cells (data not released). We performed this experiment in human PBMCs because the cells are commercially available with pre-obtained consent for public release and PBMCs are a primary, disease-relevant tissue that contains multiple mature cell types (including T-cells, B-cells, myeloid cells, and NK cells) with established markers for annotation of cell types. To supplement this dataset, we also measured cells from each donor at baseline with joint scRNA and single-cell chromatin accessibility measurements using the 10x Multiome assay. We hope that the addition of rich multi-omic data for each donor and cell type at baseline will help establish biological priors that explain the susceptibility of particular genes to exhibit perturbation responses in difference biological contexts.
Summary
Display settings
Filter datasets
Filter methods
Filter metrics
Results
Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.
Dataset info
Show
NeurIPS2023 scPerturb DGE
Differential gene expression sign(logFC) * -log10(p-value) values after 24 hours of treatment with 144 compounds in human PBMCs (TBD?).
For this competition, we designed and generated a novel single-cell perturbational dataset in human peripheral blood mononuclear cells (PBMCs). We selected 144 compounds from the Library of Integrated Network-Based Cellular Signatures (LINCS) Connectivity Map dataset (PMID: 29195078) and measured single-cell gene expression profiles after 24 hours of treatment. The experiment was repeated in three healthy human donors, and the compounds were selected based on diverse transcriptional signatures observed in CD34+ hematopoietic stem cells (data not released). We performed this experiment in human PBMCs because the cells are commercially available with pre-obtained consent for public release and PBMCs are a primary, disease-relevant tissue that contains multiple mature cell types (including T-cells, B-cells, myeloid cells, and NK cells) with established markers for annotation of cell types. To supplement this dataset, we also measured cells from each donor at baseline with joint scRNA and single-cell chromatin accessibility measurements using the 10x Multiome assay. We hope that the addition of rich multi-omic data for each donor and cell type at baseline will help establish biological priors that explain the susceptibility of particular genes to exhibit perturbation responses in difference biological contexts.
Method info
Show
LSTM-GRU-CNN Ensemble
An ensemble of LSTM, GRU, and 1D CNN models. Links: Docs.
An ensemble of LSTM, GRU, and 1D CNN models with a variety of input features derived from ChemBERTa embeddings, one-hot encoding of cell type/small molecule pairs, and various statistical measures of target gene expression. The models were trained with a combination of MSE, MAE, LogCosh, and BCE loss functions to improve their robustness and predictive performance. The approach also included data augmentation techniques to ensure generalization and account for noise in the data.
NN retraining with pseudolabels
Neural networks with pseudolabeling and ensemble modelling. Links: Docs.
The prediction system is two staged, so I publish two versions of the notebook. The first stage predicts pseudolabels. To be honest, if I stopped on this version, I would not be the third. The predicted pseudolabels on all test data (255 rows) are added to training in the second stage.
Stage 1 preparing pseudolabels: The main part of this system is a neural network. Every neural network and its environment was optimized by optuna. Hyperparameters that have been optimized: a dropout value, a number of neurons in particular layers, an output dimension of an embedding layer, a number of epochs, a learning rate, a batch size, a number of dimension of truncated singular value decomposition. The optimization was done on custom 4-folds cross validation. In order to avoid overfitting to cross validation by optuna I applied 2 repeats for every fold and took an average. Generally, the more, the better. The optuna’s criterion was MRRMSE. Finally, 7 models were ensembled. Optuna was applied again to determine best weights of linear combination. The prediction of test set is the pseudolabels now and will be used in second stage.
Stage 2 retraining with pseudolabels: The pseudolabels (255 rows) were added to the training dataset. I applied 20 models with optimized parameters in different experiments for a model diversity. Optuna selected optimal weights for the linear combination of the prediction again. Models had high variance, so every model was trained 10 times on all dataset and the median of prediction is taken as a final prediction. The prediction was additionally clipped to colwise min and max.
JN-AP-OP2
Deep learning architecture composed of 2 modules: a sample-centric MLP and a gene-centric MLP. Links: Docs.
We first encode each sample using leave-one-out encoder based on compound and cell type. This produces X with the dimension of n_samples, n_genes, n_encode, where n_encode is 2. Then, X is passed to a MLP1 sample-wise with input of n_samples, n_genesn_encode, which outputs the same dimension data. The purpose of this MLP is to learn inter-gene relationships. Then, we group the output of MLP1 with X (original encoded data) and feed it to MLP2 which receives n_smaplesn_genes, (n_encode + n_encode) and results n_samples*n_genes. This MLP2 trains on each (compound, cell_type, gene) combination. This is to overcome the underdetermination problem due to lack of sufficient (compound, cell_type) samples.
ScAPE
Neural network model for drug effect prediction (pablormier2023scape?). Links: Docs.
ScAPE is utilises a neural network (NN) model to estimate drug effects on gene expression in peripheral blood mononuclear cells (PBMCs). The model took drug and cell features as input, with these features primarily derived from the median of signed log-pvalues and log fold-changes grouped by drug and cell type. The NN was trained using a leave-one-drug-out cross-validation strategy, focusing on NK cells as a representative cell type due to their similarity to B cells and Myeloid cells in principal component analysis. Model performance was evaluated by comparing its predictions against two baselines: predicting zero effect and predicting the median log-pvalue for each drug. The final submission combined predictions from models trained on different gene and drug subsets, aiming to enhance overall prediction accuracy.
Transformer Ensemble
An ensemble of four transformer models, trained on diverse feature sets, with a cluster-based sampling strategy and robust validation for optimal performance. Links: Docs.
This method employs an ensemble of four transformer models, each with different weights and trained on slightly varying feature sets. The feature engineering process involved one-hot encoding of categorical labels, target encoding using mean and standard deviation, and enriching the feature set with the standard deviation of target variables. Additionally, the dataset was carefully examined to ensure data cleanliness. A sophisticated sampling strategy based on K-Means clustering was employed to partition the data into training and validation sets, ensuring a representative distribution. The model architecture leveraged sparse and dense feature encoding, along with a transformer for effective learning.
Py-boost
Py-boost predicting t-scores. Links: Docs.
An ensemble of four models was considered:
- Py-boost (a ridge regression-based recommender system)
- ExtraTrees (a decision tree ensemble with target-encoded features)
- a k-nearest neighbors recommender system
- a ridge regression model
Each model offered distinct strengths and weaknesses: ExtraTrees and knn were unable to extrapolate beyond the training data, while ridge regression provided extrapolation capability. To enhance model performance, data augmentation techniques were used, including averaging differential expressions for compound mixtures and adjusting cell counts to reduce biases.
In the end, only the py-boost model is used for generating predictions.
Control method info
Show
Ground truth
Returns the ground truth predictions
The identity function that returns the ground-truth information as the output.
Mean per gene
Baseline method that returns mean of gene’s outcomes
Baseline method that predicts for a gene the mean of its outcomes of all samples.
Mean per cell type and gene
Baseline method that returns mean of cell type’s outcomes
Baseline method that predicts for a cell type the mean of its outcomes of all compounds.
Mean per compound and gene
Baseline method that returns mean of compound’s outcomes
Baseline method that predicts for a compound the mean of its outcomes of all samples.
Sample
Sample predictions from the training data
This method samples the training data to generate predictions.
Zeros
Baseline method that predicts all zeros
Baseline method that predicts all zeros.
Metric info
Show
Mean Rowwise RMSE
The mean of the root mean squared error (RMSE) of each row in the matrix.
We use the Mean Rowwise Root Mean Squared Error to score submissions, computed as follows:
\textrm{MRRMSE} = \frac{1}{R}\sum_{i=1}^R\left(\frac{1}{n} \sum_{j=1}^{n} (y_{ij} - \widehat{y}_{ij})^2\right)^{1/2}
where (R) is the number of scored rows, and (y_{ij}) and (\widehat{y}_{ij}) are the actual and predicted values, respectively, for row (i) and column (j), and (n) bis the number of columns.
Mean Rowwise MAE
The mean of the absolute error (MAE) of each row in the matrix.
We use the Mean Rowwise Absolute Error to score submissions, computed as follows:
\textrm{MRMAE} = \frac{1}{R}\sum_{i=1}^R\left(\frac{1}{n} \sum_{j=1}^{n}y_{ij} - \widehat{y}_{ij}\right)
where (R) is the number of scored rows, and (y_{ij}) and (\widehat{y}_{ij}) are the actual and predicted values, respectively, for row (i) and column (j), and (n) bis the number of columns.
Mean Rowwise Pearson
The mean of Pearson correlations per row (perturbation).
The Mean Pearson Correlation is computed as follows:
\textrm{Mean-Pearson} = \frac{1}{R}\sum_{i=1}^R\frac{\textrm{Cov}(\mathbf{y}_i, \mathbf{\hat{y}}_i)}{\textrm{Var}(\mathbf{y}_i) \cdot \textrm{Var}(\mathbf{\hat{y}}_i)}
where (R) is the number of scored rows, and (\mathbf{y}_i) and (\mathbf{\hat{y}}_i) are the actual and predicted values, respectively, for row (i).
Mean Rowwise Spearman
The mean of Spearman correlations per row (perturbation).
The Mean Spearman Correlation is computed as follows:
\textrm{Mean-Pearson} = \frac{1}{R}\sum_{i=1}^R\frac{\textrm{Cov}(\mathbf{r}_i, \mathbf{\hat{r}}_i)}{\textrm{Var}(\mathbf{r}_i) \cdot \textrm{Var}(\mathbf{\hat{r}}_i)}
where (R) is the number of scored rows, and (\mathbf{r}_i) and (\mathbf{\hat{r}}_i) are the ranks of the actual and predicted values, respectively, for row (i).
Mean Rowwise Cosine
The mean of cosine similarities per row (perturbation).
The Mean Cosine Similarity is computed as follows:
\textrm{Mean-Cosine} = \frac{1}{R}\sum_{i=1}^R\frac{\mathbf{y}_i\cdot \mathbf{\hat{y}}_i}{\\mathbf{y}_i\ \\mathbf{\hat{y}}_i\}
where (R) is the number of scored rows, and (\mathbf{y}_i) and (\mathbf{\hat{y}}_i) are the actual and predicted values, respectively, for row (i).
Quality control results
Show
Category | Name | Value | Condition | Severity |
---|---|---|---|---|
Method info | Pct 'paper_reference' missing | 0.4166667 | percent_missing(method_info, field) | ✗✗ |
Metric info | Pct 'paper_reference' missing | 1.0000000 | percent_missing(metric_info, field) | ✗✗ |