Multimodal Data Integration

Alignment of cellular profiles from two different modalities

3 datasets · 5 methods · 2 control methods · 2 metrics

Task info Method info Metric info Dataset info Results

Cellular function is regulated by the complex interplay of different types of biological molecules (DNA, RNA, proteins, etc.), which determine the state of a cell. Several recently described technologies allow for simultaneous measurement of different aspects of cellular state. For example, sci-CAR jointly profiles RNA expression and chromatin accessibility on the same cell and CITE-seq measures surface protein abundance and RNA expression from each cell. These technologies enable us to better understand cellular function, however datasets are still rare and there are tradeoffs that these measurements make for to profile multiple modalities.

Joint methods can be more expensive or lower throughput or more noisy than measuring a single modality at a time. Therefore it is useful to develop methods that are capable of integrating measurements of the same biological system but obtained using different technologies on different cells.

Here the goal is to learn a latent space where cells profiled by different technologies in different modalities are matched if they have the same state. We use jointly profiled data as ground truth so that we can evaluate when the observations from the same cell acquired using different modalities are similar. A perfect result has each of the paired observations sharing the same coordinates in the latent space.

Summary

poss_dataset_ids = dataset_info
  .map(d => d.dataset_id)
  .filter(d => results.map(r => r.dataset_id).includes(d))
poss_method_ids = method_info
  .map(d => d.method_id)
  .filter(d => results.map(r => r.method_id).includes(d))
poss_metric_ids = metric_info
  .map(d => d.metric_id)
  .filter(d => results.map(r => Object.keys(r.scaled_scores)).flat().includes(d))

results_long = results.flatMap(d => {
  return Object.entries(d.scaled_scores).map(([metric_id, value]) =>
    ({
      method_id: d.method_id,
      dataset_id: d.dataset_id,
      metric_id: metric_id,
      score: value
    })
  )
}).filter(d => method_ids.includes(d.method_id) && metric_ids.includes(d.metric_id) && dataset_ids.includes(d.dataset_id))

results_resources = results.flatMap(d => {
  return ({
    method_id: d.method_id,
    dataset_id: d.dataset_id,
    ...d.resources
  })
})

function label_time(time) {
  if (time < 1e-5) return "0s";
  if (time < 1) return "<1s";
  if (time < 60) return `${Math.floor(time)}s`;
  if (time < 3600) return `${Math.floor(time / 60)}m`;
  if (time < 3600 * 24) return `${Math.floor(time / 3600)}h`;
  if (time < 3600 * 24 * 7) return `${Math.floor(time / 3600 / 24)}d`;
  return ">7d"; // Assuming missing values are encoded as NaN
}

function label_memory(x_mb, include_mb = true) {
  if (!include_mb && x_mb < 1e3) return "<1G";
  if (x_mb < 1) return "<1M";
  if (x_mb < 1e3) return `${Math.round(x_mb)}M`;
  if (x_mb < 1e6) return `${Math.round(x_mb / 1e3)}G`;
  if (x_mb < 1e9) return `${Math.round(x_mb / 1e6)}T`;
  return ">1P";
}

function aggregate_scores(obj) {
  return d3.mean(obj.map(val => {
    if (val.score === undefined || isNaN(val.score)) return 0;
    return Math.min(1, Math.max(0, val.score))
  }));
}

function mean_na_rm(x) {
  return d3.mean(x.filter(d => !isNaN(d)));
}

function transpose_list_of_objects(list) {
  return Object.fromEntries(Object.keys(list[0]).map(key => [key, list.map(d => d[key])]))
}

overall = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => ({method_id, mean_score: aggregate_scores(values)}))

per_dataset = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => {
    const datasets = d3.groups(values, d => d.dataset_id)
      .map(([dataset_id, values]) => ({["dataset_" + dataset_id]: aggregate_scores(values)}))
      .reduce((a, b) => ({...a, ...b}), {})
    return {method_id, ...datasets}
  })

per_metric = d3.groups(results_long, d => d.method_id)
  .map(([method_id, values]) => {
    const metrics = d3.groups(values, d => d.metric_id)
      .map(([metric_id, values]) => ({["metric_" + metric_id]: aggregate_scores(values)}))
      .reduce((a, b) => ({...a, ...b}), {})
    return {method_id, ...metrics}
  })

resources = d3.groups(results_resources, d => d.method_id)
  .map(([method_id, values]) => {
    const error_pct_oom = d3.mean(values, d => d.exit_code === 137)
    const error_pct_timeout = d3.mean(values, d => d.exit_code === 143)
    const error_pct_error = d3.mean(values, d => d.exit_code > 0) - error_pct_oom - error_pct_timeout
    const error_pct_ok = 1 - error_pct_oom - error_pct_timeout - error_pct_error
    const mean_peak_memory_mb = mean_na_rm(values.map(d => d.peak_memory_mb))
    const mean_disk_read_mb = mean_na_rm(values.map(d => d.disk_read_mb))
    const mean_disk_write_mb = mean_na_rm(values.map(d => d.disk_write_mb))
    const mean_duration_sec = mean_na_rm(values.map(d => d.duration_sec))
    return ({
      method_id,
      error_pct_error,
      error_pct_oom,
      error_pct_timeout,
      error_pct_ok,
      // error_reason: {
      //   "Memory limit exceeded": error_pct_oom,
      //   "Time limit exceeded": error_pct_timeout,
      //   "Execution error": error_pct_error,
      //   "No error": error_pct_ok
      // },
      error_reason: [error_pct_oom, error_pct_timeout, error_pct_error, error_pct_ok],
      mean_cpu_pct: mean_na_rm(values.map(d => d.cpu_pct)),
      mean_peak_memory_mb,
      mean_peak_memory_log: -Math.log10(mean_peak_memory_mb),
      mean_peak_memory_str: " " + label_memory(mean_peak_memory_mb) + " ",
      mean_disk_read_mb: mean_na_rm(values.map(d => d.disk_read_mb)),
      mean_disk_read_log: -Math.log10(mean_disk_read_mb),
      mean_disk_read_str: " " + label_memory(mean_disk_read_mb) + " ",
      mean_disk_write_mb: mean_na_rm(values.map(d => d.disk_write_mb)),
      mean_disk_write_log: -Math.log10(mean_disk_write_mb),
      mean_disk_write_str: " " + label_memory(mean_disk_write_mb) + " ",
      mean_duration_sec,
      mean_duration_log: -Math.log10(mean_duration_sec),
      mean_duration_str: " " + label_time(mean_duration_sec) + " "
    })
  })

summary_all = method_info
  .filter(d => show_con || !d.is_baseline)
  .filter(d => method_ids.includes(d.method_id))
  .map(method => {
    const method_id = method.method_id
    const method_name = method.method_name
    const mean_score = overall.find(d => d.method_id === method_id).mean_score
    const datasets = per_dataset.find(d => d.method_id === method_id)
    const metrics = per_metric.find(d => d.method_id === method_id)
    const resources_ = resources.find(d => d.method_id === method_id)
    return {method_id, method_name, mean_score, ...datasets, ...metrics, ...resources_}
  })
  .sort((a, b) => b.mean_score - a.mean_score)

// make sure the first entry contains all columns
column_info = [
  {id: "method_name", name: "Name", label: null, group: "method", geom: "text", palette: null},
  {id: "mean_score", name: "Score", group: "overall", geom: "bar", palette: "overall"},
  {id: "error_reason", name: "Error reason", group: "overall", geom: "pie", palette: "error_reason"},
  ...dataset_info
    .filter(d => dataset_ids.includes(d.dataset_id)).map(d => ({id: "dataset_" + d.dataset_id, name: d.dataset_name, group: "dataset", geom: "funkyrect", palette: "dataset"}))
    .sort((a, b) => a.name.localeCompare(b.name)),
  ...metric_info
    .filter(d => metric_ids.includes(d.metric_id)).map(d => ({id: "metric_" + d.metric_id, name: d.metric_name, group: "metric", geom: "funkyrect", palette: "metric"}))
    .sort((a, b) => a.name.localeCompare(b.name)),
  {id: "mean_cpu_pct", name: "%CPU", group: "resources", geom: "funkyrect", palette: "resources"},
  {id: "mean_peak_memory_log", name: "Peak memory", label: "mean_peak_memory_str", group: "resources", geom: "rect", palette: "resources"},
  {id: "mean_disk_read_log", name: "Disk read", label: "mean_disk_read_str", group: "resources", geom: "rect", palette: "resources"},
  {id: "mean_disk_write_log", name: "Disk write", label: "mean_disk_write_str", group: "resources", geom: "rect", palette: "resources"},
  {id: "mean_duration_log", name: "Duration", label: "mean_duration_str", group: "resources", geom: "rect", palette: "resources"}
].map(d => {
  if (d.id === "method_name") {
    return {...d, options: {width: 15, hjust: 0}}
  } else if (d.id === "is_baseline") {
    return {...d, options: {width: 1}}
  } else if (d.geom === "bar") {
    return {...d, options: {width: 4}}
  } else {
    return d
  }
})

column_groups = [
  {group: "method", palette: null, level1: ""},
  {group: "overall", palette: "overall", level1: "Overall"},
  {group: "error_reason", palette: "error_reason", level1: "Error reason"},
  {group: "dataset", palette: "dataset", level1: dataset_info.length >= 3 ? "Datasets" : ""},
  {group: "metric", palette: "metric", level1: metric_info.length >= 3 ? "Metrics" : ""},
  {group: "resources", palette: "resources", level1: "Resources"}
]

palettes = [
  {
    overall: "Greys",
    dataset: "Blues",
    metric: "Reds",
    resources: "YlOrBr",
    error_reason: {
      colors: ["#8DD3C7", "#FFFFB3", "#BEBADA", "#FFFFFF"],
      names: ["Memory limit exceeded", "Time limit exceeded", "Execution error", "No error"]
    }
  }
][0]

funkyheatmap(
    transpose_list_of_objects(summary_all),
    transpose_list_of_objects(column_info),
    [],
    transpose_list_of_objects(column_groups),
    [],
    palettes,
    {
        fontSize: 14,
        rowHeight: 26,
        rootStyle: 'max-width: none',
        colorByRank: color_by_rank,
        theme: {
            oddRowBackground: 'var(--bs-body-bg)',
            evenRowBackground: 'var(--bs-button-hover)',
            textColor: 'var(--bs-body-color)',
            strokeColor: 'var(--bs-body-color)',
            headerColor: 'var(--bs-body-color)',
            hoverColor: 'var(--bs-body-color)'
        }
    },
    scale_column
);

Figure 1: Overview of the results per method. This figures shows the mean of the scaled scores (group Overall), the mean scores per dataset (group Dataset) and the mean scores per metric (group Metric).

Display settings

viewof color_by_rank = Inputs.toggle({label: "Color by rank:", value: true})
viewof scale_column = Inputs.toggle({label: "Minmax column:", value: false})
viewof show_con = Inputs.toggle({label: "Show control methods:", value: true})

Filter datasets

viewof dataset_ids = Inputs.checkbox(
  dataset_info.filter(d => poss_dataset_ids.includes(d.dataset_id)),
  {
    keyof: d => d.dataset_name,
    valueof: d => d.dataset_id,
    value: dataset_info.map(d => d.dataset_id),
    label: "Datasets:"
  }
)

Filter methods

viewof method_ids = Inputs.checkbox(
  method_info.filter(d => poss_method_ids.includes(d.method_id)),
  {
    keyof: d => d.method_name,
    valueof: d => d.method_id,
    value: method_info.map(d => d.method_id),
    label: "Methods:"
  }
)

Filter metrics

viewof metric_ids = Inputs.checkbox(
  metric_info.filter(d => poss_metric_ids.includes(d.metric_id)),
  {
    keyof: d => d.metric_name,
    valueof: d => d.metric_id,
    value: metric_info.map(d => d.metric_id),
    label: "Metrics:"
  }
)

funkyheatmap = (await require('d3@7').then(d3 => {
  window.d3 = d3;
  window._ = _;
  return import('https://unpkg.com/funkyheatmapjs@0.2.5');
})).default;

Results

Results table of the scores per method, dataset and metric (after scaling). Use the filters to make a custom subselection of methods and datasets. The “Overall mean” dataset is the mean value across all datasets.

Dataset info

Show

CITE-seq Cord Blood Mononuclear Cells

8k cord blood mononuclear cells sequenced by CITEseq, a multimodal addition to the 10x scRNA-seq platform that allows simultaneous measurement of RNA and protein (Stoeckius et al. 2017).

sciCAR Cell Lines

5k cells from a time-series of dexamethasone treatment sequenced by sci-CAR, a combinatorial indexing-based co-assay that jointly profiles chromatin accessibility and mRNA (Cao et al. 2018).

sciCAR Mouse Kidney

11k cells from adult mouse kidney sequenced by sci-CAR, a combinatorial indexing-based co-assay that jointly profiles chromatin accessibility and mRNA (Cao et al. 2018).

Method info

Show

Harmonic Alignment (log scran)

Harmonic alignment embeds cellular data from each modality into a common space by computing a mapping between the 100-dimensional diffusion maps of each modality. This mapping is computed by computing an isometric transformation of the eigenmaps, and concatenating the resulting diffusion maps together into a joint 200-dimensional space. This joint diffusion map space is used as output for the task (Stanley et al. 2020). Links: Docs.

Harmonic Alignment (sqrt CP10k)

Mutual Nearest Neighbors (log CP10k)

Mutual nearest neighbors (MNN) embeds cellular data from each modality into a common space by computing a mapping between modality-specific 100-dimensional SVD embeddings. The embeddings are integrated using the FastMNN version of the MNN algorithm, which generates an embedding of the second modality mapped to the SVD space of the first. This corrected joint SVD space is used as output for the task (Haghverdi et al. 2018). Links: Docs.

Mutual Nearest Neighbors (log scran)

Procrustes superimposition

Procrustes superimposition embeds cellular data from each modality into a common space by aligning the 100-dimensional SVD embeddings to one another by using an isomorphic transformation that minimizes the root mean squared distance between points. The unmodified SVD embedding and the transformed second modality are used as output for the task (Gower 1975). Links: Docs.

Control method info

Show

Random Features

20-dimensional SVD is computed on the first modality, and is then randomly permuted twice, once for use as the output for each modality, producing random features with no correlation between modalities

True Features

20-dimensional SVD is computed on the first modality, and this same embedding is used as output for both modalities, producing perfectly aligned features from each modality

Metric info

Show

kNN Area Under the Curve

Let f(i) ∈ F be the scRNA-seq measurement of cell i, and g(i) ∈ G be the scATAC- seq measurement of cell i. kNN-AUC calculates the average percentage overlap of neighborhoods of f(i) in F with neighborhoods of g(i) in G. Higher is better (Stanley et al. 2020).

Mean squared error

Mean squared error (MSE) is the average distance between each pair of matched observations of the same cell in the learned latent space. Lower is better (Lance et al. 2022).

Quality control results

Show

✓ All checks succeeded!

Normalisation visualisation

Show

Authors

Cao, Junyue, Darren A. Cusanovich, Vijay Ramani, Delasa Aghamirzaie, Hannah A. Pliner, Andrew J. Hill, Riza M. Daza, et al. 2018. “Joint Profiling of Chromatin Accessibility and Gene Expression in Thousands of Single Cells.” Science 361 (6409): 1380–85. https://doi.org/10.1126/science.aau0730.

Gower, J. C. 1975. “Generalized Procrustes Analysis.” Psychometrika 40 (1): 33–51. https://doi.org/10.1007/bf02291478.

Haghverdi, Laleh, Aaron T L Lun, Michael D Morgan, and John C Marioni. 2018. “Batch Effects in Single-Cell RNA-Sequencing Data Are Corrected by Matching Mutual Nearest Neighbors.” Nature Biotechnology 36 (5): 421–27. https://doi.org/10.1038/nbt.4091.

Lance, Christopher, Malte D. Luecken, Daniel B. Burkhardt, Robrecht Cannoodt, Pia Rautenstrauch, Anna Laddach, Aidyn Ubingazhibov, et al. 2022. “Multimodal Single Cell Data Integration Challenge: Results and Lessons Learned.” bioRxiv. https://doi.org/10.1101/2022.04.11.487796.

Stanley, Jay S., Scott Gigante, Guy Wolf, and Smita Krishnaswamy. 2020. “Harmonic Alignment.” In Proceedings of the 2020 SIAM International Conference on Data Mining, 316–24. Society for Industrial; Applied Mathematics. https://doi.org/10.1137/1.9781611976236.36.

Stoeckius, Marlon, Christoph Hafemeister, William Stephenson, Brian Houck-Loomis, Pratip K Chattopadhyay, Harold Swerdlow, Rahul Satija, and Peter Smibert. 2017. “Simultaneous Epitope and Transcriptome Measurement in Single Cells.” Nature Methods 14 (9): 865–68. https://doi.org/10.1038/nmeth.4380.