Critical Assessment through Spatially-resolved Transcriptomics

Single-cell technologies have made it possible to identify novel cell types, considerably advancing our knowledge of a tissue's constituents. Cell types are typically derived by clustering transcriptomics data, but the quality and the granularity of these clusters remain difficult to assess. In the absence of gold standard information, benchmarks commonly validate predicted cell types against author annotations, implicitly favoring approaches that mimic the original study. In CAST (Critical Assessment through Spatially-resolved Transcriptomics), we shed new light on clustering assessment by taking advantage of the rise of multi-modal technologies. We assess the validity of cell types using a completely independent modality, the spatial location of neurons. This heuristic takes its roots in classical neuroscience, where different classes of neurons have specific spatial patterns (layer- or brain region-specificity) that also map to the neuron's functions and connectivity patterns. Our rationale is that when a type is split into subtypes, the subtypes should have more refined spatial patterns than the original type. The CAST challenge consists in identifying cell types in a BARseq2 dataset that jointly measures transcriptomic signatures of marker genes and physical locations of neurons across a whole mouse cortex. Participants are asked to identify cell types based on the expression data alone; the cell types will then be assessed through their spatial patterns in the held-out physical locations.

Challenge data, with access instructions for R and Python, is downloadable at https://labshare.cshl.edu/shares/gillislab/resource/CellTypeMapping_2022/

User information

To upload submissions, please provide a name and email address. You may be asked to verify your identity later. If you are having issues uploading your results, try a different web browser or feel free to email them to John Lee at johlee@cshl.edu

Mapping cells to a reference dataset.

Each cell should be mapped to one of the types of the reference dataset or labeled as “unassigned” if it does not seem to correspond to any of the reference cell types.

De novo clustering of cells.

Each cell can be mapped to an arbitrary label. Each type should contain at least 20 cells.


                      

                      

Critical Assessment through Spatially-resolved Transcriptomics

The Task:

The task consists in identifying cell types in a BARseq2 dataset that jointly measures transcriptomic signatures of marker genes and physical locations of neurons across a whole mouse cortex. The dataset contains two matrices: one expression matrix (dimension 109 x 642,340) and a physical location matrix (dimension 3 x 642,340) with matched samples. The cell types should be identified based on the expression data and will be validated using the held-out physical locations. We strongly encourage participants to train and evaluate their methods on published datasets with joint expression and physical information (e.g., PMID: 34616063).


We propose two subtasks:

  • Subtask 1: Mapping cells to a reference dataset (see below). Each cell should be mapped to one of the types of the reference dataset or labeled as “unassigned” if it does not seem to correspond to any of the reference cell types.
  • Subtask 2: De novo clustering of cells. Each cell can be mapped to an arbitrary label. Each type should contain at least 20 cells.

  • Expected output

    For both subtasks, the expected output is a two-column CSV file containing one row per cell (642,350 rows). The first column contains the sample ID of the cell (column names in the expression matrix), the second column contains the cell type label of the cell (one of the reference types or “unassigned” for subtask 1, arbitrary label for subtask 2).


    Spatial assessment of cell types

    To evaluate predictions, we will rely only on BARseq's spatial data. We start by computing cell-type specific scores according to the following criteria: Cells from the same type should be colocalized (knn_score). We will measure co-localization using a kNN classification approach, scored as an AUROC (1-vs-all binary classification). The overall score will be a combination of the following criteria:

  • Underclustering score: Cell type predictions should be as specific as possible. We define a cell type's specificity as 1/#cells. We further require that the proposed cell types display significant co-localization, as evidenced by their knn_score. The final score is the average knn_score, weighted by the cell type's specificity.
  • Overclustering score: Cell types should be distinct from their closest neighbors. To this end, we will compute a stratified version of the knn_score, where we restrict the kNN classification to neighboring types (supertypes from the same subclass for subtask 1, worst 1-vs-1 score for subtask 2). The final score is the average stratified_knn_score, weighted by the number of cells in each cell type.
  • CCF score (subtask 1 only): We will measure the fraction of cells that overlap with plausible CCF areas (manually curated based on the description of reference types).
  • As a guideline for participants, we will provide baseline performance for default algorithms and priors (single-marker based cell typing, standard single cell pipelines, standard clustering algorithms).


    Data description:

    We generated a dataset containing 1,259,256 cells using the BARseq2 technology (PMID: 33972801). BARseq uses In Situ Sequencing to jointly profile the physical location and the expression profile of individual cells. To generate the data, we hemi-sected the brain of an 8 week old C57BL6/J mouse (male), then cut 40 coronal sections from the left hemisphere. Each section is 20um wide with 200um spacing from slice to slice (180um space between neighboring slices). Cell segmentation was done using cellpose (PMID: 33318659), using DAPI stain for nuclear label and gene sequencing signals as cytoplasmic label. We recorded the expression of 109 marker genes selected to maximally resolve cortical Glutamatergic subtypes, although the markers are also detected in GABAergic and glial cell. Based on our preliminary analysis, the dataset contains 642,340 Glutamatergic, 427,939 GABAergic and 188,977 “other” cells (post QC). For the critical assessment, we suggest using only Glutamatergic cells.

    The dataset poses multiple challenges: low number of measured genes, lower sensitivity compared to scRNAseq (27 median genes detected per cell, 53 median reads per cell) and different per-gene sensitivities compared to scRNAseq (some highly detected genes become lowly detected). Despite these challenges, standard unsupervised analyses are sufficient to resolve coarse Glutamatergic types (e.g., CT, PT, L4/5 IT) and even finer Glutamatergic types (region-specific L4/5 IT types, finer sublayer types, e.g., within L2/3, validated using the physical location of cells), suggesting that most cells can be annotated with relatively high confidence. We were also able to annotate cell types that were not originally targeted by our marker set, e.g, glutamatergic types from the hippocampus, the entorhinal cortex, the piriform cortex, the amygdala and the thalamus.


    Reference dataset:

    For subtask 1, cell types should be mapped to the glutamatergic supertypes from the isocortex and hippocampus taxonomy from Yao et al. 2021 (PMID: 34004146) as the reference dataset.

    The reference taxonomy contains cell type descriptions at varying levels of granularity: 388 clusters (finest types), 101 supertypes (medium resolution) and 42 subclasses (coarse types). The data is a compendium of 39 scRNAseq datasets (21 using the SmartSeq4 technology, 18 using 10Xv2) totalling ~1.3M cells. Each dataset samples a specific cortical or hippocampal brain region (e.g., MOp, VISp, HIP). The final taxonomy uses an integrated version of all datasets, enabling to identify shared and region-specific types.

    The reference dataset is a subset of the BARseq dataset in terms of sampling, i.e., all reference cell types should appear in the BARseq dataset. However, in practice, some of the finer types won't be resolvable because the BARseq data only contains the expression of 109 genes. An additional challenge is that some of the BARseq cell types are not present in the reference dataset (e.g. piriform, thalamus or amygdala cell types) and should thus be left unannotated (or annotated as novel cell types).

    Reference Mapping leaderboard:

    De novo Clustering leaderboard: