7.1 Bulk deconvolution¶
TL;DR We provide a brief overview over basic concepts of cell type deconvolution including input structure, data preprocessing and analysis of the output data.
Background¶
在研究大块组织中细胞类型组成差异方面有几个重要原因。首先,细胞类型之间的相互作用在疾病的进展和恢复过程中起着重要作用。其次,分子模式(如基因表达和蛋白质丰度)通常直接与组织内细胞类型的组成相关联。理解这些组成对于研究疾病等生物条件的基础分子机制至关重要。第三,揭示特定于疾病的细胞类型模式可以更好地指导治疗靶点的选择,提供重要的临床实用性。
细胞类型解混(cell type deconvolution)是一种计算框架,用于推断复杂多样的组织中细胞群体的组成 {cite}Kuhn2012,Schwartz2010,Du2019,Zaitsev2019
。由于在实验上测量这些组成是耗时且昂贵的,解混方法允许基于分子数据对细胞群体进行大规模分析。解混方法通常采用线性回归定义为:
[ y = bX ]
其中,( y ) 表示使用常见的分子流程(如微阵列或RNA-seq)对杂合基因表达谱进行混合,( X ) 是包含同质化细胞类型特异性谱的签名矩阵,( b ) 是由解混方法推断的混合数据中细胞比例向量 {cite}Baron2016
。为了选择适合目标生物条件的最优解混方法,应考虑多种技术和生物因素的影响,包括解混方法、缺失或稀有细胞类型的参考数据、数据标准化和特征(标记物)的选择。
用于推断细胞组成的签名矩阵 ( X ) 反映了我们对组织内细胞异质性的最佳了解,并且极大地影响解混过程的成功 {cite}Aliee2021
。最初,签名矩阵是通过从杂合组织中分选细胞(使用FACS或CyTOF)生成的,由于预先选择的细胞类型面板和缺乏合适的抗体而存在内部偏差 {cite}Monaco2019,Aran2017
。今天,这些矩阵主要是使用单细胞技术生成的无偏谱,允许跨不同生物体、组织和生物条件生成签名矩阵 {cite}Aliee2021,Newman2019
。
Approaches¶
Bulk解混方法可以分为基于线性回归、基于富集分析、基于非线性深度学习以及其他类型的方法。
最常见的细胞类型解混方法是基于线性回归的方法。这些方法尝试直接解决 $y = bX$ 方程,使用不同的正则化方法,并依赖相对较多的特征。这类工具的例子包括 CIBERSORTx {cite}Newman2019
, MuSiC {cite}WangXuran2019
, dtangle {cite}Hunt2018
和 DWLS {cite}Tsoucas2019
。
另一方面,基于富集分析的工具则为每个细胞类型单独计算富集分数,基于代表该细胞类型的基因集。所有细胞类型的富集分数然后通过特定于方法的转换函数组合和转换为组成。由于这些方法一次只考虑一个细胞类型,它们在简单情况下提供有意义的见解,但是在包含多种细胞类型的参考数据时精度较低。xCell {cite}Aran2017
就是一个基于富集分析的工具的例子。
第三种选择是非线性深度学习方法,它们以提高解混精度为目标,同时试图保持较高的生物解释性。目前来看,这些方法是否能够优于其他类型的方法还为时尚早。Scaden {cite}Menden2020
就是一个基于深度学习的方法的例子。
Bulk解混方法已经得到广泛基准测试,结果总体上相当一致 {cite}ShenOrr2013,Cobos2020,Jin2021,Nadel2021
。使用单细胞RNA-seq数据的方法表现良好,而半监督方法显示出较高的错误率。在参考数据中未包括混合物中存在的细胞类型会导致较差的结果 {cite}Cobos2020
。Cobos等人建议:(1) 输入数据应处于线性尺度,(2) 避免行缩放、列最小-最大值、列Z-score或分位数归一化,(3) 回归型的bulk解混方法如CIBERSORTx或FARDEEP表现良好,如果有单细胞RNA-seq数据则应同时使用DWLS、MuSiC或SCDC进行结果比较,(4) 使用严格的标记选择策略,重点关注前两个具有最高表达值的细胞类型之间的差异,(5) 使用包含所有混合物中存在的相关细胞类型的全面参考矩阵 {cite}Cobos2020
。
关于标准化策略的影响,Li等人 {cite}Li2016
提出标准化策略对结果有较大影响,而Cobos等人的研究未能证实这一点 {cite}Cobos2020
。
由于包括表现良好的CIBERSORTx在内的许多批量解卷积工具仅作为Web工具提供,我们选择用MuSiC展示一个使用案例。
需要注意的是,MuSiC要求使用多个包含相同细胞类型的单细胞样本。在实际操作中,我们的参考基因组文件可能只包括单个样本,或者某些细胞类型在多个样本中可能缺失。在这些情况下,MuSiC会失败,需要选择另一种解卷积方法。
Deconvolving bulk COVID-19 whole blood samples¶
以下是我们使用MuSiC从39名COVID-19患者和10名健康对照中收集的49个整血RNA-seq样本进行细胞解卷积的实际案例 {cite}Aschenbrenner2020
。作为单细胞参考数据集,我们将使用COVID-19患者的全血单细胞RNA-seq数据 {cite}Schulte-Schrepping2020
。
Environment setup¶
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import scipy as sci
Loading Data¶
我们首先读取单细胞和批量数据。我们不对数据进行缩放,因为研究表明,使用scRNA-seq数据作为参考的去卷积方法在应用于线性尺度数据时表现最佳,并且在库大小标准化后准确性有所提高 {cite}Jin2021,Cobos2020
。
data_file = "/storage/groups/ml01/workspace/amit.frishberg/OriginalData/"
adata = sc.read(data_file + "seurat_COVID19_freshWB_PBMC_cohort2_incl_raw.h5ad")
adata.X = adata.layers["counts"]
adata = adata[adata.obs["cells"] == "Whole_blood"].copy()
adata
AnnData object with n_obs × n_vars = 89883 × 33417 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels' var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable' obsm: 'X_pca', 'X_umap' layers: 'counts'
adata.obs["cluster_labels_res.0.8"].value_counts()
Neutrophils_1 22714 Neutrophils_2 18675 Neutrophils_3 9986 CD4_T_cells_1 6278 CD14_Monocytes_1 5362 Neutrophils_4 4710 CD8_T_cells 3255 Megakaryocytes 3180 NK_cells 2916 B_cells_1 2561 CD4_T_cells_2 1917 Mixed_cells 1727 Immature Neutrophils_1 1317 CD16_Monocytes 983 Immature Neutrophils_2 981 CD14_Monocytes_3 632 Eosinophils 579 CD14_Monocytes_2 556 CD4_T_cells_3 409 Plasmablast 390 Prol. cells 247 mDC 235 B_cells_2 138 pDC 86 CD34+ GATA2+ cells 49 Name: cluster_labels_res.0.8, dtype: int64
bulk = pd.read_csv(data_file + "BulkSmall.txt", sep="\t", index_col=0)
metadata = pd.read_csv(data_file + "annoSmall.txt", sep="\t", index_col=0)
metadata.index = metadata.index.astype("str")
metadata = metadata.loc[bulk.transpose().index]
Data preprocessing¶
读取数据后,我们需要从我们的单细胞参考数据中移除未定义的细胞。在这里,我们移除混合细胞和增殖细胞,因为它们不限于单一的细胞类型。
注意 质量控制已经通过移除低质量细胞和基因在单细胞数据上进行过。因此,我们跳过这一部分。
adata = adata[
~adata.obs["cluster_labels_res.0.8"].isin(["None", "Mixed_cells", "Prol. cells"])
].copy()
ct_counts = adata.obs["cluster_labels_res.0.8"].value_counts()
adata.obs["cluster_labels_res.0.8"].value_counts()
Neutrophils_1 22714 Neutrophils_2 18675 Neutrophils_3 9986 CD4_T_cells_1 6278 CD14_Monocytes_1 5362 Neutrophils_4 4710 CD8_T_cells 3255 Megakaryocytes 3180 NK_cells 2916 B_cells_1 2561 CD4_T_cells_2 1917 Immature Neutrophils_1 1317 CD16_Monocytes 983 Immature Neutrophils_2 981 CD14_Monocytes_3 632 Eosinophils 579 CD14_Monocytes_2 556 CD4_T_cells_3 409 Plasmablast 390 mDC 235 B_cells_2 138 pDC 86 CD34+ GATA2+ cells 49 Name: cluster_labels_res.0.8, dtype: int64
研究表明,去卷积方法通常无法预测稀有细胞的比例 {cite}Tsoucas2019
。如果您想移除稀有细胞类型,可以将细胞类型限定为超过 cellTypeNumCutOff
的细胞数。此截止值由用户定义,应根据数据选择。
# removing very rare cells
rare_ct_cut_off = 50 # This is a user-specific parameter based on the data
ct_to_keep = ct_counts[ct_counts > rare_ct_cut_off].index
adata = adata[adata.obs["cluster_labels_res.0.8"].isin(ct_to_keep)].copy()
adata.obs["cluster_labels_res.0.8"].value_counts()
Neutrophils_1 22714 Neutrophils_2 18675 Neutrophils_3 9986 CD4_T_cells_1 6278 CD14_Monocytes_1 5362 Neutrophils_4 4710 CD8_T_cells 3255 Megakaryocytes 3180 NK_cells 2916 B_cells_1 2561 CD4_T_cells_2 1917 Immature Neutrophils_1 1317 CD16_Monocytes 983 Immature Neutrophils_2 981 CD14_Monocytes_3 632 Eosinophils 579 CD14_Monocytes_2 556 CD4_T_cells_3 409 Plasmablast 390 mDC 235 B_cells_2 138 pDC 86 Name: cluster_labels_res.0.8, dtype: int64
We need to filter for the shared genes across bulk and single-cell data before selecting highly variable genes.
bulk_sc_genes = np.intersect1d(bulk.index, adata.var_names)
bulk = bulk.loc[bulk_sc_genes, :].copy()
adata = adata[:, bulk_sc_genes].copy()
Visualization of single-cell data¶
For visualization of single-cell data, we first normalize the counts, select highly-variable genes and log transform the data.
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=False)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=5000)
adata_log = sc.pp.log1p(
adata, copy=True
) # logged counts are only used for visualisation (can also work with layers)
We then reduce the dimensionality using PCA and visualize the data using UMAP:
sc.tl.pca(adata_log)
adata_log.obsm["X_pca"] *= -1 # multiply by -1 to match Seurat
sc.pp.neighbors(adata_log, n_neighbors=30)
sc.tl.umap(adata_log)
sc.pl.umap(adata_log, color="cluster_labels_res.0.8")
Deconvolving using MuSiC¶
Loading R¶
# R interface
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
import rpy2.rinterface_lib.callbacks
import anndata2ri
pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython
C:\Users\Shennor\.conda\envs\eharpy\lib\site-packages\rpy2\robjects\packages.py:365: UserWarning: The symbol 'quartz' is not in this R namespace/package. warnings.warn(
为了使用 R 脚本,我们需要将数据从 Python 转换为 R。在某些情况下,这需要对数据进行子采样以避免内存限制。
import random
import itertools
downSamplingSize = 80
downSamplingIndexes = [
random.sample(
np.where(currCell == adata.obs["cluster_labels_res.0.8"])[0].tolist(),
np.min(
[downSamplingSize, np.sum(currCell == adata.obs["cluster_labels_res.0.8"])]
),
)
for currCell in np.unique(adata.obs["cluster_labels_res.0.8"])
]
downSamplingIndexes = list(itertools.chain(*downSamplingIndexes))
adata_r = adata[downSamplingIndexes].copy()
adata_r
AnnData object with n_obs × n_vars = 1809 × 26807 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels' var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable' obsm: 'X_pca', 'X_umap' layers: 'counts'
如果不需要子采样(数据相对较小),可以简单地将整个 AnnData 转换为 R 对象。
adata_r = adata.copy()
adata_r
AnnData object with n_obs × n_vars = 87860 × 26807 obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels', 'n_counts' var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'highly_variable', 'means', 'dispersions', 'dispersions_norm' uns: 'hvg' obsm: 'X_pca', 'X_umap' layers: 'counts'
Running MuSiC¶
%%R
library(MuSiC)
library(Biobase)
我们首先收集运行 MuSiC 所需的所有剩余输入:
- adata_r - 包含将用作参考的单细胞矩阵。
- cell_subsets_r - 细胞类型身份。
- bulk - 进行去卷积过程的批量样本。
- sc_genes - MuSiC 分析中使用的基因标记。
cell_subsets_r = adata_r.obs["cluster_labels_res.0.8"].astype(str).copy()
sc_genes = adata_r.var_names
MuSiC 一次运行一个样本的去卷积框架。这里,9*** 是样本名称。
%%R -i adata_r,cell_subsets_r,bulk,sc_genes -o musicRes
df = data.frame(cellNames = cell_subsets_r, Sample = factor(rep(1, dim(adata_r@colData)[1])))
row.names(df) = row.names(adata_r@colData)
df = new("AnnotatedDataFrame", data = df) # Cell type identities are stored as an AnnotatedDataFrame
# Creating an ExpressionSet from the single cell matrix
scDataMatrix = Matrix::as.matrix(adata_r@assays@data@listData[[1]])
row.names(scDataMatrix) = sc_genes
scDataMatrix = scDataMatrix[rowSums(scDataMatrix)>0,] # Removing genes with no reads
SCDataES <- Biobase::ExpressionSet(assayData=scDataMatrix,phenoData = df, protocolData = df)
bulkDataES <- Biobase::ExpressionSet(assayData=as.matrix(bulk)) # Creating an ExpressionSet from the bulk matrix
musicRes = MuSiC::music_prop(bulk.eset = bulkDataES, sc.eset = SCDataES, clusters = 'cellNames') # Running MuSiC
R[write to console]: Creating Relative Abundance Matrix... R[write to console]: Creating Variance Matrix... R[write to console]: Creating Library Size Matrix... R[write to console]: Used 20407 common genes... R[write to console]: Used 23 cell types in deconvolution... R[write to console]: 9088 has common genes 18250 ... R[write to console]: 9089 has common genes 17774 ... R[write to console]: 9091 has common genes 18598 ... R[write to console]: 9092 has common genes 16184 ... R[write to console]: 9093 has common genes 18585 ... R[write to console]: 9094 has common genes 18701 ... R[write to console]: 9095 has common genes 18646 ... R[write to console]: 9096 has common genes 17247 ... R[write to console]: 9097 has common genes 17922 ... R[write to console]: 9098 has common genes 17987 ... R[write to console]: 9099 has common genes 18223 ... R[write to console]: 9100 has common genes 18786 ... R[write to console]: 9101 has common genes 16458 ... R[write to console]: 9102 has common genes 17604 ... R[write to console]: 9103 has common genes 17429 ... R[write to console]: 9104 has common genes 18118 ... R[write to console]: 9105 has common genes 15286 ... R[write to console]: 9106 has common genes 17237 ... R[write to console]: 9107 has common genes 15735 ... R[write to console]: 9108 has common genes 17387 ... R[write to console]: 9109 has common genes 16863 ... R[write to console]: 9110 has common genes 17736 ... R[write to console]: 9112 has common genes 17513 ... R[write to console]: 9113 has common genes 14950 ... R[write to console]: 9114 has common genes 17318 ... R[write to console]: 9116 has common genes 14518 ... R[write to console]: 9117 has common genes 13708 ... R[write to console]: 9118 has common genes 17526 ... R[write to console]: 9119 has common genes 17274 ... R[write to console]: 9120 has common genes 16520 ... R[write to console]: 9121 has common genes 15434 ... R[write to console]: 9165 has common genes 16907 ... R[write to console]: 9166 has common genes 16534 ... R[write to console]: 9167 has common genes 17115 ... R[write to console]: 9168 has common genes 16621 ... R[write to console]: 9169 has common genes 17251 ... R[write to console]: 9170 has common genes 17353 ... R[write to console]: 9171 has common genes 14603 ... R[write to console]: 9172 has common genes 14141 ... R[write to console]: 9122 has common genes 17865 ... R[write to console]: 9123 has common genes 18009 ... R[write to console]: 9124 has common genes 18390 ... R[write to console]: 9125 has common genes 17235 ... R[write to console]: 9126 has common genes 18432 ... R[write to console]: 9127 has common genes 17662 ... R[write to console]: 9128 has common genes 16875 ... R[write to console]: 9129 has common genes 17396 ... R[write to console]: 9130 has common genes 19469 ... R[write to console]: 9131 has common genes 18756 ...
# Create the final output matrix
music_frac = pd.DataFrame(musicRes[0])
music_frac.index = bulk.columns
music_frac.columns = np.unique(cell_subsets_r)
Outputs and Validations¶
所有细胞类型去卷积方法的主要输出是一个 NxM 矩阵,其中:
- N(行数)代表样本数量
- M(列数)代表细胞类型数量
矩阵中每个单元格的值表示特定样本中某一特定细胞类型的组成。在大多数情况下,样本中的细胞类型组成将显示为分数,因此为非负值,总和为一。
music_frac
B_cells_1 | B_cells_2 | CD14_Monocytes_1 | CD14_Monocytes_2 | CD14_Monocytes_3 | CD16_Monocytes | CD34+ GATA2+ cells | CD4_T_cells_1 | CD4_T_cells_2 | CD4_T_cells_3 | ... | Immature Neutrophils_2 | Megakaryocytes | NK_cells | Neutrophils_1 | Neutrophils_2 | Neutrophils_3 | Neutrophils_4 | Plasmablast | mDC | pDC | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9088 | 0.025266 | 0.000007 | 0.009212 | 0.007646 | 0.004111 | 0.000000 | 0.021343 | 0.055596 | 0.0 | 0.022603 | ... | 0.004530 | 0.037775 | 0.035206 | 0.181982 | 0.000000 | 0.000000 | 0.096936 | 0.007348 | 0.0 | 0.000000 |
9089 | 0.000000 | 0.000000 | 0.011156 | 0.013301 | 0.000000 | 0.000000 | 0.011891 | 0.000000 | 0.0 | 0.000000 | ... | 0.000594 | 0.012610 | 0.000000 | 0.402237 | 0.000000 | 0.000000 | 0.117283 | 0.000086 | 0.0 | 0.000000 |
9091 | 0.000091 | 0.000000 | 0.005645 | 0.005848 | 0.000309 | 0.000247 | 0.003656 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.063919 | 0.011228 | 0.343444 | 0.000000 | 0.002390 | 0.155726 | 0.000936 | 0.0 | 0.000003 |
9092 | 0.004650 | 0.000000 | 0.000000 | 0.000000 | 0.002024 | 0.000000 | 0.003379 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.091188 | 0.004478 | 0.294886 | 0.000000 | 0.000000 | 0.175727 | 0.001784 | 0.0 | 0.000000 |
9093 | 0.008431 | 0.000000 | 0.000000 | 0.000000 | 0.032441 | 0.000000 | 0.004801 | 0.020848 | 0.0 | 0.001814 | ... | 0.000000 | 0.025210 | 0.033820 | 0.100954 | 0.090013 | 0.000000 | 0.276946 | 0.000504 | 0.0 | 0.000000 |
9094 | 0.099594 | 0.000000 | 0.000425 | 0.003925 | 0.000000 | 0.000000 | 0.016660 | 0.016462 | 0.0 | 0.027290 | ... | 0.002317 | 0.000000 | 0.029776 | 0.020330 | 0.082443 | 0.000000 | 0.128015 | 0.000000 | 0.0 | 0.000000 |
9095 | 0.001680 | 0.000000 | 0.000000 | 0.048237 | 0.000000 | 0.000000 | 0.006444 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.017268 | 0.017518 | 0.278576 | 0.000000 | 0.000000 | 0.217223 | 0.000744 | 0.0 | 0.000000 |
9096 | 0.019351 | 0.000000 | 0.000785 | 0.028895 | 0.000000 | 0.000000 | 0.007645 | 0.015595 | 0.0 | 0.000000 | ... | 0.000000 | 0.003820 | 0.060052 | 0.277122 | 0.061254 | 0.000000 | 0.026074 | 0.002298 | 0.0 | 0.000000 |
9097 | 0.015769 | 0.000000 | 0.000399 | 0.031298 | 0.000000 | 0.000000 | 0.009512 | 0.028784 | 0.0 | 0.000000 | ... | 0.000000 | 0.005165 | 0.061176 | 0.275073 | 0.036719 | 0.000000 | 0.056550 | 0.005490 | 0.0 | 0.000000 |
9098 | 0.015624 | 0.000000 | 0.015810 | 0.011449 | 0.004735 | 0.000000 | 0.011961 | 0.038063 | 0.0 | 0.007417 | ... | 0.000000 | 0.012727 | 0.051597 | 0.230531 | 0.002209 | 0.000000 | 0.022288 | 0.013302 | 0.0 | 0.000000 |
9099 | 0.015419 | 0.000000 | 0.000125 | 0.009149 | 0.026507 | 0.000000 | 0.013332 | 0.088082 | 0.0 | 0.022198 | ... | 0.000000 | 0.013670 | 0.043131 | 0.073787 | 0.085893 | 0.000000 | 0.084937 | 0.001153 | 0.0 | 0.000000 |
9100 | 0.038787 | 0.000000 | 0.000000 | 0.006566 | 0.000000 | 0.000000 | 0.014090 | 0.035495 | 0.0 | 0.002291 | ... | 0.000648 | 0.000000 | 0.038413 | 0.170412 | 0.044578 | 0.000000 | 0.078720 | 0.008307 | 0.0 | 0.000000 |
9101 | 0.003037 | 0.000000 | 0.000000 | 0.025247 | 0.000000 | 0.000000 | 0.002739 | 0.000516 | 0.0 | 0.000000 | ... | 0.000000 | 0.063421 | 0.000801 | 0.360854 | 0.005524 | 0.000000 | 0.111516 | 0.000328 | 0.0 | 0.000000 |
9102 | 0.008490 | 0.000000 | 0.000000 | 0.025538 | 0.000577 | 0.000000 | 0.010780 | 0.028101 | 0.0 | 0.000000 | ... | 0.000156 | 0.025812 | 0.037690 | 0.284248 | 0.054286 | 0.000000 | 0.014266 | 0.001326 | 0.0 | 0.000000 |
9103 | 0.018500 | 0.013523 | 0.000000 | 0.013188 | 0.000000 | 0.000000 | 0.008010 | 0.079577 | 0.0 | 0.022855 | ... | 0.000000 | 0.001177 | 0.058535 | 0.160046 | 0.053514 | 0.000065 | 0.028745 | 0.000483 | 0.0 | 0.000000 |
9104 | 0.035979 | 0.000430 | 0.000000 | 0.021775 | 0.000000 | 0.000000 | 0.028406 | 0.039231 | 0.0 | 0.014846 | ... | 0.000000 | 0.000000 | 0.046156 | 0.097204 | 0.000000 | 0.000000 | 0.000000 | 0.005464 | 0.0 | 0.000370 |
9105 | 0.029334 | 0.000000 | 0.012962 | 0.000000 | 0.014837 | 0.028242 | 0.000414 | 0.025195 | 0.0 | 0.007911 | ... | 0.000000 | 0.017720 | 0.059066 | 0.177582 | 0.000000 | 0.000000 | 0.000000 | 0.000380 | 0.0 | 0.000000 |
9106 | 0.004070 | 0.000000 | 0.000000 | 0.016110 | 0.000000 | 0.000000 | 0.010171 | 0.027782 | 0.0 | 0.000945 | ... | 0.000000 | 0.000000 | 0.014647 | 0.072228 | 0.222037 | 0.000000 | 0.245327 | 0.002171 | 0.0 | 0.000000 |
9107 | 0.012894 | 0.000000 | 0.000733 | 0.030433 | 0.000000 | 0.000000 | 0.029637 | 0.064172 | 0.0 | 0.021531 | ... | 0.000000 | 0.025631 | 0.008832 | 0.267814 | 0.000000 | 0.000000 | 0.030072 | 0.000000 | 0.0 | 0.000000 |
9108 | 0.015228 | 0.000000 | 0.000170 | 0.022236 | 0.000000 | 0.000000 | 0.018144 | 0.004921 | 0.0 | 0.005200 | ... | 0.000000 | 0.000000 | 0.031351 | 0.211152 | 0.000000 | 0.000000 | 0.131391 | 0.003772 | 0.0 | 0.000000 |
9109 | 0.036592 | 0.000000 | 0.003224 | 0.031009 | 0.004927 | 0.000000 | 0.020842 | 0.071492 | 0.0 | 0.021438 | ... | 0.000212 | 0.001298 | 0.054454 | 0.162173 | 0.001041 | 0.000000 | 0.060488 | 0.003435 | 0.0 | 0.000000 |
9110 | 0.017529 | 0.000000 | 0.000000 | 0.017824 | 0.000000 | 0.000000 | 0.018764 | 0.000000 | 0.0 | 0.000000 | ... | 0.002869 | 0.040700 | 0.006030 | 0.246727 | 0.000000 | 0.000000 | 0.086884 | 0.003378 | 0.0 | 0.000000 |
9112 | 0.031905 | 0.003329 | 0.003657 | 0.038107 | 0.000000 | 0.000000 | 0.019290 | 0.045312 | 0.0 | 0.002296 | ... | 0.000000 | 0.019050 | 0.053040 | 0.092170 | 0.000000 | 0.000000 | 0.000000 | 0.001793 | 0.0 | 0.000000 |
9113 | 0.045796 | 0.000000 | 0.003949 | 0.002286 | 0.000000 | 0.000000 | 0.006261 | 0.027501 | 0.0 | 0.000000 | ... | 0.000000 | 0.025152 | 0.062904 | 0.257408 | 0.037749 | 0.000000 | 0.000000 | 0.001175 | 0.0 | 0.000000 |
9114 | 0.079590 | 0.006749 | 0.002371 | 0.021311 | 0.000000 | 0.000000 | 0.017618 | 0.111433 | 0.0 | 0.010972 | ... | 0.000000 | 0.000000 | 0.109078 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000221 | 0.0 | 0.000919 |
9116 | 0.000000 | 0.000000 | 0.000000 | 0.003446 | 0.000000 | 0.000000 | 0.005269 | 0.000000 | 0.0 | 0.000000 | ... | 0.000160 | 0.114610 | 0.022406 | 0.243280 | 0.014482 | 0.000000 | 0.112478 | 0.000000 | 0.0 | 0.000000 |
9117 | 0.001213 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.040750 | 0.001302 | 0.397878 | 0.000000 | 0.000000 | 0.166156 | 0.000911 | 0.0 | 0.000000 |
9118 | 0.019181 | 0.000000 | 0.000000 | 0.041347 | 0.000000 | 0.000000 | 0.015371 | 0.040876 | 0.0 | 0.012439 | ... | 0.000219 | 0.000000 | 0.023484 | 0.157825 | 0.031902 | 0.000000 | 0.137689 | 0.004242 | 0.0 | 0.000000 |
9119 | 0.036951 | 0.002960 | 0.000000 | 0.021283 | 0.000000 | 0.000000 | 0.010837 | 0.062043 | 0.0 | 0.015091 | ... | 0.000076 | 0.006685 | 0.045383 | 0.197475 | 0.040867 | 0.000457 | 0.026496 | 0.001799 | 0.0 | 0.000200 |
9120 | 0.026611 | 0.000781 | 0.000000 | 0.000468 | 0.000000 | 0.000000 | 0.009929 | 0.040768 | 0.0 | 0.003021 | ... | 0.000000 | 0.024230 | 0.019282 | 0.235651 | 0.017742 | 0.000000 | 0.095742 | 0.002195 | 0.0 | 0.000066 |
9121 | 0.005433 | 0.000000 | 0.000454 | 0.056839 | 0.000000 | 0.000000 | 0.019102 | 0.052365 | 0.0 | 0.011261 | ... | 0.000000 | 0.048820 | 0.035044 | 0.133349 | 0.024641 | 0.000000 | 0.109806 | 0.002113 | 0.0 | 0.000000 |
9165 | 0.017479 | 0.000000 | 0.000000 | 0.008826 | 0.000000 | 0.000000 | 0.013337 | 0.013744 | 0.0 | 0.003507 | ... | 0.001174 | 0.056218 | 0.020538 | 0.298131 | 0.008824 | 0.000000 | 0.125729 | 0.002626 | 0.0 | 0.000000 |
9166 | 0.017747 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.013241 | 0.025715 | 0.0 | 0.000000 | ... | 0.000000 | 0.011393 | 0.015843 | 0.236452 | 0.071622 | 0.000000 | 0.133848 | 0.001551 | 0.0 | 0.000000 |
9167 | 0.046150 | 0.004153 | 0.000586 | 0.029131 | 0.000000 | 0.000000 | 0.015318 | 0.090483 | 0.0 | 0.048755 | ... | 0.000000 | 0.000000 | 0.097628 | 0.052031 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000393 |
9168 | 0.038030 | 0.000000 | 0.008009 | 0.032995 | 0.000000 | 0.000000 | 0.015055 | 0.019886 | 0.0 | 0.000122 | ... | 0.000750 | 0.000000 | 0.017958 | 0.237107 | 0.000000 | 0.000000 | 0.034764 | 0.000354 | 0.0 | 0.000000 |
9169 | 0.040401 | 0.000000 | 0.000000 | 0.017743 | 0.000000 | 0.000000 | 0.012677 | 0.018073 | 0.0 | 0.000072 | ... | 0.000000 | 0.000000 | 0.037869 | 0.152494 | 0.000000 | 0.000000 | 0.000000 | 0.002456 | 0.0 | 0.000000 |
9170 | 0.001406 | 0.000000 | 0.000000 | 0.024063 | 0.000000 | 0.000000 | 0.006337 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.034838 | 0.036429 | 0.237845 | 0.186743 | 0.000000 | 0.048438 | 0.000954 | 0.0 | 0.000000 |
9171 | 0.004296 | 0.000412 | 0.000000 | 0.037080 | 0.007700 | 0.000000 | 0.008648 | 0.013575 | 0.0 | 0.000000 | ... | 0.000353 | 0.000761 | 0.010418 | 0.191832 | 0.000000 | 0.001537 | 0.144852 | 0.001676 | 0.0 | 0.000000 |
9172 | 0.004607 | 0.000422 | 0.016581 | 0.029331 | 0.000000 | 0.000019 | 0.010006 | 0.031523 | 0.0 | 0.000000 | ... | 0.002156 | 0.018158 | 0.009102 | 0.241950 | 0.000000 | 0.004823 | 0.104858 | 0.015060 | 0.0 | 0.000000 |
9122 | 0.044896 | 0.006798 | 0.000000 | 0.013065 | 0.000000 | 0.000000 | 0.022354 | 0.067906 | 0.0 | 0.011042 | ... | 0.000000 | 0.000000 | 0.030763 | 0.148463 | 0.000000 | 0.000000 | 0.004873 | 0.000000 | 0.0 | 0.000000 |
9123 | 0.036793 | 0.021414 | 0.000000 | 0.013317 | 0.000000 | 0.000000 | 0.027294 | 0.108887 | 0.0 | 0.038598 | ... | 0.000000 | 0.000000 | 0.037017 | 0.085672 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9124 | 0.047839 | 0.005392 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.023929 | 0.095896 | 0.0 | 0.000795 | ... | 0.000000 | 0.000000 | 0.036180 | 0.072006 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9125 | 0.047328 | 0.008957 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.025473 | 0.122208 | 0.0 | 0.012059 | ... | 0.000000 | 0.000000 | 0.045437 | 0.036258 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9126 | 0.052478 | 0.007149 | 0.000000 | 0.005149 | 0.000000 | 0.000000 | 0.026425 | 0.088092 | 0.0 | 0.008506 | ... | 0.000000 | 0.000000 | 0.081810 | 0.031593 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9127 | 0.031883 | 0.000000 | 0.000000 | 0.007273 | 0.000000 | 0.000000 | 0.019330 | 0.083236 | 0.0 | 0.011144 | ... | 0.000000 | 0.000000 | 0.040736 | 0.128713 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9128 | 0.049647 | 0.006472 | 0.000000 | 0.003832 | 0.000000 | 0.000000 | 0.024532 | 0.087951 | 0.0 | 0.010565 | ... | 0.000000 | 0.000000 | 0.069194 | 0.045462 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9129 | 0.071568 | 0.000000 | 0.000000 | 0.015938 | 0.000000 | 0.000000 | 0.031499 | 0.087671 | 0.0 | 0.014011 | ... | 0.000000 | 0.000000 | 0.033312 | 0.047501 | 0.000000 | 0.000000 | 0.000000 | 0.000006 | 0.0 | 0.000000 |
9130 | 0.066304 | 0.000000 | 0.002818 | 0.001318 | 0.000000 | 0.000000 | 0.021337 | 0.090575 | 0.0 | 0.002119 | ... | 0.000000 | 0.000000 | 0.060295 | 0.080626 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
9131 | 0.039067 | 0.009778 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.023478 | 0.111008 | 0.0 | 0.018155 | ... | 0.000000 | 0.000000 | 0.040867 | 0.056370 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 |
49 rows × 23 columns
如果我们知道各个样本中细胞类型的真实比例,我们可以验证我们的去卷积结果。在这里,测量了中性粒细胞的数量,确实发现测量值与我们基于去卷积的结果之间具有高度相关性。
neutCounts = metadata["Total.neutrophil.count...mm3."].astype(float)
subsetCorMuSiC = pd.Series(
np.corrcoef(
music_frac.to_numpy()[~np.isnan(neutCounts.to_numpy()), :].transpose(),
neutCounts[~np.isnan(neutCounts.to_numpy())].astype(float),
)[music_frac.shape[1], 0 : music_frac.shape[1]]
)
subsetCorMuSiC.index = music_frac.columns
subsetCorMuSiC.sort_values()
C:\Users\Shennor\AppData\Roaming\Python\Python38\site-packages\numpy\lib\function_base.py:2691: RuntimeWarning: invalid value encountered in true_divide c /= stddev[:, None] C:\Users\Shennor\AppData\Roaming\Python\Python38\site-packages\numpy\lib\function_base.py:2692: RuntimeWarning: invalid value encountered in true_divide c /= stddev[None, :]
NK_cells -0.511184 Eosinophils -0.498070 CD4_T_cells_1 -0.403485 B_cells_1 -0.325301 B_cells_2 -0.281896 CD8_T_cells -0.281529 pDC -0.264388 CD4_T_cells_3 -0.180847 CD16_Monocytes -0.165263 CD14_Monocytes_2 -0.164895 CD34+ GATA2+ cells -0.132763 CD14_Monocytes_3 -0.063592 Neutrophils_2 -0.058816 Immature Neutrophils_1 -0.004655 CD14_Monocytes_1 -0.002719 Plasmablast 0.015157 Neutrophils_3 0.057406 Megakaryocytes 0.292929 Immature Neutrophils_2 0.322763 Neutrophils_1 0.446441 Neutrophils_4 0.447348 CD4_T_cells_2 NaN mDC NaN dtype: float64
我们还可以查看疾病患者和健康对照组之间细胞组成的显著变化。在这种情况下,对于每种细胞类型,我们基于学生 t 检验计算这种变化的 p 值。
healty_vs_covid = pd.Series(
[
sci.stats.ttest_ind(
music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "covid"],
music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "healthy"],
)[1]
for cell in music_frac.columns
]
)
healty_vs_covid.index = music_frac.columns
healty_vs_covid.sort_values()
CD4_T_cells_1 3.535377e-08 Eosinophils 2.174486e-07 CD34+ GATA2+ cells 1.313440e-06 B_cells_2 4.179994e-05 Neutrophils_1 1.133780e-04 B_cells_1 3.829060e-04 Neutrophils_4 4.508343e-04 CD14_Monocytes_2 1.010730e-02 Megakaryocytes 1.307137e-02 Plasmablast 1.902793e-02 Neutrophils_2 6.436426e-02 NK_cells 1.110367e-01 Immature Neutrophils_1 1.321325e-01 CD14_Monocytes_1 1.452041e-01 CD4_T_cells_3 1.676061e-01 Immature Neutrophils_2 1.806682e-01 CD14_Monocytes_3 2.641251e-01 CD8_T_cells 3.427982e-01 pDC 3.571902e-01 Neutrophils_3 4.002572e-01 CD16_Monocytes 6.143465e-01 CD4_T_cells_2 NaN mDC NaN dtype: float64
Here is a boxplot presenting the differences between the two conditions
selected_cell = healty_vs_covid.index[np.nanargmin(healty_vs_covid.to_numpy())]
status_df = pd.DataFrame(metadata["status"])
status_df["cellFraction"] = music_frac[[selected_cell]]
status_df.boxplot(by="status")
<AxesSubplot:title={'center':'cellFraction'}, xlabel='[status]'>
Limitations and traps¶
虽然与预选的分选细胞相比,单细胞数据更不容易包含缺失的细胞类型,但处理缺失的细胞类型仍然被认为是细胞类型去卷积领域的一个主要挑战 {cite}Cobos2020,Jin2021
。此外,签名矩阵中细胞类型的数量对去卷积的准确性有重大影响,因为更多的细胞类型通常会导致去卷积过程的准确性降低 {cite}Newman2019
。
虽然几种去卷积方法可以正确推断主要细胞成分的比例,但它们在稀有或相关成分上的表现各不相同。为了处理预测变量之间的共线性,一些方法在去卷积之前进行特征选择,通过选择最小化细胞类型之间相关性的基因子集(称为签名基因列表)来执行特征选择。
New directions¶
除了我们之前描述的方法之外,还有其他工具可以用来改进去卷积或在细胞类型内解析更高分辨率的状态:
AutoGeneS {cite}Aliee2021
提出了一种多目标特征选择方法,可以集成到去卷积平台中。AutoGeneS 不需要关于标记基因的先验知识,并通过同时优化多个标准来选择基因:最小化相关性和最大化细胞类型之间的距离。AutoGeneS 可以应用于来自各种来源的参考配置文件,如单细胞实验或分选细胞群体。
CPM {cite}Frishberg2019
是一种细胞状态去卷积方法,可以基于单细胞空间发现每种细胞类型内发生的组成变化,通常捕捉跨连续细胞轨迹的细胞组成变化。通过专注于细胞类型内的变化,而不是细胞类型之间的变化(与大多数去卷积方法不同),CPM 可以发现不同细胞亚群的组成变化或跨特定细胞轨迹的连续变化。
Key takeaways¶
使用没有缺失细胞类型的无偏参考数据。
使用库大小标准化对参考数据进行标准化。
在去卷积之前进行特征选择可以提高结果的准确性,特别是对于线性去卷积方法。
去卷积是一项具有挑战性的任务(参见局限性和陷阱)。因此,我们建议用户尝试多种去卷积方法,并在生物学上评估结果后选择一种方法。