7.1 Bulk deconvolution¶

TL;DR We provide a brief overview over basic concepts of cell type deconvolution including input structure, data preprocessing and analysis of the output data.

Background¶

在研究大块组织中细胞类型组成差异方面有几个重要原因。首先，细胞类型之间的相互作用在疾病的进展和恢复过程中起着重要作用。其次，分子模式（如基因表达和蛋白质丰度）通常直接与组织内细胞类型的组成相关联。理解这些组成对于研究疾病等生物条件的基础分子机制至关重要。第三，揭示特定于疾病的细胞类型模式可以更好地指导治疗靶点的选择，提供重要的临床实用性。

细胞类型解混（cell type deconvolution）是一种计算框架，用于推断复杂多样的组织中细胞群体的组成 {cite}Kuhn2012,Schwartz2010,Du2019,Zaitsev2019。由于在实验上测量这些组成是耗时且昂贵的，解混方法允许基于分子数据对细胞群体进行大规模分析。解混方法通常采用线性回归定义为：

[ y = bX ]

其中，( y ) 表示使用常见的分子流程（如微阵列或RNA-seq）对杂合基因表达谱进行混合，( X ) 是包含同质化细胞类型特异性谱的签名矩阵，( b ) 是由解混方法推断的混合数据中细胞比例向量 {cite}Baron2016。为了选择适合目标生物条件的最优解混方法，应考虑多种技术和生物因素的影响，包括解混方法、缺失或稀有细胞类型的参考数据、数据标准化和特征（标记物）的选择。

用于推断细胞组成的签名矩阵 ( X ) 反映了我们对组织内细胞异质性的最佳了解，并且极大地影响解混过程的成功 {cite}Aliee2021。最初，签名矩阵是通过从杂合组织中分选细胞（使用FACS或CyTOF）生成的，由于预先选择的细胞类型面板和缺乏合适的抗体而存在内部偏差 {cite}Monaco2019,Aran2017。今天，这些矩阵主要是使用单细胞技术生成的无偏谱，允许跨不同生物体、组织和生物条件生成签名矩阵 {cite}Aliee2021,Newman2019。

Approaches¶

Bulk解混方法可以分为基于线性回归、基于富集分析、基于非线性深度学习以及其他类型的方法。

最常见的细胞类型解混方法是基于线性回归的方法。这些方法尝试直接解决 $y = bX$ 方程，使用不同的正则化方法，并依赖相对较多的特征。这类工具的例子包括 CIBERSORTx {cite}Newman2019, MuSiC {cite}WangXuran2019, dtangle {cite}Hunt2018 和 DWLS {cite}Tsoucas2019。

另一方面，基于富集分析的工具则为每个细胞类型单独计算富集分数，基于代表该细胞类型的基因集。所有细胞类型的富集分数然后通过特定于方法的转换函数组合和转换为组成。由于这些方法一次只考虑一个细胞类型，它们在简单情况下提供有意义的见解，但是在包含多种细胞类型的参考数据时精度较低。xCell {cite}Aran2017 就是一个基于富集分析的工具的例子。

第三种选择是非线性深度学习方法，它们以提高解混精度为目标，同时试图保持较高的生物解释性。目前来看，这些方法是否能够优于其他类型的方法还为时尚早。Scaden {cite}Menden2020 就是一个基于深度学习的方法的例子。

Bulk解混方法已经得到广泛基准测试，结果总体上相当一致 {cite}ShenOrr2013,Cobos2020,Jin2021,Nadel2021。使用单细胞RNA-seq数据的方法表现良好，而半监督方法显示出较高的错误率。在参考数据中未包括混合物中存在的细胞类型会导致较差的结果 {cite}Cobos2020。Cobos等人建议：(1) 输入数据应处于线性尺度，(2) 避免行缩放、列最小-最大值、列Z-score或分位数归一化，(3) 回归型的bulk解混方法如CIBERSORTx或FARDEEP表现良好，如果有单细胞RNA-seq数据则应同时使用DWLS、MuSiC或SCDC进行结果比较，(4) 使用严格的标记选择策略，重点关注前两个具有最高表达值的细胞类型之间的差异，(5) 使用包含所有混合物中存在的相关细胞类型的全面参考矩阵 {cite}Cobos2020。

关于标准化策略的影响，Li等人 {cite}Li2016 提出标准化策略对结果有较大影响，而Cobos等人的研究未能证实这一点 {cite}Cobos2020。

由于包括表现良好的CIBERSORTx在内的许多批量解卷积工具仅作为Web工具提供，我们选择用MuSiC展示一个使用案例。

需要注意的是，MuSiC要求使用多个包含相同细胞类型的单细胞样本。在实际操作中，我们的参考基因组文件可能只包括单个样本，或者某些细胞类型在多个样本中可能缺失。在这些情况下，MuSiC会失败，需要选择另一种解卷积方法。

Deconvolving bulk COVID-19 whole blood samples¶

以下是我们使用MuSiC从39名COVID-19患者和10名健康对照中收集的49个整血RNA-seq样本进行细胞解卷积的实际案例 {cite}Aschenbrenner2020。作为单细胞参考数据集，我们将使用COVID-19患者的全血单细胞RNA-seq数据 {cite}Schulte-Schrepping2020。

Environment setup¶

In [1]:

Copied!





import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import scipy as sci
import scanpy as sc
import anndata
import numpy as np
import pandas as pd
import scipy as sci

Loading Data¶

我们首先读取单细胞和批量数据。我们不对数据进行缩放，因为研究表明，使用scRNA-seq数据作为参考的去卷积方法在应用于线性尺度数据时表现最佳，并且在库大小标准化后准确性有所提高 {cite}Jin2021,Cobos2020。

In [45]:

Copied!





data_file = "/storage/groups/ml01/workspace/amit.frishberg/OriginalData/"
adata = sc.read(data_file + "seurat_COVID19_freshWB_PBMC_cohort2_incl_raw.h5ad")

adata.X = adata.layers["counts"]
adata = adata[adata.obs["cells"] == "Whole_blood"].copy()
adata
data_file = "/storage/groups/ml01/workspace/amit.frishberg/OriginalData/"
adata = sc.read(data_file + "seurat_COVID19_freshWB_PBMC_cohort2_incl_raw.h5ad")

adata.X = adata.layers["counts"]
adata = adata[adata.obs["cells"] == "Whole_blood"].copy()
adata

Out[45]:

AnnData object with n_obs × n_vars = 89883 × 33417
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'

In [46]:

Copied!

adata.obs["cluster_labels_res.0.8"].value_counts()
adata.obs["cluster_labels_res.0.8"].value_counts()

Out[46]:

Neutrophils_1             22714
Neutrophils_2             18675
Neutrophils_3              9986
CD4_T_cells_1              6278
CD14_Monocytes_1           5362
Neutrophils_4              4710
CD8_T_cells                3255
Megakaryocytes             3180
NK_cells                   2916
B_cells_1                  2561
CD4_T_cells_2              1917
Mixed_cells                1727
Immature Neutrophils_1     1317
CD16_Monocytes              983
Immature Neutrophils_2      981
CD14_Monocytes_3            632
Eosinophils                 579
CD14_Monocytes_2            556
CD4_T_cells_3               409
Plasmablast                 390
Prol. cells                 247
mDC                         235
B_cells_2                   138
pDC                          86
CD34+ GATA2+ cells           49
Name: cluster_labels_res.0.8, dtype: int64

In [4]:

Copied!





bulk = pd.read_csv(data_file + "BulkSmall.txt", sep="\t", index_col=0)
metadata = pd.read_csv(data_file + "annoSmall.txt", sep="\t", index_col=0)
metadata.index = metadata.index.astype("str")
metadata = metadata.loc[bulk.transpose().index]
bulk = pd.read_csv(data_file + "BulkSmall.txt", sep="\t", index_col=0)
metadata = pd.read_csv(data_file + "annoSmall.txt", sep="\t", index_col=0)
metadata.index = metadata.index.astype("str")
metadata = metadata.loc[bulk.transpose().index]

Data preprocessing¶

读取数据后，我们需要从我们的单细胞参考数据中移除未定义的细胞。在这里，我们移除混合细胞和增殖细胞，因为它们不限于单一的细胞类型。

注意质量控制已经通过移除低质量细胞和基因在单细胞数据上进行过。因此，我们跳过这一部分。

In [50]:

Copied!





adata = adata[
    ~adata.obs["cluster_labels_res.0.8"].isin(["None", "Mixed_cells", "Prol. cells"])
].copy()
ct_counts = adata.obs["cluster_labels_res.0.8"].value_counts()
adata.obs["cluster_labels_res.0.8"].value_counts()
adata = adata[
    ~adata.obs["cluster_labels_res.0.8"].isin(["None", "Mixed_cells", "Prol. cells"])
].copy()
ct_counts = adata.obs["cluster_labels_res.0.8"].value_counts()
adata.obs["cluster_labels_res.0.8"].value_counts()

Out[50]:

Neutrophils_1             22714
Neutrophils_2             18675
Neutrophils_3              9986
CD4_T_cells_1              6278
CD14_Monocytes_1           5362
Neutrophils_4              4710
CD8_T_cells                3255
Megakaryocytes             3180
NK_cells                   2916
B_cells_1                  2561
CD4_T_cells_2              1917
Immature Neutrophils_1     1317
CD16_Monocytes              983
Immature Neutrophils_2      981
CD14_Monocytes_3            632
Eosinophils                 579
CD14_Monocytes_2            556
CD4_T_cells_3               409
Plasmablast                 390
mDC                         235
B_cells_2                   138
pDC                          86
CD34+ GATA2+ cells           49
Name: cluster_labels_res.0.8, dtype: int64

研究表明，去卷积方法通常无法预测稀有细胞的比例 {cite}Tsoucas2019。如果您想移除稀有细胞类型，可以将细胞类型限定为超过 cellTypeNumCutOff 的细胞数。此截止值由用户定义，应根据数据选择。

In [59]:

Copied!





# removing very rare cells
rare_ct_cut_off = 50  # This is a user-specific parameter based on the data
ct_to_keep = ct_counts[ct_counts > rare_ct_cut_off].index
adata = adata[adata.obs["cluster_labels_res.0.8"].isin(ct_to_keep)].copy()
adata.obs["cluster_labels_res.0.8"].value_counts()
# removing very rare cells
rare_ct_cut_off = 50  # This is a user-specific parameter based on the data
ct_to_keep = ct_counts[ct_counts > rare_ct_cut_off].index
adata = adata[adata.obs["cluster_labels_res.0.8"].isin(ct_to_keep)].copy()
adata.obs["cluster_labels_res.0.8"].value_counts()

Out[59]:

Neutrophils_1             22714
Neutrophils_2             18675
Neutrophils_3              9986
CD4_T_cells_1              6278
CD14_Monocytes_1           5362
Neutrophils_4              4710
CD8_T_cells                3255
Megakaryocytes             3180
NK_cells                   2916
B_cells_1                  2561
CD4_T_cells_2              1917
Immature Neutrophils_1     1317
CD16_Monocytes              983
Immature Neutrophils_2      981
CD14_Monocytes_3            632
Eosinophils                 579
CD14_Monocytes_2            556
CD4_T_cells_3               409
Plasmablast                 390
mDC                         235
B_cells_2                   138
pDC                          86
Name: cluster_labels_res.0.8, dtype: int64

We need to filter for the shared genes across bulk and single-cell data before selecting highly variable genes.

In [48]:

Copied!

bulk_sc_genes = np.intersect1d(bulk.index, adata.var_names)
bulk = bulk.loc[bulk_sc_genes, :].copy()
adata = adata[:, bulk_sc_genes].copy()
bulk_sc_genes = np.intersect1d(bulk.index, adata.var_names)
bulk = bulk.loc[bulk_sc_genes, :].copy()
adata = adata[:, bulk_sc_genes].copy()

Visualization of single-cell data¶

For visualization of single-cell data, we first normalize the counts, select highly-variable genes and log transform the data.

In [7]:

Copied!





sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=False)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=5000)
adata_log = sc.pp.log1p(
    adata, copy=True
)  # logged counts are only used for visualisation (can also work with layers)
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4, copy=False)
sc.pp.highly_variable_genes(adata, flavor="cell_ranger", n_top_genes=5000)
adata_log = sc.pp.log1p(
    adata, copy=True
)  # logged counts are only used for visualisation (can also work with layers)

We then reduce the dimensionality using PCA and visualize the data using UMAP:

In [10]:

Copied!





sc.tl.pca(adata_log)
adata_log.obsm["X_pca"] *= -1  # multiply by -1 to match Seurat
sc.pp.neighbors(adata_log, n_neighbors=30)
sc.tl.umap(adata_log)
sc.pl.umap(adata_log, color="cluster_labels_res.0.8")
sc.tl.pca(adata_log)
adata_log.obsm["X_pca"] *= -1  # multiply by -1 to match Seurat
sc.pp.neighbors(adata_log, n_neighbors=30)
sc.tl.umap(adata_log)
sc.pl.umap(adata_log, color="cluster_labels_res.0.8")

No description has been provided for this image

Deconvolving using MuSiC¶

Loading R¶

In [8]:

Copied!





# R interface
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
import rpy2.rinterface_lib.callbacks
import anndata2ri

pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython
# R interface
from rpy2.robjects import pandas2ri
from rpy2.robjects import r
import rpy2.rinterface_lib.callbacks
import anndata2ri

pandas2ri.activate()
anndata2ri.activate()
%load_ext rpy2.ipython

C:\Users\Shennor\.conda\envs\eharpy\lib\site-packages\rpy2\robjects\packages.py:365: UserWarning: The symbol 'quartz' is not in this R namespace/package.
  warnings.warn(

为了使用 R 脚本，我们需要将数据从 Python 转换为 R。在某些情况下，这需要对数据进行子采样以避免内存限制。

In [51]:

Copied!





import random
import itertools

downSamplingSize = 80
downSamplingIndexes = [
    random.sample(
        np.where(currCell == adata.obs["cluster_labels_res.0.8"])[0].tolist(),
        np.min(
            [downSamplingSize, np.sum(currCell == adata.obs["cluster_labels_res.0.8"])]
        ),
    )
    for currCell in np.unique(adata.obs["cluster_labels_res.0.8"])
]
downSamplingIndexes = list(itertools.chain(*downSamplingIndexes))

adata_r = adata[downSamplingIndexes].copy()
adata_r
import random
import itertools

downSamplingSize = 80
downSamplingIndexes = [
    random.sample(
        np.where(currCell == adata.obs["cluster_labels_res.0.8"])[0].tolist(),
        np.min(
            [downSamplingSize, np.sum(currCell == adata.obs["cluster_labels_res.0.8"])]
        ),
    )
    for currCell in np.unique(adata.obs["cluster_labels_res.0.8"])
]
downSamplingIndexes = list(itertools.chain(*downSamplingIndexes))

adata_r = adata[downSamplingIndexes].copy()
adata_r

Out[51]:

AnnData object with n_obs × n_vars = 1809 × 26807
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'

如果不需要子采样（数据相对较小），可以简单地将整个 AnnData 转换为 R 对象。

In [21]:

Copied!

adata_r = adata.copy()
adata_r
adata_r = adata.copy()
adata_r

Out[21]:

AnnData object with n_obs × n_vars = 87860 × 26807
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'nCount_HTO', 'nFeature_HTO', 'percent.mito', 'percent.hb', 'HTO_maxID', 'HTO_secondID', 'HTO_margin', 'HTO_classification', 'HTO_classification.global', 'hash.ID', 'demultID', 'donor', 'onset_of_symptoms', 'days_after_onset', 'sampleID', 'date_of_sampling', 'experiment', 'cartridge', 'platform', 'purification', 'cells', 'age', 'sex', 'group_per_sample', 'who_per_sample', 'disease_stage', 'diagnosis', 'oxygen', 'outcome', 'comorbidities', 'COVID.19.related_medication_and_anti.microbials', 'primary_complaint', 'RNA_snn_res.0.8', 'cluster_labels_res.0.8', 'new.order', 'hpca.labels', 'blueprint.labels', 'monaco.labels', 'immune.labels', 'dmap.labels', 'hemato.labels', 'n_counts'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'hvg'
    obsm: 'X_pca', 'X_umap'
    layers: 'counts'

Running MuSiC¶

In [24]:

Copied!

%%R
library(MuSiC)
library(Biobase)
%%R
library(MuSiC)
library(Biobase)

我们首先收集运行 MuSiC 所需的所有剩余输入：

adata_r - 包含将用作参考的单细胞矩阵。
cell_subsets_r - 细胞类型身份。
bulk - 进行去卷积过程的批量样本。
sc_genes - MuSiC 分析中使用的基因标记。

In [52]:

Copied!

cell_subsets_r = adata_r.obs["cluster_labels_res.0.8"].astype(str).copy()
sc_genes = adata_r.var_names
cell_subsets_r = adata_r.obs["cluster_labels_res.0.8"].astype(str).copy()
sc_genes = adata_r.var_names

MuSiC 一次运行一个样本的去卷积框架。这里，9*** 是样本名称。

In [53]:

Copied!





%%R -i adata_r,cell_subsets_r,bulk,sc_genes -o musicRes
df = data.frame(cellNames = cell_subsets_r, Sample = factor(rep(1, dim(adata_r@colData)[1])))
row.names(df) = row.names(adata_r@colData)
df = new("AnnotatedDataFrame", data = df) # Cell type identities are stored as an AnnotatedDataFrame

# Creating an ExpressionSet from the single cell matrix
scDataMatrix = Matrix::as.matrix(adata_r@assays@data@listData[[1]])
row.names(scDataMatrix) = sc_genes
scDataMatrix = scDataMatrix[rowSums(scDataMatrix)>0,] # Removing genes with no reads
SCDataES <- Biobase::ExpressionSet(assayData=scDataMatrix,phenoData = df, protocolData = df)

bulkDataES <- Biobase::ExpressionSet(assayData=as.matrix(bulk)) # Creating an ExpressionSet from the bulk matrix
musicRes = MuSiC::music_prop(bulk.eset = bulkDataES, sc.eset = SCDataES, clusters = 'cellNames') # Running MuSiC
%%R -i adata_r,cell_subsets_r,bulk,sc_genes -o musicRes
df = data.frame(cellNames = cell_subsets_r, Sample = factor(rep(1, dim(adata_r@colData)[1])))
row.names(df) = row.names(adata_r@colData)
df = new("AnnotatedDataFrame", data = df) # Cell type identities are stored as an AnnotatedDataFrame

# Creating an ExpressionSet from the single cell matrix
scDataMatrix = Matrix::as.matrix(adata_r@assays@data@listData[[1]])
row.names(scDataMatrix) = sc_genes
scDataMatrix = scDataMatrix[rowSums(scDataMatrix)>0,] # Removing genes with no reads
SCDataES <- Biobase::ExpressionSet(assayData=scDataMatrix,phenoData = df, protocolData = df)

bulkDataES <- Biobase::ExpressionSet(assayData=as.matrix(bulk)) # Creating an ExpressionSet from the bulk matrix
musicRes = MuSiC::music_prop(bulk.eset = bulkDataES, sc.eset = SCDataES, clusters = 'cellNames') # Running MuSiC

R[write to console]: Creating Relative Abundance Matrix...

R[write to console]: Creating Variance Matrix...

R[write to console]: Creating Library Size Matrix...

R[write to console]: Used 20407 common genes...

R[write to console]: Used 23 cell types in deconvolution...

R[write to console]: 9088 has common genes 18250 ...

R[write to console]: 9089 has common genes 17774 ...

R[write to console]: 9091 has common genes 18598 ...

R[write to console]: 9092 has common genes 16184 ...

R[write to console]: 9093 has common genes 18585 ...

R[write to console]: 9094 has common genes 18701 ...

R[write to console]: 9095 has common genes 18646 ...

R[write to console]: 9096 has common genes 17247 ...

R[write to console]: 9097 has common genes 17922 ...

R[write to console]: 9098 has common genes 17987 ...

R[write to console]: 9099 has common genes 18223 ...

R[write to console]: 9100 has common genes 18786 ...

R[write to console]: 9101 has common genes 16458 ...

R[write to console]: 9102 has common genes 17604 ...

R[write to console]: 9103 has common genes 17429 ...

R[write to console]: 9104 has common genes 18118 ...

R[write to console]: 9105 has common genes 15286 ...

R[write to console]: 9106 has common genes 17237 ...

R[write to console]: 9107 has common genes 15735 ...

R[write to console]: 9108 has common genes 17387 ...

R[write to console]: 9109 has common genes 16863 ...

R[write to console]: 9110 has common genes 17736 ...

R[write to console]: 9112 has common genes 17513 ...

R[write to console]: 9113 has common genes 14950 ...

R[write to console]: 9114 has common genes 17318 ...

R[write to console]: 9116 has common genes 14518 ...

R[write to console]: 9117 has common genes 13708 ...

R[write to console]: 9118 has common genes 17526 ...

R[write to console]: 9119 has common genes 17274 ...

R[write to console]: 9120 has common genes 16520 ...

R[write to console]: 9121 has common genes 15434 ...

R[write to console]: 9165 has common genes 16907 ...

R[write to console]: 9166 has common genes 16534 ...

R[write to console]: 9167 has common genes 17115 ...

R[write to console]: 9168 has common genes 16621 ...

R[write to console]: 9169 has common genes 17251 ...

R[write to console]: 9170 has common genes 17353 ...

R[write to console]: 9171 has common genes 14603 ...

R[write to console]: 9172 has common genes 14141 ...

R[write to console]: 9122 has common genes 17865 ...

R[write to console]: 9123 has common genes 18009 ...

R[write to console]: 9124 has common genes 18390 ...

R[write to console]: 9125 has common genes 17235 ...

R[write to console]: 9126 has common genes 18432 ...

R[write to console]: 9127 has common genes 17662 ...

R[write to console]: 9128 has common genes 16875 ...

R[write to console]: 9129 has common genes 17396 ...

R[write to console]: 9130 has common genes 19469 ...

R[write to console]: 9131 has common genes 18756 ...

In [54]:

Copied!





# Create the final output matrix
music_frac = pd.DataFrame(musicRes[0])
music_frac.index = bulk.columns
music_frac.columns = np.unique(cell_subsets_r)
# Create the final output matrix
music_frac = pd.DataFrame(musicRes[0])
music_frac.index = bulk.columns
music_frac.columns = np.unique(cell_subsets_r)

Outputs and Validations¶

所有细胞类型去卷积方法的主要输出是一个 NxM 矩阵，其中：

N（行数）代表样本数量
M（列数）代表细胞类型数量

矩阵中每个单元格的值表示特定样本中某一特定细胞类型的组成。在大多数情况下，样本中的细胞类型组成将显示为分数，因此为非负值，总和为一。

In [55]:

Copied!

music_frac
music_frac

Out[55]:

	B_cells_1	B_cells_2	CD14_Monocytes_1	CD14_Monocytes_2	CD14_Monocytes_3	CD16_Monocytes	CD34+ GATA2+ cells	CD4_T_cells_1	CD4_T_cells_3	...	Immature Neutrophils_2	Megakaryocytes	NK_cells	Neutrophils_1	Neutrophils_2	Neutrophils_3	Neutrophils_4	Plasmablast	pDC
9088	0.025266	0.000007	0.009212	0.007646	0.004111	0.000000	0.021343	0.055596	0.022603	...	0.004530	0.037775	0.035206	0.181982	0.000000	0.000000	0.096936	0.007348	0.000000
9089	0.000000	0.000000	0.011156	0.013301	0.000000	0.000000	0.011891	0.000000	0.000000	...	0.000594	0.012610	0.000000	0.402237	0.000000	0.000000	0.117283	0.000086	0.000000
9091	0.000091	0.000000	0.005645	0.005848	0.000309	0.000247	0.003656	0.000000	0.000000	...	0.000000	0.063919	0.011228	0.343444	0.000000	0.002390	0.155726	0.000936	0.000003
9092	0.004650	0.000000	0.000000	0.000000	0.002024	0.000000	0.003379	0.000000	0.000000	...	0.000000	0.091188	0.004478	0.294886	0.000000	0.000000	0.175727	0.001784	0.000000
9093	0.008431	0.000000	0.000000	0.000000	0.032441	0.000000	0.004801	0.020848	0.001814	...	0.000000	0.025210	0.033820	0.100954	0.090013	0.000000	0.276946	0.000504	0.000000
9094	0.099594	0.000000	0.000425	0.003925	0.000000	0.000000	0.016660	0.016462	0.027290	...	0.002317	0.000000	0.029776	0.020330	0.082443	0.000000	0.128015	0.000000	0.000000
9095	0.001680	0.000000	0.000000	0.048237	0.000000	0.000000	0.006444	0.000000	0.000000	...	0.000000	0.017268	0.017518	0.278576	0.000000	0.000000	0.217223	0.000744	0.000000
9096	0.019351	0.000000	0.000785	0.028895	0.000000	0.000000	0.007645	0.015595	0.000000	...	0.000000	0.003820	0.060052	0.277122	0.061254	0.000000	0.026074	0.002298	0.000000
9097	0.015769	0.000000	0.000399	0.031298	0.000000	0.000000	0.009512	0.028784	0.000000	...	0.000000	0.005165	0.061176	0.275073	0.036719	0.000000	0.056550	0.005490	0.000000
9098	0.015624	0.000000	0.015810	0.011449	0.004735	0.000000	0.011961	0.038063	0.007417	...	0.000000	0.012727	0.051597	0.230531	0.002209	0.000000	0.022288	0.013302	0.000000
9099	0.015419	0.000000	0.000125	0.009149	0.026507	0.000000	0.013332	0.088082	0.022198	...	0.000000	0.013670	0.043131	0.073787	0.085893	0.000000	0.084937	0.001153	0.000000
9100	0.038787	0.000000	0.000000	0.006566	0.000000	0.000000	0.014090	0.035495	0.002291	...	0.000648	0.000000	0.038413	0.170412	0.044578	0.000000	0.078720	0.008307	0.000000
9101	0.003037	0.000000	0.000000	0.025247	0.000000	0.000000	0.002739	0.000516	0.000000	...	0.000000	0.063421	0.000801	0.360854	0.005524	0.000000	0.111516	0.000328	0.000000
9102	0.008490	0.000000	0.000000	0.025538	0.000577	0.000000	0.010780	0.028101	0.000000	...	0.000156	0.025812	0.037690	0.284248	0.054286	0.000000	0.014266	0.001326	0.000000
9103	0.018500	0.013523	0.000000	0.013188	0.000000	0.000000	0.008010	0.079577	0.022855	...	0.000000	0.001177	0.058535	0.160046	0.053514	0.000065	0.028745	0.000483	0.000000
9104	0.035979	0.000430	0.000000	0.021775	0.000000	0.000000	0.028406	0.039231	0.014846	...	0.000000	0.000000	0.046156	0.097204	0.000000	0.000000	0.000000	0.005464	0.000370
9105	0.029334	0.000000	0.012962	0.000000	0.014837	0.028242	0.000414	0.025195	0.007911	...	0.000000	0.017720	0.059066	0.177582	0.000000	0.000000	0.000000	0.000380	0.000000
9106	0.004070	0.000000	0.000000	0.016110	0.000000	0.000000	0.010171	0.027782	0.000945	...	0.000000	0.000000	0.014647	0.072228	0.222037	0.000000	0.245327	0.002171	0.000000
9107	0.012894	0.000000	0.000733	0.030433	0.000000	0.000000	0.029637	0.064172	0.021531	...	0.000000	0.025631	0.008832	0.267814	0.000000	0.000000	0.030072	0.000000	0.000000
9108	0.015228	0.000000	0.000170	0.022236	0.000000	0.000000	0.018144	0.004921	0.005200	...	0.000000	0.000000	0.031351	0.211152	0.000000	0.000000	0.131391	0.003772	0.000000
9109	0.036592	0.000000	0.003224	0.031009	0.004927	0.000000	0.020842	0.071492	0.021438	...	0.000212	0.001298	0.054454	0.162173	0.001041	0.000000	0.060488	0.003435	0.000000
9110	0.017529	0.000000	0.000000	0.017824	0.000000	0.000000	0.018764	0.000000	0.000000	...	0.002869	0.040700	0.006030	0.246727	0.000000	0.000000	0.086884	0.003378	0.000000
9112	0.031905	0.003329	0.003657	0.038107	0.000000	0.000000	0.019290	0.045312	0.002296	...	0.000000	0.019050	0.053040	0.092170	0.000000	0.000000	0.000000	0.001793	0.000000
9113	0.045796	0.000000	0.003949	0.002286	0.000000	0.000000	0.006261	0.027501	0.000000	...	0.000000	0.025152	0.062904	0.257408	0.037749	0.000000	0.000000	0.001175	0.000000
9114	0.079590	0.006749	0.002371	0.021311	0.000000	0.000000	0.017618	0.111433	0.010972	...	0.000000	0.000000	0.109078	0.000000	0.000000	0.000000	0.000000	0.000221	0.000919
9116	0.000000	0.000000	0.000000	0.003446	0.000000	0.000000	0.005269	0.000000	0.000000	...	0.000160	0.114610	0.022406	0.243280	0.014482	0.000000	0.112478	0.000000	0.000000
9117	0.001213	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.040750	0.001302	0.397878	0.000000	0.000000	0.166156	0.000911	0.000000
9118	0.019181	0.000000	0.000000	0.041347	0.000000	0.000000	0.015371	0.040876	0.012439	...	0.000219	0.000000	0.023484	0.157825	0.031902	0.000000	0.137689	0.004242	0.000000
9119	0.036951	0.002960	0.000000	0.021283	0.000000	0.000000	0.010837	0.062043	0.015091	...	0.000076	0.006685	0.045383	0.197475	0.040867	0.000457	0.026496	0.001799	0.000200
9120	0.026611	0.000781	0.000000	0.000468	0.000000	0.000000	0.009929	0.040768	0.003021	...	0.000000	0.024230	0.019282	0.235651	0.017742	0.000000	0.095742	0.002195	0.000066
9121	0.005433	0.000000	0.000454	0.056839	0.000000	0.000000	0.019102	0.052365	0.011261	...	0.000000	0.048820	0.035044	0.133349	0.024641	0.000000	0.109806	0.002113	0.000000
9165	0.017479	0.000000	0.000000	0.008826	0.000000	0.000000	0.013337	0.013744	0.003507	...	0.001174	0.056218	0.020538	0.298131	0.008824	0.000000	0.125729	0.002626	0.000000
9166	0.017747	0.000000	0.000000	0.000000	0.000000	0.000000	0.013241	0.025715	0.000000	...	0.000000	0.011393	0.015843	0.236452	0.071622	0.000000	0.133848	0.001551	0.000000
9167	0.046150	0.004153	0.000586	0.029131	0.000000	0.000000	0.015318	0.090483	0.048755	...	0.000000	0.000000	0.097628	0.052031	0.000000	0.000000	0.000000	0.000000	0.000393
9168	0.038030	0.000000	0.008009	0.032995	0.000000	0.000000	0.015055	0.019886	0.000122	...	0.000750	0.000000	0.017958	0.237107	0.000000	0.000000	0.034764	0.000354	0.000000
9169	0.040401	0.000000	0.000000	0.017743	0.000000	0.000000	0.012677	0.018073	0.000072	...	0.000000	0.000000	0.037869	0.152494	0.000000	0.000000	0.000000	0.002456	0.000000
9170	0.001406	0.000000	0.000000	0.024063	0.000000	0.000000	0.006337	0.000000	0.000000	...	0.000000	0.034838	0.036429	0.237845	0.186743	0.000000	0.048438	0.000954	0.000000
9171	0.004296	0.000412	0.000000	0.037080	0.007700	0.000000	0.008648	0.013575	0.000000	...	0.000353	0.000761	0.010418	0.191832	0.000000	0.001537	0.144852	0.001676	0.000000
9172	0.004607	0.000422	0.016581	0.029331	0.000000	0.000019	0.010006	0.031523	0.000000	...	0.002156	0.018158	0.009102	0.241950	0.000000	0.004823	0.104858	0.015060	0.000000
9122	0.044896	0.006798	0.000000	0.013065	0.000000	0.000000	0.022354	0.067906	0.011042	...	0.000000	0.000000	0.030763	0.148463	0.000000	0.000000	0.004873	0.000000	0.000000
9123	0.036793	0.021414	0.000000	0.013317	0.000000	0.000000	0.027294	0.108887	0.038598	...	0.000000	0.000000	0.037017	0.085672	0.000000	0.000000	0.000000	0.000000	0.000000
9124	0.047839	0.005392	0.000000	0.000000	0.000000	0.000000	0.023929	0.095896	0.000795	...	0.000000	0.000000	0.036180	0.072006	0.000000	0.000000	0.000000	0.000000	0.000000
9125	0.047328	0.008957	0.000000	0.000000	0.000000	0.000000	0.025473	0.122208	0.012059	...	0.000000	0.000000	0.045437	0.036258	0.000000	0.000000	0.000000	0.000000	0.000000
9126	0.052478	0.007149	0.000000	0.005149	0.000000	0.000000	0.026425	0.088092	0.008506	...	0.000000	0.000000	0.081810	0.031593	0.000000	0.000000	0.000000	0.000000	0.000000
9127	0.031883	0.000000	0.000000	0.007273	0.000000	0.000000	0.019330	0.083236	0.011144	...	0.000000	0.000000	0.040736	0.128713	0.000000	0.000000	0.000000	0.000000	0.000000
9128	0.049647	0.006472	0.000000	0.003832	0.000000	0.000000	0.024532	0.087951	0.010565	...	0.000000	0.000000	0.069194	0.045462	0.000000	0.000000	0.000000	0.000000	0.000000
9129	0.071568	0.000000	0.000000	0.015938	0.000000	0.000000	0.031499	0.087671	0.014011	...	0.000000	0.000000	0.033312	0.047501	0.000000	0.000000	0.000000	0.000006	0.000000
9130	0.066304	0.000000	0.002818	0.001318	0.000000	0.000000	0.021337	0.090575	0.002119	...	0.000000	0.000000	0.060295	0.080626	0.000000	0.000000	0.000000	0.000000	0.000000
9131	0.039067	0.009778	0.000000	0.000000	0.000000	0.000000	0.023478	0.111008	0.018155	...	0.000000	0.000000	0.040867	0.056370	0.000000	0.000000	0.000000	0.000000	0.000000

49 rows × 23 columns

如果我们知道各个样本中细胞类型的真实比例，我们可以验证我们的去卷积结果。在这里，测量了中性粒细胞的数量，确实发现测量值与我们基于去卷积的结果之间具有高度相关性。

In [56]:

Copied!





neutCounts = metadata["Total.neutrophil.count...mm3."].astype(float)
subsetCorMuSiC = pd.Series(
    np.corrcoef(
        music_frac.to_numpy()[~np.isnan(neutCounts.to_numpy()), :].transpose(),
        neutCounts[~np.isnan(neutCounts.to_numpy())].astype(float),
    )[music_frac.shape[1], 0 : music_frac.shape[1]]
)
subsetCorMuSiC.index = music_frac.columns
subsetCorMuSiC.sort_values()
neutCounts = metadata["Total.neutrophil.count...mm3."].astype(float)
subsetCorMuSiC = pd.Series(
    np.corrcoef(
        music_frac.to_numpy()[~np.isnan(neutCounts.to_numpy()), :].transpose(),
        neutCounts[~np.isnan(neutCounts.to_numpy())].astype(float),
    )[music_frac.shape[1], 0 : music_frac.shape[1]]
)
subsetCorMuSiC.index = music_frac.columns
subsetCorMuSiC.sort_values()

C:\Users\Shennor\AppData\Roaming\Python\Python38\site-packages\numpy\lib\function_base.py:2691: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
C:\Users\Shennor\AppData\Roaming\Python\Python38\site-packages\numpy\lib\function_base.py:2692: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]

Out[56]:

NK_cells                 -0.511184
Eosinophils              -0.498070
CD4_T_cells_1            -0.403485
B_cells_1                -0.325301
B_cells_2                -0.281896
CD8_T_cells              -0.281529
pDC                      -0.264388
CD4_T_cells_3            -0.180847
CD16_Monocytes           -0.165263
CD14_Monocytes_2         -0.164895
CD34+ GATA2+ cells       -0.132763
CD14_Monocytes_3         -0.063592
Neutrophils_2            -0.058816
Immature Neutrophils_1   -0.004655
CD14_Monocytes_1         -0.002719
Plasmablast               0.015157
Neutrophils_3             0.057406
Megakaryocytes            0.292929
Immature Neutrophils_2    0.322763
Neutrophils_1             0.446441
Neutrophils_4             0.447348
CD4_T_cells_2                  NaN
mDC                            NaN
dtype: float64

我们还可以查看疾病患者和健康对照组之间细胞组成的显著变化。在这种情况下，对于每种细胞类型，我们基于学生 t 检验计算这种变化的 p 值。

In [57]:

Copied!





healty_vs_covid = pd.Series(
    [
        sci.stats.ttest_ind(
            music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "covid"],
            music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "healthy"],
        )[1]
        for cell in music_frac.columns
    ]
)
healty_vs_covid.index = music_frac.columns
healty_vs_covid.sort_values()
healty_vs_covid = pd.Series(
    [
        sci.stats.ttest_ind(
            music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "covid"],
            music_frac[cell].to_numpy()[metadata["status"].to_numpy() == "healthy"],
        )[1]
        for cell in music_frac.columns
    ]
)
healty_vs_covid.index = music_frac.columns
healty_vs_covid.sort_values()

Out[57]:

CD4_T_cells_1             3.535377e-08
Eosinophils               2.174486e-07
CD34+ GATA2+ cells        1.313440e-06
B_cells_2                 4.179994e-05
Neutrophils_1             1.133780e-04
B_cells_1                 3.829060e-04
Neutrophils_4             4.508343e-04
CD14_Monocytes_2          1.010730e-02
Megakaryocytes            1.307137e-02
Plasmablast               1.902793e-02
Neutrophils_2             6.436426e-02
NK_cells                  1.110367e-01
Immature Neutrophils_1    1.321325e-01
CD14_Monocytes_1          1.452041e-01
CD4_T_cells_3             1.676061e-01
Immature Neutrophils_2    1.806682e-01
CD14_Monocytes_3          2.641251e-01
CD8_T_cells               3.427982e-01
pDC                       3.571902e-01
Neutrophils_3             4.002572e-01
CD16_Monocytes            6.143465e-01
CD4_T_cells_2                      NaN
mDC                                NaN
dtype: float64

Here is a boxplot presenting the differences between the two conditions

In [58]:

Copied!





selected_cell = healty_vs_covid.index[np.nanargmin(healty_vs_covid.to_numpy())]
status_df = pd.DataFrame(metadata["status"])
status_df["cellFraction"] = music_frac[[selected_cell]]
status_df.boxplot(by="status")
selected_cell = healty_vs_covid.index[np.nanargmin(healty_vs_covid.to_numpy())]
status_df = pd.DataFrame(metadata["status"])
status_df["cellFraction"] = music_frac[[selected_cell]]
status_df.boxplot(by="status")

Out[58]:

<AxesSubplot:title={'center':'cellFraction'}, xlabel='[status]'>

Limitations and traps¶

虽然与预选的分选细胞相比，单细胞数据更不容易包含缺失的细胞类型，但处理缺失的细胞类型仍然被认为是细胞类型去卷积领域的一个主要挑战 {cite}Cobos2020,Jin2021。此外，签名矩阵中细胞类型的数量对去卷积的准确性有重大影响，因为更多的细胞类型通常会导致去卷积过程的准确性降低 {cite}Newman2019。

虽然几种去卷积方法可以正确推断主要细胞成分的比例，但它们在稀有或相关成分上的表现各不相同。为了处理预测变量之间的共线性，一些方法在去卷积之前进行特征选择，通过选择最小化细胞类型之间相关性的基因子集（称为签名基因列表）来执行特征选择。

New directions¶

除了我们之前描述的方法之外，还有其他工具可以用来改进去卷积或在细胞类型内解析更高分辨率的状态：

AutoGeneS {cite}Aliee2021 提出了一种多目标特征选择方法，可以集成到去卷积平台中。AutoGeneS 不需要关于标记基因的先验知识，并通过同时优化多个标准来选择基因：最小化相关性和最大化细胞类型之间的距离。AutoGeneS 可以应用于来自各种来源的参考配置文件，如单细胞实验或分选细胞群体。

CPM {cite}Frishberg2019 是一种细胞状态去卷积方法，可以基于单细胞空间发现每种细胞类型内发生的组成变化，通常捕捉跨连续细胞轨迹的细胞组成变化。通过专注于细胞类型内的变化，而不是细胞类型之间的变化（与大多数去卷积方法不同），CPM 可以发现不同细胞亚群的组成变化或跨特定细胞轨迹的连续变化。

Key takeaways¶

使用没有缺失细胞类型的无偏参考数据。
使用库大小标准化对参考数据进行标准化。
在去卷积之前进行特征选择可以提高结果的准确性，特别是对于线性去卷积方法。

去卷积是一项具有挑战性的任务（参见局限性和陷阱）。因此，我们建议用户尝试多种去卷积方法，并在生物学上评估结果后选择一种方法。