Single Cell Transcriptomics (PBMC3k)#

[1]:
import scanpy as sc
import singleCellHaystack as hs

sc.set_figure_params(facecolor="white", dpi=90)

Load data#

We load the human PBMC data available from scanpy.

[2]:
adata = sc.datasets.pbmc3k_processed()
adata
[2]:
AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'
[3]:
sc.pl.umap(adata, color="louvain")
/Users/diez/miniconda3/envs/singleCellHaystack/lib/python3.10/site-packages/scanpy/plotting/_tools/scatterplots.py:392: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
../_images/tutorials_pbmc3k_5_1.png

Filter some cells#

We remove Megakaryocytes since they are not interesting in this example and they will drive the top results.

[4]:
mega = adata.obs.louvain.isin(["Megakaryocytes"])
mega.value_counts()
[4]:
False    2623
True       15
Name: louvain, dtype: int64
[5]:
adata = adata[~mega, :]
adata
[5]:
View of AnnData object with n_obs × n_vars = 2623 × 1838
    obs: 'n_genes', 'percent_mito', 'n_counts', 'louvain'
    var: 'n_cells'
    uns: 'draw_graph', 'louvain', 'louvain_colors', 'neighbors', 'pca', 'rank_genes_groups'
    obsm: 'X_pca', 'X_tsne', 'X_umap', 'X_draw_graph_fr'
    varm: 'PCs'
    obsp: 'distances', 'connectivities'

Run haystack#

We run haystack using PCA coordinates and otherwise default parameters. Since the data has been scaled, we need to pass the raw data to avoid negative numbers. Also this will allow haystack to run on all the genes, not only those to calculate the PCA coordinates.

[6]:
res = hs.haystack(adata.raw.to_adata(), coord="pca")
> starting haystack ...
> entering array method ...
> scaling coordinates ...
> calculating feature stds ...
> removing 6 genes with zero variance ...
> calculating grid points ...
> calculating distance to cells ...
> calculating densities ...
> calculating Q dist ...
> calculating KLD for 13708 features ...
100%|██████████| 13708/13708 [00:02<00:00, 6130.26it/s]
> calculating feature's CV ...
> selecting genes to randomize ...
> calculating randomized KLD ...
100%|██████████| 100/100 [00:02<00:00, 40.36it/s]
> calculating P values ...
> done.

Check some QC plots#

We can examine some of the QC plots. First the randomization fits. These are used to calculate KLD from randomized expression levels for a subset of genes, in order to estimate the values to the entire gene set.

[7]:
hs.plot_rand_fit(res, "mean")
hs.plot_rand_fit(res, "sd")
../_images/tutorials_pbmc3k_14_0.png
../_images/tutorials_pbmc3k_14_1.png

The ranking of logpval and distribution of pval gives us some idea of how many significant genes we can detect.

[8]:
hs.plot_pval_rank(res)
hs.plot_pval_hist(res)
../_images/tutorials_pbmc3k_16_0.png
../_images/tutorials_pbmc3k_16_1.png

Visualizing the results#

A pandas DataFrame with the results can be obtained. By default the results are sorted by logpval_adj.

[9]:
sum = res.top_features(n=10)
sum
[9]:
gene KLD pval pval_adj logpval logpval_adj
9290 GZMB 0.004207 2.671463e-67 3.662041e-63 -66.573251 -62.436277
3200 FGFBP2 0.003713 2.990241e-63 4.099023e-59 -62.524294 -58.387320
7149 PRF1 0.002337 2.624538e-57 3.597716e-53 -56.580947 -52.443973
12799 CD79A 0.002090 1.626349e-56 2.229399e-52 -55.788786 -51.651812
1802 GNLY 0.002155 3.416672e-56 4.683574e-52 -55.466397 -51.329423
... ... ... ... ... ... ...
13172 A1BG-AS1 0.000098 9.969061e-01 1.000000e+00 -0.001346 0.000000
344 SYNC 0.000088 9.974878e-01 1.000000e+00 -0.001092 0.000000
1731 PCYOX1 0.000065 9.980668e-01 1.000000e+00 -0.000840 0.000000
13247 CRKL 0.000038 9.988439e-01 1.000000e+00 -0.000502 0.000000
9925 STOML1 0.000099 9.989249e-01 1.000000e+00 -0.000467 0.000000

13708 rows × 6 columns

[10]:
sc.pl.umap(adata, color=sum.gene.iloc[:4], ncols=2, cmap="Spectral_r")
../_images/tutorials_pbmc3k_20_0.png

Clustering results into gene modules#

We can cluster the results using K-means. We need to pass the AnnData object, the haystack results and the number of clusters.

[11]:
gene_mods = hs.cluster_genes(adata.raw.to_adata(), res, n_clusters=4, random_state=1)
gene_mods.sort_values("cluster").groupby("cluster").head(3)
[11]:
gene cluster
0 GZMB 0
48 CCL5 0
46 CTSW 0
63 CEBPD 1
64 SPI1 1
71 NCF2 1
3 CD79A 2
96 PKIG 2
11 HLA-DQA1 2
61 CD1C 3
6 FCER1A 3
20 CLEC10A 3
[12]:
hs.plot_gene_clusters(adata.raw.to_adata(), gene_mods, basis="umap", ncols=2, figsize=[7, 6],   color_map="Spectral_r")
../_images/tutorials_pbmc3k_24_0.png

Check some of the genes in Module 0.

[13]:
mod0 = gene_mods.groupby("cluster").get_group(0)
mod0.head(4)
[13]:
gene cluster
0 GZMB 0
1 FGFBP2 0
2 PRF1 0
4 GNLY 0
[14]:
sc.pl.umap(adata, color=mod0.gene.iloc[:4], cmap="Spectral_r", ncols=2)
../_images/tutorials_pbmc3k_27_0.png

Export results#

[15]:
#sum.to_csv("pbmc3k-results.tsv")