Proportion analysis using scanpro

A notebook to test the scanpro tool.

The dataset used in this tutorial is a simulated dataset with three conditions and five cell clusters. The proportions of these clusters do not change significantly across conditions.

[1]:
import anndata as ad
import numpy as np

from scanpro import scanpro, anova, t_test
from scanpro.linear_model import create_design
from scanpro.get_transformed_props import get_transformed_props
[2]:
# Set seed to make simulations reproducible
np.random.seed(1)

Load anndata

[3]:
adata = ad.read_h5ad('scanpro_tutorial.h5ad')
adata
[3]:
AnnData object with n_obs × n_vars = 37079 × 10
    obs: 'sample', 'cluster', 'group'

The anndata.obs table has to include 3 columns, for example:

  • sample

  • biological condition / group

  • clusters/celltypes

[4]:
adata.obs
[4]:
sample cluster group
cells
cell_1 S1 c1 cond_1
cell_2 S1 c1 cond_1
cell_3 S1 c1 cond_1
cell_4 S1 c1 cond_1
cell_5 S1 c1 cond_1
... ... ... ...
cell_37075 S6 c5 cond_3
cell_37076 S6 c5 cond_3
cell_37077 S6 c5 cond_3
cell_37078 S6 c5 cond_3
cell_37079 S6 c5 cond_3

37079 rows × 3 columns


Run scanpro

  • You can use the wrapper function “scanpro”.

  • Parameters are:

    • anndata object

    • clusters: name of clusters/celltypes column in obs table

    • sample: name of sample column in obs table

    • cond: name of condition/group column in obs table

    • transform: type of transformation; logit or arcsin, default is logit

[5]:
out = scanpro(adata, clusters_col='cluster', samples_col='sample', conds_col='group', transform='logit')

out.results
There are more than 2 conditions. ANOVA will be performed...
Done!
[5]:
baseline_props mean_props_cond_1 mean_props_cond_2 mean_props_cond_3 f_statistics p_values adjusted_p_values
clusters
c1 0.409531 0.006399 0.007605 0.010391 1.015703 0.362148 0.452685
c2 0.374093 0.043683 0.075932 0.043750 1.907890 0.148393 0.370983
c3 0.156935 0.137301 0.157528 0.184273 0.503052 0.604682 0.604682
c4 0.051161 0.401103 0.270136 0.415870 2.019649 0.132702 0.370983
c5 0.008280 0.411515 0.488800 0.345716 1.347237 0.259957 0.433262
  • the scanpro function generates a ScanproResult object that has the following methods:

    • results: returns the results as a dataframe

    • props: returns clusters proportions as a dataframe

    • prop_trans: returns transformed proportions as a dataframe

    • counts: returns counts as a dataframe

    • design: returns the design matrix

    • plot and plot_samples: for plotting

[6]:
out.props
[6]:
cluster c1 c2 c3 c4 c5
sample
S1 0.005436 0.063044 0.150121 0.388767 0.392632
S2 0.007362 0.024321 0.124481 0.413438 0.430398
S3 0.010331 0.055854 0.097744 0.241583 0.594487
S4 0.004880 0.096009 0.217312 0.298688 0.383112
S5 0.010221 0.041330 0.202284 0.434774 0.311391
S6 0.010562 0.046169 0.166261 0.396966 0.380042

Visualisation of proportions for each condition in each cluster

  • You can plot proportion per condition using out.plot() function. Parameters are:

    • kind: kind of plot: stripplot, barplot or boxplot

    • clusters: specify which clusters you want to plot, default is all

    • save: path to where you want to save the plot

  • or out.plot_samples() to plot a bar plot of proportions of clusters in samples. Parameters are:

    • stacked: True for stacked bar plot

    • x: Whether samples or clusters should be on the x axis

    • save: path to where you want to save the plot

[7]:
out.plot()
_images/proportion_analysis_16_0.png
  • You can use kind to set different styles for the plot:

[8]:
out.plot(kind="boxplot")
_images/proportion_analysis_18_0.png
[9]:
out.plot(kind="barplot")
_images/proportion_analysis_19_0.png
  • You can use clusters to limit the clusters shown:

[10]:
out.plot(clusters=["c1"])
_images/proportion_analysis_21_0.png
  • Proportions of cells per sample can be plotted with plot_samples:

[11]:
out.plot_samples(stacked=True)
_images/proportion_analysis_23_0.png
[12]:
out.plot_samples(x="clusters")
_images/proportion_analysis_24_0.png

Test specific conditions

  • by using the parameter conditions you can specify conditions of interests you want to compare instead of comparing all the conditions

[13]:
out = scanpro(adata, clusters_col='cluster', samples_col='sample',
              conds_col='group', transform='logit', conditions=['cond_1', 'cond_2'])

out.results
There are 2 conditions. T-Test will be performed...
Done!
[13]:
baseline_props mean_props_cond_1 mean_props_cond_2 prop_ratio t_statistics p_values adjusted_p_values
clusters
c1 0.409531 0.006399 0.007605 0.841368 -0.346112 0.734065 0.806602
c2 0.374093 0.043683 0.075932 0.575292 -1.821747 0.088493 0.299088
c3 0.156935 0.137301 0.157528 0.871598 -0.249180 0.806602 0.806602
c4 0.051161 0.401103 0.270136 1.484819 1.650414 0.119635 0.299088
c5 0.008280 0.411515 0.488800 0.841889 -0.858832 0.403949 0.673249
[14]:
out.plot()
_images/proportion_analysis_28_0.png

Bootstrapping method

  • The bootstrapping method splits the original samples into n_reps replicated and run the analysis on the simulated samples.

  • The bootstrapping method will run for non-replicated datasets in the following scenarios:

    • If the number of samples in samples_col is equal to the number of conditions in conds_col

    • If samples_col is not given or set to None, the dataset is assumed to be not replicated and scanpro will run the bootstrapping method by setting conds_col temporarily as samples_col

[15]:
out_boot = scanpro(adata,
                   clusters_col='cluster',
                   conds_col='group',
                   # no sample column given,
                   transform='arcsin',
                   n_reps=8,  # number of replicates generated by bootstrapping
                   conditions=['cond_1', 'cond_2']  # optional: you can also specify conditions, or test all of them
                   )

out_boot.results
Your data doesn't have replicates! Artificial replicates will be simulated to run scanpro.
Simulation may take some minutes...
Generating 8 replicates and running 100 simulations...
Finished 100 simulations in 10.57 seconds
[15]:
baseline_props mean_props_cond_1 mean_props_cond_2 p_values
clusters
c1 0.409531 0.006574 0.013976 0.657169
c2 0.374093 0.045318 0.073735 0.429444
c3 0.156935 0.138966 0.143901 0.596598
c4 0.051161 0.402073 0.256700 0.110951
c5 0.008280 0.407068 0.511688 0.218473
  • To plot the bootstrapping results, use the plot function

[16]:
out_boot.plot()
_images/proportion_analysis_33_0.png

Partially replicated data

We now imagine that the cond_1 dataset was not replicated. This will trigger a partial bootstrapping method.

[17]:
adata_partial = adata.copy()
adata_partial.obs.loc[adata_partial.obs["group"] == "cond_1", "sample"] = "S1"
[18]:
out_boot = scanpro(adata_partial,
                   clusters_col='cluster',
                   conds_col='group',
                   samples_col='sample',
                   transform='arcsin',
                   n_reps=8,
                   conditions=['cond_1', 'cond_2']
                   )

out_boot.results
The following conditions don't have replicates: cond_1
Both normal scanpro and sim_scanpro will be performed.
Running scanpro with original replicates...
There are 2 conditions. T-Test will be performed...
Done!
Running scanpro with simulated replicates...
Generating 8 replicates and running 100 simulations...
Finished 100 simulations in 10.59 seconds
To access results for original replicates, run <out.results>, and <out.sim_results> for simulated results
[18]:
baseline_props mean_props_cond_1 mean_props_cond_2 prop_ratio t_statistics p_values adjusted_p_values
clusters
c1 0.409531 0.006264 0.007605 0.823593 -0.123120 0.906333 0.906333
c2 0.374093 0.046387 0.075932 0.610909 -0.999106 0.359144 0.743984
c3 0.156935 0.139092 0.157528 0.882967 -0.233504 0.823712 0.906333
c4 0.051161 0.399380 0.270136 1.478442 2.325340 0.062309 0.311547
c5 0.008280 0.408877 0.488800 0.836493 -0.819015 0.446390 0.743984
[20]:
out_boot.plot()
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
_images/proportion_analysis_39_1.png

Scanpro workflow

The scanpro function is a wrapper function, to have more controls you can perform the steps below and see how the scanpro function works

  • Get proportions and transformed proportions using get_transformed_props function

[21]:
counts, props, prop_trans = get_transformed_props(adata, sample_col='sample', cluster_col='cluster', transform='logit')
[22]:
counts
[22]:
cluster c1 c2 c3 c4 c5
sample
S1 45 522 1243 3219 3251
S2 46 152 778 2584 2690
S3 54 292 511 1263 3108
S4 16 315 713 980 1257
S5 68 275 1346 2893 2072
S6 78 341 1228 2932 2807
[23]:
props
[23]:
cluster c1 c2 c3 c4 c5
sample
S1 0.005436 0.063044 0.150121 0.388767 0.392632
S2 0.007362 0.024321 0.124481 0.413438 0.430398
S3 0.010331 0.055854 0.097744 0.241583 0.594487
S4 0.004880 0.096009 0.217312 0.298688 0.383112
S5 0.010221 0.041330 0.202284 0.434774 0.311391
S6 0.010562 0.046169 0.166261 0.396966 0.380042
[24]:
prop_trans
[24]:
cluster c1 c2 c3 c4 c5
sample
S1 -5.198679 -2.698113 -1.733536 -0.452733 -0.436502
S2 -4.893820 -3.688879 -1.950396 -0.350114 -0.280595
S3 -4.553587 -2.826251 -2.222005 -1.144124 0.381781
S4 -5.288267 -2.241485 -1.281479 -0.853905 -0.476938
S5 -4.566171 -3.142492 -1.372083 -0.262750 -0.793816
S6 -4.533900 -3.027010 -1.612283 -0.418396 -0.489623
[25]:
# proportions of each cluster in all samples
baseline_props = adata.obs['cluster'].value_counts()/adata.obs.shape[0]
baseline_props
[25]:
cluster
c5    0.409531
c4    0.374093
c3    0.156935
c2    0.051161
c1    0.008280
Name: count, dtype: float64
  • Create design matrix using create_design function. The matrix has samples as rows and conditions and columns, whith 1s and 0s to specify which condition belongs to which sample (independent variables). The design matrix will be used as the X matrix in the linear model fit.

  • Parameters:

    • data: anndata object, or a pandas dataframe

    • samples: name of samples column

    • conds: name of conditions/groups column

[26]:
design = create_design(data=adata, samples='sample', conds='group')
design
[26]:
group cond_1 cond_2 cond_3
sample
S1 1 0 0
S2 1 0 0
S3 0 1 0
S4 0 1 0
S5 0 0 1
S6 0 0 1
[27]:
# for anova -> more than 2 conditions/groups
# The coef parameter specify the columns in the design matrix containing conditions of interest
# that you want to compare
coef = np.arange(len(design.columns))
coef
[27]:
array([0, 1, 2])
  • Since we have more than 2 conditions, use the function anova to compare. If you have two conditions or want to only compare two, then specify a contras variable cont = [1, -1] with 1 and -1 match the columns of the design matrix with two conditions you want to compare. if the design matrix has more than two conditions but you only want to compare two, put 1 and -1 where they match the columns of interest and fill the rest with 0s, in this case, to compare 10x and celseq: cont = [1, -1, 0]

[28]:
anova(props, prop_trans, design, coef, robust=True)
[28]:
mean_props_cond_1 mean_props_cond_2 mean_props_cond_3 f_statistics p_values adjusted_p_values
clusters
c1 0.006399 0.007605 0.010391 1.015703 0.362148 0.452685
c2 0.043683 0.075932 0.043750 1.907890 0.148393 0.370983
c3 0.137301 0.157528 0.184273 0.503052 0.604682 0.604682
c4 0.401103 0.270136 0.415870 2.019649 0.132702 0.370983
c5 0.411515 0.488800 0.345716 1.347237 0.259957 0.433262
  • Additionally, you can specify cofactors of interest and add them to the design matrix, either specify cofactor columns in the adata.obs as parameter cofactor to the create_design function, or add it manually to the design matrix. In our case, we will add pair information that specifies which sample is a replicate

[29]:
X = design.iloc[:,coef]
N = len(X)  # number of samples
p = len(X.columns)  # number of conditions
[30]:
# add pair information to the design matrix as cofactor
cont = np.tile([1, 2], N//2)
new_design = design.copy()
new_design['pair'] = cont
new_design
[30]:
group cond_1 cond_2 cond_3 pair
sample
S1 1 0 0 1
S2 1 0 0 2
S3 0 1 0 1
S4 0 1 0 2
S5 0 0 1 1
S6 0 0 1 2
[31]:
# run anova, taking into account pair information
anova(props, prop_trans, new_design, coef, robust=True)
[31]:
mean_props_cond_1 mean_props_cond_2 mean_props_cond_3 f_statistics p_values adjusted_p_values
clusters
c1 0.006399 0.007605 0.010391 0.719279 0.487104 0.608879
c2 0.043683 0.075932 0.043750 1.351087 0.258959 0.608879
c3 0.137301 0.157528 0.184273 0.356240 0.700304 0.700304
c4 0.401103 0.270136 0.415870 1.430231 0.239254 0.608879
c5 0.411515 0.488800 0.345716 0.954057 0.385175 0.608879
[32]:
# to compare only cond_1 and cond_2, specify a contrast and run t_test
contr = [1,-1,0]
t_test(props, prop_trans, design, contr, robust=True)
[32]:
mean_props_cond_1 mean_props_cond_2 prop_ratio t_statistics p_values adjusted_p_values
clusters
c1 0.006399 0.007605 0.841368 -0.346112 0.734065 0.806602
c2 0.043683 0.075932 0.575292 -1.821747 0.088493 0.299088
c3 0.137301 0.157528 0.871598 -0.249180 0.806602 0.806602
c4 0.401103 0.270136 1.484819 1.650414 0.119635 0.299088
c5 0.411515 0.488800 0.841889 -0.858832 0.403949 0.673249
[ ]: