Proportion analysis using scanpro
A notebook to test the scanpro tool.
The dataset used in this tutorial is a simulated dataset with three conditions and five cell clusters. The proportions of these clusters do not change significantly across conditions.
[1]:
import anndata as ad
import numpy as np
from scanpro import scanpro, anova, t_test
from scanpro.linear_model import create_design
from scanpro.get_transformed_props import get_transformed_props
[2]:
# Set seed to make simulations reproducible
np.random.seed(1)
Load anndata
[3]:
adata = ad.read_h5ad('scanpro_tutorial.h5ad')
adata
[3]:
AnnData object with n_obs × n_vars = 37079 × 10
obs: 'sample', 'cluster', 'group'
The anndata.obs table has to include 3 columns, for example:
sample
biological condition / group
clusters/celltypes
[4]:
adata.obs
[4]:
| sample | cluster | group | |
|---|---|---|---|
| cells | |||
| cell_1 | S1 | c1 | cond_1 |
| cell_2 | S1 | c1 | cond_1 |
| cell_3 | S1 | c1 | cond_1 |
| cell_4 | S1 | c1 | cond_1 |
| cell_5 | S1 | c1 | cond_1 |
| ... | ... | ... | ... |
| cell_37075 | S6 | c5 | cond_3 |
| cell_37076 | S6 | c5 | cond_3 |
| cell_37077 | S6 | c5 | cond_3 |
| cell_37078 | S6 | c5 | cond_3 |
| cell_37079 | S6 | c5 | cond_3 |
37079 rows × 3 columns
Run scanpro
You can use the wrapper function “scanpro”.
Parameters are:
anndata object
clusters: name of clusters/celltypes column in obs table
sample: name of sample column in obs table
cond: name of condition/group column in obs table
transform: type of transformation; logit or arcsin, default is logit
[5]:
out = scanpro(adata, clusters_col='cluster', samples_col='sample', conds_col='group', transform='logit')
out.results
There are more than 2 conditions. ANOVA will be performed...
Done!
[5]:
| baseline_props | mean_props_cond_1 | mean_props_cond_2 | mean_props_cond_3 | f_statistics | p_values | adjusted_p_values | |
|---|---|---|---|---|---|---|---|
| clusters | |||||||
| c1 | 0.409531 | 0.006399 | 0.007605 | 0.010391 | 1.015703 | 0.362148 | 0.452685 |
| c2 | 0.374093 | 0.043683 | 0.075932 | 0.043750 | 1.907890 | 0.148393 | 0.370983 |
| c3 | 0.156935 | 0.137301 | 0.157528 | 0.184273 | 0.503052 | 0.604682 | 0.604682 |
| c4 | 0.051161 | 0.401103 | 0.270136 | 0.415870 | 2.019649 | 0.132702 | 0.370983 |
| c5 | 0.008280 | 0.411515 | 0.488800 | 0.345716 | 1.347237 | 0.259957 | 0.433262 |
the
scanprofunction generates a ScanproResult object that has the following methods:results: returns the results as a dataframe
props: returns clusters proportions as a dataframe
prop_trans: returns transformed proportions as a dataframe
counts: returns counts as a dataframe
design: returns the design matrix
plot and plot_samples: for plotting
[6]:
out.props
[6]:
| cluster | c1 | c2 | c3 | c4 | c5 |
|---|---|---|---|---|---|
| sample | |||||
| S1 | 0.005436 | 0.063044 | 0.150121 | 0.388767 | 0.392632 |
| S2 | 0.007362 | 0.024321 | 0.124481 | 0.413438 | 0.430398 |
| S3 | 0.010331 | 0.055854 | 0.097744 | 0.241583 | 0.594487 |
| S4 | 0.004880 | 0.096009 | 0.217312 | 0.298688 | 0.383112 |
| S5 | 0.010221 | 0.041330 | 0.202284 | 0.434774 | 0.311391 |
| S6 | 0.010562 | 0.046169 | 0.166261 | 0.396966 | 0.380042 |
Visualisation of proportions for each condition in each cluster
You can plot proportion per condition using
out.plot()function. Parameters are:kind: kind of plot:
stripplot,barplotorboxplotclusters: specify which clusters you want to plot, default is all
save: path to where you want to save the plot
or
out.plot_samples()to plot a bar plot of proportions of clusters in samples. Parameters are:stacked: True for stacked bar plot
x: Whether samples or clusters should be on the x axis
save: path to where you want to save the plot
[7]:
out.plot()
You can use
kindto set different styles for the plot:
[8]:
out.plot(kind="boxplot")
[9]:
out.plot(kind="barplot")
You can use
clustersto limit the clusters shown:
[10]:
out.plot(clusters=["c1"])
Proportions of cells per sample can be plotted with
plot_samples:
[11]:
out.plot_samples(stacked=True)
[12]:
out.plot_samples(x="clusters")
Test specific conditions
by using the parameter
conditionsyou can specify conditions of interests you want to compare instead of comparing all the conditions
[13]:
out = scanpro(adata, clusters_col='cluster', samples_col='sample',
conds_col='group', transform='logit', conditions=['cond_1', 'cond_2'])
out.results
There are 2 conditions. T-Test will be performed...
Done!
[13]:
| baseline_props | mean_props_cond_1 | mean_props_cond_2 | prop_ratio | t_statistics | p_values | adjusted_p_values | |
|---|---|---|---|---|---|---|---|
| clusters | |||||||
| c1 | 0.409531 | 0.006399 | 0.007605 | 0.841368 | -0.346112 | 0.734065 | 0.806602 |
| c2 | 0.374093 | 0.043683 | 0.075932 | 0.575292 | -1.821747 | 0.088493 | 0.299088 |
| c3 | 0.156935 | 0.137301 | 0.157528 | 0.871598 | -0.249180 | 0.806602 | 0.806602 |
| c4 | 0.051161 | 0.401103 | 0.270136 | 1.484819 | 1.650414 | 0.119635 | 0.299088 |
| c5 | 0.008280 | 0.411515 | 0.488800 | 0.841889 | -0.858832 | 0.403949 | 0.673249 |
[14]:
out.plot()
Bootstrapping method
The bootstrapping method splits the original samples into
n_repsreplicated and run the analysis on the simulated samples.The bootstrapping method will run for non-replicated datasets in the following scenarios:
If the number of samples in samples_col is equal to the number of conditions in conds_col
If samples_col is not given or set to None, the dataset is assumed to be not replicated and scanpro will run the bootstrapping method by setting conds_col temporarily as samples_col
[15]:
out_boot = scanpro(adata,
clusters_col='cluster',
conds_col='group',
# no sample column given,
transform='arcsin',
n_reps=8, # number of replicates generated by bootstrapping
conditions=['cond_1', 'cond_2'] # optional: you can also specify conditions, or test all of them
)
out_boot.results
Your data doesn't have replicates! Artificial replicates will be simulated to run scanpro.
Simulation may take some minutes...
Generating 8 replicates and running 100 simulations...
Finished 100 simulations in 10.57 seconds
[15]:
| baseline_props | mean_props_cond_1 | mean_props_cond_2 | p_values | |
|---|---|---|---|---|
| clusters | ||||
| c1 | 0.409531 | 0.006574 | 0.013976 | 0.657169 |
| c2 | 0.374093 | 0.045318 | 0.073735 | 0.429444 |
| c3 | 0.156935 | 0.138966 | 0.143901 | 0.596598 |
| c4 | 0.051161 | 0.402073 | 0.256700 | 0.110951 |
| c5 | 0.008280 | 0.407068 | 0.511688 | 0.218473 |
To plot the bootstrapping results, use the
plotfunction
[16]:
out_boot.plot()
Partially replicated data
We now imagine that the cond_1 dataset was not replicated. This will trigger a partial bootstrapping method.
[17]:
adata_partial = adata.copy()
adata_partial.obs.loc[adata_partial.obs["group"] == "cond_1", "sample"] = "S1"
[18]:
out_boot = scanpro(adata_partial,
clusters_col='cluster',
conds_col='group',
samples_col='sample',
transform='arcsin',
n_reps=8,
conditions=['cond_1', 'cond_2']
)
out_boot.results
The following conditions don't have replicates: cond_1
Both normal scanpro and sim_scanpro will be performed.
Running scanpro with original replicates...
There are 2 conditions. T-Test will be performed...
Done!
Running scanpro with simulated replicates...
Generating 8 replicates and running 100 simulations...
Finished 100 simulations in 10.59 seconds
To access results for original replicates, run <out.results>, and <out.sim_results> for simulated results
[18]:
| baseline_props | mean_props_cond_1 | mean_props_cond_2 | prop_ratio | t_statistics | p_values | adjusted_p_values | |
|---|---|---|---|---|---|---|---|
| clusters | |||||||
| c1 | 0.409531 | 0.006264 | 0.007605 | 0.823593 | -0.123120 | 0.906333 | 0.906333 |
| c2 | 0.374093 | 0.046387 | 0.075932 | 0.610909 | -0.999106 | 0.359144 | 0.743984 |
| c3 | 0.156935 | 0.139092 | 0.157528 | 0.882967 | -0.233504 | 0.823712 | 0.906333 |
| c4 | 0.051161 | 0.399380 | 0.270136 | 1.478442 | 2.325340 | 0.062309 | 0.311547 |
| c5 | 0.008280 | 0.408877 | 0.488800 | 0.836493 | -0.819015 | 0.446390 | 0.743984 |
[20]:
out_boot.plot()
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
Scanpro workflow
The scanpro function is a wrapper function, to have more controls you can perform the steps below and see how the scanpro function works
Get proportions and transformed proportions using
get_transformed_propsfunction
[21]:
counts, props, prop_trans = get_transformed_props(adata, sample_col='sample', cluster_col='cluster', transform='logit')
[22]:
counts
[22]:
| cluster | c1 | c2 | c3 | c4 | c5 |
|---|---|---|---|---|---|
| sample | |||||
| S1 | 45 | 522 | 1243 | 3219 | 3251 |
| S2 | 46 | 152 | 778 | 2584 | 2690 |
| S3 | 54 | 292 | 511 | 1263 | 3108 |
| S4 | 16 | 315 | 713 | 980 | 1257 |
| S5 | 68 | 275 | 1346 | 2893 | 2072 |
| S6 | 78 | 341 | 1228 | 2932 | 2807 |
[23]:
props
[23]:
| cluster | c1 | c2 | c3 | c4 | c5 |
|---|---|---|---|---|---|
| sample | |||||
| S1 | 0.005436 | 0.063044 | 0.150121 | 0.388767 | 0.392632 |
| S2 | 0.007362 | 0.024321 | 0.124481 | 0.413438 | 0.430398 |
| S3 | 0.010331 | 0.055854 | 0.097744 | 0.241583 | 0.594487 |
| S4 | 0.004880 | 0.096009 | 0.217312 | 0.298688 | 0.383112 |
| S5 | 0.010221 | 0.041330 | 0.202284 | 0.434774 | 0.311391 |
| S6 | 0.010562 | 0.046169 | 0.166261 | 0.396966 | 0.380042 |
[24]:
prop_trans
[24]:
| cluster | c1 | c2 | c3 | c4 | c5 |
|---|---|---|---|---|---|
| sample | |||||
| S1 | -5.198679 | -2.698113 | -1.733536 | -0.452733 | -0.436502 |
| S2 | -4.893820 | -3.688879 | -1.950396 | -0.350114 | -0.280595 |
| S3 | -4.553587 | -2.826251 | -2.222005 | -1.144124 | 0.381781 |
| S4 | -5.288267 | -2.241485 | -1.281479 | -0.853905 | -0.476938 |
| S5 | -4.566171 | -3.142492 | -1.372083 | -0.262750 | -0.793816 |
| S6 | -4.533900 | -3.027010 | -1.612283 | -0.418396 | -0.489623 |
[25]:
# proportions of each cluster in all samples
baseline_props = adata.obs['cluster'].value_counts()/adata.obs.shape[0]
baseline_props
[25]:
cluster
c5 0.409531
c4 0.374093
c3 0.156935
c2 0.051161
c1 0.008280
Name: count, dtype: float64
Create design matrix using
create_designfunction. The matrix has samples as rows and conditions and columns, whith 1s and 0s to specify which condition belongs to which sample (independent variables). The design matrix will be used as the X matrix in the linear model fit.Parameters:
data: anndata object, or a pandas dataframe
samples: name of samples column
conds: name of conditions/groups column
[26]:
design = create_design(data=adata, samples='sample', conds='group')
design
[26]:
| group | cond_1 | cond_2 | cond_3 |
|---|---|---|---|
| sample | |||
| S1 | 1 | 0 | 0 |
| S2 | 1 | 0 | 0 |
| S3 | 0 | 1 | 0 |
| S4 | 0 | 1 | 0 |
| S5 | 0 | 0 | 1 |
| S6 | 0 | 0 | 1 |
[27]:
# for anova -> more than 2 conditions/groups
# The coef parameter specify the columns in the design matrix containing conditions of interest
# that you want to compare
coef = np.arange(len(design.columns))
coef
[27]:
array([0, 1, 2])
Since we have more than 2 conditions, use the function
anovato compare. If you have two conditions or want to only compare two, then specify a contras variablecont = [1, -1]with 1 and -1 match the columns of the design matrix with two conditions you want to compare. if the design matrix has more than two conditions but you only want to compare two, put 1 and -1 where they match the columns of interest and fill the rest with 0s, in this case, to compare 10x and celseq:cont = [1, -1, 0]
[28]:
anova(props, prop_trans, design, coef, robust=True)
[28]:
| mean_props_cond_1 | mean_props_cond_2 | mean_props_cond_3 | f_statistics | p_values | adjusted_p_values | |
|---|---|---|---|---|---|---|
| clusters | ||||||
| c1 | 0.006399 | 0.007605 | 0.010391 | 1.015703 | 0.362148 | 0.452685 |
| c2 | 0.043683 | 0.075932 | 0.043750 | 1.907890 | 0.148393 | 0.370983 |
| c3 | 0.137301 | 0.157528 | 0.184273 | 0.503052 | 0.604682 | 0.604682 |
| c4 | 0.401103 | 0.270136 | 0.415870 | 2.019649 | 0.132702 | 0.370983 |
| c5 | 0.411515 | 0.488800 | 0.345716 | 1.347237 | 0.259957 | 0.433262 |
Additionally, you can specify cofactors of interest and add them to the design matrix, either specify cofactor columns in the adata.obs as parameter
cofactorto thecreate_designfunction, or add it manually to the design matrix. In our case, we will add pair information that specifies which sample is a replicate
[29]:
X = design.iloc[:,coef]
N = len(X) # number of samples
p = len(X.columns) # number of conditions
[30]:
# add pair information to the design matrix as cofactor
cont = np.tile([1, 2], N//2)
new_design = design.copy()
new_design['pair'] = cont
new_design
[30]:
| group | cond_1 | cond_2 | cond_3 | pair |
|---|---|---|---|---|
| sample | ||||
| S1 | 1 | 0 | 0 | 1 |
| S2 | 1 | 0 | 0 | 2 |
| S3 | 0 | 1 | 0 | 1 |
| S4 | 0 | 1 | 0 | 2 |
| S5 | 0 | 0 | 1 | 1 |
| S6 | 0 | 0 | 1 | 2 |
[31]:
# run anova, taking into account pair information
anova(props, prop_trans, new_design, coef, robust=True)
[31]:
| mean_props_cond_1 | mean_props_cond_2 | mean_props_cond_3 | f_statistics | p_values | adjusted_p_values | |
|---|---|---|---|---|---|---|
| clusters | ||||||
| c1 | 0.006399 | 0.007605 | 0.010391 | 0.719279 | 0.487104 | 0.608879 |
| c2 | 0.043683 | 0.075932 | 0.043750 | 1.351087 | 0.258959 | 0.608879 |
| c3 | 0.137301 | 0.157528 | 0.184273 | 0.356240 | 0.700304 | 0.700304 |
| c4 | 0.401103 | 0.270136 | 0.415870 | 1.430231 | 0.239254 | 0.608879 |
| c5 | 0.411515 | 0.488800 | 0.345716 | 0.954057 | 0.385175 | 0.608879 |
[32]:
# to compare only cond_1 and cond_2, specify a contrast and run t_test
contr = [1,-1,0]
t_test(props, prop_trans, design, contr, robust=True)
[32]:
| mean_props_cond_1 | mean_props_cond_2 | prop_ratio | t_statistics | p_values | adjusted_p_values | |
|---|---|---|---|---|---|---|
| clusters | ||||||
| c1 | 0.006399 | 0.007605 | 0.841368 | -0.346112 | 0.734065 | 0.806602 |
| c2 | 0.043683 | 0.075932 | 0.575292 | -1.821747 | 0.088493 | 0.299088 |
| c3 | 0.137301 | 0.157528 | 0.871598 | -0.249180 | 0.806602 | 0.806602 |
| c4 | 0.401103 | 0.270136 | 1.484819 | 1.650414 | 0.119635 | 0.299088 |
| c5 | 0.411515 | 0.488800 | 0.841889 | -0.858832 | 0.403949 | 0.673249 |
[ ]: