The package moanin was developed to provide a simple and efficient workflow for long time-course gene expression data. It makes it simple to do differential expression between conditions based on either individual condition comparisons per time point, or by fitting spline functions per gene. There are also functions to help with clustering and visualization.
We do not detail all of the features of the package here, but some of the main functions. See our more in-depth workflow paper at https://github.com/NelleV/2019timecourse-rnaseq-pipeline.
We will work with time course data available in the accompanying
timecoursedata
package. This is mRNA-Seq timecourse data on a plant,
sorghum. The plants were sampled every week for 17 weeks. The plants were of
two different varieties (i.e. different genotypes), and under three different
watering conditions:
This data is available on both the leaf and the root sampled from the same
plant. We will concentrate on the leaf samples, available as a dataset
varoquaux2019leaf
.
library(moanin)
library(timecoursedata)
data(varoquaux2019leaf)
names(varoquaux2019leaf)
## [1] "data" "meta"
The data
element contains the gene expression data, while the meta
data
consists of information regarding the samples.
For simplicity, we are going to focus on comparing the three different watering conditions within the “BT642” variety (the variety known for comparative tolerance of pre-flowering drought). We are also going to drop Week 2, which is only measured in the control samples.
whSamples<-with(varoquaux2019leaf$meta,which(Genotype=="BT642" & Week >2))
preData<-varoquaux2019leaf$data[,whSamples]
preMeta<-varoquaux2019leaf$meta[whSamples,]
dim(preData)
## [1] 34211 97
dim(preMeta)
## [1] 97 22
moanin
classThe first step is to create a moanin class, which uses our meta data to create
the needed information for future analysis. Specifically, in addition to
storing the meta information, the function create_moanin_model
will define a
splines basis for future calculations. By default the model is given as a
separate spline basis for each level of a factor Group
, plus a group
intercept with the following R formula
~Group:ns(Timepoint, df=4) + Group + 0
moaninObject <- create_moanin_model(data=preData, meta=preMeta,
group_variable="Condition", time_variable="Week")
moaninObject
## Moanin object on 97 samples containing the following information:
## Group variable given by 'Condition' with the following levels:
## Control Postflowering Preflowering
## 37 21 39
## Time variable given by 'Week'
## Basis matrix with 15 basis_matrix functions
## Basis matrix was constructed with the following spline_formula
## ~Condition + Condition:splines::ns(Week, df = 4) + 0
##
## Information about the data (a SummarizedExperiment object):
## class: SummarizedExperiment
## dim: 34211 97
## metadata(0):
## assays(1): ''
## rownames(34211): Sobic.001G000100.v3.1 Sobic.001G000200.v3.1 ...
## Sobic.K044420.v3.1 Sobic.K044505.v3.1
## rowData names(0):
## colnames(97): 0622162L06 0622162L14 ... 0928169L11 0928169L19
## colData names(23): Barcode libraryName ... Group WeeklyGroup
Moanin
extends the SummarizedExperiment
class, and as such the data as
well as the meta data are saved in one object. This means you can index the
Moanin
object like a regular matrix, and safely be indexing the related meta
data:
moaninObject[,1:2]
## Moanin object on 2 samples containing the following information:
## Group variable given by 'Condition' with the following levels:
## Control Postflowering Preflowering
## 0 0 2
## Time variable given by 'Week'
## Basis matrix with 15 basis_matrix functions
## Basis matrix was constructed with the following spline_formula
## ~Condition + Condition:splines::ns(Week, df = 4) + 0
##
## Information about the data (a SummarizedExperiment object):
## class: SummarizedExperiment
## dim: 34211 2
## metadata(0):
## assays(1): ''
## rownames(34211): Sobic.001G000100.v3.1 Sobic.001G000200.v3.1 ...
## Sobic.K044420.v3.1 Sobic.K044505.v3.1
## rowData names(0):
## colnames(2): 0622162L06 0622162L14
## colData names(23): Barcode libraryName ... Group WeeklyGroup
Because we have count data from mRNA-Seq, we set the argument to
log_transform=TRUE
. This means that many of the functions will internally
use the transformation log(x+1)
on the assay(moaninObject)
for
computations and visualizations. However, for some steps of the DE process the
counts will be used (see below). The user can also just transform the data
herself,
logMoaninObject<-moaninObject
assay(logMoaninObject)<-log(assay(moaninObject)+1)
One type of DE analysis we can do is to compare our watering conditions to
each other, for every time point. We do this via a call to limma
for the DE
analysis, but before we can do this, we need to set up the appropriate
contrasts. For example, “Preflowering - Control” (see ?makeContrasts
in the
limma package). We provide a function to do this for every week, so as to
avoid this step.
preContrasts <- create_timepoints_contrasts(moaninObject,"Preflowering", "Control")
Notice we also get a warning that timepoint 16 is missing in our Control Samples (in fact, some of these time points only have a single observations per time point, which is not particularly appropriate for DE analysis per week).
We can also create contrasts to compare Postflowering and Control, and Preflowering and Postflowering. We get many warnings here, because Post-flowering only has samples after week 9.
postContrasts <- create_timepoints_contrasts(
moaninObject, "Postflowering", "Control" )
prepostContrasts <- create_timepoints_contrasts(
moaninObject, "Postflowering", "Preflowering")
We can run the DE analysis, setting use_voom_weights=TRUE
to make use of
limma
correction for low-counts. We will subset to just the first 500 genes
for illustration purposes, though the full set of genes does not take very
long
weeklyPre <- DE_timepoints(moaninObject[1:500,],
contrasts=c(preContrasts,postContrasts,prepostContrasts),
use_voom_weights=TRUE)
By setting use_voom_weights=TRUE
, the DE step is done after an
upperquartile
normalization via limma
using voom weights (see
?DE_timepoints
for details). Therefore this should only be done if the input
data to the Moanin
class are counts (or on a count scale, such as expected
counts from TopHat); in this case the user should set log_transform=TRUE
in
the construction of the Moanin
object, so that other functions will take the
log appropriately.
The results give the raw p-value (_pval
), the FDR adjusted p-values
(_qval
), and the estimate of log-fold-change (_lfc
) for each week (3
columns for each of the 41.3333333 contrasts):
dim(weeklyPre)
## [1] 500 124
head(weeklyPre[,1:10])
## Preflowering.3-Control.3_pval
## Sobic.001G000100.v3.1 0.76935240
## Sobic.001G000200.v3.1 0.01705126
## Sobic.001G000300.v3.1 0.75896765
## Sobic.001G000400.v3.1 0.23770107
## Sobic.001G000501.v3.1 0.75896765
## Sobic.001G000700.v3.1 0.05734872
## Preflowering.3-Control.3_qval
## Sobic.001G000100.v3.1 0.9806377
## Sobic.001G000200.v3.1 0.1792723
## Sobic.001G000300.v3.1 0.9795879
## Sobic.001G000400.v3.1 0.6740071
## Sobic.001G000501.v3.1 0.9795879
## Sobic.001G000700.v3.1 0.3530203
## Preflowering.3-Control.3_stat
## Sobic.001G000100.v3.1 0.2944697
## Sobic.001G000200.v3.1 -2.4495620
## Sobic.001G000300.v3.1 0.3081522
## Sobic.001G000400.v3.1 -1.1919054
## Sobic.001G000501.v3.1 0.3081522
## Sobic.001G000700.v3.1 -1.9355197
## Preflowering.3-Control.3_lfc
## Sobic.001G000100.v3.1 0.1257285
## Sobic.001G000200.v3.1 -0.2771224
## Sobic.001G000300.v3.1 0.1257285
## Sobic.001G000400.v3.1 -0.3496625
## Sobic.001G000501.v3.1 0.1257285
## Sobic.001G000700.v3.1 -0.3271203
## Preflowering.4-Control.4_pval
## Sobic.001G000100.v3.1 0.2290217
## Sobic.001G000200.v3.1 0.9860256
## Sobic.001G000300.v3.1 0.2083560
## Sobic.001G000400.v3.1 0.9176221
## Sobic.001G000501.v3.1 0.2083560
## Sobic.001G000700.v3.1 0.1970270
## Preflowering.4-Control.4_qval
## Sobic.001G000100.v3.1 0.6639340
## Sobic.001G000200.v3.1 0.9996942
## Sobic.001G000300.v3.1 0.6398885
## Sobic.001G000400.v3.1 0.9921867
## Sobic.001G000501.v3.1 0.6398885
## Sobic.001G000700.v3.1 0.6282489
## Preflowering.4-Control.4_stat
## Sobic.001G000100.v3.1 -1.21449747
## Sobic.001G000200.v3.1 -0.01758376
## Sobic.001G000300.v3.1 -1.27092879
## Sobic.001G000400.v3.1 -0.10383876
## Sobic.001G000501.v3.1 -1.27092879
## Sobic.001G000700.v3.1 1.30364345
## Preflowering.4-Control.4_lfc
## Sobic.001G000100.v3.1 -0.579375694
## Sobic.001G000200.v3.1 -0.002219007
## Sobic.001G000300.v3.1 -0.579375694
## Sobic.001G000400.v3.1 -0.030471152
## Sobic.001G000501.v3.1 -0.579375694
## Sobic.001G000700.v3.1 0.238851909
## Preflowering.5-Control.5_pval
## Sobic.001G000100.v3.1 0.64964783
## Sobic.001G000200.v3.1 0.03384720
## Sobic.001G000300.v3.1 0.63455782
## Sobic.001G000400.v3.1 0.62792598
## Sobic.001G000501.v3.1 0.63455782
## Sobic.001G000700.v3.1 0.08759985
## Preflowering.5-Control.5_qval
## Sobic.001G000100.v3.1 0.9465706
## Sobic.001G000200.v3.1 0.2640320
## Sobic.001G000300.v3.1 0.9417509
## Sobic.001G000400.v3.1 0.9405412
## Sobic.001G000501.v3.1 0.9417509
## Sobic.001G000700.v3.1 0.4388142
Different types of contrasts are available within create_timepoint_contrasts
based on the type
argument. Previously we used the default (per_timepoint_group_diff
) which gives the group differences per timepoint.
We can also use the same kind of approach to compare within a group the difference between two timepoints. Here we look at the difference between adjacent timepoints in “Postflowering”,
preDiffContrasts <- create_timepoints_contrasts(
moaninObject, "Preflowering" ,type="per_group_timepoint_diff")
head(preDiffContrasts)
## [1] "Preflowering.4-Preflowering.3" "Preflowering.5-Preflowering.4"
## [3] "Preflowering.6-Preflowering.5" "Preflowering.7-Preflowering.6"
## [5] "Preflowering.8-Preflowering.7" "Preflowering.9-Preflowering.8"
We can also compare these time differences between the two groups, a contrast that takes the form of
\[(TP i - TP (i-1))[Group1] - (TP i - TP (i-1))[Group2]\]
These contrasts we can create by setting type="group_and_timepoint_diff"
preGroupDiffContrasts <- create_timepoints_contrasts(
moaninObject, "Preflowering", "Control" ,type="group_and_timepoint_diff")
head(preGroupDiffContrasts)
## [1] "Preflowering.4-Preflowering.3-Control.4+Control.3"
## [2] "Preflowering.5-Preflowering.4-Control.5+Control.4"
## [3] "Preflowering.6-Preflowering.5-Control.6+Control.5"
## [4] "Preflowering.7-Preflowering.6-Control.7+Control.6"
## [5] "Preflowering.8-Preflowering.7-Control.8+Control.7"
## [6] "Preflowering.9-Preflowering.8-Control.9+Control.8"
By default the consecutive times are compared, but we could instead compare non-consecutive time points by giving an explicit vector of the timepoints to compare:
preGroupDiffContrasts <- create_timepoints_contrasts(
moaninObject, "Preflowering", "Control" ,
type="group_and_timepoint_diff",timepoints_before=c(6,8),timepoints_after=c(8,9))
head(preDiffContrasts)
## [1] "Preflowering.4-Preflowering.3" "Preflowering.5-Preflowering.4"
## [3] "Preflowering.6-Preflowering.5" "Preflowering.7-Preflowering.6"
## [5] "Preflowering.8-Preflowering.7" "Preflowering.9-Preflowering.8"
We can again submit these contrasts to DE_timepoints
weeklyGroupDiffPre <- DE_timepoints(moaninObject[1:500,],
contrasts=preGroupDiffContrasts,
use_voom_weights=TRUE)
head(weeklyGroupDiffPre)
## Preflowering.8-Preflowering.6-Control.8+Control.6_pval
## Sobic.001G000100.v3.1 0.9615977
## Sobic.001G000200.v3.1 0.8134595
## Sobic.001G000300.v3.1 0.9598148
## Sobic.001G000400.v3.1 0.1576144
## Sobic.001G000501.v3.1 0.9598148
## Sobic.001G000700.v3.1 0.5463024
## Preflowering.8-Preflowering.6-Control.8+Control.6_qval
## Sobic.001G000100.v3.1 0.9872714
## Sobic.001G000200.v3.1 0.9872714
## Sobic.001G000300.v3.1 0.9872714
## Sobic.001G000400.v3.1 0.5710667
## Sobic.001G000501.v3.1 0.9872714
## Sobic.001G000700.v3.1 0.8872794
## Preflowering.8-Preflowering.6-Control.8+Control.6_stat
## Sobic.001G000100.v3.1 -0.04833771
## Sobic.001G000200.v3.1 0.23694118
## Sobic.001G000300.v3.1 -0.05058371
## Sobic.001G000400.v3.1 -1.42988964
## Sobic.001G000501.v3.1 -0.05058371
## Sobic.001G000700.v3.1 0.60654034
## Preflowering.8-Preflowering.6-Control.8+Control.6_lfc
## Sobic.001G000100.v3.1 -0.02919678
## Sobic.001G000200.v3.1 0.03717521
## Sobic.001G000300.v3.1 -0.02919678
## Sobic.001G000400.v3.1 -0.50347547
## Sobic.001G000501.v3.1 -0.02919678
## Sobic.001G000700.v3.1 0.12847281
## Preflowering.9-Preflowering.8-Control.9+Control.8_pval
## Sobic.001G000100.v3.1 0.90550605
## Sobic.001G000200.v3.1 0.25684052
## Sobic.001G000300.v3.1 0.89903123
## Sobic.001G000400.v3.1 0.26376568
## Sobic.001G000501.v3.1 0.89903123
## Sobic.001G000700.v3.1 0.07810927
## Preflowering.9-Preflowering.8-Control.9+Control.8_qval
## Sobic.001G000100.v3.1 0.9872714
## Sobic.001G000200.v3.1 0.6794723
## Sobic.001G000300.v3.1 0.9872714
## Sobic.001G000400.v3.1 0.6868898
## Sobic.001G000501.v3.1 0.9872714
## Sobic.001G000700.v3.1 0.4268266
## Preflowering.9-Preflowering.8-Control.9+Control.8_stat
## Sobic.001G000100.v3.1 -0.1191802
## Sobic.001G000200.v3.1 -1.1441107
## Sobic.001G000300.v3.1 -0.1273901
## Sobic.001G000400.v3.1 -1.1274440
## Sobic.001G000501.v3.1 -0.1273901
## Sobic.001G000700.v3.1 -1.7904741
## Preflowering.9-Preflowering.8-Control.9+Control.8_lfc
## Sobic.001G000100.v3.1 -0.07189418
## Sobic.001G000200.v3.1 -0.17983962
## Sobic.001G000300.v3.1 -0.07352414
## Sobic.001G000400.v3.1 -0.39655540
## Sobic.001G000501.v3.1 -0.07352414
## Sobic.001G000700.v3.1 -0.38191899
The results of such a weekly analysis can be quite difficult to interpret, the number of replicates per week can be quite low, and different numbers of replicates in different weeks can make it difficult to compare the analysis across weeks.
An alternative approach is to fit a smooth spline to each gene per group, and
perform differential tests as to whether there are differences between the
spline functions between two groups. We provide this function in
DE_timecourse
, where the user provides a string of comparisons that they
wish to make:
timecourseContrasts <- c("Preflowering-Control",
"Postflowering-Control",
"Postflowering-Preflowering")
splinePre <- DE_timecourse(moaninObject[1:500,], contrasts=timecourseContrasts,
use_voom_weights=TRUE)
Again, because our input data are counts, we set use_voom_weights=TRUE
to
weight our observations based on their variability. Because we set
log_transform=TRUE
in our creation of moaninObject
, the function will
internally take the log of the data for fitting the splines, but use the
counts for creating the voom weights (see ?DE_timecourse
).
Each contrast has a pval
and a qval
column:
head(splinePre)
## Preflowering-Control_stat Preflowering-Control_pval
## Sobic.001G000100.v3.1 0.01759734 0.9180482
## Sobic.001G000200.v3.1 0.04113499 0.6438478
## Sobic.001G000300.v3.1 NaN NaN
## Sobic.001G000400.v3.1 0.07363492 0.3129612
## Sobic.001G000501.v3.1 NaN NaN
## Sobic.001G000700.v3.1 0.01892117 0.9054591
## Preflowering-Control_qval Postflowering-Control_stat
## Sobic.001G000100.v3.1 0.9609494 40417.09
## Sobic.001G000200.v3.1 0.8361870 35044.71
## Sobic.001G000300.v3.1 NaN NaN
## Sobic.001G000400.v3.1 0.7363792 38323.23
## Sobic.001G000501.v3.1 NaN NaN
## Sobic.001G000700.v3.1 0.9559906 15228.13
## Postflowering-Control_pval Postflowering-Control_qval
## Sobic.001G000100.v3.1 2.789079e-187 1.632631e-186
## Sobic.001G000200.v3.1 9.661330e-185 4.933445e-184
## Sobic.001G000300.v3.1 NaN NaN
## Sobic.001G000400.v3.1 2.469784e-186 1.362639e-185
## Sobic.001G000501.v3.1 NaN NaN
## Sobic.001G000700.v3.1 6.688958e-170 1.680995e-169
## Postflowering-Preflowering_stat
## Sobic.001G000100.v3.1 43198.49
## Sobic.001G000200.v3.1 37305.70
## Sobic.001G000300.v3.1 NaN
## Sobic.001G000400.v3.1 40463.63
## Sobic.001G000501.v3.1 NaN
## Sobic.001G000700.v3.1 15754.90
## Postflowering-Preflowering_pval
## Sobic.001G000100.v3.1 1.821574e-188
## Sobic.001G000200.v3.1 7.444016e-186
## Sobic.001G000300.v3.1 NaN
## Sobic.001G000400.v3.1 2.660553e-187
## Sobic.001G000501.v3.1 NaN
## Sobic.001G000700.v3.1 1.659090e-170
## Postflowering-Preflowering_qval
## Sobic.001G000100.v3.1 1.066287e-187
## Sobic.001G000200.v3.1 3.801200e-185
## Sobic.001G000300.v3.1 NaN
## Sobic.001G000400.v3.1 1.451211e-186
## Sobic.001G000501.v3.1 NaN
## Sobic.001G000700.v3.1 4.126233e-170
There is not a log-fold change column, because it is more complicated to
define the log-fold change over a series of timepoints, where potentially the
means may even switch from being over-expressed to under-expressed. We provide
a function estimate_log_fold_change
which gives the option of estimating
several kinds of log-fold-change. We demonstrate with the method abs_sum
,
which gives the sum of the absolute difference in the means across time
points:
log_fold_change_timepoints <- estimate_log_fold_change(
moaninObject[1:500,], contrasts=timecourseContrasts, method="sum")
head(log_fold_change_timepoints)
## Preflowering-Control Postflowering-Control
## Sobic.001G000100.v3.1 -0.3333333 -0.3333333
## Sobic.001G000200.v3.1 212.6666667 7.0000000
## Sobic.001G000300.v3.1 0.0000000 0.0000000
## Sobic.001G000400.v3.1 103.0000000 78.3333333
## Sobic.001G000501.v3.1 0.0000000 0.0000000
## Sobic.001G000700.v3.1 -68.3333333 875.0000000
## Postflowering-Preflowering
## Sobic.001G000100.v3.1 0.00000
## Sobic.001G000200.v3.1 -205.66667
## Sobic.001G000300.v3.1 0.00000
## Sobic.001G000400.v3.1 -24.66667
## Sobic.001G000501.v3.1 0.00000
## Sobic.001G000700.v3.1 943.33333
See ?estimate_log_fold_change
to see the full set of methods.
The package moanin
also provides a simple utility function
(plot_splines_data
) to visualize gene time-course data from different
conditions. We use this to plot the 10 genes with the largest log-fold-change
in Preflowering-Control
contrast (since we only looked at the first top 500,
these aren't representative of the overall signal in the data set, which is
much stronger).
whSig <- which(splinePre[,"Preflowering-Control_qval"]<0.05)
deGenes <- order(abs(
log_fold_change_timepoints)[whSig,"Preflowering-Control"],
decreasing=TRUE)[1:10]
plot_splines_data(moaninObject[whSig, ][deGenes,],
smooth=TRUE, mar=c(1.5,2.5,2,0.1))
We can also cluster the data based on their spline fits. Here we would like to
work with the log transform of the counts which are stored in moaninObject
,
so we set log_transform=TRUE
.
# First fit the kmeans clusters
kmeans_clusters <- splines_kmeans(moaninObject[1:500,], n_clusters=3,
random_seed=42,
n_init=20)
We then use the plot_splines_data
function, only now applied to the
centroids of the kmeans clustering, to visualize the centroids of each cluster
obtained with the splines k-means model.
plot_splines_data(
data=kmeans_clusters$centroids, moaninObject,
smooth=TRUE)
Because there is variability in how well the genes fit a cluster, we would like to be able to score how well a gene fits a cluster. Furthermore, we often chose a subset of genes based on a filtering process, and we would like to have a mechanism to assign all genes to a cluster.
The function splines_kmeans_score_and_label
gives a goodness-of-fit score
between each gene and each cluster.
scores_and_labels <- splines_kmeans_score_and_label(
object=moaninObject,data=preData[1:2000,], kmeans_clusters=kmeans_clusters)
Notice that we used more genes here (first 2000), even though previously we only used first 500 to make the clusters. These choices were unrealistic, since in practice we would probably pick high variable genes or differentially expressed genes, rather than the first 500, but at least give a sense of how this works.
The result is a list with elements scores
and labels
. scores
gives the
goodness-of-fit score between each gene and each cluster
head(scores_and_labels$scores)
## [,1] [,2] [,3]
## Sobic.001G000100.v3.1 0.9979535 0.9867139 0.9999825
## Sobic.001G000200.v3.1 0.9999940 0.9030656 0.9999523
## Sobic.001G000300.v3.1 1.0000000 1.0000000 1.0000000
## Sobic.001G000400.v3.1 1.0000000 0.8524792 1.0000000
## Sobic.001G000501.v3.1 1.0000000 1.0000000 1.0000000
## Sobic.001G000700.v3.1 0.9689555 0.9146831 0.9965017
labels
gives the best cluster assignment for each gene, but only if its
score in its best cluster is above a certain threshold (see
?splines_kmeans_score_and_label
)
head(scores_and_labels$labels)
## Sobic.001G000100.v3.1 Sobic.001G000200.v3.1 Sobic.001G000300.v3.1
## NA 2 NA
## Sobic.001G000400.v3.1 Sobic.001G000501.v3.1 Sobic.001G000700.v3.1
## 2 NA 2
# How many are not assigned a label?
sum(is.na(scores_and_labels$labels))
## [1] 1000