DelayedDataFrame 1.4.0
As the genetic/genomic data are having increasingly larger profile,
the annotation file are also getting much bigger than expected. the
memory space in R has been an obstable for fast and efficient data
processing, because most available R or Bioconductor packages are
developed based on in-memory data manipulation. With some newly
developed data structure as HDF5 or GDS, and the R interface
of DelayedArray to represent on-disk data structures with
different back-end in R-user-friendly array data structure (e.g.,
HDF5Array,GDSArray), the high-throughput genetic/genomic data
are now being able to easily loaded and manipulated within
R. However, the annotation files for the samples and features inside
the high-through data are also getting unexpectedly larger than
before. With an ordinary data.frame
or DataFrame
, it is still
getting more and more challenging for any analysis to be done within
R. So here we have developed the DelayedDataFrame
, which has the
very similar characteristics as data.frame
and DataFrame
. But at
the same time, all column data could be optionally saved on-disk
(e.g., in DelayedArray structure with any back-end). Common
operations like constructing, subsetting, splitting, combining could
be done in the same way as DataFrame
. This feature of
DelayedDataFrame
could enable efficient on-disk reading and
processing of the large-scale annotation files, and at the same,
signicantly saves memory space with common DataFrame
metaphor in R
and Bioconductor.
Download the package from Bioconductor:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DelayedDataFrame")
The development version is also available to download through github:
BiocManager::install("Bioconductor/DelayedDataFrame")
Load the package into R session before using:
library(DelayedDataFrame)
DelayedDataFrame
extends the DataFrame
data structure, with an
additional slot called lazyIndex
, which saves all the mapping
indexes for each column of the data inside DelayedDataFrame
. It is
similar to data.frame
in terms of construction, subsetting,
splitting, combining… The rownames
are having same feature as
DataFrame
. It will not be given automatically, but only by
explicitly specify in the constructor function DelayedDataFrame(, row.names=...)
or using the slot setter function rownames()<-
.
Here we use the GDSArray data as example to show the
DelayedDataFrame
characteristics. GDSArray is a Bioconductor
package that represents GDS files as objects derived from the
DelayedArray package and DelayedArray
class. It carries the
on-disk data path and represent the GDS nodes in a
DelayedArray
-derived data structure.
The GDSArray()
constructor takes 2 arguments: the file path and the
GDS node name inside the GDS file.
library(GDSArray)
## Loading required package: gdsfmt
##
## Attaching package: 'GDSArray'
## The following object is masked from 'package:utils':
##
## example
file <- SeqArray::seqExampleFileName("gds")
gdsnodes(file)
## [1] "sample.id" "variant.id" "position"
## [4] "chromosome" "allele" "genotype"
## [7] "annotation/id" "annotation/qual" "annotation/filter"
## [10] "annotation/info/AA" "annotation/info/AC" "annotation/info/AN"
## [13] "annotation/info/DP" "annotation/info/HM2" "annotation/info/HM3"
## [16] "annotation/info/OR" "annotation/info/GP" "annotation/info/BN"
## [19] "annotation/format/DP"
varid <- GDSArray(file, "annotation/id")
AA <- GDSArray(file, "annotation/info/AA")
We use an ordinary character vector and the GDSArray
objects to
construct a DelayedDataFrame
object.
ddf <- DelayedDataFrame(varid, AA)
The slots of DelayedDataFrame
could be accessed by lazyIndex()
,
nrow()
, rownames()
(if not NULL) functions. With a newly
constructed DelayedDataFrame
object, the initial value of
lazyIndex
slot will be NULL for all columns.
lazyIndex(ddf)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
nrow(ddf)
## [1] 1348
rownames(ddf)
## NULL
lazyIndex
slotThe lazyIndex
slot is in LazyIndex
class, which is defined in the
DelayedDataFrame
package and extends the SimpleList
class. The
listData
slot saves unique indexes for all the columns, and the
index
slots saves the position of index in listData
slot for each
column in DelayedDataFrame
object. In the above example, with an
initial construction of DelayedDataFrame
object, the index for each
column will all be NULL, and all 3 columns points the NULL values
which sits in the first position in listData
slot of lazyIndex
.
lazyIndex(ddf)@listData
## [[1]]
## NULL
lazyIndex(ddf)@index
## [1] 1 1
Whenever an operation is done (e.g., subsetting), the listData
slot
inside the DelayedDataFrame
stays the same, but the lazyIndex
slot
will be updated, so that the show method, further statistical
calculation will be applied to the subsetting data set. For example,
here we subset the DelayedDataFrame
object ddf
to keep only the
first 5 rows, and see how the lazyIndex
works. As shown in below,
after subsetting, the listData
slot in ddf1
stays the same as
ddf
. But the subsetting operation was recorded in the lazyIndex
slot, and the slots of lazyIndex
, nrows
and rownames
(if not
NULL) are all updated. So the subsetting operation is kind of
delayed
.
ddf1 <- ddf[1:20,]
identical(ddf@listData, ddf1@listData)
## [1] TRUE
lazyIndex(ddf1)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
nrow(ddf1)
## [1] 20
Only when functions like DataFrame()
, or as.list()
, the
lazyIndex
will be realized and DelayedDataFrame
returned.
We will show the realization in the following coercion method section.
The common methods on data.frame
or DataFrame
are also defined on
DelayedDataFrame
class, so that they behave similarily on
DelayedDataFrame
objects.
Coercion methods between DelayedDataFrame
and other data structures
are defined. When coercing from ANY
to DelayedDataFrame
, the
lazyIndex
slot will be added automatically, with the initial NULL
value of indexes for each column.
as(letters, "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## X
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
as(DataFrame(letters), "DelayedDataFrame")
## DelayedDataFrame with 26 rows and 1 column
## letters
## <character>
## 1 a
## 2 b
## 3 c
## ... ...
## 24 x
## 25 y
## 26 z
(a <- as(list(a=1:5, b=6:10), "DelayedDataFrame"))
## DelayedDataFrame with 5 rows and 2 columns
## a b
## <integer> <integer>
## 1 1 6
## 2 2 7
## 3 3 8
## 4 4 9
## 5 5 10
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
When coerce DelayedDataFrame
into other data structure, the
lazyIndex
slot will be realized and the new data structure
returned. For example, when DelayedDataFrame
is coerced into a
DataFrame
object, the listData
slot will be updated according to
the lazyIndex
slot.
df1 <- as(ddf1, "DataFrame")
df1@listData
## $varid
## <20> array of class DelayedArray and type "character":
## 1 2 3 . 19
## "rs111751804" "rs114390380" "rs1320571" . "rs61751002"
## 20
## "rs6691840"
##
## $AA
## <20> array of class DelayedArray and type "character":
## 1 2 3 . 19 20
## "T" "G" "A" . "C" "C"
dim(df1)
## [1] 20 2
[
two-dimensional [
subsetting on DelayedDataFrame
objects by
integer, character, logical values all work.
ddf[, 1, drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
ddf[, "AA", drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## AA
## <GDSArray>
## 1 T
## 2 G
## 3 A
## ... ...
## 1346 C
## 1347 G
## 1348 G
ddf[, c(TRUE,FALSE), drop=FALSE]
## DelayedDataFrame with 1348 rows and 1 column
## varid
## <GDSArray>
## 1 rs111751804
## 2 rs114390380
## 3 rs1320571
## ... ...
## 1346 rs8135982
## 1347 rs116581756
## 1348 rs5771206
When subsetting using [
on an already subsetted DelayedDataFrame
object, the lazyIndex
, nrows
and rownames
(if not NULL) slot will
be updated.
(a <- ddf1[1:10, 2, drop=FALSE])
## DelayedDataFrame with 10 rows and 1 column
## AA
## <DelayedArray>
## 1 T
## 2 G
## 3 A
## ... ...
## 8 C
## 9 G
## 10 G
lazyIndex(a)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## index of each column:
## [1] 1
nrow(a)
## [1] 10
[[
The [[
subsetting will take column subscripts for integer or
character values, and return corresponding columns in it’s original
data format.
ddf[[1]]
## <1348> array of class GDSArray and type "character":
## 1 2 3 . 1347
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## 1348
## "rs5771206"
ddf[["varid"]]
## <1348> array of class GDSArray and type "character":
## 1 2 3 . 1347
## "rs111751804" "rs114390380" "rs1320571" . "rs116581756"
## 1348
## "rs5771206"
identical(ddf[[1]], ddf[["varid"]])
## [1] TRUE
rbind/cbind
When doing rbind
, the lazyIndex
of input arguments will be
realized and a new DelayedDataFrame
with NULL lazyIndex will be
returned.
ddf2 <- ddf[21:40, ]
(ddfrb <- rbind(ddf1, ddf2))
## DelayedDataFrame with 40 rows and 2 columns
## varid AA
## <DelayedArray> <DelayedArray>
## 1 rs111751804 T
## 2 rs114390380 G
## 3 rs1320571 A
## ... ... ...
## 38 rs1886116 C
## 39 rs115917561 G
## 40 rs61751016 T
lazyIndex(ddfrb)
## LazyIndex of length 1
## [[1]]
## NULL
##
## index of each column:
## [1] 1 1
cbind
of DelayedDataFrame
objects will keep all existing
lazyIndex
of input arguments and carry into the new
DelayedDataFrame
object.
(ddfcb <- cbind(varid = ddf1[,1, drop=FALSE], AA=ddf1[, 2, drop=FALSE]))
## DelayedDataFrame with 20 rows and 2 columns
## varid AA.AA
## <DelayedArray> <DelayedArray>
## 1 rs111751804 T
## 2 rs114390380 G
## 3 rs1320571 A
## ... ... ...
## 18 rs115614983 T
## 19 rs61751002 C
## 20 rs6691840 C
lazyIndex(ddfcb)
## LazyIndex of length 1
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
##
## index of each column:
## [1] 1 1
sessionInfo()
## R version 4.0.0 (2020-04-24)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.11-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.11-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GDSArray_1.8.0 gdsfmt_1.24.0 DelayedDataFrame_1.4.0
## [4] DelayedArray_0.14.0 IRanges_2.22.0 matrixStats_0.56.0
## [7] S4Vectors_0.26.0 BiocGenerics_0.34.0 BiocStyle_2.16.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4.6 compiler_4.0.0 BiocManager_1.30.10
## [4] GenomeInfoDb_1.24.0 XVector_0.28.0 bitops_1.0-6
## [7] tools_4.0.0 zlibbioc_1.34.0 digest_0.6.25
## [10] evaluate_0.14 lattice_0.20-41 rlang_0.4.5
## [13] Matrix_1.2-18 SeqArray_1.28.0 yaml_2.2.1
## [16] xfun_0.13 GenomeInfoDbData_1.2.3 stringr_1.4.0
## [19] knitr_1.28 Biostrings_2.56.0 grid_4.0.0
## [22] rmarkdown_2.1 bookdown_0.18 magrittr_1.5
## [25] htmltools_0.4.0 GenomicRanges_1.40.0 SNPRelate_1.22.0
## [28] stringi_1.4.6 RCurl_1.98-1.2 crayon_1.3.4