The HDF5Array
package provides an on-disk representation of large datasets without the need to load them into memory. Convenient lazy evaluation operations allow the user to manipulate such large data files based on metadata. The DelayedMatrix
class in the HDF5Array
package provides a way to connect to a large matrix that is stored on disk.
First, we create a small matrix for constructing the DelayedMatrix
class.
smallMatrix <- matrix(rnorm(10e5), ncol = 20)
We add rownames and column names to the matrix object for compatibility with the MultiAssayExperiment
representation.
rownames(smallMatrix) <- paste0("GENE", seq_len(nrow(smallMatrix)))
colnames(smallMatrix) <- paste0("SampleID", seq_len(ncol(smallMatrix)))
Here we use the DelayedArray
constructor function to create a DelayedMatrix
object.
smallMatrix <- DelayedArray(smallMatrix)
class(smallMatrix)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "HDF5Array"
head(smallMatrix)
## DelayedMatrix object of 6 x 20 doubles:
## SampleID1 SampleID2 SampleID3 . SampleID19
## GENE1 -2.420677630 -0.413782077 0.005015549 . -0.99317569
## GENE2 -0.337925198 0.637916088 0.868933789 . 1.46162274
## GENE3 1.910359030 -0.326556686 -0.701932040 . -1.23397671
## GENE4 0.558897427 -0.691196153 -2.206321593 . 0.09419008
## GENE5 0.003950360 1.396222911 -0.450121730 . -0.39701154
## GENE6 0.540605930 0.324914221 0.974030108 . 0.45119701
## SampleID20
## GENE1 0.32186753
## GENE2 0.28349618
## GENE3 0.29549465
## GENE4 0.58166981
## GENE5 0.16263374
## GENE6 0.62273026
dim(smallMatrix)
## [1] 50000 20
Note that a large matrix from an HDF5 file can also be loaded using the HDF5Dataset
function.
For example:
dataLocation <- system.file("extdata", "exMatrix.h5", package =
"MultiAssayExperiment", mustWork = TRUE)
hdf5Data <- HDF5Dataset(file = dataLocation, name = "exMatrix")
newDelayedMatrix <- DelayedArray(hdf5Data)
class(newDelayedMatrix)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "HDF5Array"
head(newDelayedMatrix)
## DelayedMatrix object of 6 x 20 doubles:
## [,1] [,2] [,3] . [,19] [,20]
## [1,] 0.3261516 0.4149151 0.8154378 . -0.1876063 0.4156044
## [2,] 0.7243018 -0.9416687 -1.1290878 . -1.2820178 -0.3591841
## [3,] 1.5073255 0.7597899 -0.2756298 . -1.5666680 -0.1523462
## [4,] 0.1668286 1.2684049 0.9082990 . 0.3486139 1.8019041
## [5,] 0.5640491 -2.0222537 0.2881079 . 0.1210501 -1.4873598
## [6,] -0.3504778 -0.4149494 0.9145470 . 0.4291890 -0.4986399
Currently, the rhdf5
package does not store dimnames
in the h5
file by default. A request for this feature has been sent to the maintainer of the rhdf5
package.
DelayedMatrix
into a MultiAssayExperiment
The DelayedMatrix
alone conforms to the MultiAssayExperiment
requirements. Shown below, the DelayedMatrix
can be put into a named list
and passed into the MultiAssayExperiment
constructor function.
HDF5MAE <- MultiAssayExperiment(experiments = list(smallMatrix = smallMatrix))
sampleMap(HDF5MAE)
## DataFrame with 20 rows and 3 columns
## assay primary colname
## <factor> <character> <character>
## 1 smallMatrix SampleID1 SampleID1
## 2 smallMatrix SampleID2 SampleID2
## 3 smallMatrix SampleID3 SampleID3
## 4 smallMatrix SampleID4 SampleID4
## 5 smallMatrix SampleID5 SampleID5
## ... ... ... ...
## 16 smallMatrix SampleID16 SampleID16
## 17 smallMatrix SampleID17 SampleID17
## 18 smallMatrix SampleID18 SampleID18
## 19 smallMatrix SampleID19 SampleID19
## 20 smallMatrix SampleID20 SampleID20
pData(HDF5MAE)
## DataFrame with 20 rows and 0 columns
SummarizedExperiment
with DelayedMatrix
backendA more information rich DelayedMatrix
can be created when used in conjunction with the SummarizedExperiment
class and it can even include rowRanges
. The flexibility of the MultiAssayExperiment
API supports classes with minimal requirements. Additionally, this SummarizedExperiment
with the DelayedMatrix
backend can be part of a bigger MultiAssayExperiment
object. Below is a minimal example of how this would work:
HDF5SE <- SummarizedExperiment(assays = smallMatrix)
assay(HDF5SE)
## DelayedMatrix object of 50000 x 20 doubles:
## SampleID1 SampleID2 SampleID3 . SampleID19
## GENE1 -2.420677630 -0.413782077 0.005015549 . -0.99317569
## GENE2 -0.337925198 0.637916088 0.868933789 . 1.46162274
## GENE3 1.910359030 -0.326556686 -0.701932040 . -1.23397671
## GENE4 0.558897427 -0.691196153 -2.206321593 . 0.09419008
## GENE5 0.003950360 1.396222911 -0.450121730 . -0.39701154
## ... . . . . .
## GENE49996 0.8448817 2.3808400 -0.7345404 . 1.6480479
## GENE49997 -1.0270949 -1.2596197 -0.7879921 . 0.1422407
## GENE49998 1.2478262 1.1423250 0.7554497 . 0.8450550
## GENE49999 0.1469239 -0.6815375 -0.2816237 . -1.5396909
## GENE50000 0.2950322 1.3312160 1.9201837 . -1.2363591
## SampleID20
## GENE1 0.32186753
## GENE2 0.28349618
## GENE3 0.29549465
## GENE4 0.58166981
## GENE5 0.16263374
## ... .
## GENE49996 0.7814358
## GENE49997 0.2526213
## GENE49998 -1.3794653
## GENE49999 0.8451531
## GENE50000 -0.6286885
MultiAssayExperiment(list(HDF5SE = HDF5SE))
## A MultiAssayExperiment object of 1 listed
## experiment with a user-defined name and respective class.
## Containing an ExperimentList class object of length 1:
## [1] HDF5SE: SummarizedExperiment with 50000 rows and 20 columns
## To access:
## experiments() - to obtain the ExperimentList instance
## pData() - for the primary/phenotype DataFrame
## sampleMap() - for the sample availability DataFrame
## metadata() - for the metadata object of ANY class
## See also: subsetByAssay(), subsetByRow(), subsetByColumn()