Contents

1 Integrating an HDF5 backend for MultiAssayExperiment

1.1 HDF5Array::DelayedArray Constructor

The HDF5Array package provides an on-disk representation of large datasets without the need to load them into memory. Convenient lazy evaluation operations allow the user to manipulate such large data files based on metadata. The DelayedMatrix class in the HDF5Array package provides a way to connect to a large matrix that is stored on disk.

First, we create a small matrix for constructing the DelayedMatrix class.

smallMatrix <- matrix(rnorm(10e5), ncol = 20)

We add rownames and column names to the matrix object for compatibility with the MultiAssayExperiment representation.

rownames(smallMatrix) <- paste0("GENE", seq_len(nrow(smallMatrix)))
colnames(smallMatrix) <- paste0("SampleID", seq_len(ncol(smallMatrix)))

Here we use the DelayedArray constructor function to create a DelayedMatrix object.

smallMatrix <- DelayedArray(smallMatrix)
class(smallMatrix)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "HDF5Array"
head(smallMatrix)
## DelayedMatrix object of 6 x 20 doubles:
##          SampleID1    SampleID2    SampleID3          .  SampleID19
## GENE1 -2.420677630 -0.413782077  0.005015549          . -0.99317569
## GENE2 -0.337925198  0.637916088  0.868933789          .  1.46162274
## GENE3  1.910359030 -0.326556686 -0.701932040          . -1.23397671
## GENE4  0.558897427 -0.691196153 -2.206321593          .  0.09419008
## GENE5  0.003950360  1.396222911 -0.450121730          . -0.39701154
## GENE6  0.540605930  0.324914221  0.974030108          .  0.45119701
##        SampleID20
## GENE1  0.32186753
## GENE2  0.28349618
## GENE3  0.29549465
## GENE4  0.58166981
## GENE5  0.16263374
## GENE6  0.62273026
dim(smallMatrix)
## [1] 50000    20

1.2 Importing HDF5 files

Note that a large matrix from an HDF5 file can also be loaded using the HDF5Dataset function.

For example:

dataLocation <- system.file("extdata", "exMatrix.h5", package =
                              "MultiAssayExperiment", mustWork = TRUE)
hdf5Data <- HDF5Dataset(file = dataLocation, name = "exMatrix")
newDelayedMatrix <- DelayedArray(hdf5Data)
class(newDelayedMatrix)
## [1] "DelayedMatrix"
## attr(,"package")
## [1] "HDF5Array"
head(newDelayedMatrix)
## DelayedMatrix object of 6 x 20 doubles:
##            [,1]       [,2]       [,3]     .      [,19]      [,20]
## [1,]  0.3261516  0.4149151  0.8154378     . -0.1876063  0.4156044
## [2,]  0.7243018 -0.9416687 -1.1290878     . -1.2820178 -0.3591841
## [3,]  1.5073255  0.7597899 -0.2756298     . -1.5666680 -0.1523462
## [4,]  0.1668286  1.2684049  0.9082990     .  0.3486139  1.8019041
## [5,]  0.5640491 -2.0222537  0.2881079     .  0.1210501 -1.4873598
## [6,] -0.3504778 -0.4149494  0.9145470     .  0.4291890 -0.4986399

1.2.1 Dimnames from HDF5 file

Currently, the rhdf5 package does not store dimnames in the h5 file by default. A request for this feature has been sent to the maintainer of the rhdf5 package.

1.3 Insertting a DelayedMatrix into a MultiAssayExperiment

The DelayedMatrix alone conforms to the MultiAssayExperiment requirements. Shown below, the DelayedMatrix can be put into a named list and passed into the MultiAssayExperiment constructor function.

HDF5MAE <- MultiAssayExperiment(experiments = list(smallMatrix = smallMatrix))
sampleMap(HDF5MAE)
## DataFrame with 20 rows and 3 columns
##           assay     primary     colname
##        <factor> <character> <character>
## 1   smallMatrix   SampleID1   SampleID1
## 2   smallMatrix   SampleID2   SampleID2
## 3   smallMatrix   SampleID3   SampleID3
## 4   smallMatrix   SampleID4   SampleID4
## 5   smallMatrix   SampleID5   SampleID5
## ...         ...         ...         ...
## 16  smallMatrix  SampleID16  SampleID16
## 17  smallMatrix  SampleID17  SampleID17
## 18  smallMatrix  SampleID18  SampleID18
## 19  smallMatrix  SampleID19  SampleID19
## 20  smallMatrix  SampleID20  SampleID20
pData(HDF5MAE)
## DataFrame with 20 rows and 0 columns

1.3.1 SummarizedExperiment with DelayedMatrix backend

A more information rich DelayedMatrix can be created when used in conjunction with the SummarizedExperiment class and it can even include rowRanges. The flexibility of the MultiAssayExperiment API supports classes with minimal requirements. Additionally, this SummarizedExperiment with the DelayedMatrix backend can be part of a bigger MultiAssayExperiment object. Below is a minimal example of how this would work:

HDF5SE <- SummarizedExperiment(assays = smallMatrix)
assay(HDF5SE)
## DelayedMatrix object of 50000 x 20 doubles:
##              SampleID1    SampleID2    SampleID3          .  SampleID19
##     GENE1 -2.420677630 -0.413782077  0.005015549          . -0.99317569
##     GENE2 -0.337925198  0.637916088  0.868933789          .  1.46162274
##     GENE3  1.910359030 -0.326556686 -0.701932040          . -1.23397671
##     GENE4  0.558897427 -0.691196153 -2.206321593          .  0.09419008
##     GENE5  0.003950360  1.396222911 -0.450121730          . -0.39701154
##       ...            .            .            .          .           .
## GENE49996    0.8448817    2.3808400   -0.7345404          .   1.6480479
## GENE49997   -1.0270949   -1.2596197   -0.7879921          .   0.1422407
## GENE49998    1.2478262    1.1423250    0.7554497          .   0.8450550
## GENE49999    0.1469239   -0.6815375   -0.2816237          .  -1.5396909
## GENE50000    0.2950322    1.3312160    1.9201837          .  -1.2363591
##            SampleID20
##     GENE1  0.32186753
##     GENE2  0.28349618
##     GENE3  0.29549465
##     GENE4  0.58166981
##     GENE5  0.16263374
##       ...           .
## GENE49996   0.7814358
## GENE49997   0.2526213
## GENE49998  -1.3794653
## GENE49999   0.8451531
## GENE50000  -0.6286885
MultiAssayExperiment(list(HDF5SE = HDF5SE))
## A MultiAssayExperiment object of 1 listed
##  experiment with a user-defined name and respective class. 
##  Containing an ExperimentList class object of length 1: 
##  [1] HDF5SE: SummarizedExperiment with 50000 rows and 20 columns 
## To access: 
##  experiments() - to obtain the ExperimentList instance 
##  pData() - for the primary/phenotype DataFrame 
##  sampleMap() - for the sample availability DataFrame 
##  metadata() - for the metadata object of ANY class 
## See also: subsetByAssay(), subsetByRow(), subsetByColumn()