Organization of files on a local machine can be cumbersome. This is especially true for local copies of remote resources that may periodically require a new download to have the most updated information available. BiocFileCache is designed to help manage local and remote resource files stored locally. It provides a convenient location to organize files and once added to the cache management, the package provides functions to determine if remote resources are out of date and require a new download.
BiocFileCache
is a Bioconductor package and can be installed through biocLite
.
source("http://www.bioconductor.org/biocLite.R")
biocLite("BiocFileCache", dependencies = TRUE)
After the package is installed, it can be loaded into R workspace by
library(BiocFileCache)
The initial step to utilizing BiocFileCache in managing files is to create a cache object specifying a location. We will create a temporary directory for use with examples in this vignette. If a path is not specified upon creation, the default location is a directory ~/.BiocFileCache
.
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)
If the path location exists and has been utilized to store files previously, the previous object will be loaded with any files saved to the cache.
Some utility functions to examine the cache are:
bfccache(bfc)
length(bfc)
show(bfc)
bfcinfo(bfc)
bfccache()
will show the cache path. NOTE: Because we are using temporary directories, your path location will be different than shown.
bfccache(bfc)
## [1] "/tmp/Rtmp9F1wKW/tempCacheDir"
length(bfc)
## [1] 0
length()
on a BiocFileCache will show the number of files currently being tracked by the BiocFileCache
. For more detailed information on what is store in the BiocFileCache
object, there is a show method which will display the object, object class, cache path, and number of items currently being tracked.
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp9F1wKW/tempCacheDir
## bfccount: 0
## For more information see: bfcinfo() or bfcquery()
bfcinfo()
will list a table of BiocFileCache
resource files being tracked in the cache. It returns a dplyr object of class tbl_sqlite
.
bfcinfo(bfc)
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## # ... with 8 variables: rid <chr>, rname <chr>, create_time <dbl>,
## # access_time <dbl>, rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <dbl>
The table of resource files includes the following information:
rid
: resource id. Autogenerated. This is a unique identifier automatically generated when a resource is added to the cache.rname
: resource name. This is given by the user when a resource is added to the cache. It does not have to be unique and can be updated at anytime. We recommend descriptive key words and identifiers.create_time
: The date and time a resource is added to the cache.access_time
: The date and time a resource is utilized within the cache. The access time is updated when the resource is updated or accessed.rpath
: resource path. This is the path to the local file.rtype
: resource type. Either “local” or “web”, indicating if the resource has a remote origin.fpath
: If rtype is “web”, this is the link to the remote resource. It will be utilized to download the remote data.last_modified_time
: For a remote resource, the last_modified (if available) information for the local copy of the data. This information is checked against the remote resource to determine if the local copy is stale and needs to be updated.Now that we have created the cache object and location, let’s explore adding files that the cache will manage!
Now that a BiocFileCache
object and cache location has been created, files can be added to the cache for tracking. There are two functions to add a resource to the cache:
bfcnew()
bfcadd()
The difference between the options: bfcnew()
creates an entry for a resource and returns a filepath to save to. As there are many types of data that can be saved in many different ways, bfcnew()
allows you to save any R data object in the appropriate manner and still be able to track the saved file. bfcadd()
should be utilized when a file already exists or a remote resource is being accessed.
bfcnew
takes the BiocFileCache
object and a user specified rname
and returns a path location to save data to. (optionally) you can add the file extension if you know the type of file that will be saved:
savepath <- bfcnew(bfc, "NewResource", ext="RData")
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
savepath
## BFC1
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c21e1034e_527c21e1034e.RData"
## now we can use that path in any save function
m = matrix(1:12, nrow=3)
save(m, file=savepath)
## and that file will be tracked in the cache
bfcinfo(bfc)
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC1 NewResource 2017-06-21 00:57:09 2017-06-21 00:57:09
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
bfcadd()
is for existing files or remote resources. The user will still specify an rname
of their choosing but also must specify a path to local file or web resource as fpath
. If no fpath
is given, the default is to assume the rname
is also the path location. If the fpath
is a local file, there are a few options for the user determined by the action
argument. action
will allow the user to either copy
the existing file into the cache directory, move
the existing file into the cache directory, or leave the file whereever it is on the local system yet still track through the cache object asis
. copy and move will rename the file to the generated cache file path. If the fpath
is a remote source, the source will try to be downloaded, if it is successful it will save in the cache location and track in the cache object; The original source will be added to the cache information as fpath
. Relative path locations may also be used, specified with rtype = "relative"
. This will store a relative location for the file within the cache; only actions copy
and move
are available for relative paths.
First let’s use local files:
fl1 <- tempfile(); file.create(fl1)
## [1] TRUE
add2 <- bfcadd(bfc, "Test_addCopy", fl1) # copy
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
# returns filepath being tracked in cache
add2
## BFC2
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c57f55a22_file527c34a6d25f"
# the name is the unique rid in the cache
rid2 <- names(add2)
fl2 <- tempfile(); file.create(fl2)
## [1] TRUE
add3 <- bfcadd(bfc, "Test2_addMove", fl2, action="move") # move
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
rid3 <- names(add3)
fl3 <- tempfile(); file.create(fl3)
## [1] TRUE
add4 <- bfcadd(bfc, "Test3_addAsis", fl3, rtype="local",
action="asis") # reference
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
rid4 <- names(add4)
file.exists(fl1) # TRUE - copied from original location
## [1] TRUE
file.exists(fl2) # FALSE - moved from original location
## [1] FALSE
file.exists(fl3) # TRUE - left asis, original location tracked
## [1] TRUE
Now let’s add some examples with remote sources:
url <- "http://httpbin.org/get"
add5 <- bfcadd(bfc, "TestWeb", fpath=url)
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
rid5 <- names(add5)
url2<- "https://en.wikipedia.org/wiki/Bioconductor"
add6 <- bfcadd(bfc, "TestWeb", fpath=url2)
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
rid6 <- names(add6)
# let's look at our BiocFileCache object now
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp9F1wKW/tempCacheDir
## bfccount: 6
## For more information see: bfcinfo() or bfcquery()
bfcinfo(bfc)
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC1 NewResource 2017-06-21 00:57:09 2017-06-21 00:57:09
## 2 BFC2 Test_addCopy 2017-06-21 00:57:09 2017-06-21 00:57:09
## 3 BFC3 Test2_addMove 2017-06-21 00:57:09 2017-06-21 00:57:09
## 4 BFC4 Test3_addAsis 2017-06-21 00:57:09 2017-06-21 00:57:10
## 5 BFC5 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:10
## 6 BFC6 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:11
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
Now that we are tracking resources, let’s explore accessing their information!
Before we get into exploring individual resources, a helper function. Most of the functions provided require the unique rid[s] assigned to a resource. The bfcadd
and bfcnew
return the path as a named character vector, the name of the character vector is the rid. However, you may want to access a resource that you have added some time ago.
bfcquery()
bfcquery()
will take in a key word and search across the rname
, rpath
, and fpath
for any matching entries.
bfcquery(bfc, "Web")
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC5 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:10
## 2 BFC6 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:11
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
bfcquery(bfc, "copy")
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC2 Test_addCopy 2017-06-21 00:57:09 2017-06-21 00:57:09
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
q1 <- bfcquery(bfc, "wiki")
q1
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC6 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:11
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
class(q1)
## [1] "tbl_bfc" "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl"
As you can see above bfcquery()
, returns an object of class tbl_sql
and can be investiaged further utilizing methods for these classes, such as the package dplyr
methods. The rid
can be seen in the first column of the table to be used in other functions. To get a quick count of how many objects in the cache matched the query, use bfccount()
.
bfccount(q1)
## [1] 1
[
[
allows for subsetting of the BiocFileCache object. The output will be a BiocFileSubCache object. Users will still be able to query, remove (from the subset object only), and access resources of the subset, however the resources cannot be updated.
bfcsubWeb = bfc[paste0("BFC", 5:6)]
bfcsubWeb
## class: BiocFileCacheReadOnly
## bfccache: /tmp/Rtmp9F1wKW/tempCacheDir
## bfccount: 2
## For more information see: bfcinfo() or bfcquery()
bfcinfo(bfcsubWeb)
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC5 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:10
## 2 BFC6 TestWeb 2017-06-21 00:57:10 2017-06-21 00:57:11
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
There are three methods for retrieving the BiocFileCache
resource path location.
[[
bfcpath()
bfcrpath()
The [[
will access the rpath
saved in the BiocFileCache
. Retrieving this location will return the path to the local version of the resource; allowing the user to then use this path in any load/read methods most appropriate for the resource. The bfcpath()
returns a named character vector also displaying the local file that can be used for retrieval. If the resource is a remote resource, bfcpath()
will also return the path to the original source saved as fpath
. The bfcrpath()
returns a named character vector only displaying the local file. bfcrpath()
can also be used to add a resource into the cache. bfcrpath()
can take an argument rnames
; if the element in rnames
is not found, it will try and add to the cache with bfcadd()
.
bfc[["BFC2"]]
## [1] "/tmp/Rtmp9F1wKW/tempCacheDir/527c57f55a22_file527c34a6d25f"
bfcpath(bfc, "BFC2")
## BFC2
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c57f55a22_file527c34a6d25f"
bfcpath(bfc, "BFC5")
## BFC5
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c55d1aac8_get"
## fpath
## "http://httpbin.org/get"
bfcrpath(bfc, rids="BFC5")
## BFC5
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c55d1aac8_get"
bfcrpath(bfc)
## BFC1
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c21e1034e_527c21e1034e.RData"
## BFC2
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c57f55a22_file527c34a6d25f"
## BFC3
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c5d77acaa_file527c44f4f619"
## BFC4
## "/tmp/Rtmp9F1wKW/file527c18111999"
## BFC5
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c55d1aac8_get"
## BFC6
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c47ac992_Bioconductor"
bfcrpath(bfc, c("http://httpbin.org/get","Test3_addAsis"))
## BFC5
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c55d1aac8_get"
## BFC4
## "/tmp/Rtmp9F1wKW/file527c18111999"
Managing remote resources locally involves knowing when to update the local copy of the data.
bfcneedsupdate()
bfcneedsupdate()
is a method that will check the local copy of the data’s last_modified tag to the last_modified tag of the remote source. The cache saves this information when the web resource is initially added. If the resource does not have a last_modified tag, it is undetermined.
Note: This function does not automatically download the remote source if it is out of date. Please see bfcdownload()
.
bfcneedsupdate(bfc, "BFC5")
## BFC5
## NA
bfcneedsupdate(bfc, "BFC6")
## BFC6
## FALSE
bfcneedsupdate(bfc)
## BFC5 BFC6
## NA FALSE
Just as you could access the rpath
, the local resource path can be set with
[[<-
The file must exist in order to be replaced in the BiocFileCache
. If the user wishes to rename, they must make a copy (or touch) the file first.
fileBeingReplaced <- bfc[[rid3]]
fileBeingReplaced
## [1] "/tmp/Rtmp9F1wKW/tempCacheDir/527c5d77acaa_file527c44f4f619"
# fl3 was created when we were adding resources
fl3
## [1] "/tmp/Rtmp9F1wKW/file527c18111999"
bfc[[rid3]]<-fl3
## Warning in `[[<-`(`*tmp*`, rid3, value = "/tmp/Rtmp9F1wKW/
## file527c18111999"): updating rpath, changing rtype to 'local'
bfc[[rid3]]
## [1] "/tmp/Rtmp9F1wKW/file527c18111999"
The user may also wish to change the rname
or fpath
associated with a resource in addition to the rpath
. This can be done with
bfcupdate()
Again, if changing the rpath
the file must exist. If a fpath
is being updated, the data will be downloaded and overwrite the current file specified in rpath
.
bfcinfo(bfc, "BFC1")
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC1 NewResource 2017-06-21 00:57:09 2017-06-21 00:57:13
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
bfcupdate(bfc, "BFC1", rname="FirstEntry")
bfcinfo(bfc, "BFC1")
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC1 FirstEntry 2017-06-21 00:57:09 2017-06-21 00:57:15
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
Now let’s update a web resource
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:dbplyr':
##
## ident, sql
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rpath
## <chr> <chr>
## 1 BFC6 /tmp/Rtmp9F1wKW/tempCacheDir/527c47ac992_Bioconductor
## # ... with 1 more variables: fpath <chr>
bfcupdate(bfc, "BFC6", fpath=url, rname="Duplicate")
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rpath
## <chr> <chr>
## 1 BFC6 /tmp/Rtmp9F1wKW/tempCacheDir/527c47ac992_Bioconductor
## # ... with 1 more variables: fpath <chr>
Lastly, remote resources may require an update if the Data is out of date (See bfcneedsupdate()
). The bfcdownload
function will attempt to download from the original resource saved in the cache as fpath
and overwrite the out of date file rpath
bfcdownload()
The following confirms that resources need updating, and the performs the update
rid <- "BFC5"
test <- !identical(bfcneedsupdate(bfc, rid), FALSE) # 'TRUE' or 'NA'
if (test)
bfcdownload(bfc, rid)
## BFC5
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c55d1aac8_get"
Now that we have added resources, it is also possible to remove a resource.
bfcremove()
When you remove a resource from the cache, it will also delete the local file but only if it is stored in the cache directory as given by bfccache(bfc)
. If it is a path to a file somewhere else on the user system, it will only be removed from the BiocFileCache
object but the file not deleted.
# let's remind ourselves of our object
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp9F1wKW/tempCacheDir
## bfccount: 6
## For more information see: bfcinfo() or bfcquery()
bfcremove(bfc, "BFC6")
bfcremove(bfc, "BFC1")
# let's look at our BiocFileCache object now
bfc
## class: BiocFileCache
## bfccache: /tmp/Rtmp9F1wKW/tempCacheDir
## bfccount: 4
## For more information see: bfcinfo() or bfcquery()
There is another helper function that may be of use:
bfcsync()
This function will compare two things:
rpath
cannot be found (This would occur if bfcnew()
is used and the path was not used to save an object)bfccache(bfc)
), that are not being tracked by the BiocFileCache
object# create a new entry that hasn't been used
path <- bfcnew(bfc, "UseMe")
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
rmMe <- names(path)
# We also have a file not being tracked because we updated rpath
bfcsync(bfc)
## The following entries have local files specified but not found.
## Consider updating or removing:
##
## # Source: lazy query [?? x 8]
## # Database: sqlite 3.19.3
## # [/tmp/Rtmp9F1wKW/tempCacheDir/BiocFileCache.sqlite]
## rid rname create_time access_time
## <chr> <chr> <chr> <chr>
## 1 BFC7 UseMe 2017-06-21 00:57:17 2017-06-21 00:57:18
## # ... with 4 more variables: rpath <chr>, rtype <chr>, fpath <chr>,
## # last_modified_time <chr>
## The following entries are in the cache but not being tracked.
## Consider adding to cache with 'bfcadd()':
## /tmp/Rtmp9F1wKW/tempCacheDir/527c5d77acaa_file527c44f4f619
## [1] FALSE
# you can suppress the messages and just have a TRUE/FALSE
bfcsync(bfc, FALSE)
## [1] FALSE
#
# Let's do some cleaning to have a synced object
#
bfcremove(bfc, rmMe)
unlink(fileBeingReplaced)
bfcsync(bfc)
## [1] TRUE
Finally, there are two function involved with cleaning or deleting the cache:
cleanbfc()
removebfc()
cleanbfc()
will evaluate the resources in the BiocFileCache
object and determine which, if any, have not been accessed in a specified number of days. If ask=TRUE
, each entry that is above that threshold will ask if it should be removed from the cache object and the file deleted (only deleted if in bfccache(bfc)
location). If ask=FALSE
, it does not ask about each file and automatically removes and deletes the file. The default number of days is 120.
cleanbfc(bfc)
removebfc()
will remove the BiocFileCache
complete from the system. Any files saved in bfccache(bfc)
directory will also be deleted.
removebfc(bfc)
Note Use with caution!
One use for BiocFileCache is to save local copies of remote resources. The benefits of this approach include reproducibility, faster access, and access (once cached) without need for an internet connection. An example is an Ensembl GTF file (also available via [AnnotationHub][])
## paste to avoid long line in vignette
url <- paste(
"ftp://ftp.ensembl.org/pub/release-71/gtf",
"homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
sep="/")
For a system-wide cache, simply load the BiocFileCache package and ask for the local resource path (rpath
) of the resource.
library(BiocFileCache)
bfc <- BiocFileCache()
path <- bfcrpath(bfc, url)
Use the path returned by bfcrpath()
as usual, e.g.,
gtf <- rtracklayer::import.gff(path)
A more compact use, the first or any time, is
gtf <- rtracklayer::import.gff(bfcrpath(BiocFileCache(), url))
Ensembl releases do not change with time, so there is no need to check whether the cached resource needs to be updated.
One might use BiocFileCache to cache results from experimental analysis. The rname
field provides an opportunity to provide descriptive metadata to help manage collections of resources, without relying on cryptic file naming conventions.
Here we create or use a local file cache in the directory in which we are doing our analysis.
library(BiocFileCache)
bfc <- BiocFileCache("~/my-experiment/results")
We perform our analysis…
library(DESeq2)
library(airway)
data(airway)
dds <- DESeqDataData(airway, design = ~ cell + dex)
result <- DESeq(dds)
…and then save our result in a location provided by BiocFileCache.
saveRDS(result, bfcnew(bfc, "airway / DESeq standard analysis"))
Retrieve the result at a later date
result <- readRDS(bfcrpath(bfc, "airway / DESeq standard analysis"))
Once might imagine the following workflow:
library(BiocFileCache)
library(rtracklayer)
# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)
# the web resource of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
# check if url is being tracked
res <- bfcquery(bfc, url)
if (bfccount(res) == 0L) {
# if it is not in cache, add
ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url)
} else {
# if it is in cache, get path to load
rid = res %>% filter(fpath == url) %>% collect(Inf) %>% `[[`("rid")
ans <- bfcrpath(bfc, rid)
# check to see if the resource needs to be updated
check <- bfcneedsupdate(bfc, rid)
# check can be NA if it cannot be determined, choose how to handle
if (is.na(check)) check <- TRUE
if (check){
ans < - bfcdownload(bfc, rid)
}
}
# ans is the path of the file to load
ans
# we know because we search for the url that the file is a .gtf.gz,
# if we searched on other terms we can use 'bfcpath' to see the
# original fpath to know the appropriate load/read/import method
bfcpath(bfc, names(ans))
temp = GTFFile(ans)
info = import(temp)
#
# A simplier test to see if something is in the cache
# and if not start tracking it is using `bfcrpath`
#
library(BiocFileCache)
library(rtracklayer)
## Loading required package: GenomicRanges
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## Filter, Find, Map, Position, Reduce, anyDuplicated, append,
## as.data.frame, cbind, colMeans, colSums, colnames, do.call,
## duplicated, eval, evalq, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, rank, rbind, rowMeans,
## rowSums, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which, which.max, which.min
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:dplyr':
##
## first, rename
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
##
## Attaching package: 'IRanges'
## The following objects are masked from 'package:dplyr':
##
## collapse, desc, slice
## Loading required package: GenomeInfoDb
# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)
# the web resources of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"
url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz"
# if not in cache will download and create new entry
pathsToLoad <- bfcrpath(bfc, c(url, url2))
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
## Warning in rsqlite_fetch(res@ptr, n = n): Don't need to call dbFetch() for
## statements, only for queries
pathsToLoad
## BFC8
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c66222b10_Homo_sapiens.GRCh37.71.gtf.gz"
## BFC9
## "/tmp/Rtmp9F1wKW/tempCacheDir/527c2940e75b_Rattus_norvegicus.Rnor_5.0.71.gtf.gz"
# now load files as see fit
info = import(GTFFile(pathsToLoad[1]))
class(info)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
summary(info)
## [1] "GRanges object with 2253155 ranges and 12 metadata columns"
#
# One could also imagine the following:
#
library(BiocFileCache)
# load the cache
bfc <- BiocFileCache()
#
# Do some work!
#
# add a location in the cache
filepath <- bfcnew(bfc, "R workspace")
save(list = ls(), file=filepath)
# now the R workspace is being tracked in the cache
It is our hope that this package allows for easier management of local and remote resources.