BiocNeighbors 1.12.0
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 9439 4415 4050 1824 7530 1765 1809 9042 4664 2563
## [2,] 2790 3423 3782 8726 625 2779 7927 7627 2499 9672
## [3,] 2457 7743 221 1509 7170 4612 357 1104 7804 4335
## [4,] 4969 3827 7582 434 4839 8877 8875 5642 8399 6484
## [5,] 3024 1233 9286 4642 7428 2218 2799 4303 3709 4587
## [6,] 970 8731 5495 8513 3823 1449 582 1072 4587 5913
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.9488623 0.9526768 0.9918012 1.0155424 1.018635 1.038617 1.060869
## [2,] 0.9491687 0.9837804 0.9908242 1.0480871 1.055283 1.068011 1.078402
## [3,] 0.9902376 1.0116328 1.0365927 1.0736140 1.083189 1.086472 1.097179
## [4,] 0.9866520 1.0091648 1.0289248 1.0347098 1.054599 1.058922 1.071430
## [5,] 0.9620222 0.9739777 0.9899967 0.9961459 1.009494 1.019439 1.021883
## [6,] 0.9976924 1.0015095 1.0361325 1.0463392 1.058199 1.059819 1.070786
## [,8] [,9] [,10]
## [1,] 1.061591 1.061877 1.079348
## [2,] 1.106313 1.106321 1.107904
## [3,] 1.113934 1.127635 1.131233
## [4,] 1.075089 1.079372 1.095900
## [5,] 1.036463 1.037183 1.043278
## [6,] 1.081499 1.115437 1.120747
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2637 872 1732 1461 4400
## [2,] 1826 156 3000 8512 2001
## [3,] 960 524 3618 2276 5273
## [4,] 84 2483 4965 1349 5243
## [5,] 8038 8264 4906 2087 1823
## [6,] 9085 1358 3541 8958 6405
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.9116162 0.9471505 0.9820164 0.9885749 1.0134385
## [2,] 0.9471223 1.0264651 1.0276965 1.0382000 1.0838469
## [3,] 0.9862803 1.0282627 1.0421263 1.0499210 1.0595300
## [4,] 0.7931599 0.8482375 0.8868514 0.9529580 0.9833588
## [5,] 0.9192116 0.9710984 0.9776919 0.9884884 1.0078245
## [6,] 0.9002737 1.0350211 1.0597970 1.0611564 1.0645769
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 On HPC file systems, you can change TEMPDIR
to a location that is more amenable to concurrent access. and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "/tmp/RtmpBdikDR/file2cc07639387048.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.12.0 knitr_1.36 BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.7 magrittr_2.0.1 BiocGenerics_0.40.0
## [4] BiocParallel_1.28.0 lattice_0.20-45 R6_2.5.1
## [7] rlang_0.4.12 fastmap_1.1.0 stringr_1.4.0
## [10] tools_4.1.1 parallel_4.1.1 grid_4.1.1
## [13] xfun_0.27 jquerylib_0.1.4 htmltools_0.5.2
## [16] yaml_2.2.1 digest_0.6.28 bookdown_0.24
## [19] Matrix_1.3-4 BiocManager_1.30.16 S4Vectors_0.32.0
## [22] sass_0.4.0 evaluate_0.14 rmarkdown_2.11
## [25] stringi_1.7.5 compiler_4.1.1 bslib_0.3.1
## [28] stats4_4.1.1 jsonlite_1.7.2