jvecfor is a high-performance KNN library for Bioconductor single-cell workflows. It wraps a Java library, built on top of jvector, which implements two algorithms:
type = "ann") –
approximate KNN via Hierarchical Navigable Small World graphs, with
SIMD-accelerated distance computation (AVX2/AVX-512). O(log n) query
time; ~2× faster than BiocNeighbors
at n ≥ 100K cells.type = "knn") — exact KNN
using vantage-point trees. Deterministic; ideal for small datasets or
ground-truth validation.The R package communicates with the JVM via system2() —
no JNI, no rJava
dependency. The JAR is bundled in inst/java/ and requires
no separate installation step beyond having Java 20+ on your
PATH.
| Function | Description | Returns |
|---|---|---|
fastFindKNN() |
Core KNN search | list($index, $distance) |
fastMakeSNNGraph() |
KNN → shared nearest-neighbor graph | igraph object |
fastMakeKNNGraph() |
KNN → k-nearest-neighbor graph | igraph object |
jvecfor_setup() |
Copy a custom JAR into the package | path (invisibly) |
The output of fastFindKNN() is a named list with
$index (n × k integer matrix, 1-based) and
$distance (n × k numeric matrix), identical in structure to
BiocNeighbors::findKNN().
jvecfor requires Java ≥ 20 on PATH. The
JAR uses the jdk.incubator.vector module for SIMD
acceleration. Check your installation:
If java is not found, install Eclipse Temurin or any OpenJDK 20+
distribution and ensure the java binary is on your
PATH. ## Package options
Two global options control runtime behaviour:
| Option | Default | Effect |
|---|---|---|
jvecfor.verbose |
FALSE |
Pass --verbose to jvecfor, printing HNSW
build progress to stderr. |
jvecfor.jar |
NULL |
Use a custom JAR instead of the one bundled in
inst/java/. |
Set them persistently in your .Rprofile or
per-session:
library(jvecfor)
# Simulate 300 cells × 30 principal components
set.seed(42)
pca <- matrix(rnorm(300 * 30), nrow = 300, ncol = 30)
# Find 15 nearest neighbours (HNSW-DiskANN approximate, Euclidean)
nn <- fastFindKNN(pca, k = 15)
str(nn)
#> List of 2
#> $ index : int [1:300, 1:15] 271 263 125 158 271 241 210 2 153 235 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:15] "V1" "V2" "V3" "V4" ...
#> $ distance: num [1:300, 1:15] 0.0508 0.0396 0.0471 0.0409 0.0331 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:15] "V16" "V17" "V18" "V19" ...The result is a two-element list. $index is a 300 × 15
integer matrix of 1-based neighbor indices; $distance is a
300 × 15 numeric matrix of Euclidean distances.
fastFindKNN()dim(nn$index) # 300 x 15
#> [1] 300 15
dim(nn$distance) # 300 x 15
#> [1] 300 15
class(nn$index) # "integer" — 1-based, same as BiocNeighbors
#> [1] "matrix" "array"
class(nn$distance) # "numeric"
#> [1] "matrix" "array"
# No zeros in index (1-based) and all values within [1, nrow(pca)]
range(nn$index)
#> [1] 1 298This output is structurally identical to
BiocNeighbors::findKNN(), so existing code that passes
nn$index to bluster
or other Bioconductor tools works without modification.
# Approximate (HNSW-DiskANN) — default, fast, O(log n) query
nn_ann <- fastFindKNN(pca, k = 15, type = "ann")
# Exact (VP-tree) — deterministic, O(n log n) query
nn_knn <- fastFindKNN(pca, k = 15, type = "knn")
# Self-consistency check: exact search always returns itself as nearest
# (excluding self) — no duplicate indices per row
stopifnot(all(apply(nn_knn$index, 1, function(x) !anyDuplicated(x))))When to choose ANN vs. exact:
| Criterion | ANN ("ann") |
Exact ("knn") |
|---|---|---|
| Dataset size | n ≥ 5K | n < 5K, or benchmark ground truth |
| Recall | ~95–99 % (tunable) | 100 % |
| Speed | O(log n) | O(n log n) |
| Metrics | euclidean, cosine, dot_product | euclidean, cosine |
| Reproducibility | Deterministic (fixed graph) | Deterministic |
The L2 distance in PC space. Appropriate when principal components have been whitened (equal variance per component) or when the embedding preserves meaningful inter-cell distances.
Measures the angle between cell vectors; ignores magnitude. This is useful when only the direction of the embedding vector matters, for instance with log-normalised count embeddings where library-size effects inflate norms. Normalise rows to unit length before calling to ensure consistent behaviour:
The inner product ⟨u, v⟩. A similarity measure (higher = more
similar) rather than a distance. Only valid for
type = "ann" — the VP-tree requires a proper metric.
Attempting to combine metric = "dot_product" with
type = "knn" raises an informative error:
# ANN with dot-product similarity
nn_dot <- fastFindKNN(pca_norm, k = 15, metric = "dot_product", type = "ann")# Attempting exact + dot_product raises an error
fastFindKNN(pca, k = 15, metric = "dot_product", type = "knn")
#> Error in `.validate_knn_scalars()`:
#> ! 'dot_product' is not a proper metric and cannot be used with type='knn' (VP-tree exact search). Use type='ann', or switch to metric='euclidean' or metric='cosine' for exact search.When you only need neighbor indices (e.g. for graph construction),
set get.distance = FALSE. This avoids allocating and
transmitting the n × k distance matrix, saving roughly 50 % of output
parsing time:
nn_idx <- fastFindKNN(pca, k = 15, get.distance = FALSE)
is.null(nn_idx$distance) # TRUE
#> [1] TRUEThe fastMakeSNNGraph() and
fastMakeKNNGraph() wrappers set
get.distance = FALSE internally.
| Parameter | Default | Tuning guidance |
|---|---|---|
M |
16 | Connections per node. Increase to 32-64 for high-dimensional (>50 dims) data; doubles memory and build time. |
ef.search |
0 (auto) | Beam width. Auto = max(k+1, 3k). Increase (e.g. 100-500) to trade speed for recall. |
oversample.factor |
1.0 | Fetch ceil(ef x factor) candidates, keep top k. Values 1.5-3.0 improve recall with proportional cost. |
pq.subspaces |
0 (off) | PQ subspaces. See section 5.3. 4-8x speedup with minimal recall loss. |
Use this when recall quality matters more than raw speed — for example when producing the KNN graph that will underpin a publication’s cluster assignments or trajectory analysis:
Product Quantization (PQ) compresses vectors into compact codes
before distance computation, yielding a 4–8× search speedup with
negligible recall loss. Set pq.subspaces to
ncol(X) / 2 as a starting point; the value must evenly
divide ncol(X).
# 300 × 20 matrix — use 10 PQ subspaces (= 20 / 2)
set.seed(1)
pca20 <- matrix(rnorm(300 * 20), 300, 20)
nn_default <- fastFindKNN(pca20, k = 15) # no PQ
nn_pq <- fastFindKNN(pca20, k = 15, pq.subspaces = 10L) # PQ enabled
# Recall overlap (fraction of PQ neighbors that match the default)
shared <- mapply(function(a, b) length(intersect(a, b)),
split(nn_default$index, row(nn_default$index)),
split(nn_pq$index, row(nn_pq$index)))
message(sprintf("Mean PQ recall vs. default: %.1f%%", 100 * mean(shared / 15)))
#> Mean PQ recall vs. default: 100.0%For small matrices (< 1K rows) the JVM startup cost dominates and PQ will not show a wall-clock benefit. The speedup becomes material at n ≥ 10K.
Avoid PQ when
ncol(X)is very small (< 8 features): quantisation error dominates and recall drops sharply.
A timing comparison for large data:
jvecfor auto-detects the number of physical cores and uses
max(1, physical_cores − 1) threads by default. Override
with num.threads:
# Single-threaded (reproducible, good for shared HPC nodes)
nn_1t <- fastFindKNN(pca, k = 15, num.threads = 1L)
# Explicit thread count
nn_4t <- fastFindKNN(pca, k = 15, num.threads = 4L)
# Use verbose to confirm the thread count jvecfor actually uses
nn_v <- fastFindKNN(pca, k = 15, num.threads = 2L, verbose = TRUE)On shared HPC nodes, set num.threads explicitly to avoid
consuming all physical cores and interfering with co-located jobs.
The fastMakeSNNGraph() and
fastMakeKNNGraph() wrappers build igraph objects
directly from the PC matrix, combining fastFindKNN() with
bluster’s
graph constructors.
The SNN graph is the standard input for graph-based clustering in single-cell analysis. Here we use the Louvain algorithm:
louvain <- igraph::cluster_louvain(g_snn)
message(
"Number of detected communities: ",
length(igraph::communities(louvain))
)
#> Number of detected communities: 4
membership_vec <- igraph::membership(louvain)
table(membership_vec)
#> membership_vec
#> 1 2 3 4
#> 36 98 70 96Leiden clustering (when available via igraph ≥ 1.3) gives finer control over community resolution:
This end-to-end example simulates three cell populations (clusters) embedded in 30 PCs and recovers them via SNN graph clustering. All steps use only jvecfor and igraph.
# Simulate 300 cells (100 per cluster) with separable cluster centres in PC1–PC2
set.seed(42)
n_per <- 100
centres <- list(c(5, 0), c(-5, 0), c(0, 5))
pca_clust <- do.call(rbind, lapply(seq_along(centres), function(i) {
m <- matrix(rnorm(n_per * 30), n_per, 30)
m[, 1] <- m[, 1] + centres[[i]][1]
m[, 2] <- m[, 2] + centres[[i]][2]
m
}))
true_labels <- rep(1:3, each = n_per)# Step 3: Louvain community detection
lou_wf <- igraph::cluster_louvain(g_wf)
detected <- igraph::membership(lou_wf)
# Step 4: cross-tabulate detected vs. true labels
print(table(Detected = detected, True = true_labels))
#> True
#> Detected 1 2 3
#> 1 99 0 0
#> 2 1 0 100
#> 3 0 100 0# Visualise in the first two PCs (base R — no extra dependencies)
cluster_cols <- c("#E64B35", "#4DBBD5", "#00A087", "#3C5488", "#F39B7F")
plot(
pca_clust[, 1], pca_clust[, 2],
col = cluster_cols[detected],
pch = 19, cex = 0.7,
xlab = "PC 1", ylab = "PC 2",
main = "Louvain clusters on SNN graph"
)
legend("topright",
legend = paste("Cluster", sort(unique(detected))),
col = cluster_cols[sort(unique(detected))],
pch = 19, bty = "n", cex = 0.85)Louvain clusters (colours) in PC1-PC2 space. Three well-separated populations are recovered.
Swap the simulated matrix for reducedDim(sce, "PCA") to
apply the same pipeline to real data:
library(SingleCellExperiment)
# Assume sce is a SingleCellExperiment with PCA computed
# (e.g. via scater::runPCA)
pca_mat <- reducedDim(sce, "PCA")
# KNN search
nn_sce <- fastFindKNN(pca_mat, k = 15)
# Graph-based clustering
g_sce <- fastMakeSNNGraph(pca_mat, k = 15)
lou_sce <- igraph::cluster_louvain(g_sce)
sce$cluster <- as.factor(igraph::membership(lou_sce))jvecfor is designed as a drop-in replacement for BiocNeighbors. The two calls below are functionally equivalent and return identically structured output:
library(BiocNeighbors)
# BiocNeighbors (Annoy backend, the default ANN method)
nn_bn <- BiocNeighbors::findKNN(pca, k = 15, BNPARAM = AnnoyParam())
# jvecfor (HNSW-DiskANN backend) — standalone function
nn_jv <- jvecfor::fastFindKNN(pca, k = 15)
# Same structure
identical(names(nn_bn), names(nn_jv)) # TRUE — both have $index, $distance
identical(dim(nn_bn$index), dim(nn_jv$index)) # TRUEResults will not be identical because both methods are approximate — they use different ANN algorithms with different recall characteristics.
JvecforParam is a BiocNeighborParam
subclass, so you can pass it as the BNPARAM argument to any
Bioconductor function that accepts one:
library(BiocNeighbors)
# Use jvecfor through the standard BiocNeighbors interface
nn <- findKNN(pca, k = 15, BNPARAM = JvecforParam())
# Works with scran, scater, and other BNPARAM-aware packages:
# library(scran)
# g <- buildSNNGraph(sce, BNPARAM = JvecforParam())
#
# library(scater)
# sce <- runUMAP(sce, BNPARAM = JvecforParam())
# Customise algorithm parameters via the constructor
nn2 <- findKNN(pca, k = 15,
BNPARAM = JvecforParam(
type = "knn", # exact VP-tree search
distance = "Cosine",
M = 32L
))To benchmark on your own data:
# Replace `pca_large` with your n × p matrix (e.g. n = 100K cells, p = 50 PCs)
t_bn <- system.time(
BiocNeighbors::findKNN(pca_large, k = 15, BNPARAM = AnnoyParam())
)
t_jv <- system.time(
fastFindKNN(pca_large, k = 15)
)
message(sprintf("BiocNeighbors: %.1f s", t_bn["elapsed"]))
message(sprintf("jvecfor: %.1f s", t_jv["elapsed"]))
message(sprintf("Speedup: %.1fx", t_bn["elapsed"] / t_jv["elapsed"]))| Error message | Likely cause | Resolution |
|---|---|---|
Java not found on PATH |
java not in shell PATH |
Install OpenJDK 20+ (adoptium.net); re-launch R after updating PATH. |
Java >= 20 is required (found Java X) |
Outdated JDK | Upgrade to OpenJDK 20+; verify with
java --version. |
jvecfor JAR not found. Run jvecfor_setup() |
inst/java/ empty |
Run jvecfor_setup() or reinstall. |
jvecfor exited with status N |
JVM crash or invalid input | Set verbose=TRUE to inspect Java
stderr. |
| Very slow first call | JVM cold-start overhead | Expected; subsequent calls are faster. |
| Unexpected neighbour count | k >= nrow(X) |
Reduce k; need at least k + 1
observations. |
Enable verbose output to see HNSW build progress and confirm threading:
For production code that should degrade gracefully when Java is
unavailable, wrap calls with tryCatch:
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] igraph_2.3.0 jvecfor_1.0.0 BiocStyle_2.40.0
#>
#> loaded via a namespace (and not attached):
#> [1] Matrix_1.7-5 jsonlite_2.0.0 compiler_4.6.0
#> [4] BiocManager_1.30.27 Rcpp_1.1.1-1.1 parallel_4.6.0
#> [7] cluster_2.1.8.2 jquerylib_0.1.4 BiocParallel_1.46.0
#> [10] yaml_2.3.12 fastmap_1.2.0 lattice_0.22-9
#> [13] R6_2.6.1 generics_0.1.4 knitr_1.51
#> [16] BiocGenerics_0.58.0 bluster_1.22.0 maketools_1.3.2
#> [19] bslib_0.10.0 BiocNeighbors_2.6.0 rlang_1.2.0
#> [22] cachem_1.1.0 xfun_0.57 sass_0.4.10
#> [25] sys_3.4.3 cli_3.6.6 magrittr_2.0.5
#> [28] ps_1.9.3 digest_0.6.39 grid_4.6.0
#> [31] processx_3.9.0 lifecycle_1.0.5 S4Vectors_0.50.0
#> [34] evaluate_1.0.5 data.table_1.18.2.1 codetools_0.2-20
#> [37] buildtools_1.0.0 stats4_4.6.0 rmarkdown_2.31
#> [40] tools_4.6.0 pkgconfig_2.0.3 htmltools_0.5.9