In addition to implementing its own built-in functions, {tidytof}
proposes a general framework for analyzing single-cell data using a tidy interface. This framework centers on the use of “verbs,” i.e. modular function families that represent specific data operations. Users may wish to extend {tidytof}
’s existing functionality by writing functions that implements additional tidy interfaces to new algorithms or data analysis methods not currently included in {tidytof}
.
If you’re interested in contributing new functions to {tidytof}
, this vignette provides some details about how to do so.
To extend {tidytof}
to include a new algorithm - for example, one that you’ve just developed - you can take 1 of 2 general strategies (and in some cases, you may take both!). The first is to write a {tidytof}
-style verb for your algorithm that can be included in your own standalone package. In this case, the benefit of writing a {tidytof}
-style verb for your algorithm is that taking advantage of {tidytof}
’s design schema will make your algorithm easy for users to access without learning much (if any) new syntax while still allowing you to maintain your code base independently of our team.
The second approach is to write a {tidytof}
-style function that you’d like our team to add to {tidytof}
itself in its next release. In this case, the code review process will take a bit of time, but it will also allow our teams to collaborate and provide a greater degree of critical feedback to one another as well as to share the burden of code maintenance in the future.
In either case, you’re welcome to contact the {tidytof}
team to review your code via a pull request and/or an issue on the {tidytof}
GitHub page. This tutorial may be helpful if you don’t have a lot of experience collaborating with other programmers via GitHub.
After you open your request, you can submit code to our team to be reviewed. Whether you want your method to be incorporated into {tidytof}
or if you’re simply looking for external code review/feedback from our team, please mention this in your request.
{tidytof}
uses the tidyverse style guide. Adhering to tidyverse style is something our team will expect for any code being incorporated into {tidytof}
, and it’s also something we encourage for any functions you write for your own analysis packages. In our experience, the best code is written not just to be executed, but also to be read by other humans! There are also many tools you can use to lint or automatically style your R code, such as the {lintr}
and {styler}
packages.
In addition to written well-styled code, we encourage you to write unit tests for every function you write. This is common practice in the software engineering world, but not as common as it probably should be(!) in the bioinformatics community. The {tidytof}
team uses the {testthat}
package for all of its unit tests, and there’s a great tutorial for doing so here.
The most important part of writing a function that extends {tidytof}
is to adhere to {tidytof}
verb syntax. With very few exceptions, {tidytof}
functions follow a specific, shared syntax that involves 3 types of arguments that always occur in the same order. These argument types are as follows:
{tidytof}
functions, the first argument is a data frame (or tibble). This enables the use of the pipe (|>
) for multi-step calculations, which means that your first argument for most functions will be implicit (passed from the previous function using the pipe)._col
or _cols
. Column specifications are unquoted column names that tell a {tidytof}
verb which columns to compute over for a particular operation. For example, the cluster_cols
argument in tof_cluster
allows the user to specify which column in the input data frames should be used to perform the clustering. Regardless of which verb requires them, column specifications support tidyselect helpers and follow the same rules for tidyselection as tidyverse verbs like dplyr::select()
and tidyr::pivot_longer()
.{tidytof}
verb are called method specifications, and they’re comprised of every argument that isn’t an input data frame or a column specification. Whereas column specifications represent which columns should be used to perform an operation, method specifications represent the details of how that operation should be performed. For example, the tof_cluster_phenograph()
function requires the method specification num_neighbors
, which specifies how many nearest neighbors should be used to construct the PhenoGraph algorithm’s k-nearest-neighbor graph.With few exceptions, any {tidytof}
extension should include the same 3 argument types (in the same order).
In addition, any functions that extend {tidytof}
should have a name that starts with the prefix tof_
. This will make it easier for users to find {tidytof}
functions using the text completion functionality included in most development environments.
{tidytof}
verb{tidytof}
currently includes multiple verbs that perform fundamental single-cell data manipulation tasks. Currently, {tidytof}
’s extensible verbs are the following:
tof_analyze_abundance
: Perform differential cluster abundance analysistof_analyze_expression
: Perform differential marker expression analysistof_annotate_clusters
: Annotate clusters with manual IDstof_batch_correct
: Perform batch correctiontof_cluster
: Cluster cells into subpopulationstof_downsample
: Subsample a dataset into a smaller number of cellstof_extract
: Calculate sample-level summary statisticstof_metacluster
: Metacluster clusters into a smaller number of subpopulationstof_plot_cells
: Plot cell-level datatof_plot_clusters
: Plot cluster-level datatof_plot_model
: Plot the results of a sample-level modeltof_read_data
: Read data into memory from disktof_reduce_dimensions
: Perform dimensionality reductiontof_transform
: Transform marker expression values in a vectorized fashiontof_upsample
: Assign new cells to existing clusters (defined on a downsample dataset)tof_write_data
: Write data from memory to diskEach {tidytof}
verb wraps a family of related functions that all perform the same basic task. For example, the tof_cluster
verb is a wrapper for the following functions: tof_cluster_ddpr
, tof_cluster_flowsom
, tof_cluster_kmeans
, and tof_cluster_phenograph
. All of these functions implement a different clustering algorithm, but they share an underlying logic that is standardized under the tof_cluster
abstraction. In practice, this means that users can apply the DDPR, FlowSOM, K-means, and PhenoGraph clustering algorithms to their datasets either by calling one of the tof_cluster_*
functions directly, or by calling tof_cluster
with the method
argument set to the appropriate value (“ddpr”, “flowsom”, “kmeans”, and “phenograph”, respectively).
To extend an existing {tidytof}
verb, write a function whose name fits the pattern tof_{verb name}_*
, where “*” represents the name of the algorithm being used to perform the computation. In the function definition, try to share as many arguments as possible with the {tidytof}
verb you’re extending, and return the same output object as that described in the “Value” heading of the help file for the verb being extended.
For example, suppose I wanted to write a {tidytof}
-style interface for my new clustering algorithm “supercluster”, which performs k-means clustering on a dataset twice and then outputs a final cluster assignment equal to the two k-means cluster assignments spliced together. To add the supercluster algorithm to {tidytof}
, I might write a function like this:
#' Perform superclustering on high-dimensional cytometry data.
#'
#' This function applies the silly, hypothetical clustering algorithm
#' "supercluster" to high-dimensional cytometry data using user-specified
#' input variables/cytometry measurements.
#'
#' @param tof_tibble A `tof_tbl` or `tibble`.
#'
#' @param cluster_cols Unquoted column names indicating which columns in
#' `tof_tibble` to use in computing the supercluster clusters.
#' Supports tidyselect helpers.
#'
#' @param num_kmeans_clusters An integer indicating how many clusters should be
#' used for the two k-means clustering steps.
#'
#' @param sep A string to use when splicing the 2 k-means clustering assignments
#' to one another.
#'
#' @param ... Optional additional parameters to pass to
#' \code{\link[tidytof]{tof_cluster_kmeans}}
#'
#' @return A tibble with one column named `.supercluster_cluster` containing
#' a character vector of length `nrow(tof_tibble)` indicating the id of the
#' supercluster cluster to which each cell (i.e. each row) in `tof_tibble` was
#' assigned.
#'
#' @importFrom dplyr tibble
#'
tof_cluster_supercluster <-
function(tof_tibble, cluster_cols, num_kmeans_clusters = 10L, sep = "_", ...) {
kmeans_1 <-
tof_tibble |>
tof_cluster_kmeans(
cluster_cols = {{ cluster_cols }},
num_clusters = num_kmeans_clusters,
...
)
kmeans_2 <-
tof_tibble |>
tof_cluster_kmeans(
cluster_cols = {{ cluster_cols }},
num_clusters = num_kmeans_clusters,
...
)
final_result <-
dplyr::tibble(
.supercluster_cluster =
paste(kmeans_1$.kmeans_cluster, kmeans_2$.kmeans_cluster, sep = sep)
)
return(final_result)
}
In the example above, note that tof_cluster_supercluster
is named using the tof_{verb name}_*
style, that the function definition uses the same tof_tibble
and cluster_cols
arguments as tof_cluster
, and that the returned output object is a tof_tbl
with a single column encoding the cluster ids for each of the cells in tof_tibble
.
{tidytof}
verbIf you want to contribute a function to {tidytof}
that represents a new operation not encompassed by any of the existing verbs above, you should include the suggestion to create a new verb in your pull request to the {tidytof}
team. In this case, you’ll have considerably more flexibility to define the interface {tidytof}
will use to implement your new verb, and the {tidytof}
team is happy to work with you to figure out what makes the most sense (or at least to brainstorm together).
At this point in its development, we don’t recommend extending {tidytof}
’s modeling functionality, as it is likely to be abstracted into its own standalone package (with an emphasis on interoperability with the tidymodels
ecosystem) at some point in the future.
For general questions/comments/concerns about {tidytof}
, feel free to reach out to our team on GitHub here.
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] HDCytoData_1.25.0 flowCore_2.18.0
#> [3] SummarizedExperiment_1.36.0 Biobase_2.66.0
#> [5] GenomicRanges_1.58.0 GenomeInfoDb_1.42.0
#> [7] IRanges_2.40.0 S4Vectors_0.44.0
#> [9] MatrixGenerics_1.18.0 matrixStats_1.4.1
#> [11] ExperimentHub_2.14.0 AnnotationHub_3.14.0
#> [13] BiocFileCache_2.14.0 dbplyr_2.5.0
#> [15] BiocGenerics_0.52.0 forcats_1.0.0
#> [17] ggplot2_3.5.1 dplyr_1.1.4
#> [19] tidytof_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] jsonlite_1.8.9 shape_1.4.6.1 magrittr_2.0.3
#> [4] farver_2.1.2 rmarkdown_2.28 zlibbioc_1.52.0
#> [7] vctrs_0.6.5 memoise_2.0.1 htmltools_0.5.8.1
#> [10] S4Arrays_1.6.0 curl_5.2.3 SparseArray_1.6.0
#> [13] sass_0.4.9 parallelly_1.38.0 bslib_0.8.0
#> [16] lubridate_1.9.3 cachem_1.1.0 commonmark_1.9.2
#> [19] igraph_2.1.1 mime_0.12 lifecycle_1.0.4
#> [22] iterators_1.0.14 pkgconfig_2.0.3 Matrix_1.7-1
#> [25] R6_2.5.1 fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [28] future_1.34.0 digest_0.6.37 colorspace_2.1-1
#> [31] AnnotationDbi_1.68.0 RSQLite_2.3.7 labeling_0.4.3
#> [34] filelock_1.0.3 cytolib_2.18.0 fansi_1.0.6
#> [37] yardstick_1.3.1 timechange_0.3.0 httr_1.4.7
#> [40] polyclip_1.10-7 abind_1.4-8 compiler_4.4.1
#> [43] bit64_4.5.2 withr_3.0.2 doParallel_1.0.17
#> [46] viridis_0.6.5 DBI_1.2.3 highr_0.11
#> [49] ggforce_0.4.2 MASS_7.3-61 lava_1.8.0
#> [52] rappdirs_0.3.3 DelayedArray_0.32.0 tools_4.4.1
#> [55] future.apply_1.11.3 nnet_7.3-19 glue_1.8.0
#> [58] grid_4.4.1 generics_0.1.3 recipes_1.1.0
#> [61] gtable_0.3.6 tzdb_0.4.0 class_7.3-22
#> [64] tidyr_1.3.1 data.table_1.16.2 hms_1.1.3
#> [67] tidygraph_1.3.1 utf8_1.2.4 XVector_0.46.0
#> [70] markdown_1.13 ggrepel_0.9.6 BiocVersion_3.20.0
#> [73] foreach_1.5.2 pillar_1.9.0 stringr_1.5.1
#> [76] RcppHNSW_0.6.0 splines_4.4.1 tweenr_2.0.3
#> [79] lattice_0.22-6 survival_3.7-0 bit_4.5.0
#> [82] RProtoBufLib_2.18.0 tidyselect_1.2.1 Biostrings_2.74.0
#> [85] knitr_1.48 gridExtra_2.3 xfun_0.48
#> [88] graphlayouts_1.2.0 hardhat_1.4.0 timeDate_4041.110
#> [91] stringi_1.8.4 UCSC.utils_1.2.0 yaml_2.3.10
#> [94] evaluate_1.0.1 codetools_0.2-20 ggraph_2.2.1
#> [97] tibble_3.2.1 BiocManager_1.30.25 cli_3.6.3
#> [100] rpart_4.1.23 munsell_0.5.1 jquerylib_0.1.4
#> [103] Rcpp_1.0.13 globals_0.16.3 png_0.1-8
#> [106] parallel_4.4.1 gower_1.0.1 readr_2.1.5
#> [109] blob_1.2.4 listenv_0.9.1 glmnet_4.1-8
#> [112] viridisLite_0.4.2 ipred_0.9-15 ggridges_0.5.6
#> [115] scales_1.3.0 prodlim_2024.06.25 purrr_1.0.2
#> [118] crayon_1.5.3 rlang_1.1.4 KEGGREST_1.46.0