KnowYourCG (KYCG) is a supervised learning framework designed for the functional analysis of DNA methylation data. Unlike existing tools that focus on genes or genomic intervals, KnowYourCG directly targets CpG dinucleotides, featuring automated supervised screenings of diverse biological and technical influences, including sequence motifs, transcription factor binding, histone modifications, replication timing, cell-type-specific methylation, and trait associations. KnowYourCG addresses the challenges of data sparsity in various methylation datasets, including low-pass Nanopore sequencing, single-cell DNA methylomes, 5-hydroxymethylation profiles, spatial DNA methylation maps, and array-based datasets for epigenome-wide association studies and epigenetic clocks.

The input to KYCG is a CpG set (query). The CpG sets can represent differential methylation, results from an epigenome-wide association studies, or any sets that may be derived from analysis. If analyzing sequencing data, the preferred format is a YAME-compressed binary vector of 0 and 1 to indicate whether the CpG is in set. This format assume a specific order of CpGs following the genomic coordinates. Since it’s a coordinate-free approach, the reference coordinate is critical. Please refer to the YAME documentation for details. https://zhou-lab.github.io/YAME/.

KYCG workflow

QUICK START

# Code that should run only on non-Windows systems
library(knowYourCG)

# Download query and knowledgebase datasets:
temp_dir <- tempdir()
knowledgebase <- file.path(temp_dir, "ChromHMM.20220414.cm")
query <- file.path(temp_dir, "single_cell_10_samples.cg")
knowledgebase_url <- "https://github.com/zhou-lab/KYCGKB_mm10/raw/refs/heads/main/ChromHMM.20220414.cm"
query_url <- "https://github.com/zhou-lab/YAME/raw/refs/heads/main/test/input/single_cell_10_samples.cg"
download.file(knowledgebase_url, destfile = knowledgebase)
download.file(query_url, destfile = query)

# test enrichment (require YAME installed in shell)
res = testEnrichment(query, knowledgebase)
KYCG_plotDot(res, short_label=TRUE)

KNOWLEDGEBASES

The curated target features are called the knowledgebase sets. We have curated a variety of knowledgebases that represent different categorical and continuous methylation features such as CpGs associated with chromatin states, technical artifacts, gene association and gene expression correlation, transcription factor binding sites, tissue specific methylation, CpG density, etc.

Whole-genome knowledgebases are available as listed in the following tables.

Assembly	Link
human (hg38)	https://github.com/zhou-lab/KYCGKB_hg38
mouse (mm10)	https://github.com/zhou-lab/KYCGKB_mm10

Curated CpG knowledgebases

INPUT FORMAT

For non-array data, CpG sets (query, knowledgebase, and universe) should be formated using YAME. YAME supports binary representation of a set (format “b”) and optionally with a universe (format “d”). Format “d” can be created from format “b” using the yame mask function.

If you have a BED-formated data, you can pack it to YAME-“b” format using the following pipeline. This pipeline requires BEDTools and a reference coordinate file (YAME-“r”) available below:

Assembly	Link
human (hg38)	https://github.com/zhou-lab/KYCGKB_hg38/raw/refs/heads/main/cpg_nocontig.cr
mouse (mm10)	https://github.com/zhou-lab/KYCGKB_mm10/raw/refs/heads/main/cpg_nocontig.cr

yame unpack cpg_nocontig.cr | bedtools intersect -a - -b [your_input.bed] -c -sorted |
  cut -f4 | yame pack -fb - > [your_input.cg]

The above assumes your input is already sorted. Check out the bedtools instersect if you encounter any problems at this step.

ENRICHMENT TESTING

Then we simply run yame summary with -m feature file for enrichment testing. We have provided comprehensive enrichment feature files, and you can download them from th KYCG github page mm10/hg38. You can also create your own feature file with yame pack.

yame summary -m feature.cm yourfile.cg > yourfile.txt

Detailed information of the output columns can be found on the yame summary page. Basically, a higher log2oddsratio indicates a stronger association between the feature being tested and the query set. Generally, a large log2 odds ratio is typically considered to be around 2 or greater, with values between 1 and 2 often being viewed as potentially important and worthy of further investigation, while values around 0.5 might be considered a small effect size. For significance testing, seasame R package provided the testEnrichmentFisherN function, which is also provided in the yame github R page. The four input parameters correspond to the four columns from yame summary output.

ND = N_mask
NQ = N_query
NDQ = N_overlap
NU = N_universe

We can create a coarse differential methylation datasets the following way

yame pairwise -H 1 -c 10 sample1.cg sample2.cg -o output.cg

-H controls directionality and -c controls minimum coverage.

The output is a query CG sets with proper universe background. Selecting the appropriate background for enrichment testing is crucial because it can significantly impact the interpretation of the results. Usually, we use the background set that is measured in the experiment under different conditions.

yame mask -c query.cg universe.cg | yame summary -m feature.cm - > yourfile.txt

The following is an analysis example in R that explicitly call yame in R.

df = tibble(read.table(text=system("yame summary -m ~/references/mm10/KYCGKB_mm10/stranded/kmer10.20231201.cm /mnt/isilon/zhou_lab/projects/20230727_all_public_WGBS/mm10_stranded/20231201_neuron_MeCP2.cg", intern=TRUE), head=T))

SESSION INFO

sessionInfo()

## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] sesame_1.25.3               knitr_1.49                 
##  [3] gprofiler2_0.2.3            SummarizedExperiment_1.37.0
##  [5] Biobase_2.67.0              GenomicRanges_1.59.1       
##  [7] GenomeInfoDb_1.43.2         IRanges_2.41.2             
##  [9] S4Vectors_0.45.2            MatrixGenerics_1.19.0      
## [11] matrixStats_1.5.0           sesameData_1.25.0          
## [13] ExperimentHub_2.15.0        AnnotationHub_3.15.0       
## [15] BiocFileCache_2.15.0        dbplyr_2.5.0               
## [17] BiocGenerics_0.53.3         generics_0.1.3             
## [19] knowYourCG_1.3.15          
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               bitops_1.0-9            rlang_1.1.4            
##  [4] magrittr_2.0.3          compiler_4.5.0          RSQLite_2.3.9          
##  [7] png_0.1-8               vctrs_0.6.5             reshape2_1.4.4         
## [10] stringr_1.5.1           pkgconfig_2.0.3         crayon_1.5.3           
## [13] fastmap_1.2.0           XVector_0.47.2          labeling_0.4.3         
## [16] fontawesome_0.5.3       rmarkdown_2.29          tzdb_0.4.0             
## [19] UCSC.utils_1.3.0        preprocessCore_1.69.0   purrr_1.0.2            
## [22] bit_4.5.0.1             xfun_0.50               cachem_1.1.0           
## [25] jsonlite_1.8.9          blob_1.2.4              DelayedArray_0.33.3    
## [28] BiocParallel_1.41.0     parallel_4.5.0          R6_2.5.1               
## [31] bslib_0.8.0             stringi_1.8.4           RColorBrewer_1.1-3     
## [34] jquerylib_0.1.4         Rcpp_1.0.13-1           wheatmap_0.2.0         
## [37] readr_2.1.5             Matrix_1.7-1            tidyselect_1.2.1       
## [40] abind_1.4-8             yaml_2.3.10             codetools_0.2-20       
## [43] curl_6.1.0              lattice_0.22-6          tibble_3.2.1           
## [46] plyr_1.8.9              withr_3.0.2             KEGGREST_1.47.0        
## [49] evaluate_1.0.1          Biostrings_2.75.3       pillar_1.10.1          
## [52] BiocManager_1.30.25     filelock_1.0.3          plotly_4.10.4          
## [55] RCurl_1.98-1.16         BiocVersion_3.21.1      hms_1.1.3              
## [58] ggplot2_3.5.1           munsell_0.5.1           scales_1.3.0           
## [61] glue_1.8.0              lazyeval_0.2.2          tools_4.5.0            
## [64] data.table_1.16.4       grid_4.5.0              tidyr_1.3.1            
## [67] AnnotationDbi_1.69.0    colorspace_2.1-1        GenomeInfoDbData_1.2.13
## [70] cli_3.6.3               rappdirs_0.3.3          S4Arrays_1.7.1         
## [73] viridisLite_0.4.2       dplyr_1.1.4             gtable_0.3.6           
## [76] sass_0.4.9              digest_0.6.37           SparseArray_1.7.2      
## [79] ggrepel_0.9.6           farver_2.1.2            htmlwidgets_1.6.4      
## [82] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
## [85] httr_1.4.7              bit64_4.5.2

Functional Analysis of DNAm Sequencing Data

QUICK START

KNOWLEDGEBASES

INPUT FORMAT

ENRICHMENT TESTING

SESSION INFO