1 Exclude ranges

Genomic ranges of problematic genomic regions that should be avoided when working with genomic data. For human, mouse, and selected model organisms.

TL;DR - For human hg38 genome assembly, Anshul recommends ENCFF356LFX exclusion list regions.

BED files of exclusion regions are available on the ENCODE project website (Amemiya, Kundaje, and Boyle 2019). Human (hg19, hg38) and mouse (mm9, mm10) exclusion regions are available. However, exclusion lists generated by multiple labs often create uncertainty what to use. The purpose of this package is to provide a unified place for informed retrieval of exclusion regions.

Naming convention: <genome assembly>.<lab>.<original file name>, e.g., hg19.Birney.wgEncodeDacMapabilityConsensusExcludable.

See make-data.R how to create the excluderanges GRanges objects.

1.1 Install `excluderanges`

if (!requireNamespace("BiocManager", quietly = TRUE)) {
    install.packages("BiocManager")
}
# Install the development version of Bioconductor (need 3.14 and above)
# BiocManager::install(version = "devel")
# Check that you have a valid Bioconductor installation
# BiocManager::valid()
# Install the package
BiocManager::install("excluderanges", version = "devel")

# BiocManager::install("mdozmorov/excluderanges")

1.2 Use excluderanges

Get an overview of what’s available

suppressMessages(library(AnnotationHub))
ah <- AnnotationHub()
#> snapshotDate(): 2021-10-18
query_data <- query(ah, "excluderanges")
query_data
#> AnnotationHub with 42 records
#> # snapshotDate(): 2021-10-18
#> # $dataprovider: UCSC, ENCODE, mitra.stanford.edu/kundaje/akundaje/release/b...
#> # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabdi...
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH95908"]]' 
#> 
#>             title                                                    
#>   AH95908 | ce10.Kundaje.ce10-Excludable.rds                         
#>   AH95909 | dm3.Kundaje.dm3-Excludable.rds                           
#>   AH95910 | hg19.Bernstein.Mint_Excludable_hg19.rds                  
#>   AH95911 | hg19.Birney.wgEncodeDacMapabilityConsensusExcludable.rds 
#>   AH95912 | hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable.rds
#>   ...       ...                                                      
#>   AH95945 | mm10.UCSC.telomere.rds                                   
#>   AH95946 | mm9.UCSC.centromere.rds                                  
#>   AH95947 | mm9.UCSC.contig.rds                                      
#>   AH95948 | mm9.UCSC.fragment.rds                                    
#>   AH95949 | mm10.UCSC.scaffold.rds

hg38 excluderanges coordinates recommended by Anshul

# Check titles
# as.data.frame(mcols(query_data[1:10])["title"]) 
excludeGR.hg38.Kundaje.1 <- query_data[["AH95917"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
excludeGR.hg38.Kundaje.1
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> GRanges object with 910 ranges and 0 metadata columns:
#>         seqnames            ranges strand
#>            <Rle>         <IRanges>  <Rle>
#>     [1]     chr1     628903-635104      *
#>     [2]     chr1   5850087-5850571      *
#>     [3]     chr1   8909610-8910014      *
#>     [4]     chr1   9574580-9574997      *
#>     [5]     chr1 32043823-32044203      *
#>     ...      ...               ...    ...
#>   [906]     chrY 11290797-11334278      *
#>   [907]     chrY 11493053-11592850      *
#>   [908]     chrY 11671014-11671046      *
#>   [909]     chrY 11721528-11749472      *
#>   [910]     chrY 56694632-56889743      *
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

Save the data in a BED file, if needed.

rtracklayer::export(excludeGR.hg38.Kundaje.1, "hg38.Kundaje.GRCh38_unified_Excludable.bed", format = "bed")

We can load other excludable regions for the hg38 genome assembly and compare them.

query_data <- query(ah, c("excluderanges", "hg38", "Exclusion regions"))
query_data
#> AnnotationHub with 6 records
#> # snapshotDate(): 2021-10-18
#> # $dataprovider: ENCODE
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH95915"]]' 
#> 
#>             title                                                       
#>   AH95915 | hg38.Bernstein.Mint_Excludable_GRCh38.rds                   
#>   AH95916 | hg38.Kundaje.GRCh38.Excludable.rds                          
#>   AH95917 | hg38.Kundaje.GRCh38_unified_Excludable.rds                  
#>   AH95918 | hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds
#>   AH95919 | hg38.Wold.hg38mitoExcludable.rds                            
#>   AH95920 | hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed.rds
excludeGR.hg38.Bernstein <- query_data[["AH95915"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
excludeGR.hg38.Kundaje.2 <- query_data[["AH95916"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
excludeGR.hg38.Reddy     <- query_data[["AH95918"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
excludeGR.hg38.Wold      <- query_data[["AH95919"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
excludeGR.hg38.Yeo       <- query_data[["AH95920"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache

Compare the number of excludable regions.

library(ggplot2)
mtx_to_plot <- data.frame(Count = c(length(excludeGR.hg38.Bernstein), 
                                    length(excludeGR.hg38.Kundaje.1), 
                                    length(excludeGR.hg38.Kundaje.2), 
                                    length(excludeGR.hg38.Reddy), 
                                    length(excludeGR.hg38.Wold), 
                                    length(excludeGR.hg38.Yeo)),
                          Source = c("Bernstein.Mint_Excludable_GRCh38", 
                                     "Kundaje.GRCh38_unified_Excludable", 
                                     "Kundaje.GRCh38.Excludable", 
                                     "Reddy.wgEncodeDacMapabilityConsensusExcludable", 
                                     "Wold.hg38mitoExcludable", 
                                     "Yeo.eCLIP_Excludableregions.hg38liftover.bed"))
# Order Source by the number of regions
mtx_to_plot$Source <- factor(mtx_to_plot$Source, levels = mtx_to_plot$Source[order(mtx_to_plot$Count)])

ggplot(mtx_to_plot, aes(x = Source, y = Count, fill = Source)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  theme_bw() + theme(legend.position = "none")

# ggsave("man/figures/excluderanges_hg38_count.png", width = 5.5, height = 2)

Compare the width of excludable regions. log2 scale because of heavy right tail distributions.

library(ggridges)
mtx_to_plot <- data.frame(Width = c(width(excludeGR.hg38.Bernstein), 
                                    width(excludeGR.hg38.Kundaje.1), 
                                    width(excludeGR.hg38.Kundaje.2), 
                                    width(excludeGR.hg38.Reddy), 
                                    width(excludeGR.hg38.Wold), 
                                    width(excludeGR.hg38.Yeo)),
                          Source = c(rep("Bernstein.Mint_Excludable_GRCh38", length(excludeGR.hg38.Bernstein)),
                                     rep("Kundaje.GRCh38_unified_Excludable", length(excludeGR.hg38.Kundaje.1)),
                                     rep("Kundaje.GRCh38.Excludable", length(excludeGR.hg38.Kundaje.2)),
                                     rep("Reddy.wgEncodeDacMapabilityConsensusExcludable", length(excludeGR.hg38.Reddy)),
                                     rep("Wold.hg38mitoExcludable", length(excludeGR.hg38.Wold)),
                                     rep("Yeo.eCLIP_Excludableregions.hg38liftover.bed", length(excludeGR.hg38.Yeo))))

ggplot(mtx_to_plot, aes(x = log2(Width), y = Source, fill = Source)) +
  geom_density_ridges() +
  theme_bw() + theme(legend.position = "none")
#> Picking joint bandwidth of 0.372

# ggsave("man/figures/excluderanges_hg38_width.png", width = 5.5, height = 2)

We can investigate the total width of each set of excludable ranges.

mtx_to_plot <- data.frame(TotalWidth = c(sum(width(excludeGR.hg38.Bernstein)), 
                                         sum(width(excludeGR.hg38.Kundaje.1)), 
                                         sum(width(excludeGR.hg38.Kundaje.2)), 
                                         sum(width(excludeGR.hg38.Reddy)), 
                                         sum(width(excludeGR.hg38.Wold)), 
                                         sum(width(excludeGR.hg38.Yeo))), 
                          Source = c("Bernstein.Mint_Excludable_GRCh38", 
                                     "Kundaje.GRCh38_unified_Excludable", 
                                     "Kundaje.GRCh38.Excludable", 
                                     "Reddy.wgEncodeDacMapabilityConsensusExcludable", 
                                     "Wold.hg38mitoExcludable", 
                                     "Yeo.eCLIP_Excludableregions.hg38liftover"))
ggplot(mtx_to_plot, aes(x = TotalWidth, y = Source, fill = Source)) + 
  geom_bar(stat="identity") + scale_x_log10() + scale_y_discrete(label=abbreviate) +
  xlab("log10 total width")

# ggsave("man/figures/excluderanges_hg38_sumwidth.png", width = 6.5, height = 2)

We can compare Jaccard overlap between those sets of excludable regions.

library(pheatmap)
library(stringr)
# Jaccard calculations
jaccard <- function(gr_a, gr_b) {
  intersects <- GenomicRanges::intersect(gr_a, gr_b, ignore.strand = TRUE)
  intersection <- sum(width(intersects))
  union <- sum(width(GenomicRanges::union(gr_a, gr_b, ignore.strand = TRUE)))
  DataFrame(intersection, union, 
            jaccard = intersection/union,
             n_intersections = length(intersects))
}
# List and names of all excludable regions
all_excludeGR_list <- list(excludeGR.hg38.Bernstein, 
                        excludeGR.hg38.Kundaje.1, 
                        excludeGR.hg38.Kundaje.2,
                        excludeGR.hg38.Reddy,
                        excludeGR.hg38.Wold,
                        excludeGR.hg38.Yeo)
all_excludeGR_name <- c("Bernstein.Mint_Excludable_GRCh38", 
                     "Kundaje.GRCh38_unified_Excludable", 
                     "Kundaje.GRCh38.Excludable", 
                     "Reddy.wgEncodeDacMapabilityConsensusExcludable", 
                     "Wold.hg38mitoExcludable", 
                     "Yeo.eCLIP_Excludableregions.hg38liftover")
# Correlation matrix, empty
mtx_to_plot <- matrix(data = 0, nrow = length(all_excludeGR_list), ncol = length(all_excludeGR_list))
# Fill it in
for (i in 1:length(all_excludeGR_list)) {
  for (j in 1:length(all_excludeGR_list)) {
    # If diagonal, set to zero
    if (i == j) mtx_to_plot[i, j] <- 0
    # Process only one half, the other is symmetric
    if (i > j) {
      mtx_to_plot[i, j] <- mtx_to_plot[j, i] <- jaccard(all_excludeGR_list[[i]], all_excludeGR_list[[j]])[["jaccard"]]
    }
  }
}
# Trim row/colnames
rownames(mtx_to_plot) <- colnames(mtx_to_plot) <- str_trunc(all_excludeGR_name, width = 25) 
# Save the plot
# png("man/figures/excluderanges_hg38_jaccard.png", width = 1000, height = 900, res = 200)
pheatmap(data.matrix(mtx_to_plot))

# dev.off()

Note that some excludable ranges objects contain six columns, implying there may be some interesting metadata. Let’s explore one.

mcols(excludeGR.hg38.Reddy)
#> DataFrame with 401 rows and 2 columns
#>                       name     score
#>                <character> <numeric>
#> 1   High_Mappability_isl..      1000
#> 2         Satellite_repeat      1000
#> 3                 BSR/Beta      1000
#> 4   Low_mappability_island      1000
#> 5                 (CATTC)n      1000
#> ...                    ...       ...
#> 397                   TAR1      1000
#> 398       Satellite_repeat      1000
#> 399               (CATTC)n      1000
#> 400               (CATTC)n      1000
#> 401                   TAR1      1000
mtx_to_plot <- as.data.frame(table(mcols(excludeGR.hg38.Reddy)[["name"]]))
colnames(mtx_to_plot) <- c("Type", "Number")
mtx_to_plot <- mtx_to_plot[order(mtx_to_plot$Number), ]
mtx_to_plot$Type <- factor(mtx_to_plot$Type, levels = mtx_to_plot$Type)
ggplot(mtx_to_plot, aes(x = Number, y = Type, fill = Type)) +
  geom_bar(stat="identity") +
  theme_bw() + theme(legend.position = "none")

# ggsave("man/figures/excluderanges_hg38_Reddy_metadata.png", width = 5, height = 2.5)

One may decide to combine the excludable ranges from all labs, although from previous results we may decide to follow Anshul’s advice advice about the ENCFF356LFX exclusion list regions and use the excludeGR.hg38.Kundaje.1 object.

excludeGR.hg38.all <- reduce(c(excludeGR.hg38.Bernstein, excludeGR.hg38.Kundaje.1, excludeGR.hg38.Kundaje.2, excludeGR.hg38.Reddy, excludeGR.hg38.Wold, excludeGR.hg38.Yeo))
#> Warning in valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE): GRanges object contains 1 out-of-bound range located on sequence
#>   chr4_GL000008v2_random. Note that ranges located on a sequence whose
#>   length is unknown (NA) or on a circular sequence are not considered
#>   out-of-bound (use seqlengths() and isCircular() to get the lengths and
#>   circularity flags of the underlying sequences). You can use trim() to
#>   trim these ranges. See ?`trim,GenomicRanges-method` for more
#>   information.
# Keep only standard chromosomes
excludeGR.hg38.all <- keepStandardChromosomes(excludeGR.hg38.all, pruning.mode = "coarse")
print(length(excludeGR.hg38.all))
#> [1] 13239
summary(width(excludeGR.hg38.all))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>       5    1778    2306    8153    2859 5407757

1.3 Centromeres, telomeres, etc.

Besides the ENCODE-produced excludable regions, we may want to exclude centromeres, telomeres, and other gap locations. The “Gap Locations” track for Homo Sapiens is available for the GRcH37/hg19 genome assembly as a UCSC ‘gap’ table. It can be retrieved from AnnotationHub, but lacks the metadata columns needed to decide the type of gaps.

# Search for the gap track
# ahData <- query(ah, c("gap", "Homo sapiens", "hg19"))
# ahData[ahData$title == "Gap"]
gaps <- ahData[["AH6444"]]

The UCSC ‘gap’ table provides better granularity about the types of gaps available. E.g., for human, hg19, we have the following types and the number of gaps.

Those objects are provided as individual GRanges.

Naming convention: <genome assembly>.UCSC.<gap type>, e.g., hg19.UCSC.gap_centromere.

We can similarly load any gap type object.

query_data <- query(ah, c("excluderanges", "UCSC", "Homo Sapiens", "hg19"))
query_data
#> AnnotationHub with 7 records
#> # snapshotDate(): 2021-10-18
#> # $dataprovider: UCSC
#> # $species: Homo sapiens
#> # $rdataclass: GRanges
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["AH95927"]]' 
#> 
#>             title                        
#>   AH95927 | hg19.UCSC.centromere.rds     
#>   AH95928 | hg19.UCSC.clone.rds          
#>   AH95929 | hg19.UCSC.contig.rds         
#>   AH95930 | hg19.UCSC.heterochromatin.rds
#>   AH95931 | hg19.UCSC.scaffold.rds       
#>   AH95932 | hg19.UCSC.short_arm.rds      
#>   AH95933 | hg19.UCSC.telomere.rds

gapsGR_hg19_centromere <- query_data[["AH95927"]]
#> downloading 1 resources
#> retrieving 1 resource
#> loading from cache
gapsGR_hg19_centromere
#> GRanges object with 24 ranges and 6 metadata columns:
#>       seqnames              ranges strand |       bin        ix           n
#>          <Rle>           <IRanges>  <Rle> | <numeric> <numeric> <character>
#>     2     chr1 121535434-124535434      * |        23      1270           N
#>   184    chr21   11288129-14288129      * |        10        22           N
#>   199    chr22   13000000-16000000      * |        10         3           N
#>   206    chr19   24681782-27681782      * |         1       410           N
#>   224     chrY   10104553-13104553      * |        10       105           N
#>   ...      ...                 ...    ... .       ...       ...         ...
#>   439     chr6   58830166-61830166      * |        16       628           N
#>   453     chr5   46405641-49405641      * |        14       452           N
#>   460     chr4   49660117-52660117      * |         1       447           N
#>   476     chr3   90504854-93504854      * |         2       784           N
#>   481     chr2   92326171-95326171      * |        20       770           N
#>            size        type      bridge
#>       <numeric> <character> <character>
#>     2     3e+06  centromere          no
#>   184     3e+06  centromere          no
#>   199     3e+06  centromere          no
#>   206     3e+06  centromere          no
#>   224     3e+06  centromere          no
#>   ...       ...         ...         ...
#>   439     3e+06  centromere          no
#>   453     3e+06  centromere          no
#>   460     3e+06  centromere          no
#>   476     3e+06  centromere          no
#>   481     3e+06  centromere          no
#>   -------
#>   seqinfo: 24 sequences from hg19 genome

1.3.1 Centromeres for the hg38 genome assembly

Note that the UCSC ‘gap’ table for the hg38 human genome assembly does not contain genomic coordinates for the “centromere” gap type. These can be obtained from the rCGH package as follows:

suppressPackageStartupMessages(library(rCGH))
suppressPackageStartupMessages(library(GenomicRanges))
# hg38 # data.frame
# Adjust chromosome names
hg38$chrom[hg38$chrom == 23] <- "X"
hg38$chrom[hg38$chrom == 24] <- "Y"
hg38$chrom <- paste0("chr", hg38$chrom)
# Make GRanges object
hg38.UCSC.centromere <- makeGRangesFromDataFrame(hg38, seqnames.field = "chrom", start.field = "centromerStart", end.field = "centromerEnd")
# Assign seqinfo data
seqlengths(hg38.UCSC.centromere) <- hg38$length
genome(hg38.UCSC.centromere)     <- "hg38"
# Resulting object
hg38.UCSC.centromere
#> GRanges object with 24 ranges and 0 metadata columns:
#>        seqnames              ranges strand
#>           <Rle>           <IRanges>  <Rle>
#>    [1]     chr1 121535434-124535434      *
#>    [2]     chr2   92326171-95326171      *
#>    [3]     chr3   90504854-93504854      *
#>    [4]     chr4   49660117-52660117      *
#>    [5]     chr5   46405641-49405641      *
#>    ...      ...                 ...    ...
#>   [20]    chr20   26369569-29369569      *
#>   [21]    chr21   11288129-14288129      *
#>   [22]    chr22   13000000-16000000      *
#>   [23]     chrX   58632012-61632012      *
#>   [24]     chrY   10104553-13104553      *
#>   -------
#>   seqinfo: 24 sequences from hg38 genome

The rCGH package also contains data for the hg19 and hg18 genomes. The hg19 centromere data is equivalent to the hg19.UCSC.centromere object provided in our excluderanges package.

1.4 Source data for the excludable regions

Object	Number.of.regions	Assembly	Lab	Number.of.columns	Source
ce10.Kundaje.ce10-Excludable.rds	122	ce10	Anshul Kundaje, Stanford	3	http://mitra.stanford.edu/kundaje/akundaje/release/Excludables/ce10-C.elegans
dm3.Kundaje.dm3-Excludable.rds	492	dm3	Anshul Kundaje, Stanford	3	http://mitra.stanford.edu/kundaje/akundaje/release/Excludables/dm3-D.melanogaster/
hg19.Bernstein.Mint_Excludable_hg19.rds	9035	hg19	Bradley Bernstein, Broad	6	https://www.encodeproject.org/files/ENCFF200UUD/
hg19.Birney.wgEncodeDacMapabilityConsensusExcludable.rds	411	hg19	Ewan Birney, EBI	6	https://www.encodeproject.org/files/ENCFF001TDO/
hg19.Crawford.wgEncodeDukeMapabilityRegionsExcludable.rds	1649	hg19	Gregory Crawford, Duke	6	https://www.encodeproject.org/files/ENCFF001THR/
hg19.Wold.hg19mitoExcludable.rds	295	hg19	Barbara Wold, Caltech	3	https://www.encodeproject.org/files/ENCFF055QTV/
hg19.Yeo.eCLIP_Excludableregions.hg19.rds	57	hg19	Gene Yeo, UCSD	6	https://www.encodeproject.org/files/ENCFF039QTN/
hg38.Bernstein.Mint_Excludable_GRCh38.rds	12052	hg38	Bradley Bernstein, Broad	6	https://www.encodeproject.org/files/ENCFF023CZC/
hg38.Kundaje.GRCh38.Excludable.rds	38	hg38	Anshul Kundaje, Stanford	3	https://www.encodeproject.org/files/ENCFF356LFX/
hg38.Kundaje.GRCh38_unified_Excludable.rds	910	hg38	Anshul Kundaje, Stanford	3	https://www.encodeproject.org/files/ENCFF419RSJ/
hg38.Reddy.wgEncodeDacMapabilityConsensusExcludable.hg38.rds	401	hg38	Tim Reddy, Duke	6	https://www.encodeproject.org/files/ENCFF220FIN/
hg38.Wold.hg38mitoExcludable.rds	299	hg38	Barbara Wold, Caltech	3	https://www.encodeproject.org/files/ENCFF940NTE/
hg38.Yeo.eCLIP_Excludableregions.hg38liftover.bed.fixed.rds	56	hg38	Gene Yeo, UCSD	6	https://www.encodeproject.org/files/ENCFF269URO/
mm10.Hardison.Excludable.full.rds	7865	mm10	Ross Hardison, PennState	3	https://www.encodeproject.org/files/ENCFF790DJT/
mm10.Hardison.psuExcludable.mm10.rds	5552	mm10	Ross Hardison, PennState	3	https://www.encodeproject.org/files/ENCFF226BDM/
mm10.Kundaje.anshul.Excludable.mm10.rds	3010	mm10	Anshul Kundaje, Stanford	3	https://www.encodeproject.org/files/ENCFF999QPV/
mm10.Kundaje.mm10.Excludable.rds	164	mm10	Anshul Kundaje, Stanford	3	https://www.encodeproject.org/files/ENCFF547MET/
mm10.Wold.mm10mitoExcludable.rds	123	mm10	Barbara Wold, Caltech	3	https://www.encodeproject.org/files/ENCFF759PJK/
mm9.Wold.mm9mitoExcludable.rds	123	mm9	Barbara Wold, Caltech	3	https://www.encodeproject.org/files/ENCFF299EZH/

Object	Number.of.regions	Assembly	Lab	Number.of.columns	Source
hg19.UCSC.centromere.rds	24	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg19.UCSC.clone.rds	207	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg19.UCSC.contig.rds	163	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg19.UCSC.heterochromatin.rds	12	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg19.UCSC.scaffold.rds	40	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg19.UCSC.short_arm.rds	5	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg19.UCSC.telomere.rds	46	hg19	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg19&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg38.UCSC.contig.rds	285	hg38	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg38.UCSC.heterochromatin.rds	11	hg38	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg38.UCSC.scaffold.rds	478	hg38	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg38.UCSC.short_arm.rds	5	hg38	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
hg38.UCSC.telomere.rds	48	hg38	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.centromere.rds	20	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.clone.rds	114	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.contig.rds	104	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.fragment.rds	1	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.other.rds	384	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.short_arm.rds	21	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.telomere.rds	42	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm9.UCSC.centromere.rds	21	mm9	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm9&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm9.UCSC.contig.rds	281	mm9	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm9&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm9.UCSC.fragment.rds	709	mm9	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm9&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema
mm10.UCSC.scaffold.rds	48	mm10	UCSC	9	http://genome.ucsc.edu/cgi-bin/hgTables?db=mm10&hgta_group=map&hgta_track=gap&hgta_table=gap&hgta_doSchema=describe+table+schema

Download all data from the Google Drive folder

1.5 `R` session information.

#> R version 4.1.1 (2021-08-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.14-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.14-bioc/R/lib/libRlapack.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] rCGH_1.23.0          stringr_1.4.0        pheatmap_1.0.12     
#>  [4] ggridges_0.5.3       ggplot2_3.3.5        GenomicRanges_1.45.0
#>  [7] GenomeInfoDb_1.29.10 IRanges_2.27.2       S4Vectors_0.31.5    
#> [10] AnnotationHub_3.1.6  BiocFileCache_2.1.1  dbplyr_2.1.1        
#> [13] BiocGenerics_0.39.2  BiocStyle_2.21.4    
#> 
#> loaded via a namespace (and not attached):
#>   [1] colorspace_2.0-2                        
#>   [2] rjson_0.2.20                            
#>   [3] ellipsis_0.3.2                          
#>   [4] mclust_5.4.7                            
#>   [5] DNAcopy_1.67.0                          
#>   [6] XVector_0.33.0                          
#>   [7] farver_2.1.0                            
#>   [8] affyio_1.63.2                           
#>   [9] bit64_4.0.5                             
#>  [10] interactiveDisplayBase_1.31.2           
#>  [11] AnnotationDbi_1.55.2                    
#>  [12] fansi_0.5.0                             
#>  [13] xml2_1.3.2                              
#>  [14] splines_4.1.1                           
#>  [15] cachem_1.0.6                            
#>  [16] knitr_1.36                              
#>  [17] jsonlite_1.7.2                          
#>  [18] Rsamtools_2.9.1                         
#>  [19] cluster_2.1.2                           
#>  [20] png_0.1-7                               
#>  [21] shiny_1.7.1                             
#>  [22] TxDb.Hsapiens.UCSC.hg18.knownGene_3.2.2 
#>  [23] BiocManager_1.30.16                     
#>  [24] compiler_4.1.1                          
#>  [25] httr_1.4.2                              
#>  [26] assertthat_0.2.1                        
#>  [27] Matrix_1.3-4                            
#>  [28] fastmap_1.1.0                           
#>  [29] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 
#>  [30] limma_3.49.4                            
#>  [31] later_1.3.0                             
#>  [32] htmltools_0.5.2                         
#>  [33] prettyunits_1.1.1                       
#>  [34] tools_4.1.1                             
#>  [35] gtable_0.3.0                            
#>  [36] glue_1.4.2                              
#>  [37] GenomeInfoDbData_1.2.7                  
#>  [38] affy_1.71.0                             
#>  [39] dplyr_1.0.7                             
#>  [40] rappdirs_0.3.3                          
#>  [41] Rcpp_1.0.7                              
#>  [42] TxDb.Hsapiens.UCSC.hg38.knownGene_3.14.0
#>  [43] Biobase_2.53.0                          
#>  [44] jquerylib_0.1.4                         
#>  [45] vctrs_0.3.8                             
#>  [46] Biostrings_2.61.2                       
#>  [47] multtest_2.49.0                         
#>  [48] aCGH_1.71.0                             
#>  [49] preprocessCore_1.55.2                   
#>  [50] rtracklayer_1.53.1                      
#>  [51] xfun_0.27                               
#>  [52] mime_0.12                               
#>  [53] lifecycle_1.0.1                         
#>  [54] restfulr_0.0.13                         
#>  [55] XML_3.99-0.8                            
#>  [56] org.Hs.eg.db_3.14.0                     
#>  [57] MASS_7.3-54                             
#>  [58] zlibbioc_1.39.0                         
#>  [59] scales_1.1.1                            
#>  [60] hms_1.1.1                               
#>  [61] promises_1.2.0.1                        
#>  [62] MatrixGenerics_1.5.4                    
#>  [63] parallel_4.1.1                          
#>  [64] SummarizedExperiment_1.23.5             
#>  [65] RColorBrewer_1.1-2                      
#>  [66] yaml_2.2.1                              
#>  [67] curl_4.3.2                              
#>  [68] memoise_2.0.0                           
#>  [69] sass_0.4.0                              
#>  [70] biomaRt_2.49.7                          
#>  [71] stringi_1.7.5                           
#>  [72] RSQLite_2.2.8                           
#>  [73] BiocVersion_3.14.0                      
#>  [74] highr_0.9                               
#>  [75] BiocIO_1.3.0                            
#>  [76] GenomicFeatures_1.45.2                  
#>  [77] filelock_1.0.2                          
#>  [78] BiocParallel_1.27.17                    
#>  [79] rlang_0.4.12                            
#>  [80] pkgconfig_2.0.3                         
#>  [81] bitops_1.0-7                            
#>  [82] matrixStats_0.61.0                      
#>  [83] evaluate_0.14                           
#>  [84] lattice_0.20-45                         
#>  [85] purrr_0.3.4                             
#>  [86] GenomicAlignments_1.29.0                
#>  [87] labeling_0.4.2                          
#>  [88] bit_4.0.4                               
#>  [89] tidyselect_1.1.1                        
#>  [90] plyr_1.8.6                              
#>  [91] magrittr_2.0.1                          
#>  [92] bookdown_0.24                           
#>  [93] R6_2.5.1                                
#>  [94] magick_2.7.3                            
#>  [95] generics_0.1.0                          
#>  [96] DelayedArray_0.19.4                     
#>  [97] DBI_1.1.1                               
#>  [98] pillar_1.6.4                            
#>  [99] withr_2.4.2                             
#> [100] survival_3.2-13                         
#> [101] KEGGREST_1.33.0                         
#> [102] RCurl_1.98-1.5                          
#> [103] tibble_3.1.5                            
#> [104] crayon_1.4.1                            
#> [105] utf8_1.2.2                              
#> [106] rmarkdown_2.11                          
#> [107] progress_1.2.2                          
#> [108] grid_4.1.1                              
#> [109] blob_1.2.2                              
#> [110] digest_0.6.28                           
#> [111] xtable_1.8-4                            
#> [112] httpuv_1.6.3                            
#> [113] munsell_0.5.0                           
#> [114] bslib_0.3.1

References

Amemiya, Haley M, Anshul Kundaje, and Alan P Boyle. 2019. “The Encode Blacklist: Identification of Problematic Regions of the Genome.” Sci Rep 9 (1): 9354. https://doi.org/10.1038/s41598-019-45839-z.

Introduction to excluderanges

2021-10-20

1 Exclude ranges

1.1 Install `excluderanges`

1.2 Use excluderanges

1.3 Centromeres, telomeres, etc.

1.3.1 Centromeres for the hg38 genome assembly

1.4 Source data for the excludable regions

1.5 `R` session information.

References

Introduction to excluderanges

2021-10-20

1 Exclude ranges

1.1 Install excluderanges

1.2 Use excluderanges

1.3 Centromeres, telomeres, etc.

1.3.1 Centromeres for the hg38 genome assembly

1.4 Source data for the excludable regions

1.5 R session information.

References

1.1 Install `excluderanges`

1.5 `R` session information.