# Load and attach PRONE
library(PRONE)
Here, we are directly working with the SummarizedExperiment data. For more information on how to create the SummarizedExperiment from a proteomics data set, please refer to the “Get Started” vignette.
The example TMT data set originates from (Biadglegne et al. 2022).
data("tuberculosis_TMT_se")
se <- tuberculosis_TMT_se
This SummarizedExperiment object already includes data of different normalization methods. Since this vignette should show you how to use the PRONE workflow for novel proteomics data, we will remove the normalized data and only keep the raw and log2 data that are available after loading the data accordingly.
se <- subset_SE_by_norm(se, ain = c("raw", "log2"))
To get an overview on the number of NAs, you can simply use the function get_NA_overview()
:
knitr::kable(get_NA_overview(se, ain = "log2"))
Total.Values | NA.Values | NA.Percentage |
---|---|---|
6020 | 1945 | 32.30897 |
To get an overview on the number of samples per sample group or batch, you can simply use the function plot_condition_overview()
by specifying the column of the meta-data that should be used for coloring. By default (condition = NULL), the column specified in load_data()
will be used.
plot_condition_overview(se, condition = NULL)
#> Condition of SummarizedExperiment used!
plot_condition_overview(se, condition = "Pool")
A general overview of the protein intensities across the different samples is provided by the function plot_heatmap()
. The parameter “ain” specifies the data to plot, currently only “raw” and “log2” is available (names(assays(se)). Later if multiple normalization methods are executed, these will be saved as assays, and the normalized data can be visualized.
available_ains <- names(assays(se))
plot_heatmap(se, ain = "log2", color_by = c("Pool", "Group"),
label_by = NULL, only_refs = FALSE)
#> Label of SummarizedExperiment used!
#> $log2
Similarly, an upset plot can be generated to visualize the overlaps between sets defined by a specific column in the metadata. The sets are generated by using non-NA values.
plot_upset(se, color_by = NULL, label_by = NULL, mb.ratio = c(0.7,0.3),
only_refs = FALSE)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
If you are interested in the intensities of specific biomarkers, you can use the plot_markers_boxplots()
function to compare the distribution of intensities per group. The plot can be generated per marker and facet by normalization method (facet_norm = TRUE) or by normalization method and facet by marker (facet_marker = TRUE).
p <- plot_markers_boxplots(se,
markers = c("Q92954;J3KP74;E9PLR3", "Q9Y6Z7"),
ain = "log2",
id_column = "Protein.IDs",
facet_norm = FALSE,
facet_marker = TRUE)
#> Condition of SummarizedExperiment used!
#> No shaping done.
p[[1]] + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))
se <- filter_out_complete_NA_proteins(se)
#> 13 proteins were removed.
Typically proteins with “+” in the columns “Reverse”, “Only.identified.by.site”, and “Potential.contaminant” are removed in case of a MaxQuant proteinGroups.txt output file.
se <- filter_out_proteins_by_value(se, "Reverse", "+")
#> 17 proteins were removed.
se <- filter_out_proteins_by_value(se, "Only.identified.by.site", "+")
#> 1 proteins were removed.
#se <- filter_out_proteins_by_value(se, "Potential.contaminant", "+")
If you don’t want to remove for instance all proteins with “Potential.contaminant == +”, you can also first get the protein ID with the specific value, check them in Uniprot, and then remove only some by using the function filter_out_proteins_by_ID()
.
pot_contaminants <- get_proteins_by_value(se, "Potential.contaminant", "+")
#> 24 proteins were identified.
se <- filter_out_proteins_by_ID(se, pot_contaminants)
#> 24 proteins were removed.
Due to the high amount of missing values in MS-based proteomics data, it is important to explore the missing value pattern in the data. The function plot_NA_heatmap()
provides a heatmap of the proteins with at least one missing value across all samples.
plot_NA_heatmap(se, color_by = NULL, label_by = NULL, cluster_samples = TRUE,
cluster_proteins = TRUE)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
Another way to explore the missing value pattern is to use the functions plot_NA_density()
and plot_NA_frequency()
.
plot_NA_density(se)
plot_NA_frequency(se)
To reduce the amount of missing values, it is possible to filter proteins by applying a missing value threshold. The function filter_out_NA_proteins_by_threshold()
removes proteins with more missing values than the specified threshold. The threshold is a value between 0 and 1, where 0.7, for instance, means that proteins with less than 70% of real values will be removed, i.e., proteins with more than 30% missing values will be removed.
se <- filter_out_NA_proteins_by_threshold(se, thr = 0.7)
#> 99 proteins were removed.
plot_NA_heatmap(se)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
Following filtering proteins by different criteria, samples can be analyzed more in detail. PRONE provides some functions, such as plot_nr_prot_samples()
and plot_tot_int_samples()
, to get an overview of the number of proteins and the total intensity per sample, but also offers the automatic outlier detection method of POMA.
plot_nr_prot_samples(se, color_by = NULL, label_by = NULL)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
plot_tot_int_samples(se, color_by = NULL, label_by = NULL)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!
Based on these plots, samples “1.HC_Pool1” and 1_HC_Pool2 seem to be outliers. You can easily remove samples manually by using the remove_samples_manually()
function.
se <- remove_samples_manually(se, "Label", c("1.HC_Pool1", "1.HC_Pool2"))
#> 2 samples removed.
And you can remove the reference samples directly using the function remove_reference_samples()
. But attention: possibly you need them for normalization! That is exactly why we currently keep them!
se_no_refs <- remove_reference_samples(se)
#> 2 reference samples removed from the SummarizedExperiment object.
The POMA R package provides a method to detect outliers in proteomics data. The function detect_outliers_POMA()
detects outliers in the data based on the POMA algorithm. The method calculates the euclidean distances (or maximum or manhattan distances) between all sample pairs and then group centroids are computed either as the average distance of group members or the spatial median. The distances of the group members to the corresponding group centroid are then calculated to apply the classical univariate outlier detection formula Q3 + 1.5 * IQR (1.5 is adjustable) to detect group-dependent outliers.
The function returns a list with the following elements: polygon plot, distance boxplot, and the outliers. For further information on the POMA algorithm, please refer to the original publication (Castellano-Escuder et al. 2021):
poma_res <- detect_outliers_POMA(se, ain = "log2")
#> Condition of SummarizedExperiment used!
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.
poma_res$polygon_plot
poma_res$distance_boxplot
knitr::kable(poma_res$outliers,
caption = "Outliers detected by the POMA algorithm.", digits = 2)
sample | groups | distance_to_centroid | limit_distance |
---|---|---|---|
Reporter.intensity.corrected.10.Pat_tbc_pool_2 | PTB | 11.05 | 10.52 |
To remove the outliers detected via the POMA algorithm, just put the data.table of the detect_outliers_POMA()
function into the remove_POMA_outliers()
function.
se <- remove_POMA_outliers(se, poma_res$outliers)
#> 1 outlier samples removed.
utils::sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] PRONE_1.1.0 SummarizedExperiment_1.37.0
#> [3] Biobase_2.67.0 GenomicRanges_1.59.0
#> [5] GenomeInfoDb_1.43.0 IRanges_2.41.0
#> [7] S4Vectors_0.45.0 BiocGenerics_0.53.0
#> [9] MatrixGenerics_1.19.0 matrixStats_1.4.1
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.9
#> [3] shape_1.4.6.1 MultiAssayExperiment_1.33.0
#> [5] magrittr_2.0.3 magick_2.8.5
#> [7] farver_2.1.2 MALDIquant_1.22.3
#> [9] rmarkdown_2.28 GlobalOptions_0.1.2
#> [11] zlibbioc_1.53.0 vctrs_0.6.5
#> [13] Cairo_1.6-2 htmltools_0.5.8.1
#> [15] S4Arrays_1.7.0 ComplexUpset_1.3.3
#> [17] SparseArray_1.7.0 mzID_1.45.0
#> [19] sass_0.4.9 bslib_0.8.0
#> [21] htmlwidgets_1.6.4 plyr_1.8.9
#> [23] impute_1.81.0 cachem_1.1.0
#> [25] igraph_2.1.1 lifecycle_1.0.4
#> [27] iterators_1.0.14 pkgconfig_2.0.3
#> [29] Matrix_1.7-1 R6_2.5.1
#> [31] fastmap_1.2.0 GenomeInfoDbData_1.2.13
#> [33] clue_0.3-65 digest_0.6.37
#> [35] pcaMethods_1.99.0 colorspace_2.1-1
#> [37] patchwork_1.3.0 crosstalk_1.2.1
#> [39] vegan_2.6-8 labeling_0.4.3
#> [41] fansi_1.0.6 httr_1.4.7
#> [43] abind_1.4-8 mgcv_1.9-1
#> [45] compiler_4.5.0 withr_3.0.2
#> [47] doParallel_1.0.17 BiocParallel_1.41.0
#> [49] UpSetR_1.4.0 highr_0.11
#> [51] MASS_7.3-61 DelayedArray_0.33.0
#> [53] rjson_0.2.23 permute_0.9-7
#> [55] mzR_2.41.0 tools_4.5.0
#> [57] PSMatch_1.11.0 glue_1.8.0
#> [59] nlme_3.1-166 QFeatures_1.17.0
#> [61] gridtext_0.1.5 grid_4.5.0
#> [63] cluster_2.1.6 reshape2_1.4.4
#> [65] generics_0.1.3 gtable_0.3.6
#> [67] preprocessCore_1.69.0 tidyr_1.3.1
#> [69] data.table_1.16.2 xml2_1.3.6
#> [71] utf8_1.2.4 XVector_0.47.0
#> [73] foreach_1.5.2 pillar_1.9.0
#> [75] stringr_1.5.1 limma_3.63.0
#> [77] splines_4.5.0 circlize_0.4.16
#> [79] dplyr_1.1.4 ggtext_0.1.2
#> [81] lattice_0.22-6 tidyselect_1.2.1
#> [83] ComplexHeatmap_2.23.0 knitr_1.48
#> [85] gridExtra_2.3 bookdown_0.41
#> [87] ProtGenerics_1.39.0 xfun_0.48
#> [89] statmod_1.5.0 MSnbase_2.33.0
#> [91] DT_0.33 stringi_1.8.4
#> [93] UCSC.utils_1.3.0 lazyeval_0.2.2
#> [95] yaml_2.3.10 NormalyzerDE_1.25.0
#> [97] evaluate_1.0.1 codetools_0.2-20
#> [99] MsCoreUtils_1.19.0 tibble_3.2.1
#> [101] BiocManager_1.30.25 cli_3.6.3
#> [103] affyio_1.77.0 munsell_0.5.1
#> [105] jquerylib_0.1.4 Rcpp_1.0.13
#> [107] png_0.1-8 XML_3.99-0.17
#> [109] parallel_4.5.0 ggplot2_3.5.1
#> [111] dendsort_0.3.4 AnnotationFilter_1.31.0
#> [113] scales_1.3.0 affy_1.85.0
#> [115] ncdf4_1.23 purrr_1.0.2
#> [117] crayon_1.5.3 POMA_1.17.0
#> [119] GetoptLong_1.0.5 rlang_1.1.4
#> [121] vsn_3.75.0