1 Load PRONE Package

# Load and attach PRONE
library(PRONE)

2 Load Data (TMT)

Here, we are directly working with the SummarizedExperiment data. For more information on how to create the SummarizedExperiment from a proteomics data set, please refer to the “Get Started” vignette.

The example TMT data set originates from (Biadglegne et al. 2022).

data("tuberculosis_TMT_se")
se <- tuberculosis_TMT_se

This SummarizedExperiment object already includes data of different normalization methods. Since this vignette should show you how to use the PRONE workflow for novel proteomics data, we will remove the normalized data and only keep the raw and log2 data that are available after loading the data accordingly.

se <- subset_SE_by_norm(se, ain = c("raw", "log2"))

3 Overview of the Data

To get an overview on the number of NAs, you can simply use the function get_NA_overview():

knitr::kable(get_NA_overview(se, ain = "log2"))

Total.Values	NA.Values	NA.Percentage
6020	1945	32.30897

To get an overview on the number of samples per sample group or batch, you can simply use the function plot_condition_overview() by specifying the column of the meta-data that should be used for coloring. By default (condition = NULL), the column specified in load_data() will be used.

plot_condition_overview(se, condition = NULL)
#> Condition of SummarizedExperiment used!

Figure 3.1: Overview barplot of the number of samples per condition.

plot_condition_overview(se, condition = "Pool")

Figure 3.2: Overview barplot of the number of samples per pool.

A general overview of the protein intensities across the different samples is provided by the function plot_heatmap(). The parameter “ain” specifies the data to plot, currently only “raw” and “log2” is available (names(assays(se)). Later if multiple normalization methods are executed, these will be saved as assays, and the normalized data can be visualized.

available_ains <- names(assays(se))

plot_heatmap(se, ain = "log2", color_by = c("Pool", "Group"), 
             label_by = NULL, only_refs = FALSE)
#> Label of SummarizedExperiment used!
#> $log2

Figure 3.3: Heatmap of the log2-protein intensities with columns and proteins being clustered with hclust.

Similarly, an upset plot can be generated to visualize the overlaps between sets defined by a specific column in the metadata. The sets are generated by using non-NA values.

plot_upset(se, color_by = NULL, label_by = NULL, mb.ratio = c(0.7,0.3), 
           only_refs = FALSE)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Figure 3.4: Upset plot of the non-NA protein intensities with sets defined by the Pool column.

If you are interested in the intensities of specific biomarkers, you can use the plot_markers_boxplots() function to compare the distribution of intensities per group. The plot can be generated per marker and facet by normalization method (facet_norm = TRUE) or by normalization method and facet by marker (facet_marker = TRUE).

p <- plot_markers_boxplots(se, 
                           markers = c("Q92954;J3KP74;E9PLR3", "Q9Y6Z7"), 
                           ain = "log2", 
                           id_column = "Protein.IDs", 
                           facet_norm = FALSE, 
                           facet_marker = TRUE)
#> Condition of SummarizedExperiment used!
#> No shaping done.
p[[1]] + ggplot2::theme(axis.text.x = ggplot2::element_text(angle = 90, vjust = 0.5))

Boxplot of the log2-protein intensities of the markers Q92954;J3KP74;E9PLR3, Q9Y6Z7, and Q68CQ4. If specific biomarkers are known to be upregulated in some conditions, you can use this plot to get an impression of the intensities before doing differential expression analysis.

Figure 3.5: Boxplot of the log2-protein intensities of the markers Q92954;J3KP74;E9PLR3, Q9Y6Z7, and Q68CQ4. If specific biomarkers are known to be upregulated in some conditions, you can use this plot to get an impression of the intensities before doing differential expression analysis.

4 Filter Proteins

4.1 Remove Proteins With Missing Values in ALL Samples

se <- filter_out_complete_NA_proteins(se)
#> 13 proteins were removed.

4.2 Remove Proteins With a Specific Value in a Specific Column

Typically proteins with “+” in the columns “Reverse”, “Only.identified.by.site”, and “Potential.contaminant” are removed in case of a MaxQuant proteinGroups.txt output file.

se <- filter_out_proteins_by_value(se, "Reverse", "+")
#> 17 proteins were removed.
se <- filter_out_proteins_by_value(se, "Only.identified.by.site", "+")
#> 1 proteins were removed.
#se <- filter_out_proteins_by_value(se, "Potential.contaminant", "+")

4.3 Remove Proteins by ID

If you don’t want to remove for instance all proteins with “Potential.contaminant == +”, you can also first get the protein ID with the specific value, check them in Uniprot, and then remove only some by using the function filter_out_proteins_by_ID().

pot_contaminants <- get_proteins_by_value(se, "Potential.contaminant", "+")
#> 24 proteins were identified.
se <- filter_out_proteins_by_ID(se, pot_contaminants)
#> 24 proteins were removed.

4.4 Explore Missing Value Pattern

Due to the high amount of missing values in MS-based proteomics data, it is important to explore the missing value pattern in the data. The function plot_NA_heatmap() provides a heatmap of the proteins with at least one missing value across all samples.

plot_NA_heatmap(se, color_by = NULL, label_by = NULL, cluster_samples = TRUE, 
                cluster_proteins = TRUE)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Figure 4.1: Heatmap of the missing value pattern. Proteins with at least one missing value are included in this heatmap.

Another way to explore the missing value pattern is to use the functions plot_NA_density() and plot_NA_frequency().

plot_NA_density(se)

Density plot of protein intensities of proteins with missing values and proteins without missing values. This can help to check whether the missing values are biased to lower intense proteins. In this case, the missing values are biased towards the lower intense proteins.

Figure 4.2: Density plot of protein intensities of proteins with missing values and proteins without missing values. This can help to check whether the missing values are biased to lower intense proteins. In this case, the missing values are biased towards the lower intense proteins.

plot_NA_frequency(se)

Figure 4.3: Protein identification overlap. This plot shows the number of identified proteins (y-axis) per number of samples (x-axis).

4.5 Filter Proteins By Applying a Missing Value Threshold

To reduce the amount of missing values, it is possible to filter proteins by applying a missing value threshold. The function filter_out_NA_proteins_by_threshold() removes proteins with more missing values than the specified threshold. The threshold is a value between 0 and 1, where 0.7, for instance, means that proteins with less than 70% of real values will be removed, i.e., proteins with more than 30% missing values will be removed.

se <- filter_out_NA_proteins_by_threshold(se, thr = 0.7) 
#> 99 proteins were removed.

plot_NA_heatmap(se)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Heatmap of the missing value pattern after filtering proteins with too many missing values (threshold was set to 70%). Proteins with at least one missing value are included in this heatmap.

Figure 4.4: Heatmap of the missing value pattern after filtering proteins with too many missing values (threshold was set to 70%). Proteins with at least one missing value are included in this heatmap.

5 Filter Samples

Following filtering proteins by different criteria, samples can be analyzed more in detail. PRONE provides some functions, such as plot_nr_prot_samples() and plot_tot_int_samples(), to get an overview of the number of proteins and the total intensity per sample, but also offers the automatic outlier detection method of POMA.

5.1 Quality Control

plot_nr_prot_samples(se, color_by = NULL, label_by = NULL)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Barplot on the number of non-zero proteins per samples colored by condition. This plot helps as quality control to check whether the number of proteins is similar across samples. In this case, only very few proteins have been detected for sample 1.HC_Pool1 compared to the other samples.

Figure 5.1: Barplot on the number of non-zero proteins per samples colored by condition. This plot helps as quality control to check whether the number of proteins is similar across samples. In this case, only very few proteins have been detected for sample 1.HC_Pool1 compared to the other samples.

plot_tot_int_samples(se, color_by = NULL, label_by = NULL)
#> Condition of SummarizedExperiment used!
#> Label of SummarizedExperiment used!

Barplot on the total protein intensity per samples colored by condition. This plot helps as quality control to check whether the total protein intensity is similar across samples. In this case, the total protein intensities of samples 1.HC_Pool1 and 1.HC_Pool2 are much lower compared to the other samples.

Figure 5.2: Barplot on the total protein intensity per samples colored by condition. This plot helps as quality control to check whether the total protein intensity is similar across samples. In this case, the total protein intensities of samples 1.HC_Pool1 and 1.HC_Pool2 are much lower compared to the other samples.

5.2 Remove Samples Manually

Based on these plots, samples “1.HC_Pool1” and 1_HC_Pool2 seem to be outliers. You can easily remove samples manually by using the remove_samples_manually() function.

se <- remove_samples_manually(se, "Label", c("1.HC_Pool1", "1.HC_Pool2"))
#> 2 samples removed.

5.3 Remove Reference Samples

And you can remove the reference samples directly using the function remove_reference_samples(). But attention: possibly you need them for normalization! That is exactly why we currently keep them!

se_no_refs <- remove_reference_samples(se)
#> 2 reference samples removed from the SummarizedExperiment object.

5.4 Outlier Detection via POMA R Package

The POMA R package provides a method to detect outliers in proteomics data. The function detect_outliers_POMA() detects outliers in the data based on the POMA algorithm. The method calculates the euclidean distances (or maximum or manhattan distances) between all sample pairs and then group centroids are computed either as the average distance of group members or the spatial median. The distances of the group members to the corresponding group centroid are then calculated to apply the classical univariate outlier detection formula Q3 + 1.5 * IQR (1.5 is adjustable) to detect group-dependent outliers.

The function returns a list with the following elements: polygon plot, distance boxplot, and the outliers. For further information on the POMA algorithm, please refer to the original publication (Castellano-Escuder et al. 2021):

poma_res <- detect_outliers_POMA(se, ain = "log2")
#> Condition of SummarizedExperiment used!
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

poma_res$polygon_plot

Polygon plot of the principal coordinates calculated using the POMA algorithm. The original distances are reduced to principal coordinates (see POMA and the vegan::betadisper method).

Figure 5.3: Polygon plot of the principal coordinates calculated using the POMA algorithm. The original distances are reduced to principal coordinates (see POMA and the vegan::betadisper method).

poma_res$distance_boxplot

Figure 5.4: Boxplot of the distances of the samples to the group centroid calculated using POMA.

knitr::kable(poma_res$outliers, 
             caption = "Outliers detected by the POMA algorithm.", digits = 2)

Table 5.1: Outliers detected by the POMA algorithm.
sample	groups	distance_to_centroid	limit_distance
Reporter.intensity.corrected.10.Pat_tbc_pool_2	PTB	11.05	10.52

To remove the outliers detected via the POMA algorithm, just put the data.table of the detect_outliers_POMA() function into the remove_POMA_outliers() function.

se <- remove_POMA_outliers(se, poma_res$outliers)
#> 1 outlier samples removed.

6 Session Info

utils::sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] PRONE_1.1.0                 SummarizedExperiment_1.37.0
#>  [3] Biobase_2.67.0              GenomicRanges_1.59.0       
#>  [5] GenomeInfoDb_1.43.0         IRanges_2.41.0             
#>  [7] S4Vectors_0.45.0            BiocGenerics_0.53.0        
#>  [9] MatrixGenerics_1.19.0       matrixStats_1.4.1          
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3          jsonlite_1.8.9             
#>   [3] shape_1.4.6.1               MultiAssayExperiment_1.33.0
#>   [5] magrittr_2.0.3              magick_2.8.5               
#>   [7] farver_2.1.2                MALDIquant_1.22.3          
#>   [9] rmarkdown_2.28              GlobalOptions_0.1.2        
#>  [11] zlibbioc_1.53.0             vctrs_0.6.5                
#>  [13] Cairo_1.6-2                 htmltools_0.5.8.1          
#>  [15] S4Arrays_1.7.0              ComplexUpset_1.3.3         
#>  [17] SparseArray_1.7.0           mzID_1.45.0                
#>  [19] sass_0.4.9                  bslib_0.8.0                
#>  [21] htmlwidgets_1.6.4           plyr_1.8.9                 
#>  [23] impute_1.81.0               cachem_1.1.0               
#>  [25] igraph_2.1.1                lifecycle_1.0.4            
#>  [27] iterators_1.0.14            pkgconfig_2.0.3            
#>  [29] Matrix_1.7-1                R6_2.5.1                   
#>  [31] fastmap_1.2.0               GenomeInfoDbData_1.2.13    
#>  [33] clue_0.3-65                 digest_0.6.37              
#>  [35] pcaMethods_1.99.0           colorspace_2.1-1           
#>  [37] patchwork_1.3.0             crosstalk_1.2.1            
#>  [39] vegan_2.6-8                 labeling_0.4.3             
#>  [41] fansi_1.0.6                 httr_1.4.7                 
#>  [43] abind_1.4-8                 mgcv_1.9-1                 
#>  [45] compiler_4.5.0              withr_3.0.2                
#>  [47] doParallel_1.0.17           BiocParallel_1.41.0        
#>  [49] UpSetR_1.4.0                highr_0.11                 
#>  [51] MASS_7.3-61                 DelayedArray_0.33.0        
#>  [53] rjson_0.2.23                permute_0.9-7              
#>  [55] mzR_2.41.0                  tools_4.5.0                
#>  [57] PSMatch_1.11.0              glue_1.8.0                 
#>  [59] nlme_3.1-166                QFeatures_1.17.0           
#>  [61] gridtext_0.1.5              grid_4.5.0                 
#>  [63] cluster_2.1.6               reshape2_1.4.4             
#>  [65] generics_0.1.3              gtable_0.3.6               
#>  [67] preprocessCore_1.69.0       tidyr_1.3.1                
#>  [69] data.table_1.16.2           xml2_1.3.6                 
#>  [71] utf8_1.2.4                  XVector_0.47.0             
#>  [73] foreach_1.5.2               pillar_1.9.0               
#>  [75] stringr_1.5.1               limma_3.63.0               
#>  [77] splines_4.5.0               circlize_0.4.16            
#>  [79] dplyr_1.1.4                 ggtext_0.1.2               
#>  [81] lattice_0.22-6              tidyselect_1.2.1           
#>  [83] ComplexHeatmap_2.23.0       knitr_1.48                 
#>  [85] gridExtra_2.3               bookdown_0.41              
#>  [87] ProtGenerics_1.39.0         xfun_0.48                  
#>  [89] statmod_1.5.0               MSnbase_2.33.0             
#>  [91] DT_0.33                     stringi_1.8.4              
#>  [93] UCSC.utils_1.3.0            lazyeval_0.2.2             
#>  [95] yaml_2.3.10                 NormalyzerDE_1.25.0        
#>  [97] evaluate_1.0.1              codetools_0.2-20           
#>  [99] MsCoreUtils_1.19.0          tibble_3.2.1               
#> [101] BiocManager_1.30.25         cli_3.6.3                  
#> [103] affyio_1.77.0               munsell_0.5.1              
#> [105] jquerylib_0.1.4             Rcpp_1.0.13                
#> [107] png_0.1-8                   XML_3.99-0.17              
#> [109] parallel_4.5.0              ggplot2_3.5.1              
#> [111] dendsort_0.3.4              AnnotationFilter_1.31.0    
#> [113] scales_1.3.0                affy_1.85.0                
#> [115] ncdf4_1.23                  purrr_1.0.2                
#> [117] crayon_1.5.3                POMA_1.17.0                
#> [119] GetoptLong_1.0.5            rlang_1.1.4                
#> [121] vsn_3.75.0

References

Biadglegne, Fantahun, Johannes R. Schmidt, Kathrin M. Engel, Jörg Lehmann, Robert T. Lehmann, Anja Reinert, Brigitte König, Jürgen Schiller, Stefan Kalkhof, and Ulrich Sack. 2022. “Mycobacterium Tuberculosis Affects Protein and Lipid Content of Circulating Exosomes in Infected Patients Depending on Tuberculosis Disease State.” Biomedicines 10 (4): 783. https://doi.org/10.3390/biomedicines10040783.

Castellano-Escuder, Pol, Raúl González-Domínguez, Francesc Carmona-Pontaque, Cristina Andrés-Lacueva, and Alex Sánchez-Pla. 2021. “POMAShiny: A User-Friendly Web-Based Workflow for Metabolomics and Proteomics Data Analysis.” Edited by Manja Marz. PLOS Computational Biology 17 (7): e1009148. https://doi.org/10.1371/journal.pcbi.1009148.

Preprocessing

Lis Arend