1 mspms

1.1 Introduction

The goal of mspms is provide a R Package that can be used to easily and reproducibly analyze data resulting from the Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) method.

Additionally, we provide a graphical user interface powered by shiny apps that allows for a user to utilize the method without requiring any R coding knowledge.

Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) is a powerful method used to determine the substrate specificity of proteases. This method is of interest for groups interested in the study of proteases and their role as regulators of biological pathways, whether applied to the study of disease states, the development of diagnostic and prognostic tests, generation of tool compounds, or rational design of protease targeting therapeutics. Analysis of the MS based data produced by MSP-MS is a multi-step process involving detection and quantification of peptides, imputation, normalization, cleavage sequence identification, statistics, and data visualizations. This process can be challenging, especially for scientists with limited programming experience. To overcome these issues, we developed the mspms R package to facilitate the analysis of MSP-MS data utilizing good software/data analysis practices.

mspms differs from existing proteomics packages that are available in the Bioconductor project in that it is specifically designed to analyze MSP-MS data. This involves unique preprocessing steps and data visualizations tailored to this specific assay.

In order to do so, mspms uses several excellent packages from the Bioconductor project to provide a framework to easily and reproducibly analyze MSP-MS data. These include:

QFeatures: management and processing of peptide quantitative features and sample metadata.
MsCoreUtils: log2 transformation, imputation (QRILC), and normalization (center.median).

Some excellent non-Bioconductor packages are also utilized.

ggplot2: data visualizations
heatmaply: interactive heatmaps
rstatix: conduct statistics
ggseqlogo: iceLogo motif visualization.
downloadthis: gives users the ability to download data from standard .html report generated by mspms::generate_report().

Lastly, mspms takes advantage of iceLogos (Colaert, N. et al. Nature Methods 6, 786-787 (2009)) to viusalize overrepresented amino acid motifs relative to a background set by implementing components of the Java software in R.

1.2 Quickstart

To generate a general report using your own data, run the following code. It requires data that has been prepared for mspms data analysis by a converter function, and colData (sample metadata). For more information see subsequent sections.

mspms::generate_report(
  prepared_data = mspms::peaks_prepared_data,
  outdir = "../Desktop/mspms_report"
)

The above command will generate a .html file containing a generic mspms analysis.

There is much more that can be done using the mspms package- see the following sections for more information.

1.3 Overview

Functions in this package are logically divided into 4 broad types of functions.

Pre-processing data. These functions are focused on getting all of the data necessary for a mspms analysis together and in a consistent format.
Data processing. These functions allow the user to process the proteomics data (imputation, normalization, etc).
Statistics. These methods allow the user to perform basic statistics on the normalized/processed data.
Data visualization. These functions allow the user to visualize the data.

Note that only a subset of functions are exported to the user. Helper functions can be found in the preprocessing_helper_functions.R, processing_helper_functions.R,icelogo_helper_functions.R, statistics_helper_functions.R, and iceLogo_helper_functions.R files.

These functions were not intended to be directly used by the user and are not exported in an effort to make the package API cleaner and more intuitive.

1.4 Pre-processing data

Pre-processing data here refers to standardizing the data exported from different software solutions for analyzing mass spectrometry-based proteomics and adding useful information specific to the MSP-MS assay.

Specifically, the data from the file exported from the upstream software is parsed to contain the peptide (detected peptide sequence, cleavages are represented as “_”), library_id (name of the peptide in the peptide library each detected peptide maps to), and intensities for each sample.

Information about each peptide relative to the peptide it was derived from is then added. This includes the cleavage_seq (the motif the peptide a user specified n residues up and down from the cleavage event), cleavage_pos (position of the peptide library the cleavage product was cleaved at ), peptide_type (whether the detected peptide is a cleavage product of the peptide library, or a full length member of the peptide library).

All of the information is then loaded as a QFeatures object along with colData (metadata about each sample) where it is stored as a SummarizedExperiment object named “peptides”.

1.4.1 A note on colData

As briefly described above, the colData contains metadata describing the samples you are trying to analyze.

It is very important that this contains at least the following columns: quantCols”,“group”,“condition”,and “time”.

The quantCols field contains the name of each sample. These names must match whatever the samples are named in the files exported from the upstream proteomics software.

The group field contains text specifying what group each sample belongs to. All replicate samples should have the same group name

The condition field contains the experimental conditions that vary in your MSP-MS experiment (this depends on the experimental question, could be type of protease, the type of inhibitor, etc)

The time field contains the duration of incubation of your protease of interest with the peptide library for each sample.

An example of a valid colData file is shown below.

colData <- readr::read_csv(system.file("extdata/colData.csv",
  package = "mspms"
))
#> Rows: 12 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): quantCols, group, condition
#> dbl (1): time
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(colData)
#> # A tibble: 6 × 4
#>   quantCols   group     condition  time
#>   <chr>       <chr>     <chr>     <dbl>
#> 1 CatA_0000_1 CatA_0000 CatA          0
#> 2 CatA_0000_2 CatA_0000 CatA          0
#> 3 CatA_0000_3 CatA_0000 CatA          0
#> 4 CatA_0000_4 CatA_0000 CatA          0
#> 5 CatA_0030_1 CatA_0030 CatA         30
#> 6 CatA_0030_2 CatA_0030 CatA         30

1.4.2 PEAKS

Analysis using a .csv exported from PEAKS software is supported.

an exported .csv file can be generated from PEAKS as follows:

Create a New Project in PEAKS.
Select data to add to PEAKS project.
Add data, rename samples, set correct parameters. • Highlight all the added data.
• Select Create new sample for each file.
• Select appropriate instrument and fragmentation method.
• Set Enzyme to “None”.
• Rename samples according to enzymes, time points, and replicates.
• Can also ensure appropriate data is selected for these.
• Select “Data Refinement” to continue.
Data Refinement.
• On the top right, select MSP-MS as the predefined parameters.
• Parameters shown are what should be used for MSP-MS data analysis.
• Select “Identification” to continue.
Peptide Identification.
• On the top right, select MSP-MS as the predefined parameters.
• Parameters shown are what should be used for MSP-MS data analysis.
• Identification -> Select Database -> TPD_237.
• Identification -> no PTMs.
• Can remove the PTMs by highlighting them and selecting Remove
Label Free Quantification.
• Make sure Label Free is selected.
• Group samples -> add quadruplicates of samples to new group.


library(dplyr)
library(mspms)

# File path of peaks lfq file
lfq_filepath <- system.file("extdata/peaks_protein-peptides-lfq.csv",
  package = "mspms"
)

colData_filepath <- system.file("extdata/colData.csv", package = "mspms")

# Prepare the data
peaks_prepared_data <- mspms::prepare_peaks(lfq_filepath,
  colData_filepath,
  quality_threshold = 0.3,
  n_residues = 4
)
#> Rows: 2444 Columns: 32
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (6): Protein Accession, Peptide, Used, Candidate, Sample Profile (Ratio...
#> dbl (25): Protein Group, Protein ID, Quality, Significance, Avg. ppm, Avg. A...
#> lgl  (1): PTM
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> 841 peptides were removed because they had a quality score < 0.3 (34%)
#> 
#> Rows: 12 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): quantCols, group, condition
#> dbl (1): time
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Checking arguments.
#> 
#> Loading data as a 'SummarizedExperiment' object.
#> 
#> Formatting sample annotations (colData).
#> 
#> Formatting data as a 'QFeatures' object.

peaks_prepared_data
#> An instance of class QFeatures containing 1 assays:
#>  [1] peptides: SummarizedExperiment with 2071 rows and 12 columns

1.4.3 Fragpipe

Data exported from Fragpipe as a .tsv file is supported.

To generate this file, run a LFQ Fragpipe workflow using your desired parameters (we recommend using match between runs).

You will find a file named “combined_peptide.tsv” in the output folder from Fragpipe.

This can be loaded into mspms as follows:

combined_peptide_filepath <- system.file("extdata/fragpipe_combined_peptide.tsv",
  package = "mspms"
)

colData_filepath <- system.file("extdata/colData.csv", package = "mspms")


fragpipe_prepared_data <- prepare_fragpipe(
  combined_peptide_filepath,
  colData_filepath
)
#> Rows: 1847 Columns: 63
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (19): Peptide Sequence, Prev AA, Next AA, Protein, Protein ID, Entry Nam...
#> dbl (40): Start, End, Peptide Length, CatA_0000_1 Spectral Count, CatA_0000_...
#> num  (1): Charges
#> lgl  (3): Gene, Mapped Genes, Mapped Proteins
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 12 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): quantCols, group, condition
#> dbl (1): time
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Checking arguments.
#> 
#> Loading data as a 'SummarizedExperiment' object.
#> 
#> Formatting sample annotations (colData).
#> 
#> Formatting data as a 'QFeatures' object.

fragpipe_prepared_data
#> An instance of class QFeatures containing 1 assays:
#>  [1] peptides: SummarizedExperiment with 1694 rows and 12 columns

1.4.4 Proteome Discoverer

Analysis of a PeptideGroups.txt file exported from proteome discoverer is supported.

To generate this .txt file: 1. Ensure that non-normalized/ non-scaled abundances are included in your PeptideGroup table.
2. File > export > To Text (tab delimited) > Items to be exported - check the Peptide Groups box > Export.

Note that the fileIDs of each sample in proteome discoverer (By default F1 - Fn) must match the quantCols in colData.

peptide_groups_filepath <- system.file(
  "extdata/proteome_discoverer_PeptideGroups.txt",
  package = "mspms"
)

colData_filepath <- system.file("extdata/proteome_discover_colData.csv",
  package = "mspms"
)

prepared_pd_data <- prepare_pd(peptide_groups_filepath, colData_filepath)
#> Rows: 1207 Columns: 68
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr (17): Confidence, Annotated Sequence, Master Protein Accessions, Positio...
#> dbl (48): Peptide Groups Peptide Group ID, Qvality PEP, Qvality q-value, # P...
#> lgl  (3): Checked, Modifications, Modifications in Master Proteins
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Rows: 12 Columns: 4
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (3): quantCols, group, condition
#> dbl (1): time
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> Checking arguments.
#> 
#> Loading data as a 'SummarizedExperiment' object.
#> 
#> Formatting sample annotations (colData).
#> 
#> Formatting data as a 'QFeatures' object.

prepared_pd_data
#> An instance of class QFeatures containing 1 assays:
#>  [1] peptides: SummarizedExperiment with 1161 rows and 12 columns

1.5 Data Processing

Data processing includes imputation and normalization of the proteomics data. This is all done in a QFeatures object using MSCoreUtils methods. Data from each step is stored as a SummarizedExperiment within the QFeatures object with the indicated name.

Data is first log2 transformed (“peptides_log”).
Data is normalized using the center.median method (“peptides_log_norm”).
Missing values are then imputed using the QRILC method (“peptides_log_impute_norm”).
Data is then reverse log2 transformed (“peptides_norm”).

set.seed(2)
processed_qf <- process_qf(peaks_prepared_data)
#> Loading required namespace: imputeLCMD
#> Imputing along margin 2 (samples/columns).
#> Your row data contain missing values. Please read the relevant
#> section(s) in the aggregateFeatures manual page regarding the effects
#> of missing values on data aggregation.
processed_qf
#> An instance of class QFeatures containing 5 assays:
#>  [1] peptides: SummarizedExperiment with 2071 rows and 12 columns 
#>  [2] peptides_log: SummarizedExperiment with 2071 rows and 12 columns 
#>  [3] peptides_log_norm: SummarizedExperiment with 2071 rows and 12 columns 
#>  [4] peptides_log_impute_norm: SummarizedExperiment with 2071 rows and 12 columns 
#>  [5] peptides_norm: SummarizedExperiment with 2071 rows and 12 columns

1.6 Statistics

MSP-MS analysis has traditionally used T-tests to what peptides are significantly different relative to T0. The authors of the assay suggest using a T-test derived FDR adjusted p value threshold <= 0.05 and log2 fold change threshold >= 3.

We provide a user-friendly implementation of this statistical test as follows:

log2fc_t_test_data <- mspms::log2fc_t_test(processed_qf = processed_qf)
#> Joining with `by = join_by(quantCols)`
#> Joining with `by = join_by(quantCols)`

Please note that statistical analyses are one of the strengths of the Bioconductor project and several packages provide alternative statistical approaches that are more flexible than t-tests and could be used with the QFeatures objects produced by mspms.

This includes packages using empirical Bayesian approaches (see limma) and mixed modelling (see msqrob2), that are suited for proteomics data.

Users are encouraged to use these excellent packages.

1.7 Data Visualizations

We provide a number of data visualizations as part of the mspms package.

1.7.1 QC Checks

A critical initial step in MSP-MS data analysis is a rigorous quality control assessment. Given the reliance on a specific peptide library, calculating the percentage of library peptides detected in each sample provides a straightforward yet powerful indicator of data integrity. Anomalously low detection rates signal potential instrument or experimental problems.

We can see the percentage of the peptide library that were undetected among each group of replicates as follows (for full length peptides a threshold of ~< 10 is considered good while for cleavage products a threshold ~< 5 is good):

plot_qc_check(processed_qf,
  full_length_threshold = 10,
  cleavage_product_threshold = 5
)
#> Joining with `by = join_by(quantCols)`

We can also see in what percentage of samples each peptide in the peptide library were undetected in.

plot_nd_peptides(processed_qf)
#> Joining with `by = join_by(quantCols)`

1.7.2 Tidying our data

We can then convert our QFeatures object into a “long” tibble using the mspms_tidy function.

This makes the data easy to manipulate prior to ggplot2 or plotly based plotting functions.

mspms_tidy_data <- mspms_tidy(processed_qf)
#> Joining with `by = join_by(quantCols)`

1.7.3 Heatmap

We can inspect our data using an interactive heatmap as follows:

plot_heatmap(mspms_tidy_data)

note that the interactive plot is not returned in this vignette due to size constraints.

1.7.4 PCA

We can also inspect our data using a PCA as follows.

plot_pca(mspms_tidy_data, value_colname = "peptides_norm")

1.7.5 Volcano plots

We can generate volcano plots showing the significant peptides from each condition as follows:

plot_volcano(log2fc_t_test_data)

1.7.6 Cleavage Position Plots

We can also create plots showing the number of times we observe a cleavage at each position of the peptide_library among a set of peptides of interest.

We can do this as follows:

sig_cleavage_data <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3)

p1 <- mspms::plot_cleavages_per_pos(sig_cleavage_data)

p1

1.7.7 iceLogo

Icelogos are a method for visualizing conserved patterns in protein sequences through probability theory.

As part of the mspms R package, we have implemented the logic to generate iceLogos in R.

To use this, we first have to define a vector of cleavage sequences that we are interested in. We will do this for the significant CatA cleavages below.

catA_sig_cleavages <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3) %>%
  dplyr::filter(condition == "CatA") %>%
  dplyr::pull(cleavage_seq) %>%
  unique()

We also need to know all of the possible cleavages we could see among our peptide library. We can generate that as below:

all_possible_8mers_from_228_library <- calculate_all_cleavages(
  mspms::peptide_library$library_real_sequence,
  n_AA_after_cleavage = 4
)

We can then generate an iceLogo relative to the background of all possible cleavages in our peptide library like so:

plot_icelogo(catA_sig_cleavages,
  background_universe = all_possible_8mers_from_228_library
)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

We also provide a function that conveniently plots all icelogos from all conditions from the experiment all together.

sig_cleavage_data <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3)

plot_all_icelogos(sig_cleavage_data)
#> Scale for x is already present.
#> Adding another scale for x, which will replace the existing scale.

1.7.8 Plotting a Time Course

We can also plot the intensities of peptides of interest over time.

First we need to define which peptides we are interested in. Here we will use the t-test results, and look at the peptide with the max log2fc.

max_log2fc_pep <- log2fc_t_test_data %>%
  dplyr::filter(p.adj <= 0.05, log2fc > 3) %>%
  dplyr::filter(log2fc == max(log2fc)) %>%
  pull(peptide)

We can then filter our tidy mspms data to only include this peptide and plot

p1 <- mspms_tidy_data %>%
  dplyr::filter(peptide == max_log2fc_pep) %>%
  plot_time_course()

p1

sessionInfo()
#> R Under development (unstable) (2024-10-21 r87258)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_GB              LC_COLLATE=C              
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.4      mspms_0.99.7     BiocStyle_2.35.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] gridExtra_2.3               sandwich_3.1-1             
#>   [3] rlang_1.1.4                 magrittr_2.0.3             
#>   [5] clue_0.3-66                 matrixStats_1.4.1          
#>   [7] compiler_4.5.0              vctrs_0.6.5                
#>   [9] reshape2_1.4.4              stringr_1.5.1              
#>  [11] ProtGenerics_1.39.0         pkgconfig_2.0.3            
#>  [13] crayon_1.5.3                fastmap_1.2.0              
#>  [15] magick_2.8.5                backports_1.5.0            
#>  [17] XVector_0.47.0              labeling_0.4.3             
#>  [19] utf8_1.2.4                  rmarkdown_2.29             
#>  [21] tzdb_0.4.0                  UCSC.utils_1.3.0           
#>  [23] tinytex_0.54                purrr_1.0.2                
#>  [25] bit_4.5.0.1                 xfun_0.49                  
#>  [27] MultiAssayExperiment_1.33.1 ggseqlogo_0.2              
#>  [29] zlibbioc_1.53.0             cachem_1.1.0               
#>  [31] GenomeInfoDb_1.43.2         jsonlite_1.8.9             
#>  [33] gmm_1.8                     DelayedArray_0.33.3        
#>  [35] broom_1.0.7                 parallel_4.5.0             
#>  [37] cluster_2.1.6               R6_2.5.1                   
#>  [39] bslib_0.8.0                 stringi_1.8.4              
#>  [41] car_3.1-3                   GenomicRanges_1.59.1       
#>  [43] jquerylib_0.1.4             Rcpp_1.0.13-1              
#>  [45] bookdown_0.41               SummarizedExperiment_1.37.0
#>  [47] knitr_1.49                  zoo_1.8-12                 
#>  [49] readr_2.1.5                 IRanges_2.41.2             
#>  [51] Matrix_1.7-1                igraph_2.1.1               
#>  [53] tidyselect_1.2.1            abind_1.4-8                
#>  [55] yaml_2.3.10                 lattice_0.22-6             
#>  [57] tibble_3.2.1                plyr_1.8.9                 
#>  [59] Biobase_2.67.0              withr_3.0.2                
#>  [61] evaluate_1.0.1              tmvtnorm_1.6               
#>  [63] archive_1.1.10              norm_1.0-11.1              
#>  [65] ggpubr_0.6.0                pillar_1.9.0               
#>  [67] BiocManager_1.30.25         MatrixGenerics_1.19.0      
#>  [69] carData_3.0-5               stats4_4.5.0               
#>  [71] generics_0.1.3              vroom_1.6.5                
#>  [73] S4Vectors_0.45.2            hms_1.1.3                  
#>  [75] ggplot2_3.5.1               munsell_0.5.1              
#>  [77] scales_1.3.0                glue_1.8.0                 
#>  [79] lazyeval_0.2.2              tools_4.5.0                
#>  [81] QFeatures_1.17.0            ggsignif_0.6.4             
#>  [83] imputeLCMD_2.1              mvtnorm_1.3-2              
#>  [85] cowplot_1.1.3               grid_4.5.0                 
#>  [87] impute_1.81.0               tidyr_1.3.1                
#>  [89] MsCoreUtils_1.19.0          colorspace_2.1-1           
#>  [91] GenomeInfoDbData_1.2.13     Formula_1.2-5              
#>  [93] cli_3.6.3                   fansi_1.0.6                
#>  [95] S4Arrays_1.7.1              AnnotationFilter_1.31.0    
#>  [97] pcaMethods_1.99.0           gtable_0.3.6               
#>  [99] rstatix_0.7.2               sass_0.4.9                 
#> [101] digest_0.6.37               BiocGenerics_0.53.3        
#> [103] SparseArray_1.7.2           farver_2.1.2               
#> [105] htmltools_0.5.8.1           lifecycle_1.0.4            
#> [107] httr_1.4.7                  bit64_4.5.2                
#> [109] MASS_7.3-61

mspms_vignette

3 December 2024

Contents