Foreword

This software is free and open-source software. If you use it, please support the project by citing it in publications:

Gatto L, Lilley KS. MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry data visualization, processing and quantitation. Bioinformatics. 2012 Jan 15;28(2):288-9. doi: 10.1093/bioinformatics/btr645. PMID: 22113085.

MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data. Laurent Gatto, Sebastian Gibb, Johannes Rainer. bioRxiv 2020.04.29.067868; doi: https://doi.org/10.1101/2020.04.29.067868

Questions and bugs

For bugs, typos, suggestions or other questions, please file an issue in our tracking system (https://github.com/lgatto/MSnbase/issues) providing as much information as possible, a reproducible example and the output of sessionInfo().

If you don’t have a GitHub account or wish to reach a broader audience for general questions about proteomics analysis using R, you may want to use the Bioconductor support site: https://support.bioconductor.org/.

1 Overview

MSnbase’s aims are to facilitate the reproducible analysis of mass spectrometry data within the R environment, from raw data import and processing, feature quantification, quantification and statistical analysis of the results (Gatto and Lilley 2012). Data import functions for several formats are provided and intermediate or final results can also be saved or exported. These capabilities are presented below.

2 Data input

Raw data

Data stored in one of the published XML-based formats. i.e. mzXML (Pedrioli et al. 2004), mzData (Orchard et al. 2007) or mzML (Martens et al. 2010), can be imported with the readMSData method, which makes use of the mzR package to create MSnExp objects. The files can be in profile or centroided mode. See ?readMSData for details.

Data from mzML files containing chromatographic data (e.g. generated in SRM/MRM experiments) can be imported with the readSRMData function that returns the chromatographic data as a MChromatograms object. See ?readSRMData for more details.

Peak lists

Peak lists in the mgf format1 http://www.matrixscience.com/help/data_file_help.html can be imported using the readMgfData. In this case, the peak data has generally been pre-processed by other software. See ?readMgfData for details.

Quantitation data

Third party software can be used to generate quantitative data and exported as a spreadsheet (generally comma or tab separated format). This data as well as any additional meta-data can be imported with the readMSnSet function. See ?readMSnSet for details.

MSnbase also supports the mzTab format2 https://github.com/HUPO-PSI/mzTab, a light-weight, tab-delimited file format for proteomics data developed within the Proteomics Standards Initiative (PSI). mzTab files can be read into R with readMzTabData to create and MSnSet instance.

MSnbase input capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.
MSnbase input capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.

3 Data output

RData files

R objects can most easily be stored on disk with the save function. It creates compressed binary images of the data representation that can later be read back from the file with the load function.

mzML/mzXML files

MSnExp and OnDiskMSnExp files can be written to MS data files in mzML or mzXML files with the writeMSData method. See ?writeMSData for details.

Peak lists

MSnExp instances as well as individual spectra can be written as mgf files with the writeMgfData method. Note that the meta-data in the original R object can not be included in the file. See ?writeMgfData for details.

Quantitation data

Quantitation data can be exported to spreadsheet files with the write.exprs method. Feature meta-data can be appended to the feature intensity values. See ?writeMgfData for details.

Deprecated MSnSet instances can also be exported to mzTab files using the writeMzTabData function.

MSnbase output capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.
MSnbase output capabilities. The white and red boxes represent R functions/methods and objects respectively. The blue boxes represent different disk storage formats.

4 Creating MSnSet from text spread sheets

This section describes the generation of MSnSet objects using data available in a text-based spreadsheet. This entry point into R and MSnbase allows to import data processed by any of the third party mass-spectrometry processing software available and proceed with data exploration, normalisation and statistical analysis using functions available in and the numerous Bioconductor packages.

4.1 A complete work flow

The following section describes a work flow that uses three input files to create the MSnSet. These files respectively describe the quantitative expression data, the sample meta-data and the feature meta-data. It is taken from the pRoloc tutorial and uses example files from the pRolocdat package.

We start by describing the csv to be used as input using the read.csv function.

## The original data for replicate 1, available
## from the pRolocdata package
f0 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE,
          pattern = "pr800866n_si_004-rep1.csv")
csv <- read.csv(f0)

The three first lines of the original spreadsheet, containing the data for replicate one, are illustrated below (using the function head). It contains 888 rows (proteins) and 16 columns, including protein identifiers, database accession numbers, gene symbols, reporter ion quantitation values, information related to protein identification, …

head(csv, n=3)
##   Protein.ID        FBgn Flybase.Symbol No..peptide.IDs Mascot.score
## 1    CG10060 FBgn0001104    G-ialpha65A               3       179.86
## 2    CG10067 FBgn0000044         Act57B               5       222.40
## 3    CG10077 FBgn0035720        CG10077               5       219.65
##   No..peptides.quantified area.114 area.115 area.116 area.117
## 1                       1 0.379000 0.281000 0.225000 0.114000
## 2                       9 0.420000 0.209667 0.206111 0.163889
## 3                       3 0.187333 0.167333 0.169667 0.476000
##   PLS.DA.classification Peptide.sequence Precursor.ion.mass
## 1                    PM                                    
## 2                    PM                                    
## 3                                                          
##   Precursor.ion.charge pd.2013 pd.markers
## 1                           PM    unknown
## 2                           PM    unknown
## 3                      unknown    unknown

Below read in turn the spread sheets that contain the quantitation data (exprsFile.csv), feature meta-data (fdataFile.csv) and sample meta-data (pdataFile.csv).

## The quantitation data, from the original data
f1 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE, pattern = "exprsFile.csv")
exprsCsv <- read.csv(f1)
## Feature meta-data, from the original data
f2 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE, pattern = "fdataFile.csv")
fdataCsv <- read.csv(f2)
## Sample meta-data, a new file
f3 <- dir(system.file("extdata", package = "pRolocdata"),
          full.names = TRUE, pattern = "pdataFile.csv")
pdataCsv <- read.csv(f3)

exprsFile.csv contains the quantitation (expression) data for the 888 proteins and 4 reporter tags.

head(exprsCsv, n = 3)
##          FBgn     X114     X115     X116     X117
## 1 FBgn0001104 0.379000 0.281000 0.225000 0.114000
## 2 FBgn0000044 0.420000 0.209667 0.206111 0.163889
## 3 FBgn0035720 0.187333 0.167333 0.169667 0.476000

fdataFile.csv contains meta-data for the 888 features (here proteins).

head(fdataCsv, n = 3)
##          FBgn ProteinID FlybaseSymbol NoPeptideIDs MascotScore
## 1 FBgn0001104   CG10060   G-ialpha65A            3      179.86
## 2 FBgn0000044   CG10067        Act57B            5      222.40
## 3 FBgn0035720   CG10077       CG10077            5      219.65
##   NoPeptidesQuantified PLSDA
## 1                    1    PM
## 2                    9    PM
## 3                    3

pdataFile.csv contains samples (here fractions) meta-data. This simple file has been created manually.

pdataCsv
##   sampleNames Fractions
## 1        X114       4/5
## 2        X115     12/13
## 3        X116        19
## 4        X117        21

The self-contained MSnSet can now easily be generated using the readMSnSet constructor, providing the respective csv file names shown above and specifying that the data is comma-separated (with sep = ","). Below, we call that object res and display its content.

library("MSnbase")
res <- readMSnSet(exprsFile = f1,
                  featureDataFile = f2,
                  phenoDataFile = f3,
                  sep = ",")
res
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: X114 X115 X116 X117
##   varLabels: Fractions
##   varMetadata: labelDescription
## featureData
##   featureNames: FBgn0001104 FBgn0000044 ... FBgn0001215 (888 total)
##   fvarLabels: ProteinID FlybaseSymbol ... PLSDA (6 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
## Quantitation data loaded: Tue Oct 29 18:30:42 2024  using readMSnSet. 
##  MSnbase version: 2.33.0

4.1.1 The MSnSet class

Although there are additional specific sub-containers for additional meta-data (for instance to make the object MIAPE compliant), the feature (the sub-container, or slot featureData) and sample (the phenoData slot) are the most important ones. They need to meet the following validity requirements (see figure below):

  • the number of row in the expression/quantitation data and feature data must be equal and the row names must match exactly, and

  • the number of columns in the expression/quantitation data and number of row in the sample meta-data must be equal and the column/row names must match exactly.

A detailed description of the MSnSet class is available by typing ?MSnSet in the R console.

Dimension requirements for the respective expression, feature and sample meta-data slots.
Dimension requirements for the respective expression, feature and sample meta-data slots.

The individual parts of this data object can be accessed with their respective accessor methods:

  • the quantitation data can be retrieved with exprs(res),
  • the feature meta-data with fData(res) and
  • the sample meta-data with pData(res).

4.2 A shorter work flow

The readMSnSet2 function provides a simplified import workforce. It takes a single spreadsheet as input (default is csv) and extract the columns identified by ecol to create the expression data, while the others are used as feature meta-data. ecol can be a character with the respective column labels or a numeric with their indices. In the former case, it is important to make sure that the names match exactly. Special characters like '-' or '(' will be transformed by R into '.' when the csv file is read in. Optionally, one can also specify a column to be used as feature names. Note that these must be unique to guarantee the final object validity.

ecol <- paste("area", 114:117, sep = ".")
fname <- "Protein.ID"
eset <- readMSnSet2(f0, ecol, fname)
eset
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: CG10060 CG10067 ... CG9983 (888 total)
##   fvarLabels: Protein.ID FBgn ... pd.markers (12 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.33.0

The ecol columns can also be queried interactively from R using the getEcols and grepEcols function. The former return a character with all column names, given a splitting character, i.e. the separation value of the spreadsheet (typically "," for csv, "\t" for tsv, …). The latter can be used to grep a pattern of interest to obtain the relevant column indices.

getEcols(f0, ",")
##  [1] "\"Protein ID\""              "\"FBgn\""                   
##  [3] "\"Flybase Symbol\""          "\"No. peptide IDs\""        
##  [5] "\"Mascot score\""            "\"No. peptides quantified\""
##  [7] "\"area 114\""                "\"area 115\""               
##  [9] "\"area 116\""                "\"area 117\""               
## [11] "\"PLS-DA classification\""   "\"Peptide sequence\""       
## [13] "\"Precursor ion mass\""      "\"Precursor ion charge\""   
## [15] "\"pd.2013\""                 "\"pd.markers\""
grepEcols(f0, "area", ",")
## [1]  7  8  9 10
e <- grepEcols(f0, "area", ",")
readMSnSet2(f0, e)
## MSnSet (storageMode: lockedEnvironment)
## assayData: 888 features, 4 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: 1 2 ... 888 (888 total)
##   fvarLabels: Protein.ID FBgn ... pd.markers (12 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:  
## - - - Processing information - - -
##  MSnbase version: 2.33.0

The phenoData slot can now be updated accordingly using the replacement functions phenoData<- or pData<- (see ?MSnSet for details).

5 Session information

sessionInfo()
## R Under development (unstable) (2024-10-21 r87258)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.21-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] gplots_3.2.0         msdata_0.45.0        pRoloc_1.47.0       
##  [4] BiocParallel_1.41.0  MLInterfaces_1.87.0  cluster_2.1.6       
##  [7] annotate_1.85.0      XML_3.99-0.17        AnnotationDbi_1.69.0
## [10] IRanges_2.41.0       pRolocdata_1.43.3    Rdisop_1.67.0       
## [13] zoo_1.8-12           MSnbase_2.33.0       ProtGenerics_1.39.0 
## [16] S4Vectors_0.45.0     mzR_2.41.0           Rcpp_1.0.13         
## [19] Biobase_2.67.0       BiocGenerics_0.53.0  ggplot2_3.5.1       
## [22] BiocStyle_2.35.0    
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.5.0               bitops_1.0-9               
##   [3] filelock_1.0.3              tibble_3.2.1               
##   [5] hardhat_1.4.0               preprocessCore_1.69.0      
##   [7] pROC_1.18.5                 rpart_4.1.23               
##   [9] lifecycle_1.0.4             httr2_1.0.5                
##  [11] doParallel_1.0.17           globals_0.16.3             
##  [13] lattice_0.22-6              MASS_7.3-61                
##  [15] MultiAssayExperiment_1.33.0 dendextend_1.18.1          
##  [17] magrittr_2.0.3              limma_3.63.0               
##  [19] plotly_4.10.4               sass_0.4.9                 
##  [21] rmarkdown_2.28              jquerylib_0.1.4            
##  [23] yaml_2.3.10                 MsCoreUtils_1.19.0         
##  [25] DBI_1.2.3                   RColorBrewer_1.1-3         
##  [27] lubridate_1.9.3             abind_1.4-8                
##  [29] zlibbioc_1.53.0             GenomicRanges_1.59.0       
##  [31] purrr_1.0.2                 mixtools_2.0.0             
##  [33] AnnotationFilter_1.31.0     nnet_7.3-19                
##  [35] rappdirs_0.3.3              ipred_0.9-15               
##  [37] lava_1.8.0                  GenomeInfoDbData_1.2.13    
##  [39] listenv_0.9.1               parallelly_1.38.0          
##  [41] ncdf4_1.23                  codetools_0.2-20           
##  [43] DelayedArray_0.33.0         xml2_1.3.6                 
##  [45] tidyselect_1.2.1            farver_2.1.2               
##  [47] UCSC.utils_1.3.0            viridis_0.6.5              
##  [49] matrixStats_1.4.1           BiocFileCache_2.15.0       
##  [51] jsonlite_1.8.9              caret_6.0-94               
##  [53] e1071_1.7-16                survival_3.7-0             
##  [55] iterators_1.0.14            foreach_1.5.2              
##  [57] segmented_2.1-3             tools_4.5.0                
##  [59] progress_1.2.3              glue_1.8.0                 
##  [61] prodlim_2024.06.25          gridExtra_2.3              
##  [63] SparseArray_1.7.0           mgcv_1.9-1                 
##  [65] xfun_0.48                   MatrixGenerics_1.19.0      
##  [67] GenomeInfoDb_1.43.0         dplyr_1.1.4                
##  [69] withr_3.0.2                 BiocManager_1.30.25        
##  [71] fastmap_1.2.0               fansi_1.0.6                
##  [73] caTools_1.18.3              digest_0.6.37              
##  [75] timechange_0.3.0            R6_2.5.1                   
##  [77] colorspace_2.1-1            gtools_3.9.5               
##  [79] lpSolve_5.6.21              biomaRt_2.63.0             
##  [81] RSQLite_2.3.7               utf8_1.2.4                 
##  [83] tidyr_1.3.1                 generics_0.1.3             
##  [85] hexbin_1.28.4               data.table_1.16.2          
##  [87] recipes_1.1.0               FNN_1.1.4.1                
##  [89] class_7.3-22                prettyunits_1.2.0          
##  [91] PSMatch_1.11.0              httr_1.4.7                 
##  [93] htmlwidgets_1.6.4           S4Arrays_1.7.0             
##  [95] ModelMetrics_1.2.2.2        pkgconfig_2.0.3            
##  [97] gtable_0.3.6                timeDate_4041.110          
##  [99] blob_1.2.4                  impute_1.81.0              
## [101] XVector_0.47.0              htmltools_0.5.8.1          
## [103] bookdown_0.41               MALDIquant_1.22.3          
## [105] clue_0.3-65                 scales_1.3.0               
## [107] png_0.1-8                   gower_1.0.1                
## [109] knitr_1.48                  reshape2_1.4.4             
## [111] coda_0.19-4.1               nlme_3.1-166               
## [113] curl_5.2.3                  proxy_0.4-27               
## [115] cachem_1.1.0                stringr_1.5.1              
## [117] KernSmooth_2.23-24          parallel_4.5.0             
## [119] mzID_1.45.0                 vsn_3.75.0                 
## [121] pillar_1.9.0                vctrs_0.6.5                
## [123] pcaMethods_1.99.0           randomForest_4.7-1.2       
## [125] dbplyr_2.5.0                xtable_1.8-4               
## [127] evaluate_1.0.1              magick_2.8.5               
## [129] tinytex_0.53                mvtnorm_1.3-1              
## [131] cli_3.6.3                   compiler_4.5.0             
## [133] rlang_1.1.4                 crayon_1.5.3               
## [135] future.apply_1.11.3         labeling_0.4.3             
## [137] LaplacesDemon_16.1.6        mclust_6.1.1               
## [139] QFeatures_1.17.0            affy_1.85.0                
## [141] plyr_1.8.9                  stringi_1.8.4              
## [143] viridisLite_0.4.2           munsell_0.5.1              
## [145] Biostrings_2.75.0           lazyeval_0.2.2             
## [147] Matrix_1.7-1                hms_1.1.3                  
## [149] bit64_4.5.2                 future_1.34.0              
## [151] KEGGREST_1.47.0             statmod_1.5.0              
## [153] highr_0.11                  SummarizedExperiment_1.37.0
## [155] kernlab_0.9-33              igraph_2.1.1               
## [157] memoise_2.0.1               affyio_1.77.0              
## [159] bslib_0.8.0                 sampling_2.10              
## [161] bit_4.5.0

References

Gatto, Laurent, and Kathryn S Lilley. 2012. MSnbase – an R/Bioconductor Package for Isobaric Tagged Mass Spectrometry Data Visualization, Processing and Quantitation.” Bioinformatics 28 (2): 288–89. https://doi.org/10.1093/bioinformatics/btr645.
Martens, Lennart, Matthew Chambers, Marc Sturm, Darren Kes sner, Fredrik Levander, Jim Shofstahl, Wilfred H Tang, et al. 2010. “mzML - a Community Standard for Mass Spectrometry Data.” Molecular & Cellular Proteomics : MCP. https://doi.org/10.1074/mcp.R110.000133.
Orchard, Sandra, Luisa Montechi-Palazzi, Eric W Deutsch, Pierre-Alain Binz, Andrew R Jones, Norman Paton, Angel Pizarro, David M Creasy, Jérôme Wojcik, and Henning Hermjakob. 2007. “Five Years of Progress in the Standardization of Proteomics Data 4th Annual Spring Workshop of the HUPO-Proteomics Standards Initiative April 23-25, 2007 Ecole Nationale Supérieure (ENS), Lyon, France.” Proteomics 7 (19): 3436–40. https://doi.org/10.1002/pmic.200700658.
Pedrioli, Patrick G A, Jimmy K Eng, Robert Hubley, Mathijs Vogelzang, Eric W Deutsch, Brian Raught, Brian Pratt, et al. 2004. “A Common Open Representation of Mass Spectrometry Data and Its Application to Proteomics Research.” Nat. Biotechnol. 22 (11): 1459–66. https://doi.org/10.1038/nbt1031.