Contents

1 Introduction

The function of this R package is to assess the contribution of the targeted precursor in a fragmentation isolation window using a metric called “precursor purity”.

What we call “Precursor purity” is a measure of the contribution of a selected precursor peak in an isolation window used for fragmentation. The simple calculation involves dividing the intensity of the selected precursor peak by the total intensity of the isolation window. When assessing MS/MS spectra this calculation is done before and after the MS/MS scan of interest and the purity is interpolated at the time of the MS/MS acquisition. The calculation is very similar to the “Precursor Ion Fraction” (PIF) metric described by (Michalski, Cox, and Mann 2011) for proteomics with the exception that purity here is interpolated at the recorded point of MS2 acquisition using bordering full-scan spectra. Additionally, low abundance ions that are remove that are thought to have limited contribution to the resulting MS2 spectra and can optionally take into account the isolation efficiency of the mass spectrometer

There are two main use cases for the package

  1. Assessing precursor purity of previously acquired MS2 spectra: A user has acquired either LC-MS2 or DIMS2 spectra and an assessment is made of the precursor purity for each MS2 scan. purityA
  2. Assessing precursor purity of anticipated isolation windows for MS2 spectra: A user has acquired either LC-MS (purityX) or DIMS (purityD) full scan (MS1) data and an assessment is to be made of the precursor purity of detected features using anticipated or theoretical isolation windows. This information can then be used to guide further targeted MS2 experiments.

The package has been developed to be used with DI-MS or LC-MS data and has been checked to work with the following vendor files after conversion to mzML: Thermo, Agilent and AB Sciex.

2 Assessing precursor purity of previously acquired MS2 spectra

2.1 purityA

Given a vector of LC-MS/MS or DI-MS/MS mzML file paths the function purityA will calculate the precursor purity of each MS/MS scan. The output is a S4 class object where a dataframe of the purity results can be accessed using the appropriate slot (@puritydf).

The isolation widths will be determined automatically from the mzML file. For some mzML files this is not recorded and in these cases the offsets can be given as a parameter.

In the case of Agilent only the “narrow” isolation is supported. This roughly equates to +/- 0.65 Da (depending on the instrument). If the file is detected as originating from an Agilent instrument the isolation widths will automatically be set as +/- 0.65 Da (this can be overwritten with the offsets argument)

The purity dataframe (pa@puritydf) consists of the following columns:

  • pid: unique id for MS/MS scan
  • fileid: unqiue id for file
  • seqNum: scan number
  • precursorIntensity: precursor intensity value as defined from mzML file
  • precursorMZ: precursor m/z value as defined from mzML file
  • precursorRT: precursor RT value as defined from mzML file
  • precursorScanNum: precursor scan number value as defined from mzML file
  • id: unique id (redundant)
  • filename: mzML filename
  • precursorNearest: MS1 scan nearest to this MS/MS scan
  • aMz: The m/z value in the precursorNearest scan which most closely matches the precursorMZ value provided from the mzML file
  • aPurity: The purity score for aMz
  • apkNm: The number of peaks in the isolation window for aMz
  • iMz: The m/z value in the precursorNearest scan that is the most intense within the isolation window.
  • iPurity: The purity score for iMz
  • ipkNm: The number of peaks in the isolation window for iMz
  • inPurity: The interpolated purity score
  • inpkNm: The interpolated number of peaks in the isolation window
library(msPurity)
msmsPths <- list.files(system.file("extdata", "lcms", "mzML", package="msPurityData"), full.names = TRUE, pattern = "MSMS")
msPths <- list.files(system.file("extdata", "lcms", "mzML", package="msPurityData"), full.names = TRUE, pattern = "LCMS_")
pa <- purityA(msmsPths)
print(head(pa@puritydf))
##   pid fileid seqNum acquisitionNum precursorIntensity precursorMZ
## 1   1      1      7              7          2338044.2    391.2838
## 2   2      1      8              8          1415939.8    149.0232
## 3   3      1      9              9          1319700.2    135.1015
## 4   4      1     10             10          1179373.9    219.1742
## 5   5      1     11             11          1065425.9    136.0200
## 6   6      1     13             13           817673.7    235.1690
##   precursorRT precursorScanNum id      filename precursorNearest      aMz
## 1    2.707016                6  7 LCMSMS_1.mzML                6 391.2838
## 2    2.707016                6  8 LCMSMS_1.mzML                6 149.0233
## 3    2.707016                6  9 LCMSMS_1.mzML                6 135.1015
## 4    2.707016                6 10 LCMSMS_1.mzML               12 219.1742
## 5    2.707016                6 11 LCMSMS_1.mzML               12 136.0215
## 6    3.583746               12 13 LCMSMS_1.mzML               12 235.1691
##     aPurity apkNm      iMz   iPurity ipkNm inPkNm  inPurity
## 1 1.0000000     1 391.2838 1.0000000     1      1 1.0000000
## 2 0.8535700     2 149.0233 0.8535700     2      2 0.8475240
## 3 0.7616688     4 135.1015 0.7616688     4      4 0.7558731
## 4 0.7173636     3 219.1742 0.7173636     3      3 0.7248489
## 5 0.8163521     4 136.0215 0.8163521     4      3 0.8247355
## 6 0.8312278     2 235.1691 0.8312278     2      2 0.8299369

2.2 Mapping XCMS features to fragmentation spectra

The MS/MS spectra can be assigned to an XCMS grouped feature using the frag4feature function.

First an xcmsSet object of the same files is required #```{r results=‘hide’, message=FALSE, warning=FALSE}

library(xcms)

xset <- xcms::xcmsSet(msmsPths)
xset <- xcms::group(xset)
## Processing 3163 mz slices ... OK
xset <- xcms::retcor(xset)
## Performing retention time correction using 351 peak groups.
xset <- xcms::group(xset)
## Processing 3163 mz slices ... OK
pa <- frag4feature(pa, xset)

The slot grped_df is a dataframe of the grouped XCMS features linked to a reference to any associated MS/MS scans in the region of the full width of the XCMS feature in each file. The dataframe contains the following columns.

  • grpid: XCMS grouped feature id
  • mz: derived from XCMS peaklist
  • mzmin: derived from XCMS peaklist
  • mzmax: derived from XCMS peaklist
  • rt: derived from XCMS peaklist
  • rtmin: derived from XCMS peaklist
  • rtmax: derived from XCMS peaklist
  • into: derived from XCMS peaklist
  • intb: derived from XCMS peaklist
  • maxo: derived from XCMS peaklist
  • sn: derived from XCMS peaklist
  • sample: derived from XCMS peaklist
  • id: unique id of MS/MS scan
  • precurMtchID: Associated nearest precursor scan id (file specific)
  • precurMtchRT: Associated precursor scan RT
  • precurMtchMZ: Associated precursor m/z
  • precurMtchPPM: Associated precursor m/z parts per million (ppm) tolerance to XCMS feauture m/z
  • inPurity: The interpolated purity score
print(head(pa@grped_df))
##     grpid       mz    mzmin    mzmax       rt    rtmin    rtmax      into
## 108     8 112.0508 112.0507 112.0872 67.60929 55.27690 80.36167  36223791
## 109     8 112.0509 112.0506 112.1205 67.51574 55.41402 80.55541  36139266
## 16     12 116.0706 116.0109 116.0709 47.68880 35.74403 60.00379 130337063
## 17     12 116.0706 116.0109 116.0709 47.78054 35.59266 59.86578 124086404
## 46     12 116.0706 116.0109 116.0709 47.68880 35.74403 60.00379 130337063
## 47     12 116.0706 116.0109 116.0709 47.78054 35.59266 59.86578 124086404
##          intf     maxo     maxf i       sn sample cid      filename
## 108 133522504  7158012  8555976 1 21.05495      2 398 LCMSMS_2.mzML
## 109 133395721  7426336  8522973 1 20.38040      1   9 LCMSMS_1.mzML
## 16  491415400 27555850 32138223 1 24.64411      1  13 LCMSMS_1.mzML
## 17  465322433 26501960 30360236 1 24.07917      2 402 LCMSMS_2.mzML
## 46  491415400 27555850 32138223 1 24.64411      1  13 LCMSMS_1.mzML
## 47  465322433 26501960 30360236 1 24.07917      2 402 LCMSMS_2.mzML
##     rtminCorrected rtmaxCorrected precurMtchID precurMtchScan precurMtchRT
## 108       55.37223       80.35088          466            462     64.42730
## 109       55.39383       80.40870          472            468     65.37616
## 16        35.55508       60.13504          277            276     41.09841
## 17        35.81265       59.93620          277            276     40.95480
## 46        35.55508       60.13504          343            342     49.31567
## 47        35.81265       59.93620          343            342     49.18240
##     precurMtchMZ precurMtchPPM  inPurity  pid
## 108     112.0507     0.5516288 1.0000000 1213
## 109     112.0507     1.0047960 1.0000000  389
## 16      116.0708     2.2246382 0.9893762  226
## 17      116.0708     1.7359354 0.9506322 1055
## 46      116.0708     2.2903689 1.0000000  281
## 47      116.0708     1.6702047 1.0000000 1110

The slot grped_ms2 is a list of the associated fragmentation spectra for the grouped features.

print(pa@grped_ms2[2:3])
## $`12`
## $`12`[[1]]
##          [,1]       [,2]
## [1,] 107.2701   1726.613
## [2,] 116.0164   2890.495
## [3,] 116.0709 100876.133
## [4,] 116.1072   2424.613
## 
## $`12`[[2]]
##          [,1]      [,2]
## [1,] 116.0168  3725.937
## [2,] 116.0709 97631.586
## [3,] 116.1071  3945.327
## 
## $`12`[[3]]
##          [,1]    [,2]
## [1,] 116.0709 1847703
## 
## $`12`[[4]]
##          [,1]        [,2]
## [1,] 103.1290    4419.712
## [2,] 116.0164    5682.144
## [3,] 116.0709 1782171.000
## [4,] 130.0276    4081.138
## 
## $`12`[[5]]
##          [,1]       [,2]
## [1,] 116.0166   4434.369
## [2,] 116.0709 165623.641
## [3,] 116.1073  11372.488
## 
## $`12`[[6]]
##          [,1]       [,2]
## [1,] 116.0168  14364.784
## [2,] 116.0709 149471.266
## [3,] 116.1074   8359.903
## 
## 
## $`27`
## $`27`[[1]]
##          [,1]       [,2]
## [1,] 117.8772   5004.664
## [2,] 132.1019 273406.250
## 
## $`27`[[2]]
##          [,1]      [,2]
## [1,] 132.1020 402822.69
## [2,] 144.2789   7715.69
## [3,] 146.6829   7014.51
## 
## $`27`[[3]]
##         [,1]      [,2]
## [1,] 130.187  121726.4
## [2,] 132.102 3973065.5
## 
## $`27`[[4]]
##          [,1]      [,2]
## [1,] 104.9648  111113.7
## [2,] 132.1021 3328366.8
## 
## $`27`[[5]]
##          [,1]     [,2]
## [1,] 132.1021 77799.47
## 
## $`27`[[6]]
##          [,1]     [,2]
## [1,] 115.4372  2424.58
## [2,] 132.1020 67118.03

3 Assessing precursor purity of anticipated isolation windows for MS2 spectra

3.1 purityX: Assessing anticipated purity of XCMS features from an LC-MS run

NOTE ON TERMINOLOGY: The term ‘anticipated purity’ and ‘predicted purity’ are used interchangeably

A processed xcmsSet object is required to determine the anticipated (predicted) precursor purity score from an LC-MS dataset. The offsets chosen in the parameters should reflect what settings would be used in a hypothetical fragmentation experiment.

The slot predictions provides the anticipated (predicted) purity scores for each feature. The dataframe contains the following columns:

  • grpid: XCMS grouped feature id
  • mean: Mean predicted purity of the feature
  • median: Median predicted purity of the feature
  • sd: Standard deviation of the predicted purity of the feature
  • stde: Standard error of the predicted purity of the feature
  • pknm: Median peak number in isolation window
  • RSD: Relative standard deviation of the predicted purity of the feature
  • i: Median intensity of the grouped feature. Uses XCMS “into” intensity value.
  • mz: m/z of the XCMS grouped feature

XCMS run on an LC-MS dataset

xset <- xcms::xcmsSet(msPths)
xset <- xcms::group(xset)
## Processing 3179 mz slices ... OK
xset <- xcms::retcor(xset)
## Performing retention time correction using 763 peak groups.
xset <- xcms::group(xset)
## Processing 3179 mz slices ... OK

Perform purity calculations

ppLCMS <- purityX(xset, offsets=c(0.5, 0.5), xgroups = c(1, 2))
## [1] 4
print(head(ppLCMS@predictions))
##   grpid mean median sd stde RSD pknm        i       mz
## 1     1    1      1  0    0   0    1 61925043 102.0916
## 2     2    1      1  0    0   0    1 25719001 103.0544

3.2 purityD: Assessing anticipated purity from a DI-MS run

The anticipated/predicted purity for a DI-MS experiment can be performed on any DI-MS dataset consisting of multiple MS1 scans of the same mass range, i.e. it has not been developed to be used with any SIM stitching approach.

A number of simple data processing steps are performed on the mzML files to provide a DI-MS peak list (features) to perform the purity predictions on.

These data processing steps consist of:

  • Averaging peaks across multiple scans
  • Removing peaks below a signal to noise threshold [optional]
  • Removing peaks less than an intensity threshold [optional]
  • Removing peaks above a RSD threshold for intensity [optional]
  • Where there is a blank, subtracting blank peaks [optional]

The averaged peaks before and after filtering are stored in the avPeaks slot of purityPD S4 object.

Get file dataframe: The purityD constructor requires a dataframe consisting of the following columns:

  • filepth
  • name
  • sampleType [either sample or blank]
  • class [for grouping samples together]
  • polarity [optional]
datapth <- system.file("extdata", "dims", "mzML", package="msPurityData")
inDF <- Getfiles(datapth, pattern=".mzML", check = FALSE)
ppDIMS <- purityD(inDF, mzML=TRUE)

Average spectra: The default averaging will use a Hierarchal clustering approach. Noise filtering is also performed here.

ppDIMS <- averageSpectra(ppDIMS, snMeth = "median", snthr = 5)

Filter by RSD and Intensity

ppDIMS <- filterp(ppDIMS, thr=5000, rsd = 10)

Subtract blank

ppDIMS <- subtract(ppDIMS)

Predict purity

ppDIMS <- dimsPredictPurity(ppDIMS)

print(head(ppDIMS@avPeaks$processed$B02_Daph_TEST_pos))
##    peakID       mz          i        snr      rsd        inorm medianPurity
## 5       5 173.0806 11272447.0 216.506319 9.006126 0.0108585920    1.0000000
## 7       7 179.1177   606983.2  11.425825 6.019861 0.0005729283    1.0000000
## 10     10 217.1067 17770220.0 343.292914 8.602331 0.0171178067    0.7797864
## 15     15 235.1173  4950841.5  95.991762 6.302825 0.0047694791    1.0000000
## 16     16 236.1206   486912.0   9.270517 8.811437 0.0004638254    0.8818313
## 17     17 239.1485  2533134.5  48.892062 5.781277 0.0024401334    0.8123950
##    meanPurity   sdPurity cvPurity   sdePurity medianPeakNum
## 5   1.0000000 0.00000000 0.000000 0.000000000             1
## 7   1.0000000 0.00000000 0.000000 0.000000000             1
## 10  0.7808917 0.01261501 1.615462 0.005641605             2
## 15  1.0000000 0.00000000 0.000000 0.000000000             1
## 16  0.8755873 0.01056807 1.206969 0.004726184             2
## 17  0.8229505 0.04384595 5.327896 0.019608505             2

3.3 Calculating the anticipated (predicted) purity from a known m/z target list for DI-MS

The data processing steps carried out through purityPD can be bypassed if the peaks (m/z values) of interest are already known. The function dimsPredictPuritySingle() can be used to predict the purity of a list of m/z values in a chosen mzML file.

mzpth <- system.file("extdata", "dims", "mzML", "B02_Daph_TEST_pos.mzML", package="msPurityData")
predicted <- dimsPredictPuritySingle(filepth = mzpth, mztargets = c(111.0436, 113.1069))
print(predicted)
##   medianPurity meanPurity  sdPurity  cvPurity  sdePurity medianPeakNum
## 1    0.6390276  0.6251787 0.0356821  5.707505 0.01595752             5
## 2    0.7453778  0.7619277 0.1008513 13.236338 0.04510209             5

References

Michalski, Annette, Juergen Cox, and Matthias Mann. 2011. “More than 100,000 detectable peptide species elute in single shotgun proteomics runs but the majority is inaccessible to data-dependent LC-MS/MS.” Journal of Proteome Research 10 (4):1785–93. https://doi.org/10.1021/pr101060v.