proteinDiscover: new functions

Introduction

Apart from a big bunch of bug-fixes and tweaks to the functions, there are a couple of new functions related to psms and spectra information.

MS² fragmentation spectra are sort of hidden in a separate raw column in the MassSpectrumItems table in .pdResult files. It turned out that the data is actually xml data in a compressed (zip) format.

Spectra data

The following functions deal with this type raw data:

Name	Remark
transformSpectrumRaw	takes raw spectrum data and ‘translates’ it into an xml object for further processing. All other functions work with the XML object coming from transformSpectrumRaw
spectrum.header
spectrum.scanEvent
spectrum.centroid	The only format which seems to be present in these spectra is centroided
spectrum.profile	As I have not encountered any profile data yet, this function always returns NA
spectrum.precursor.header
spectrum.precursor.scanEvent
spectrum.precursor.info
spectrum.precursor.centroid	The only format which seems to be present in these spectra is centroided
spectrum.precursor.profile	As I have not encountered any profile data yet, this function always returns NA
spectrum.precursor.additionalInfo

Please note that the transformSpectrumRaw function writes a temporary file and then deletes it again due to having to decompress the data.

Examples

library(proteinDiscover)
library(dplyr)
library(stringr)
db <- dbOpen(fileName = "hela001.pdResult")

Get the 5 proteins with the highest (summed) Sequest score:

best5 <- dbGetProteinTable(db,
                           columns = c("Accession", "Description",
                                       "ProteinGroupIDs", "ScoreSequestHT")) %>%
  dfTransformRaws() %>%
  arrange(desc(ScoreSequestHT_1)) %>%
  slice(1:5) %>%
  dplyr::mutate(
    Description = substr(
      str_replace_all(Description, pattern = "\\[.*\\]", replacement = ""), 1, 30))

best5 %>%
  tibble()

## # A tibble: 5 × 4
##   Accession Description                      ProteinGroupIDs ScoreSequestHT_1
##   <chr>     <chr>                            <chr>                      <dbl>
## 1 P60709    "Actin, cytoplasmic 1 OS=Homo s" 2218                       1278.
## 2 P06733    "Alpha-enolase OS=Homo sapiens " 3518                       1200.
## 3 P21333    "Filamin-A OS=Homo sapiens OX=9" 1179                        983.
## 4 O75369    "Filamin-B OS=Homo sapiens OX=9" 3093                        932.
## 5 Q09666    "Neuroblast differentiation-ass" 2062                        834.

Then get the best peptides for the first protein

peptide5 <- dbGetPeptideTable(db = db, proteinGroupIDs = 2218) %>%
  dfTransformRaws() %>%
  dplyr::select(PeptideGroupID, Sequence, Modifications, PsmCount, XCorrbySearchEngine_1) %>%
  dplyr::arrange(desc(XCorrbySearchEngine_1)) %>%
  slice(1:5)

peptide5 %>%
  tibble()

## # A tibble: 5 × 5
##   PeptideGroupID Sequence           Modifications PsmCount XCorrbySearchEngine_1
##            <int> <chr>              <chr>            <int>                 <dbl>
## 1          68421 TTGIVMDSGDGVTHTVP… ""                  24                  8.39
## 2           7425 CPEALFQPSFLGMESCG… "Carbamidome…       70                  8.19
## 3          10695 DLYANTVLSGGTTMYPG… ""                  69                  7.29
## 4          10697 DLYANTVLSGGTTMYPG… "1×Oxidation…       30                  7.23
## 5           7427 CPEALFQPSFLGMESCG… "Carbamidome…       33                  6.27

Now retrieve the PSM’s for the first peptide

psms5 <- dbGetPsmTable(db = db, peptideGroupIDs = 10695) %>%
  dplyr::select(PeptideID, Charge, MassOverCharge, XCorr) %>%
  dplyr::arrange(desc(XCorr)) %>%
  slice(1:5)

psms5

##   PeptideID Charge MassOverCharge    XCorr
## 1    616594      2      1108.0387 7.286522
## 2    610318      2      1108.0387 6.880840
## 3    613318      2      1108.0392 6.832521
## 4    617110      3       739.0286 6.653973
## 5    619826      2      1108.0388 6.643941

with the PeptideID we can now retrieve information on the psms spectra

MSnInfo <- dbGetMSnSpectrumInfo(db = db, peptideID = psms5$PeptideID)

MSnInfo %>%
  dplyr::select(RetentionTime, MasterScanNumbers, ScanNumbers, PercentIsolationInterference)

##   RetentionTime MasterScanNumbers ScanNumbers PercentIsolationInterference
## 1      75.61137             43595       43645                    43.630718
## 2      75.93290             43852       43890                     3.204555
## 3      76.27213             44110       44147                     0.000000
## 4      76.32849             44146       44187                     3.603348
## 5      76.60362             44352       44394                    15.745718

MSnInfo %>%
  dplyr::select(IonInjectTime, MassOverCharge, Charge, OriginalPrecursorMass, Intensity)

##   IonInjectTime MassOverCharge Charge OriginalPrecursorMass Intensity
## 1        35.000      1108.0387      2             1108.0387   1323099
## 2         1.543      1108.0392      2             1108.0392 339010432
## 3        30.875      1108.0387      2             1108.0387  20638660
## 4        35.000       739.0286      3              739.0286   5674726
## 5        35.000      1108.0388      2             1108.0388   8470223

…and we can get the Mass Spectrum Info:

spectrumInfo <-  dbGetMassSpectrumItems(db = db, peptideID = psms5$PeptideID)

spectrumInfo %>% 
  dplyr::select(RetentionTime, ResolutionAtMass200, IsolationWidth, Spectrum)

##   RetentionTime ResolutionAtMass200 IsolationWidth       Spectrum
## 1      75.61137               30000            1.6 blob[12.26 kB]
## 2      75.93290               30000            1.6  blob[9.64 kB]
## 3      76.27213               30000            1.6 blob[11.81 kB]
## 4      76.32849               30000            1.6 blob[11.77 kB]
## 5      76.60362               30000            1.6 blob[10.07 kB]

Most interesting in this last table is obviously the spectrum object which is in a raw (blob) format. With the transformSpectrumRaw function we can translate this into useful information:

theSpectrum <- transformSpectrumRaw(spectrumInfo$Spectrum[[2]])

class(theSpectrum)

## [1] "list"

names(theSpectrum)

## [1] "Header"        "ScanEvent"     "PrecursorInfo" "PeakCentroids"
## [5] "ProfilePoints"

And with the helper functions we can get the info out of this somewhat cumbersome list object

spectrum.header(theSpectrum) %>%
  dplyr::select(DataType, RetentionTime, BasePeakPosition, BasePeakIntensity, TotalIntensity)

##   DataType  RetentionTime BasePeakPosition BasePeakIntensity  TotalIntensity
## 1 Centroid 75.93290145335 628.341003417969          31201228 464254629.15625

And of course the spectrum (used for identification) too:

centroided <- spectrum.centroid(theSpectrum)

centroided %>%
  dplyr::slice(1:10)

##          mz  intensity charge resolution signalToNoise
## 1  156.1659   84623.61      0      26800           2.2
## 2  157.1081  113200.70      0      27800           2.9
## 3  158.0924  259970.20      0      37600           6.6
## 4  158.9321   75278.96      0      28300           1.9
## 5  163.4241   66772.50      0      23200           1.7
## 6  173.1283  203509.00      3      26904           4.8
## 7  173.4685 1646527.00      0      41700          38.9
## 8  175.1187 5782023.00      0      37900         135.5
## 9  176.1223  183542.80      0      27400           4.3
## 10 183.1125  219083.90      0      33900           4.9

Which we can then plot!

(the plotSpectrum function comes from the MS-Analysis repo)

plotSpectrum(centroided, centroidPlot = T, labels = NA)

With the aid of the MassSpectrometryR package we can calculate the y- and b-series of ions (single charged)

library(massSpectrometryR)
peptide$new(sequence = peptide5$Sequence[2])$fragments()

## # A tibble: 28 × 13
##     aions aionsH2O aionsNH3 bions bionsH2O bionsNH3 bionsH2Oadd cions xions
##     <dbl>    <dbl>    <dbl> <dbl>    <dbl>    <dbl>       <dbl> <dbl> <dbl>
##  1   76.0     74.0     59.0  104.     86.0     87.0        122.  121.  173.
##  2  173.     155.     156.   201.    183.     184.         219.  218.  304.
##  3  302.     284.     285.   330.    312.     313.         348.  347.  417.
##  4  373.     355.     356.   401.    383.     384.         419.  418.  504.
##  5  486.     468.     469.   514.    496.     497.         532.  531.  618.
##  6  633.     615.     616.   661.    643.     644.         679.  678.  765.
##  7  761.     743.     744.   789.    771.     772.         807.  806.  866.
##  8  858.     840.     841.   886.    868.     869.         904.  903.  967.
##  9  945.     927.     928.   973.    955.     956.         991.  990. 1096.
## 10 1093.    1075.    1075.  1121.   1103.    1103.        1139. 1138. 1234.
## # ℹ 18 more rows
## # ℹ 4 more variables: yions <dbl>, yionsH2O <dbl>, yionsNH3 <dbl>, zions <dbl>

yFragments <- data.frame(
  yion = createIonSeries(theSeries = 1:21),
  mzy = peptide$new(sequence = peptide5$Sequence[3])$fragments()$yions,
  bion = createIonSeries(whichSeries = "b",theSeries = 1:21, whichCharge = 1),
  mzb = peptide$new(sequence = peptide5$Sequence[3])$fragments(chargeState = 1)$bions)

yFragments %>%
  dplyr::slice(5:0)

##   yion      mzy bion      mzb
## 1 y+_5 531.2885 b+_5 577.2617
## 2 y+_4 474.2671 b+_4 463.2187
## 3 y+_3 361.1830 b+_3 392.1816
## 4 y+_2 290.1459 b+_2 229.1183
## 5 y+_1 175.1190 b+_1 116.0342

Using the spectrumAnnotation routines we can overlay (existing) ions as a series, similair to how most

theOverlay <- list()
theOverlay[[1]] <- spectrumAnnotation$new(annotationName = "y 1+",
                                          mzs = yFragments$mzy,
                                          labels = yFragments$yion)
theOverlay[[2]] <- spectrumAnnotation$new(annotationName = "b 1+",
                                          mzs = yFragments$mzb,
                                          labels = yFragments$bion)
theOverlay[[1]]$check(spectrum = centroided)
theOverlay[[2]]$check(spectrum = centroided)

theOverlay[[2]]$color <- "blue"
# give the second label series a 'ramp'
theOverlay[[2]]$labelWhere <- 0.7 + (0.005 * 1:(length(theOverlay[[2]]$mz)))

plotSpectrum(centroided, centroidPlot = T, annotateMz = theOverlay[1])

plotSpectrum(centroided, centroidPlot = T, annotateMz = theOverlay)

With the aid of the plotly library you can make the plots interactive to be zoom in on ions/masses, though at the loss of some functionality

# turning of the 'ramp': doesn't work when ggplotly is used
theOverlay[[2]]$labelWhere <- 0.8

library(plotly)
ggplotly(plotSpectrum(centroided, centroidPlot = T, annotateMz = theOverlay))

dbClose(db)

proteinDiscover: new functions

2023-10-31

Introduction

Spectra data

Examples