proteinDiscover Extra

Introduction

This manual gives a few examples for the ‘extensions’ in the R functions provided by proteinDiscoverExtra. Some of these functions may eventually be included in the package itself, though most of them will not. The reason for this is that the functions are not very general; they’re usually specific for something I needed at some point when using the package itself. Interference Free Index calculations are an exception as they are already in the package.

Basic use

library(proteinDiscover)
library(BBPersonalR)

The library BBPersonalR is needed for some of the (graphic) routines. It should be easy to replace them with your own favourite scatterplot/boxplot/etc.

For the functions to be usable, nothing but proteinDiscover itself and simple call to the source function is needed

library(purrr)
walk(list.files(path = "../proteinDiscoverExtra/R", full.names = TRUE), source)

TMT Quantification

arsnPlot

This function creates a signal to noise plot for the reporter ions of TMT analysis. First the pdResult files is opened, and then the RetentionTime and AverageReporterSN columns from the TargetPsms table are read and plotted.

db <- dbOpen("/home/ben/Documents/Thermo/TKO-Example/TKOTMT10plex.pdResult")
arsnPlot(db = db)

Note: the file used is an example file which Thermo Scientific uses in their courses. It contains a measurement of TMT10-plex sample of a triple knockout proteomics standard.

aboveARSN

This function reads the AverageReporterSN data from the TargetPsms table and calculates which proportion is above a user-set level.

aboveARSN(db = db, cutOff = 10)

## [1] 99.75847

Together with the arsnPlot they can be used for semi-automated quality control.

pPlot

Essentially this is a scatter plot function to be able to quickly plot some of the columns in database tables. The (Default) plot below shows the delta m/z of all peptide spectral matches (PSM’s) in the analysis.

pPlot(db = db)

Note that the info from the databases does not get translated via dfTransformRaw, but with some extra code you can still plot data:

pPlot(db = db, tData = dbGetPeptideTable(db = db) %>% dfTransformRaws(),
      xColumn = "RTminbySearchEngine_1",
      yColumn = "DeltaMppmbySearchEngine_1",
      xLabel = "RetentionTime",
      yLabel = "DeltaMassInPPM")

Interference Free Index

The Interference Free Index (IFI) is a tool to see if there is ratio distortion due to ‘background’ signal. An article describing what is meant exactly can be found: Paulo et al, Journal American Society Mass Spectrometry, 2016, vol 27, issue 10, page 1620-1625. The formula is:

\[ \begin{aligned} IFI = 1 - {\frac{average S/N_{knockout}}{average S/N_{not knockout}}} \end{aligned} \]

Calculating this quality measure is not difficult, it just take some messing with the data in the tables. You can calculate it from the protein table, but unfortunately this does not give you a sense of variability. That’s why the calcIFIs & calcAllIFIs functions work with the peptide tables: for each peptide of a protein the IFI is calculated (using the abundances in the channels). These values can be used to calculated an average protein IFI while also allowing to calculate the variation.

calcIFIs simply calculates the IFI of a single knockout protein across the TMT channels. For the His4 protein in the example file, which uses TMT10 plex:

IFIHis4 <- calcIFIs(db, selected = "His4", groups = tmt10Channels())
tibble(IFIHis4)

## # A tibble: 22 × 2
##    Short   IFI
##    <chr> <dbl>
##  1 His4  0.569
##  2 His4  0.480
##  3 His4  0.906
##  4 His4  0.641
##  5 His4  0.491
##  6 His4  0.158
##  7 His4  0.944
##  8 His4  0.913
##  9 His4  0.833
## 10 His4  0.593
## # ℹ 12 more rows

Which gives us the average 0.6495927 with standard deviation 0.2108972 from 22 peptides of the protein His4 (Uniprot Accession: P00815).

Using the cakcAllIFIs function we can calculate the IFI’s for all three strains quickly and create a boxplot type of graph:

allIFIs <- calcAllIFIs(db = db, groups = tmt10Channels())
statBoxPlotMultiple(data = allIFIs, column = "IFI", melted = TRUE, varColumn = "Short",
                    showLegend = FALSE, yDefault = FALSE,
                    yLimits = c(0,1),yExpand = c(0,0.05),
                    yLabel = "Interference Free Index", xLabel = "Strain",
                    title = paste0("MS^2 ", getAcquistionDateTime(db = db)[1]))

The result may seem only soso, but this example pdResult file was coming from an MS2 type of analysis. An MS3 analysis would result in a much better IFI plot.

To see the normalized abundances of the separate TMT channels we can provide the function getPeptideInfo with the protein accessions we wish to query. Below a boxplot of one of the ‘knocked out’ proteins and a ‘non knocked out’ protein can be seen.

peptidesTKO <- getPeptideInfo(db = db, columns = "AbundancesNormalized",
                              proteinAccessions = knockOutProteins()$Accession)
statBoxPlotMultiple(data = peptidesTKO$P00815 %>%
                    dplyr::select(starts_with("AbundancesNormalized_")) %>%
                    rename_with(~as.character(1:10)), legend.title = "TMT Channel",
                    title = "Protein P00815, His4 knockout")

statBoxPlotMultiple(data = peptidesTKO$P00925 %>%
                    dplyr::select(starts_with("AbundancesNormalized_")) %>%
                    rename_with(~as.character(1:10)),  legend.title = "TMT Channel",
                    title = "Protein P00925, Eno2 (non knockout)")

For any interesting protein in a TMT experiment these plots can relatively easy be made. The volcano plot may indicate a protein as significantly up or down regulated, but how good/bad is this significance? Plots of the individual channels per ‘interesting’ protein may help.

dbClose(db)

Modifciations

The next few functions deal with modification data. The example file (not available at this moment) contains data from a Proteome Discoverer with label free quantification (minora feature detection) search of a 2 hour HPLC run of a HeLa Digest (commercially available). The Modification Sites node in the consensus workflow of Proteome Discoverer also needs to be present.

db <- dbOpen(fileName = "hela001.pdResult")

In the search the following dynamic modifications were used:

((workflowInfo(db)$nodeInfo$Processing %>%
    nodes())$`Sequest HT` %>%
   dplyr::filter(grepl(name,pattern = "Dynamic"), value != "None") %>%
   dplyr::slice(-1) %>%
   dplyr::select(value))$value

## [1] "Oxidation / +15.995 Da (M)"       "Deamidated / +0.984 Da (N)"      
## [3] "Carbamidomethyl / +57.021 Da (C)"

The 10 proteins with the best (highest summed) Sequest score are:

protsHela <- dbGetProteinTable(db = db) %>% dfTransformRaws() %>%
  dplyr::arrange(desc(ScoreSequestHT_1)) %>%
  dplyr::select(Accession, Description, ProteinGroupIDs, ScoreSequestHT_1, Coverage) %>%
  dplyr::slice(1:10)
tibble(protsHela)

## # A tibble: 10 × 5
##    Accession Description               ProteinGroupIDs ScoreSequestHT_1 Coverage
##    <chr>     <chr>                     <chr>                      <dbl>    <dbl>
##  1 P60709    Actin, cytoplasmic 1 OS=… 2218                       1278.     58.4
##  2 P06733    Alpha-enolase OS=Homo sa… 3518                       1200.     68.7
##  3 P21333    Filamin-A OS=Homo sapien… 1179                        983.     55.9
##  4 O75369    Filamin-B OS=Homo sapien… 3093                        932.     62.7
##  5 Q09666    Neuroblast differentiati… 2062                        834.     53.1
##  6 P78527    DNA-dependent protein ki… 3605                        762.     39.0
##  7 P31327    Carbamoyl-phosphate synt… 600                         761.     62.5
##  8 P14618    Pyruvate kinase PKM OS=H… 3733                        741.     64.4
##  9 P07437    Tubulin beta chain OS=Ho… 3730                        731.     79.1
## 10 P04406    Glyceraldehyde-3-phospha… 2407                        713.     70.7

If we want to get a closer look at the peptide modifications of the first 2 proteins:

pepts <- 
lapply(protsHela$ProteinGroupIDs[1:2], function(x){
dbGetPeptideIDs(db = db, proteinGroupIDs = x) %>%
  dbGetPeptideTable(db = db) %>%
  dfTransformRaws()
})
names(pepts) <- protsHela$Accession[1:2]

Part of the peptide table of protein P60709 :

tibble(
pepts$P60709 %>% dplyr::select(Sequence, Modifications) %>%
  slice(32:37)
)

## # A tibble: 6 × 2
##   Sequence                 Modifications                              
##   <chr>                    <chr>                                      
## 1 GYSFTTTAER               ""                                         
## 2 HTFYNELR                 ""                                         
## 3 LCYVALDFEQEMATAASSSSLEK  "1×Carbamidomethyl [C2]"                   
## 4 LCYVALDFEQEMATAASSSSLEK  "1×Carbamidomethyl [C2]; 1×Oxidation [M12]"
## 5 LFQPSFLGMESCGIHETTFNSIMK "1×Carbamidomethyl [C12]"                  
## 6 IWHHTFYNELR              ""

calculatePositionPercentage

To find out the ‘amount’ of oxidation at position 227 of this protein:

calculatePositionPercentage(db = db,
                            peptideTable = pepts$P60709,
                            modificationLocation = 227,
                            modificationName = "Oxidation",
                            accession = "P60709")

##   modification location percentage
## 1    Oxidation      227  0.8948326

To get a better view of the peptides involved in this calculation:

calculatePositionPercentage(db = db,
                            peptideTable = pepts$P60709,
                            modificationLocation = 227,
                            modificationName = "Oxidation",
                            accession = "P60709", giveTable = TRUE) %>%
  dplyr::select(-c(PeptideGroupID, startLocation, endLocation, peptideLocation, modificationPresent)) %>%
  tibble()

## # A tibble: 3 × 3
##   Sequence                Modificationsallpossiblesites             Abundances_1
##   <chr>                   <chr>                                            <dbl>
## 1 LCYVALDFEQEMATAASSSSLEK "1×Carbamidomethyl [C2]"                     78716943.
## 2 LCYVALDFEQEMATAASSSSLEK "1×Carbamidomethyl [C2]; 1×Oxidation [M1…      838098.
## 3 VALDFEQEMATAASSSSLEK    ""                                           14104666.

As another example, part of the peptide table of protein P06733 :

tibble(
  pepts$P06733 %>% dplyr::select(Sequence, Modifications) %>%
    slice(25:35)
)

## # A tibble: 11 × 2
##    Sequence               Modifications                               
##    <chr>                  <chr>                                       
##  1 LMIEMDGTENK            "1×Oxidation [M]"                           
##  2 MILPVGAANFR            ""                                          
##  3 MILPVGAANFR            "1×Oxidation [M1]"                          
##  4 NFRNPLAK               ""                                          
##  5 SCNCLLLK               "2×Carbamidomethyl [C2; C4]"                
##  6 SGETEDTFIADLVVGLCTGQIK ""                                          
##  7 SGETEDTFIADLVVGLCTGQIK "1×Carbamidomethyl [C17]"                   
##  8 SGKYDLDFK              ""                                          
##  9 TIAPALVSK              ""                                          
## 10 VNQIGSVTESLQACK        "1×Carbamidomethyl [C14]"                   
## 11 VNQIGSVTESIQACK        "1×Carbamidomethyl [C14]; 1×Deamidated [N2]"

And the peptides containing location 389:

calculatePositionPercentage(db = db,
                            peptideTable = pepts$P06733,
                            modificationLocation = 389,
                            modificationName = "Carbamidomethyl",
                            accession = "P06733", giveTable = TRUE) %>%
  dplyr::select(-c(PeptideGroupID, startLocation, endLocation, peptideLocation, modificationPresent)) %>%
  tibble()

## # A tibble: 2 × 3
##   Sequence               Modificationsallpossiblesites Abundances_1
##   <chr>                  <chr>                                <dbl>
## 1 SGETEDTFIADLVVGLCTGQIK ""                                 448209.
## 2 SGETEDTFIADLVVGLCTGQIK "1×Carbamidomethyl [C17]"        15261784.

…which shows that carbamidomethylation was not 100% in this case.

proteinModificationTable

To get a complete modification table for protein P06733:

modsHela <- proteinModificationTable(db = db,
                                     modificationTable = dbGetModificationsSitesIDs(db = db,
                                                                                    proteinUniqueIDs = dbGetProteinUniqueSequenceIDs(db = db, accession = "P06733")) %>%
                                       dbGetModificationsTable(db = db),
                                     peptideTable = pepts$P06733,
                                     accession = "P06733")
modsHela %>% dplyr::arrange(location)

##       modification location   percentage
## 1        Oxidation       94   5.79275735
## 2        Oxidation       97   5.79275735
## 3  Carbamidomethyl      119 100.00000000
## 4       Deamidated      151  21.74990484
## 5       Deamidated      154  21.74990484
## 6       Deamidated      161  22.20960751
## 7        Oxidation      165   9.15137617
## 8        Oxidation      169   5.68686496
## 9       Deamidated      177   0.02789415
## 10   Glu->pyro-Glu      222   7.53846164
## 11       Oxidation      244  15.54165606
## 12      Deamidated      324   0.00000000
## 13 Carbamidomethyl      337 100.00000000
## 14 Carbamidomethyl      339 100.00000000
## 15      Deamidated      345   1.33814185
## 16 Carbamidomethyl      357 100.00000000
## 17      Deamidated      363  12.07074366
## 18       Oxidation      368   1.97680383
## 19 Carbamidomethyl      389  97.14698319

Please note:

these modification functions work with both peptide & psms tables
peptide & psms tables can contain different annotations for modifications
- peptide tables use the Name, as found in the FoundModifications table, for modifications
- psms tables use the Abbreviation, as found in the FoundModifications table, for modifications
- For these functions it’s better to replace the names in peptide tables with the abbreviations, otherwise the function output may not be correct
Proteome Discoverer is not always able to distinguish between two adjacent positions (eg CC, or MM), this will lead to oddities and should be corrected/handled manually
these functions can be rather slow due to repeated database queries. A solution to this may be to load the relevant tables to a database in memory (:memory:) and to work from there.

dbClose(db)

Closing remarks

Hopefully these examples show some of the things that can be done with the proteinDiscover package. I admit that especially the modification part is rather confusing. ‘Soon’ a shiny application will be released which can help automate some of the work.