This manual gives a few examples for the ‘extensions’ in the R functions provided by proteinDiscoverExtra. Some of these functions may eventually be included in the package itself, though most of them will not. The reason for this is that the functions are not very general; they’re usually specific for something I needed at some point when using the package itself. Interference Free Index calculations are an exception as they are already in the package.
library(proteinDiscover)
library(BBPersonalR)
The library BBPersonalR is needed for some of the (graphic) routines. It should be easy to replace them with your own favourite scatterplot/boxplot/etc.
For the functions to be usable, nothing but proteinDiscover itself and simple call to the source function is needed
library(purrr)
walk(list.files(path = "../proteinDiscoverExtra/R", full.names = TRUE), source)
This function creates a signal to noise plot for the reporter ions of TMT analysis. First the pdResult files is opened, and then the RetentionTime and AverageReporterSN columns from the TargetPsms table are read and plotted.
db <- dbOpen("/home/ben/Documents/Thermo/TKO-Example/TKOTMT10plex.pdResult")
arsnPlot(db = db)
Note: the file used is an example file which Thermo Scientific uses in their courses. It contains a measurement of TMT10-plex sample of a triple knockout proteomics standard.
This function reads the AverageReporterSN data from the TargetPsms table and calculates which proportion is above a user-set level.
aboveARSN(db = db, cutOff = 10)
## [1] 99.75847
Together with the arsnPlot they can be used for semi-automated quality control.
Essentially this is a scatter plot function to be able to quickly plot some of the columns in database tables. The (Default) plot below shows the delta m/z of all peptide spectral matches (PSM’s) in the analysis.
pPlot(db = db)
Note that the info from the databases does not get translated via dfTransformRaw, but with some extra code you can still plot data:
pPlot(db = db, tData = dbGetPeptideTable(db = db) %>% dfTransformRaws(),
xColumn = "RTminbySearchEngine_1",
yColumn = "DeltaMppmbySearchEngine_1",
xLabel = "RetentionTime",
yLabel = "DeltaMassInPPM")
The Interference Free Index (IFI) is a tool to see if there is ratio distortion due to ‘background’ signal. An article describing what is meant exactly can be found: Paulo et al, Journal American Society Mass Spectrometry, 2016, vol 27, issue 10, page 1620-1625. The formula is:
\[ \begin{aligned} IFI = 1 - {\frac{average S/N_{knockout}}{average S/N_{not knockout}}} \end{aligned} \]
Calculating this quality measure is not difficult, it just take some messing with the data in the tables. You can calculate it from the protein table, but unfortunately this does not give you a sense of variability. That’s why the calcIFIs & calcAllIFIs functions work with the peptide tables: for each peptide of a protein the IFI is calculated (using the abundances in the channels). These values can be used to calculated an average protein IFI while also allowing to calculate the variation.
calcIFIs simply calculates the IFI of a single knockout protein across the TMT channels. For the His4 protein in the example file, which uses TMT10 plex:
IFIHis4 <- calcIFIs(db, selected = "His4", groups = tmt10Channels())
tibble(IFIHis4)
## # A tibble: 22 × 2
## Short IFI
## <chr> <dbl>
## 1 His4 0.569
## 2 His4 0.480
## 3 His4 0.906
## 4 His4 0.641
## 5 His4 0.491
## 6 His4 0.158
## 7 His4 0.944
## 8 His4 0.913
## 9 His4 0.833
## 10 His4 0.593
## # ℹ 12 more rows
Which gives us the average 0.6495927 with standard deviation 0.2108972 from 22 peptides of the protein His4 (Uniprot Accession: P00815).
Using the cakcAllIFIs function we can calculate the IFI’s for all three strains quickly and create a boxplot type of graph:
allIFIs <- calcAllIFIs(db = db, groups = tmt10Channels())
statBoxPlotMultiple(data = allIFIs, column = "IFI", melted = TRUE, varColumn = "Short",
showLegend = FALSE, yDefault = FALSE,
yLimits = c(0,1),yExpand = c(0,0.05),
yLabel = "Interference Free Index", xLabel = "Strain",
title = paste0("MS^2 ", getAcquistionDateTime(db = db)[1]))
The result may seem only soso, but this example pdResult file was coming from an MS2 type of analysis. An MS3 analysis would result in a much better IFI plot.
To see the normalized abundances of the separate TMT channels we can provide the function getPeptideInfo with the protein accessions we wish to query. Below a boxplot of one of the ‘knocked out’ proteins and a ‘non knocked out’ protein can be seen.
peptidesTKO <- getPeptideInfo(db = db, columns = "AbundancesNormalized",
proteinAccessions = knockOutProteins()$Accession)
statBoxPlotMultiple(data = peptidesTKO$P00815 %>%
dplyr::select(starts_with("AbundancesNormalized_")) %>%
rename_with(~as.character(1:10)), legend.title = "TMT Channel",
title = "Protein P00815, His4 knockout")
statBoxPlotMultiple(data = peptidesTKO$P00925 %>%
dplyr::select(starts_with("AbundancesNormalized_")) %>%
rename_with(~as.character(1:10)), legend.title = "TMT Channel",
title = "Protein P00925, Eno2 (non knockout)")
For any interesting protein in a TMT experiment these plots can relatively easy be made. The volcano plot may indicate a protein as significantly up or down regulated, but how good/bad is this significance? Plots of the individual channels per ‘interesting’ protein may help.
dbClose(db)
The next few functions deal with modification data. The example file (not available at this moment) contains data from a Proteome Discoverer with label free quantification (minora feature detection) search of a 2 hour HPLC run of a HeLa Digest (commercially available). The Modification Sites node in the consensus workflow of Proteome Discoverer also needs to be present.
db <- dbOpen(fileName = "hela001.pdResult")
In the search the following dynamic modifications were used:
((workflowInfo(db)$nodeInfo$Processing %>%
nodes())$`Sequest HT` %>%
dplyr::filter(grepl(name,pattern = "Dynamic"), value != "None") %>%
dplyr::slice(-1) %>%
dplyr::select(value))$value
## [1] "Oxidation / +15.995 Da (M)" "Deamidated / +0.984 Da (N)"
## [3] "Carbamidomethyl / +57.021 Da (C)"
The 10 proteins with the best (highest summed) Sequest score are:
protsHela <- dbGetProteinTable(db = db) %>% dfTransformRaws() %>%
dplyr::arrange(desc(ScoreSequestHT_1)) %>%
dplyr::select(Accession, Description, ProteinGroupIDs, ScoreSequestHT_1, Coverage) %>%
dplyr::slice(1:10)
tibble(protsHela)
## # A tibble: 10 × 5
## Accession Description ProteinGroupIDs ScoreSequestHT_1 Coverage
## <chr> <chr> <chr> <dbl> <dbl>
## 1 P60709 Actin, cytoplasmic 1 OS=… 2218 1278. 58.4
## 2 P06733 Alpha-enolase OS=Homo sa… 3518 1200. 68.7
## 3 P21333 Filamin-A OS=Homo sapien… 1179 983. 55.9
## 4 O75369 Filamin-B OS=Homo sapien… 3093 932. 62.7
## 5 Q09666 Neuroblast differentiati… 2062 834. 53.1
## 6 P78527 DNA-dependent protein ki… 3605 762. 39.0
## 7 P31327 Carbamoyl-phosphate synt… 600 761. 62.5
## 8 P14618 Pyruvate kinase PKM OS=H… 3733 741. 64.4
## 9 P07437 Tubulin beta chain OS=Ho… 3730 731. 79.1
## 10 P04406 Glyceraldehyde-3-phospha… 2407 713. 70.7
If we want to get a closer look at the peptide modifications of the first 2 proteins:
pepts <-
lapply(protsHela$ProteinGroupIDs[1:2], function(x){
dbGetPeptideIDs(db = db, proteinGroupIDs = x) %>%
dbGetPeptideTable(db = db) %>%
dfTransformRaws()
})
names(pepts) <- protsHela$Accession[1:2]
Part of the peptide table of protein P60709 :
tibble(
pepts$P60709 %>% dplyr::select(Sequence, Modifications) %>%
slice(32:37)
)
## # A tibble: 6 × 2
## Sequence Modifications
## <chr> <chr>
## 1 GYSFTTTAER ""
## 2 HTFYNELR ""
## 3 LCYVALDFEQEMATAASSSSLEK "1×Carbamidomethyl [C2]"
## 4 LCYVALDFEQEMATAASSSSLEK "1×Carbamidomethyl [C2]; 1×Oxidation [M12]"
## 5 LFQPSFLGMESCGIHETTFNSIMK "1×Carbamidomethyl [C12]"
## 6 IWHHTFYNELR ""
To find out the ‘amount’ of oxidation at position 227 of this protein:
calculatePositionPercentage(db = db,
peptideTable = pepts$P60709,
modificationLocation = 227,
modificationName = "Oxidation",
accession = "P60709")
## modification location percentage
## 1 Oxidation 227 0.8948326
To get a better view of the peptides involved in this calculation:
calculatePositionPercentage(db = db,
peptideTable = pepts$P60709,
modificationLocation = 227,
modificationName = "Oxidation",
accession = "P60709", giveTable = TRUE) %>%
dplyr::select(-c(PeptideGroupID, startLocation, endLocation, peptideLocation, modificationPresent)) %>%
tibble()
## # A tibble: 3 × 3
## Sequence Modificationsallpossiblesites Abundances_1
## <chr> <chr> <dbl>
## 1 LCYVALDFEQEMATAASSSSLEK "1×Carbamidomethyl [C2]" 78716943.
## 2 LCYVALDFEQEMATAASSSSLEK "1×Carbamidomethyl [C2]; 1×Oxidation [M1… 838098.
## 3 VALDFEQEMATAASSSSLEK "" 14104666.
As another example, part of the peptide table of protein P06733 :
tibble(
pepts$P06733 %>% dplyr::select(Sequence, Modifications) %>%
slice(25:35)
)
## # A tibble: 11 × 2
## Sequence Modifications
## <chr> <chr>
## 1 LMIEMDGTENK "1×Oxidation [M]"
## 2 MILPVGAANFR ""
## 3 MILPVGAANFR "1×Oxidation [M1]"
## 4 NFRNPLAK ""
## 5 SCNCLLLK "2×Carbamidomethyl [C2; C4]"
## 6 SGETEDTFIADLVVGLCTGQIK ""
## 7 SGETEDTFIADLVVGLCTGQIK "1×Carbamidomethyl [C17]"
## 8 SGKYDLDFK ""
## 9 TIAPALVSK ""
## 10 VNQIGSVTESLQACK "1×Carbamidomethyl [C14]"
## 11 VNQIGSVTESIQACK "1×Carbamidomethyl [C14]; 1×Deamidated [N2]"
And the peptides containing location 389:
calculatePositionPercentage(db = db,
peptideTable = pepts$P06733,
modificationLocation = 389,
modificationName = "Carbamidomethyl",
accession = "P06733", giveTable = TRUE) %>%
dplyr::select(-c(PeptideGroupID, startLocation, endLocation, peptideLocation, modificationPresent)) %>%
tibble()
## # A tibble: 2 × 3
## Sequence Modificationsallpossiblesites Abundances_1
## <chr> <chr> <dbl>
## 1 SGETEDTFIADLVVGLCTGQIK "" 448209.
## 2 SGETEDTFIADLVVGLCTGQIK "1×Carbamidomethyl [C17]" 15261784.
…which shows that carbamidomethylation was not 100% in this case.
To get a complete modification table for protein P06733:
modsHela <- proteinModificationTable(db = db,
modificationTable = dbGetModificationsSitesIDs(db = db,
proteinUniqueIDs = dbGetProteinUniqueSequenceIDs(db = db, accession = "P06733")) %>%
dbGetModificationsTable(db = db),
peptideTable = pepts$P06733,
accession = "P06733")
modsHela %>% dplyr::arrange(location)
## modification location percentage
## 1 Oxidation 94 5.79275735
## 2 Oxidation 97 5.79275735
## 3 Carbamidomethyl 119 100.00000000
## 4 Deamidated 151 21.74990484
## 5 Deamidated 154 21.74990484
## 6 Deamidated 161 22.20960751
## 7 Oxidation 165 9.15137617
## 8 Oxidation 169 5.68686496
## 9 Deamidated 177 0.02789415
## 10 Glu->pyro-Glu 222 7.53846164
## 11 Oxidation 244 15.54165606
## 12 Deamidated 324 0.00000000
## 13 Carbamidomethyl 337 100.00000000
## 14 Carbamidomethyl 339 100.00000000
## 15 Deamidated 345 1.33814185
## 16 Carbamidomethyl 357 100.00000000
## 17 Deamidated 363 12.07074366
## 18 Oxidation 368 1.97680383
## 19 Carbamidomethyl 389 97.14698319
Please note:
dbClose(db)
Hopefully these examples show some of the things that can be done with the proteinDiscover package. I admit that especially the modification part is rather confusing. ‘Soon’ a shiny application will be released which can help automate some of the work.