The aim for this package is to enable working with the results generated by the Proteome Discoverer software released by Thermo Scientific.
Many things can be done in Proteome Discoverer: ordering, filtering, making graphs, etc etc. It is however not really possible (to the best of my knowledge) to analyse the results in a programming or semi-automated way. You can integrate R scripts in the workflow by creating scripting nodes, but this is kind of limited.
This package allows for all that, including:
All without having:
The proteinDiscover package enables reading the data from a .pdResult file and using it in an r script or program. This is possible because the data in those files (and also in .msf files) is stored in a sqlite database. It’s not as simple as this may seem however:
Raw data in tables can be ‘translated’ in a lot of cases, but not always since the format of some fields is unknown. <xml> data can be extracted but it takes some effort to figure out what information exactly is stored and where.
The most important data is readily available through this package:
Some functions are still in a somewhat experimental stage, especially the ones dealing with the extraction of the Proteome Discoverer methods, parameters, etc are not fully tested yet. As time goes on, this will be improved.
Since the current version is still at the development stage, the package can only be installed via github:
devtools::install_github("BenBruyneel/proteinDiscover")
Another possibility is to download it from github and then install it from as a local package.
As per usual the package first needs to be loaded, after which a .pdResult file can be opened. For this the wrapper function dbOpen can be used which opens the sqlite database via the pool package.
library(proteinDiscover)
db <- dbOpen(fileName = "TKOTMT10plex.pdResult")
Note: this is an example file which Thermo Scientific uses in their courses. It contains a measurement of TMT10-plex sample of a triple knockout proteomics standard.
To load the proteintable:
library(dplyr)
proteins <- dbGetProteinTable(db = db, columns = c("Accession",
"Description",
"ScoreSequestHT",
"AbundanceRatios",
"AbundanceRatioPValue"))
head(tibble(proteins))
## # A tibble: 6 × 5
## Accession Description ScoreSequestHT AbundanceRatios AbundanceRatioPValue
## <chr> <chr> <blob> <blob> <blob>
## 1 Q07457 E3 ubiquitin-pr… <raw 9 B> <raw 27 B> <raw 27 B>
## 2 P39704 Protein ERP2 [O… <raw 9 B> <raw 27 B> <raw 27 B>
## 3 E7QIA1 Protein HRI1 [O… <raw 9 B> <raw 27 B> <raw 27 B>
## 4 P46955 Beta-glucosidas… <raw 9 B> <raw 27 B> <raw 27 B>
## 5 P06634 ATP-dependent R… <raw 9 B> <raw 27 B> <raw 27 B>
## 6 P04076 argininosuccina… <raw 9 B> <raw 27 B> <raw 27 B>
As can be seen three of the columns are in raw (<blob>) format. These can be ‘translated’ with the function dfTransformRaws
library(knitr)
library(kableExtra)
proteins <- proteins %>% dfTransformRaws()
colnames(proteins)
## [1] "Accession" "Description" "ScoreSequestHT_1"
## [4] "AbundanceRatios_1" "AbundanceRatios_2" "AbundanceRatios_3"
## [7] "AbundanceRatioPValue_1" "AbundanceRatioPValue_2" "AbundanceRatioPValue_3"
proteins %>%
dplyr::select(Accession, ScoreSequestHT_1, starts_with("AbundanceRatios_")) %>%
arrange(desc(ScoreSequestHT_1)) %>%
slice(1:6) %>%
kbl(align = "ccccc",
digits = c(0,2,3,3,3),
col.names = c("Accession", "Sequest Score", "Ratio 1", "Ratio 2", "Ratio 3")) %>%
kable_minimal()
Accession | Sequest Score | Ratio 1 | Ratio 2 | Ratio 3 |
---|---|---|---|---|
P00359 | 538.30 | 0.907 | 0.865 | 0.861 |
P00925 | 384.58 | 0.935 | 0.850 | 0.925 |
P00924 | 333.73 | 1.075 | 1.448 | 1.143 |
P00358 | 333.16 | 0.912 | 0.953 | 0.937 |
P07149 | 324.33 | 1.027 | 1.126 | 1.133 |
P19097 | 290.25 | 1.032 | 1.148 | 1.123 |
The function dfTransRaws attempts to guess what type (numeric or integer) a raw column might be, but can fail or be wrong. Use the other settings of the function to correct errors or even to ‘enforce’ the correct conversion.
Having this information, it’s quite easy to make eg volcano plots:
plot(log2(proteins$AbundanceRatios_1), -log10(proteins$AbundanceRatioPValue_1),
xlab = "log2 abundance ratios", ylab = " -log10 p-Value",
main = "Volcano plot protein ratios 1")
A bit more fancy:
Code for this plot uses ggplot2 & grid and is a bit too long too include
In this plot:
We can also make PCA plots with the normalized protein abundances:
proteins <- dbGetProteinTable(db = db, columns = c("Accession",
"AbundancesNormalized")) %>%
dfTransformRaws()
library(factoextra)
library(FactoMineR)
tproteins <- t(proteins %>%
dplyr::select(-Accession) %>%
na.omit())
samples <- c("His4","His4","His4",
"Met6","Met6","Met6",
"Ura2","Ura2","Ura2",
"Wildtype")
pcaprot <- PCA(tproteins[,-ncol(tproteins)],
scale.unit = TRUE, graph = FALSE, ncp = 20)
fviz_pca_ind(pcaprot,
geom = c("point"),
repel = TRUE,
col.ind = samples,
pointsize = 3,
mean.point = FALSE,
pointshape = 16,
axes = c(1,2),
title = "Score plot PCA of protein quantification")
Plotting (peptide spectrum match) average signal to noise during the analysis:
psms <- dbGetPsmTable(db = db, columns = c("RetentionTime", "AverageReporterSN", "DeltaMassInPPM"))
plot(psms$RetentionTime, psms$AverageReporterSN, log = "y",
xlab = "Retention Time (min)", ylab = "Average Reporter S/N",
main = "Average Reporter S/N vs Retention Time")
A bit more fancy:
Code for this plot uses ggplot2 & grid and is a bit too long too include
Plotting (peptide spectrum match) average signal to noise during the analysis:
plot(psms$RetentionTime, psms$DeltaMassInPPM,
xlab = "Retention Time (min)", ylab = "Delta Mass (ppm)",
main = "Delta Mass vs Retention Time")
A more complicated plot:
Code for this plot uses ggplot2 & grid and is a bit too long too include
It should be clear from these examples that the package gives access to a lot of data by relative simple R commands, thus opening the way to full statistics and quality control in a (semi-) automated way.
There are two functions that extract information from the .pdResult file about the setup of the study:
adef <- proteinDiscover::analysisDefinition(db = db)
awf <- workflowInfo(db = db)
The first one will extract information which can then be processed further with the aid of another set of functions.
studyDefinitionFactors will show the setup of the study factors
studyDefinitionFactors(adef) %>%
dplyr::select(-id)
## # A tibble: 1 × 4
## type name description factors
## <chr> <chr> <chr> <list>
## 1 Categorical strain "" <tibble [4 × 3]>
studyDefinitionFactors(adef)$factors[[1]] %>% dplyr::select(-id)
## # A tibble: 4 × 2
## type value
## <chr> <chr>
## 1 Categorical met6
## 2 Categorical his4
## 3 Categorical ura2
## 4 Categorical wildtype
studyDefinitionFileSet shows some data on the (raw) files used in the study
studyDefinitionFileSets(adef) %>%
dplyr::select(Name, FileName, FileTime, FileSize)
## # A tibble: 1 × 4
## Name FileName FileTime FileSize
## <chr> <chr> <chr> <chr>
## 1 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff "C:\\WORK\\sw… 11/23/2… 1085937…
studyDefinitionExtensionSettings will give information of the ratios used for the relative quantification
library(purrr)
adefExt <- studyDefinitionExtensionSettings(adef)
Here are the ratio definitions:
unname(map_chr(adefExt$QuanRatios, ~.x$RatioString))
## [1] "his4/wildtype" "met6/wildtype" "ura2/wildtype"
And info on the numerators and denominators of the “his4/wildtype” ratio:
adefExt$QuanRatios$`his4/wildtype`$NumeratorSamples %>% dplyr::select(-c(2:4))
## # A tibble: 3 × 4
## SampleName Quan_Channel Sample_Type strain
## <chr> <chr> <chr> <chr>
## 1 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 128N Sample his4
## 2 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 128C Sample his4
## 3 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 129N Sample his4
adefExt$QuanRatios$`his4/wildtype`$DenominatorSamples %>% dplyr::select(-c(2:4))
## # A tibble: 1 × 4
## SampleName Quan_Channel Sample_Type strain
## <chr> <chr> <chr> <chr>
## 1 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 131 Sample wildt…
It should be clear that most (if not all) info on study definition can be extracted.
Warning: these study definition functions have not been tested extensively yet, so there may be instances where it doesn’t work properly. Newer versions of the package may have some redesign of the code. Most likely the functions in their current form will be kept for the sake of continuity.
The workflowInfo function also extracts information that can then be processed further.
Some info comes immediately out of the result of the function:
tibble(awf$workflowInfo %>% dplyr::select(name, template, pc, type))
## # A tibble: 2 × 4
## name template pc type
## <chr> <chr> <chr> <chr>
## 1 TKOTMT10plex CWF_Comprehensive_Enhanced Annotation_Reporter_Quan DEBRE-… Cons…
## 2 TKOTMT10plex PWF_Fusion_TMT_Quan_MS2_SequestHT_Percolator DEBRE-… Proc…
tibble(awf$workflowInfo %>% dplyr::select(name, type, startDate, numberOfNodes))
## # A tibble: 2 × 4
## name type startDate numberOfNodes
## <chr> <chr> <chr> <int>
## 1 TKOTMT10plex Consensus 2020-01-15 07:10:55.3301677Z 14
## 2 TKOTMT10plex Processing 2020-01-15 07:10:51Z 5
Use nodeTable to get an overview of the consensus part of the workflow:
nodeTable(awf$nodeInfo$Consensus) %>%
dplyr::select(node, name, description)
## # A tibble: 14 × 3
## node name description
## <chr> <chr> <chr>
## 1 0 MSF Files "Selects the .msf file(s) to be processe…
## 2 1 PSM Grouper "Groups redundantly identified PSMs into…
## 3 2 Peptide Validator "Calculates confidences for PSMs and pep…
## 4 3 Peptide and Protein Filter "Filters the peptides and proteins accor…
## 5 4 Protein Scorer "Calculates the protein scores using the…
## 6 5 Protein Grouping "Groups the proteins into protein groups…
## 7 6 Peptide in Protein Annotation "Annotates the flanking residues, the po…
## 8 10 Reporter Ions Quantifier "Calculates quan ratios for PSMs, peptid…
## 9 7 Protein FDR Validator "Assigns a protein confidence based on t…
## 10 8 Protein Annotation "Retrieves proteins-centric annotation u…
## 11 9 Protein Marker "Marks proteins and their connected pept…
## 12 11 Result Statistics "Creates simple descriptive summary stat…
## 13 12 Display Settings "Persists a user default display filter …
## 14 13 Data Distributions "Visualizes data distributions for resul…
library(DiagrammeR)
nodeTable(awf$nodeInfo$Consensus) %>%
createDiagrammeRString() %>%
grViz()
Use the nodes function to get the settings used in one or more of the nodes in the workflow
names(nodes(awf$nodeInfo$Consensus))
## [1] "MSF Files" "PSM Grouper"
## [3] "Peptide Validator" "Peptide and Protein Filter"
## [5] "Protein Scorer" "Protein Grouping"
## [7] "Peptide in Protein Annotation" "Reporter Ions Quantifier"
## [9] "Protein FDR Validator" "Protein Annotation"
## [11] "Protein Marker" "Result Statistics"
## [13] "Display Settings" "Data Distributions"
nodes(awf$nodeInfo$Consensus)$`Peptide Validator` %>%
dplyr::select(-c(category,advanced))
## # A tibble: 8 × 2
## name value
## <chr> <chr>
## 1 Validation Mode Auto…
## 2 Target FDR (Strict) for PSMs 0.01
## 3 Target FDR (Relaxed) for PSMs 0.05
## 4 Target FDR (Strict) for Peptides 0.01
## 5 Target FDR (Relaxed) for Peptides 0.05
## 6 Validation Based on q-Va…
## 7 Target/Decoy Selection for PSM Level FDR Calculation Based on Score Auto…
## 8 Reset Confidences for Nodes without Decoy Search (Fixed score threshold… False
nodes(awf$nodeInfo$Consensus)$`Reporter Ions Quantifier` %>%
dplyr::select(-c(category,advanced)) %>%
print(n=31)
## # A tibble: 31 × 2
## name value
## <chr> <chr>
## 1 Reporter Abundance Based On "Automatic"
## 2 Apply Quan Value Corrections "False"
## 3 Co-Isolation Threshold "50"
## 4 Average Reporter S/N Threshold "10"
## 5 SPS Mass Matches [%] Threshold "65"
## 6 Peptides to Use "Unique"
## 7 Consider Protein Groups for Peptide Uniqueness "True"
## 8 Use Shared Quan Results "True"
## 9 Reject Quan Results with Missing Channels "False"
## 10 Normalization Mode "Total Peptide Amount"
## 11 Proteins For Normalization ""
## 12 Scaling Mode "On All Average"
## 13 For Normalization "Use All Peptides"
## 14 For Protein Roll-Up "Use All Peptides"
## 15 For Pairwise Ratios "Exclude Modified"
## 16 1. Considered Peptide Modification "None"
## 17 2. Considered Peptide Modification "None"
## 18 3. Considered Peptide Modification "None"
## 19 N-Terminal Considered Peptide Modification "None"
## 20 C-Terminal Considered Peptide Modification "None"
## 21 N-Terminal Considered Protein Modification "None"
## 22 C-Terminal Considered Protein Modification "None"
## 23 Protein Ratio Calculation "Pairwise Ratio Based"
## 24 Maximum Allowed Fold Change "100"
## 25 Imputation Mode "None"
## 26 Hypothesis Test "t-test (Background Based)"
## 27 1st Fold Change Threshold "2"
## 28 2nd Fold Change Threshold "4"
## 29 3rd Fold Change Threshold "6"
## 30 4th Fold Change Threshold "8"
## 31 5th Fold Change Threshold "10"
Note: since the .pdResult file is a database, it should be closed nicely when no more requests are going to be made
dbClose(db)
Obviously this package is a work in progress. A lot more testing is still required. Usually there’s a need for some graph or data analysis, which causes functions/procedures to be added to the package. Debugging is a continuous process which will cause smaller updates every now and then. Serious errors and bugs are corrected as soon as possible.
Contact info: benbruyneel@gmail.com