proteinDiscover package

Introduction

Aim of the package

The aim for this package is to enable working with the results generated by the Proteome Discoverer software released by Thermo Scientific.

Many things can be done in Proteome Discoverer: ordering, filtering, making graphs, etc etc. It is however not really possible (to the best of my knowledge) to analyse the results in a programming or semi-automated way. You can integrate R scripts in the workflow by creating scripting nodes, but this is kind of limited.

This package allows for all that, including:

(partly) recalculating results
making complicated graphs
doing advanced analyses not available in Proteome Discoverer
creating markdown documents (html, pdf, word)
making shiny applicaions using the data

All without having:

to export the data via proteome discoverer
to work with any spreadsheet

The proteinDiscover package enables reading the data from a .pdResult file and using it in an r script or program. This is possible because the data in those files (and also in .msf files) is stored in a sqlite database. It’s not as simple as this may seem however:

some data is in a raw format which needs ‘translation’ to a numeric format
info is sometimes stored in <xml> format

Raw data in tables can be ‘translated’ in a lot of cases, but not always since the format of some fields is unknown. <xml> data can be extracted but it takes some effort to figure out what information exactly is stored and where.

The most important data is readily available through this package:

Proteome Discoverer methods (used to generate the .pdResult file)
proteins
peptides
peptide spectrum matches (psm)
consensus features
quantification spectrum information

Some functions are still in a somewhat experimental stage, especially the ones dealing with the extraction of the Proteome Discoverer methods, parameters, etc are not fully tested yet. As time goes on, this will be improved.

Installation instructions

Since the current version is still at the development stage, the package can only be installed via github:

devtools::install_github("BenBruyneel/proteinDiscover")

Another possibility is to download it from github and then install it from as a local package.

General use

Opening result file

As per usual the package first needs to be loaded, after which a .pdResult file can be opened. For this the wrapper function dbOpen can be used which opens the sqlite database via the pool package.

library(proteinDiscover)
db <- dbOpen(fileName = "TKOTMT10plex.pdResult")

Note: this is an example file which Thermo Scientific uses in their courses. It contains a measurement of TMT10-plex sample of a triple knockout proteomics standard.

To load the proteintable:

library(dplyr)
proteins <- dbGetProteinTable(db = db, columns = c("Accession",
                                                   "Description",
                                                   "ScoreSequestHT",
                                                   "AbundanceRatios",
                                                   "AbundanceRatioPValue"))
head(tibble(proteins))

## # A tibble: 6 × 5
##   Accession Description      ScoreSequestHT AbundanceRatios AbundanceRatioPValue
##   <chr>     <chr>                    <blob>          <blob>               <blob>
## 1 Q07457    E3 ubiquitin-pr…      <raw 9 B>      <raw 27 B>           <raw 27 B>
## 2 P39704    Protein ERP2 [O…      <raw 9 B>      <raw 27 B>           <raw 27 B>
## 3 E7QIA1    Protein HRI1 [O…      <raw 9 B>      <raw 27 B>           <raw 27 B>
## 4 P46955    Beta-glucosidas…      <raw 9 B>      <raw 27 B>           <raw 27 B>
## 5 P06634    ATP-dependent R…      <raw 9 B>      <raw 27 B>           <raw 27 B>
## 6 P04076    argininosuccina…      <raw 9 B>      <raw 27 B>           <raw 27 B>

Translation of raw columns

As can be seen three of the columns are in raw (<blob>) format. These can be ‘translated’ with the function dfTransformRaws

library(knitr)
library(kableExtra)
proteins <- proteins %>% dfTransformRaws()
colnames(proteins)

## [1] "Accession"              "Description"            "ScoreSequestHT_1"      
## [4] "AbundanceRatios_1"      "AbundanceRatios_2"      "AbundanceRatios_3"     
## [7] "AbundanceRatioPValue_1" "AbundanceRatioPValue_2" "AbundanceRatioPValue_3"

proteins %>%
  dplyr::select(Accession, ScoreSequestHT_1, starts_with("AbundanceRatios_")) %>%
  arrange(desc(ScoreSequestHT_1)) %>%
  slice(1:6) %>%
  kbl(align = "ccccc",
      digits = c(0,2,3,3,3),
      col.names = c("Accession", "Sequest Score", "Ratio 1", "Ratio 2", "Ratio 3")) %>%
  kable_minimal()

Accession	Sequest Score	Ratio 1	Ratio 2	Ratio 3
P00359	538.30	0.907	0.865	0.861
P00925	384.58	0.935	0.850	0.925
P00924	333.73	1.075	1.448	1.143
P00358	333.16	0.912	0.953	0.937
P07149	324.33	1.027	1.126	1.133
P19097	290.25	1.032	1.148	1.123

The function dfTransRaws attempts to guess what type (numeric or integer) a raw column might be, but can fail or be wrong. Use the other settings of the function to correct errors or even to ‘enforce’ the correct conversion.

Analysis of Results

Example volcano plot

Having this information, it’s quite easy to make eg volcano plots:

plot(log2(proteins$AbundanceRatios_1), -log10(proteins$AbundanceRatioPValue_1),
     xlab = "log2 abundance ratios", ylab = " -log10 p-Value",
     main = "Volcano plot protein ratios 1")

A bit more fancy:

Code for this plot uses ggplot2 & grid and is a bit too long too include

In this plot:

bottom: density plot of the log2 abundance ratios, showing their distribution
left: density plot of the -log10 p-values, showing their distribution
bottom-left: qq-plot to examine the ‘normality’ of the log2 abundance ratio distribution
red dots: significant (p-Value < 0.05), and fold change >= 2 (log2 = 1) or fold change <= 0.5 (log2 = -1)

Example PCA plot

We can also make PCA plots with the normalized protein abundances:

proteins <- dbGetProteinTable(db = db, columns = c("Accession",
                                                   "AbundancesNormalized")) %>%
  dfTransformRaws()
library(factoextra)
library(FactoMineR)

tproteins <- t(proteins %>%
                 dplyr::select(-Accession) %>%
                 na.omit())

samples <- c("His4","His4","His4",
             "Met6","Met6","Met6",
             "Ura2","Ura2","Ura2",
             "Wildtype")

pcaprot <- PCA(tproteins[,-ncol(tproteins)],
               scale.unit = TRUE, graph = FALSE, ncp = 20)

fviz_pca_ind(pcaprot,
             geom = c("point"),
             repel = TRUE,
             col.ind = samples,
             pointsize = 3,
             mean.point = FALSE,
             pointshape = 16,
             axes = c(1,2),
             title = "Score plot PCA of protein quantification")

Examples quality control

Plotting (peptide spectrum match) average signal to noise during the analysis:

psms <- dbGetPsmTable(db = db, columns = c("RetentionTime", "AverageReporterSN", "DeltaMassInPPM"))
plot(psms$RetentionTime, psms$AverageReporterSN, log = "y",
     xlab = "Retention Time (min)", ylab = "Average Reporter S/N",
     main = "Average Reporter S/N vs Retention Time")

A bit more fancy:

Code for this plot uses ggplot2 & grid and is a bit too long too include

Plotting (peptide spectrum match) average signal to noise during the analysis:

plot(psms$RetentionTime, psms$DeltaMassInPPM,
     xlab = "Retention Time (min)", ylab = "Delta Mass (ppm)",
     main = "Delta Mass vs Retention Time")

A more complicated plot:

Code for this plot uses ggplot2 & grid and is a bit too long too include

It should be clear from these examples that the package gives access to a lot of data by relative simple R commands, thus opening the way to full statistics and quality control in a (semi-) automated way.

Proteome Discover methods

General

There are two functions that extract information from the .pdResult file about the setup of the study:

analysisDefinition
workflowInfo

adef <- proteinDiscover::analysisDefinition(db = db)
awf <- workflowInfo(db = db)

Study Definition

The first one will extract information which can then be processed further with the aid of another set of functions.

studyDefinitionFactors will show the setup of the study factors

studyDefinitionFactors(adef) %>%
  dplyr::select(-id)

## # A tibble: 1 × 4
##   type        name   description factors         
##   <chr>       <chr>  <chr>       <list>          
## 1 Categorical strain ""          <tibble [4 × 3]>

studyDefinitionFactors(adef)$factors[[1]] %>% dplyr::select(-id)

## # A tibble: 4 × 2
##   type        value   
##   <chr>       <chr>   
## 1 Categorical met6    
## 2 Categorical his4    
## 3 Categorical ura2    
## 4 Categorical wildtype

studyDefinitionFileSet shows some data on the (raw) files used in the study

studyDefinitionFileSets(adef) %>%
  dplyr::select(Name, FileName, FileTime, FileSize)

## # A tibble: 1 × 4
##   Name                                          FileName       FileTime FileSize
##   <chr>                                         <chr>          <chr>    <chr>   
## 1 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff "C:\\WORK\\sw… 11/23/2… 1085937…

studyDefinitionExtensionSettings will give information of the ratios used for the relative quantification

library(purrr)
adefExt <- studyDefinitionExtensionSettings(adef)

Here are the ratio definitions:

unname(map_chr(adefExt$QuanRatios, ~.x$RatioString))

## [1] "his4/wildtype" "met6/wildtype" "ura2/wildtype"

And info on the numerators and denominators of the “his4/wildtype” ratio:

adefExt$QuanRatios$`his4/wildtype`$NumeratorSamples %>% dplyr::select(-c(2:4))

## # A tibble: 3 × 4
##   SampleName                                     Quan_Channel Sample_Type strain
##   <chr>                                          <chr>        <chr>       <chr> 
## 1 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 128N         Sample      his4  
## 2 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 128C         Sample      his4  
## 3 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 129N         Sample      his4

adefExt$QuanRatios$`his4/wildtype`$DenominatorSamples %>% dplyr::select(-c(2:4))

## # A tibble: 1 × 4
##   SampleName                                     Quan_Channel Sample_Type strain
##   <chr>                                          <chr>        <chr>       <chr> 
## 1 TKOTMT10plex_50min_120k_45k_86ms_top20_APDoff… 131          Sample      wildt…

It should be clear that most (if not all) info on study definition can be extracted.

Warning: these study definition functions have not been tested extensively yet, so there may be instances where it doesn’t work properly. Newer versions of the package may have some redesign of the code. Most likely the functions in their current form will be kept for the sake of continuity.

Workflows

The workflowInfo function also extracts information that can then be processed further.

Some info comes immediately out of the result of the function:

tibble(awf$workflowInfo %>% dplyr::select(name, template, pc, type))

## # A tibble: 2 × 4
##   name         template                                            pc      type 
##   <chr>        <chr>                                               <chr>   <chr>
## 1 TKOTMT10plex CWF_Comprehensive_Enhanced Annotation_Reporter_Quan DEBRE-… Cons…
## 2 TKOTMT10plex PWF_Fusion_TMT_Quan_MS2_SequestHT_Percolator        DEBRE-… Proc…

tibble(awf$workflowInfo %>% dplyr::select(name, type, startDate, numberOfNodes))

## # A tibble: 2 × 4
##   name         type       startDate                    numberOfNodes
##   <chr>        <chr>      <chr>                                <int>
## 1 TKOTMT10plex Consensus  2020-01-15 07:10:55.3301677Z            14
## 2 TKOTMT10plex Processing 2020-01-15 07:10:51Z                     5

Use nodeTable to get an overview of the consensus part of the workflow:

nodeTable(awf$nodeInfo$Consensus) %>%
  dplyr::select(node, name, description)

## # A tibble: 14 × 3
##    node  name                          description                              
##    <chr> <chr>                         <chr>                                    
##  1 0     MSF Files                     "Selects the .msf file(s) to be processe…
##  2 1     PSM Grouper                   "Groups redundantly identified PSMs into…
##  3 2     Peptide Validator             "Calculates confidences for PSMs and pep…
##  4 3     Peptide and Protein Filter    "Filters the peptides and proteins accor…
##  5 4     Protein Scorer                "Calculates the protein scores using the…
##  6 5     Protein Grouping              "Groups the proteins into protein groups…
##  7 6     Peptide in Protein Annotation "Annotates the flanking residues, the po…
##  8 10    Reporter Ions Quantifier      "Calculates quan ratios for PSMs, peptid…
##  9 7     Protein FDR Validator         "Assigns a protein confidence based on t…
## 10 8     Protein Annotation            "Retrieves proteins-centric annotation u…
## 11 9     Protein Marker                "Marks proteins and their connected pept…
## 12 11    Result Statistics             "Creates simple descriptive summary stat…
## 13 12    Display Settings              "Persists a user default display filter …
## 14 13    Data Distributions            "Visualizes data distributions for resul…

library(DiagrammeR)
nodeTable(awf$nodeInfo$Consensus) %>%
  createDiagrammeRString() %>%
  grViz()

Use the nodes function to get the settings used in one or more of the nodes in the workflow

names(nodes(awf$nodeInfo$Consensus))

##  [1] "MSF Files"                     "PSM Grouper"                  
##  [3] "Peptide Validator"             "Peptide and Protein Filter"   
##  [5] "Protein Scorer"                "Protein Grouping"             
##  [7] "Peptide in Protein Annotation" "Reporter Ions Quantifier"     
##  [9] "Protein FDR Validator"         "Protein Annotation"           
## [11] "Protein Marker"                "Result Statistics"            
## [13] "Display Settings"              "Data Distributions"

nodes(awf$nodeInfo$Consensus)$`Peptide Validator` %>%
  dplyr::select(-c(category,advanced))

## # A tibble: 8 × 2
##   name                                                                     value
##   <chr>                                                                    <chr>
## 1 Validation Mode                                                          Auto…
## 2 Target FDR (Strict) for PSMs                                             0.01 
## 3 Target FDR (Relaxed) for PSMs                                            0.05 
## 4 Target FDR (Strict) for Peptides                                         0.01 
## 5 Target FDR (Relaxed) for Peptides                                        0.05 
## 6 Validation Based on                                                      q-Va…
## 7 Target/Decoy Selection for PSM Level FDR Calculation Based on Score      Auto…
## 8 Reset Confidences for Nodes without Decoy Search (Fixed score threshold… False

nodes(awf$nodeInfo$Consensus)$`Reporter Ions Quantifier` %>%
  dplyr::select(-c(category,advanced)) %>%
  print(n=31)

## # A tibble: 31 × 2
##    name                                           value                      
##    <chr>                                          <chr>                      
##  1 Reporter Abundance Based On                    "Automatic"                
##  2 Apply Quan Value Corrections                   "False"                    
##  3 Co-Isolation Threshold                         "50"                       
##  4 Average Reporter S/N Threshold                 "10"                       
##  5 SPS Mass Matches [%] Threshold                 "65"                       
##  6 Peptides to Use                                "Unique"                   
##  7 Consider Protein Groups for Peptide Uniqueness "True"                     
##  8 Use Shared Quan Results                        "True"                     
##  9 Reject Quan Results with Missing Channels      "False"                    
## 10 Normalization Mode                             "Total Peptide Amount"     
## 11 Proteins For Normalization                     ""                         
## 12 Scaling Mode                                   "On All Average"           
## 13 For Normalization                              "Use All Peptides"         
## 14 For Protein Roll-Up                            "Use All Peptides"         
## 15 For Pairwise Ratios                            "Exclude Modified"         
## 16 1. Considered Peptide Modification             "None"                     
## 17 2. Considered Peptide Modification             "None"                     
## 18 3. Considered Peptide Modification             "None"                     
## 19 N-Terminal Considered Peptide Modification     "None"                     
## 20 C-Terminal Considered Peptide Modification     "None"                     
## 21 N-Terminal Considered Protein Modification     "None"                     
## 22 C-Terminal Considered Protein Modification     "None"                     
## 23 Protein Ratio Calculation                      "Pairwise Ratio Based"     
## 24 Maximum Allowed Fold Change                    "100"                      
## 25 Imputation Mode                                "None"                     
## 26 Hypothesis Test                                "t-test (Background Based)"
## 27 1st Fold Change Threshold                      "2"                        
## 28 2nd Fold Change Threshold                      "4"                        
## 29 3rd Fold Change Threshold                      "6"                        
## 30 4th Fold Change Threshold                      "8"                        
## 31 5th Fold Change Threshold                      "10"

Finishing

Note: since the .pdResult file is a database, it should be closed nicely when no more requests are going to be made

dbClose(db)

Closing remarks

Obviously this package is a work in progress. A lot more testing is still required. Usually there’s a need for some graph or data analysis, which causes functions/procedures to be added to the package. Debugging is a continuous process which will cause smaller updates every now and then. Serious errors and bugs are corrected as soon as possible.

Contact info: benbruyneel@gmail.com

proteinDiscover package

Ben Bruyneel

31/10/2023

Introduction

Aim of the package

Installation instructions

General use

Opening result file

Translation of raw columns

Analysis of Results

Example volcano plot

Example PCA plot

Examples quality control

Proteome Discover methods

General

Study Definition

Workflows

Finishing

Closing remarks