::install_github("BenBruyneel/proteinAnnotation") devtools
Protein Annotation
An R-package for the display of protein information
Introduction
Aim of this package
The main reason for writing this package stems from my attempts to mine and visualize experimental protein data. It draws some inspiration from the drawProteins package, though the package here is more geared towards working with the data generated from protein digest experiments.
Since I work mainly with proteomics data coming from the Proteome Discoverer software of ThermoFisher Scientific, a lot of the package’s functions are geared to preparing/wrangling data/results coming from that particular software. It should be relatively easy to create similar functions for other data formats.
Installation instructions
Since the current version is still at the development stage, the package can only be installed via github:
General use
library(proteinAnnotation)
Test data
First we need to get the protein sequence data into a data.frame. The package contains the a function standardProtein
which returns the amino acid sequence for either Ovalbumin (OVA) or Bovine Serum Albumin (BSA). This function is only a helper function for examples & testing.
# Amino acid sequence of Ovalbumin
standardProtein()
[1] "MGSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTRTQINKVVRFDKLPGFGDSIEAQCGTSVNVHSSLRDILNQITKPNDVYSFSLASRLYAEERYPILPEYLQCVKELYRGGLEPINFQTAADQARELINSWVESQTNGIIRNVLQPSSVDSQTAMVLVNAIVFKGLWEKAFKDEDTQAMPFRVTEQESKPVQMMYQIGLFRVASMASEKMKILELPFASGTMSMLVLLPDEVSGLEQLESIINFEKLTEWTSSNVMEERKIKVYLPRMKMEEKYNLTSVLMAMGITDVFSSSANLSGISSAESLKISQAVHAAHAEINEAGREVVGSAEAGVDAASVSEEFRADHPFLFCIKHIATNAVLFFGRCVSP"
Another helper function with some test data is the OVATable
function which generates a protein table or a peptide table with some (fake) experimental data. The format of the these data.frames is the same as what comes out of Proteome Discoverer. The names of the columns may differ a bit depending on how you get the data out of that program. What is generated here is what you get when you use the proteinDiscover package. The protein/peptide data is all related to Ovalbumin.
library(dplyr)
library(knitr)
# Protein Table
OVATable() |> tibble()
# A tibble: 1 × 3
Sequence Accession SequenceCoverage
<chr> <chr> <chr>
1 GSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTR… P01012 385;0-380
# Peptide table
OVATable("peptide") |> head() |> kable()
Sequence | Modifications | PositionsinProteins | XCorrbySearchEngine_1 | Abundances_1 | Abundances_2 |
---|---|---|---|---|---|
ISQAVHAAHAEINEAGR | P00000 [323-339]; P01012 [323-339] | 7.746809 | 6922540038 | 7817180355 | |
ADHPFLFCIK | 1×Carbamidomethyl [C8] | P00000 [360-369]; P01012 [360-369] | 4.090933 | 6459800097 | 11641298738 |
AFKDEDTQAMPFR | P00000 [187-199]; P01012 [187-199] | 5.751740 | 5561482511 | 8982076379 | |
LTEWTSSNVMEER | P00000 [264-276]; P01012 [264-276] | 5.085020 | 4562266198 | 5516495806 | |
GGLEPINFQTAADQAR | P00000 [127-142]; P01012 [127-142] | 5.391547 | 4475159720 | 5717628078 | |
YPILPEYLQCVK | 1×Carbamidomethyl [C10] | P00000 [111-122]; P01012 [111-122] | 5.270951 | 2387118310 | 5403322654 |
Protein data
Create protein data
createProteinData
is an important function in this package because it takes a protein sequence and creates a data.frame which can then be used for the rest of available functions.
createProteinData(sequence = standardProtein(), start = 2, nterm = 1) |> head()
position sequence
1 1
2 2 G
3 3 S
4 4 I
5 5 G
6 6 A
createProteinData(sequence = standardProtein("bsa"), start = 25) |> head()
position sequence
1 25 D
2 26 T
3 27 H
4 28 K
5 29 S
6 30 E
Protein Data manipulation
These data.frame’s can be modified to add data. This can be both numerical and non-numerical data.
An example of adding all positions in the protein where the amino acid lysine (K) is present:
<- createProteinData(sequence = OVATable("protein")$Sequence)
ovaProtein $K <- ifelse(ovaProtein$sequence == "K", "Lysine", " ")
ovaProtein|> slice(15:25) ovaProtein
position sequence K
1 15 F
2 16 K Lysine
3 17 E
4 18 L
5 19 K Lysine
6 20 V
7 21 H
8 22 H
9 23 A
10 24 N
11 25 E
Another way of adding data to a protein data data.frame is using the addDataToProtein
function. It takes a data.frame with a set of positions and the data to put ‘onto’ those positions.
# find all positions of K or R in the protein
<- data.frame(position = ovaProtein$position[
trypticSites $sequence == "K" | ovaProtein$sequence == "R"],
ovaProteindata = "tryptic")
|> head() trypticSites
position data
1 16 tryptic
2 19 tryptic
3 46 tryptic
4 50 tryptic
5 55 tryptic
6 58 tryptic
# add the new data to the protein data
<- addDataToProtein(proteinDF = ovaProtein, dataframe = trypticSites,
ovaProtein dataColumn = "data", dataName = "trypticSite", NAValue = "")
276:285,] ovaProtein[
position sequence K trypticSite
276 276 R tryptic
277 277 K Lysine tryptic
278 278 I
279 279 K Lysine tryptic
280 280 V
281 281 Y
282 282 L
283 283 P
284 284 R tryptic
285 285 M
Obviously the tryptic sites as determined by the above code leave out that if Lysine (K) or Arginine (R) follows a Proline (P) it will not be digested (by Trypsin). Of course you can code that in a somewhat more elaborate piece of R code, but the function getPeptideStart
can help as it allows to find patterns in a protein sequence and regular expressions can be used!
Note: the data to be added can be of any class!
# find all tryptic sites w/o a preceding Proline (P)
<- getPeptideStart(proteinSequence = OVATable("protein")$Sequence,
trypticSites peptideSequence = "(?<!P)R|(?<!P)K")
|> head() trypticSites
[1] 16 19 46 50 55 58
<- addDataToProtein(proteinDF = ovaProtein,
ovaProtein dataframe = data.frame(position = trypticSites,
data = "tryptic"),
dataColumn = "data",
dataName = "realTrypticSite", NAValue = "non tryptic")
276:285,] ovaProtein[
position sequence K trypticSite realTrypticSite
276 276 R tryptic tryptic
277 277 K Lysine tryptic tryptic
278 278 I non tryptic
279 279 K Lysine tryptic tryptic
280 280 V non tryptic
281 281 Y non tryptic
282 282 L non tryptic
283 283 P non tryptic
284 284 R tryptic non tryptic
285 285 M non tryptic
Adding peptide information
Data can also be added via the mapPeptidesToProtein
function. For this function you need the information to be added in a data.frame with three required columns: Sequence, Peptide Position and the data to be added. The Sequence and Peptide Position are needed to figure out the positions of the amino acids in the protein sequence. The peptide position coming from Proteome Discoverer has a certain format: <Accession> [start-end] ; ...
. This is used to extract the start position. An example of this info:
# info needed to map information to protein position
OVATable("peptide") %>%
select(Sequence, PositionsinProteins) %>%
slice(1:5) %>% tibble()
# A tibble: 5 × 2
Sequence PositionsinProteins
<chr> <chr>
1 ISQAVHAAHAEINEAGR P00000 [323-339]; P01012 [323-339]
2 ADHPFLFCIK P00000 [360-369]; P01012 [360-369]
3 AFKDEDTQAMPFR P00000 [187-199]; P01012 [187-199]
4 LTEWTSSNVMEER P00000 [264-276]; P01012 [264-276]
5 GGLEPINFQTAADQAR P00000 [127-142]; P01012 [127-142]
# peptide information in the data.frame()
OVATable("peptide") %>%
select(-c(Sequence, PositionsinProteins)) %>%
slice(1:5) %>% tibble()
# A tibble: 5 × 4
Modifications XCorrbySearchEngine_1 Abundances_1 Abundances_2
<chr> <dbl> <dbl> <dbl>
1 "" 7.75 6922540038 7817180355
2 "1×Carbamidomethyl [C8]" 4.09 6459800097 11641298738
3 "" 5.75 5561482511 8982076379
4 "" 5.09 4562266198 5516495806
5 "" 5.39 4475159720 5717628078
To map the info in the data.frame to the protein data, we can use:
# mapping Abundance information
<- mapPeptidesToProtein(proteinDF = ovaProtein, peptideTable = OVATable("peptide"),
ovaProtein Accession = "P01012", positionColumn = "PositionsinProteins", variable = "Abundances_1",
dataName = "Abundance", combineFunction = sum)
# mapping Score information
<- mapPeptidesToProtein(proteinDF = ovaProtein, peptideTable = OVATable("peptide"),
ovaProtein Accession = "P01012", positionColumn = "PositionsinProteins", variable = "XCorrbySearchEngine_1",
dataName = "Score", combineFunction = max)
35:45,] |> tibble() ovaProtein[
# A tibble: 11 × 7
position sequence K trypticSite realTrypticSite Abundance Score
<int> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 35 M " " "" non tryptic 0 0
2 36 S " " "" non tryptic 0 0
3 37 A " " "" non tryptic 0 0
4 38 L " " "" non tryptic 0 0
5 39 A " " "" non tryptic 45376732 2.06
6 40 M " " "" non tryptic 45376732 2.06
7 41 V " " "" non tryptic 45376732 2.06
8 42 Y " " "" non tryptic 45376732 2.06
9 43 L " " "" non tryptic 45376732 2.06
10 44 G " " "" non tryptic 45376732 2.06
11 45 A " " "" non tryptic 45376732 2.06
As can be seen, position 35-38 was not present in any of the peptides in the peptide information, so both Score and Abundance are both zero. Position 39-45 were present, so they get the score and abundances from the peptides that contain them. They way the function in the above code-block was called:
- Score at a position = maximum XCorr score of all peptides containing the position
- Abundance at a position = sum of all peptides containing the position
Displaying protein data
Plotting protein data
Obviously base R can be used to get some insights into the peptide info across protein positions
par(mfrow = c(2,1))
plot(ovaProtein$position, ovaProtein$Abundance, type = "l",
main = "Peptide Abundance across the protein",
xlab = "Position", ylab = "Abundance")
plot(ovaProtein$position, ovaProtein$Score, type = "l",
main = "Peptide Score across the protein",
xlab = "Position", ylab = "Score")
par(mfrow = c(1,1))
To make a bit prettier graphs, the function plotProteinVariable
can be used.
plotProteinVariable(proteinDF = ovaProtein, column = "Score", yLabel = "Score",
title = "Peptide Score across the protein")
plotProteinVariable(proteinDF = ovaProtein, column = "Abundance",
yLabel = "Abundance (log10)", yLog = TRUE,
ylogSetValue = 1E4, yLimits = c(1E3, 2E10),
drawYzero = TRUE, zeroType = "dotted", zeroAlpha = 0.5,
title = "Peptide Abundance across the protein",
caption = "Due to log10 transformation all 'zero' values were set to 10000")
Protein coverage
proteinCoverage
is a special function which can map peptide (or protein) information on to protein data. It takes a similar data.frame as mapPeptidesToProtein
, but uses only the Sequence and peptide position information to setup a so called coverage column which specifies positions in the protein are found in the peptide information data.frame. Example
proteinCoverage(sequence = standardProtein(),
peptideTable = OVATable("peptide"),
start = 2, nterm = 1,
Accession = "P01012", positionColumn = "PositionsinProteins") %>%
slice(56:65)
position sequence coverage
1 56 K 0
2 57 V 0
3 58 V 0
4 59 R 1
5 60 F 1
6 61 D 1
7 62 K 1
8 63 L 1
9 64 P 1
10 65 G 1
As can be seen, position 56 to 58 were ‘tagged’ with a 0 indicating none of the peptides in the peptide table contained these protein positions. The positions marked 1 were found in at least one of the peptides.
<- proteinCoverage(sequence = OVATable("protein")$Sequence,
ovaCoverage peptideTable = OVATable("peptide"),
Accession = "P01012", positionColumn = "PositionsinProteins")
<- bind_cols(ovaProtein,
ovaProtein %>% select(coverage))
ovaCoverage 56:65,] %>% tibble() ovaProtein[
# A tibble: 10 × 8
position sequence K trypticSite realTrypticSite Abundance Score coverage
<int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr>
1 56 V " " "" non tryptic 0 0 0
2 57 V " " "" non tryptic 0 0 0
3 58 R " " "tryptic" tryptic 0 0 0
4 59 F " " "" non tryptic 78966057 10.7 1
5 60 D " " "" non tryptic 78966057 10.7 1
6 61 K "Lysi… "tryptic" tryptic 78966057 10.7 1
7 62 L " " "" non tryptic 135071761 11.0 1
8 63 P " " "" non tryptic 135071761 11.0 1
9 64 G " " "" non tryptic 135071761 11.0 1
10 65 F " " "" non tryptic 135071761 11.0 1
Displaying protein Info
There’s two main functions for this:
displayProtein
for graphical display (via ggplot2’sgeom_tile
function)displayProteinText
for displaying the protein information as colored text in the console or in HTML format
Examples of displaying protein information in different ways…
displayProtein(ovaProtein, columns = 50,
textColorColumn = "coverage", textColors = c("black","red"))
For Quarto/R Markdown you can’t really use displayProteinText
except when using html-output with for example ‘html’ tables as with the gt library
displayProteinText & gt tables
<- displayProteinText(ovaProtein, columns = 50,
coverageProtein textColorColumn = "coverage", textColors = c("black","red"),
HTMLoutput = TRUE)
<- ovaProtein %>%
trypticProtein mutate(realTrypticSite = as.integer(realTrypticSite == "tryptic")) %>%
displayProteinText(columns = 50,
textColorColumn = "realTrypticSite", textColors = c("black","red"),
HTMLoutput = TRUE)
<- ovaProtein %>%
combinedProtein mutate(realTrypticSite = as.integer(realTrypticSite == "tryptic")) %>%
displayProteinText(columns = 50,
textColorColumn = "coverage", textColors = c("black","red"),
textBackgroundColorColumn = "realTrypticSite",
textBackgroundColors = c("white","steelblue"),
HTMLoutput = TRUE)
library(gt)
data.frame(Description = c("Coverage",
"Tryptic sites",
"Coverage & Tryptic sites"),
Sequence = c(coverageProtein,
trypticProtein,%>%
combinedProtein)) ::gt() %>%
gt::fmt_markdown(columns = Sequence) %>%
gt::tab_style(
gtstyle = list(
::cell_text(font = "courier")
gt
),locations = gt::cells_body(
columns = "Sequence"
) )
Description | Sequence |
---|---|
Coverage | GSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTR |
Tryptic sites | GSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTR |
Coverage & Tryptic sites | GSIGAASMEFCFDVFKELKVHHANENIFYCPIAIMSALAMVYLGAKDSTR |
To display the coverage and tryptic sites with displayProtein
displayProtein(ovaProtein, columns = 50,
textColorColumn = "coverage", textColors = c("black","red"),
fillColorColumn = "realTrypticSite",
fillColors = c("white", "steelblue"), replaceFillNA = "non tryptic",
fillColorLevels = c("non tryptic","tryptic"), fillLegendTitle = "Tryptic sites")
Originally the background color of the letters (tiles) was meant to be of a continous type (like a gradient)
displayProtein(ovaProtein, columns = 50,
textColorColumn = "coverage", textColors = c("black","red"),
fillColorColumn = "Score",
fillColors = c("white", "blue"), alpha = 0.55,
fillLegendTitle = "Score")
displayProtein(ovaProtein, columns = 50,
textColorColumn = "coverage", textColors = c("black","red"),
fillColorColumn = "Abundance",
fillColors = c("white", "blue"), alpha = 0.55,
fillLegendTitle = "Abundance")
Since the range of Abundance is rather big, the ‘fillColor’ can be made logarithmic to show lower ‘levels’ of Abundance a bit better.
displayProtein(ovaProtein, columns = 50,
textColorColumn = "coverage", textColors = c("black","red"),
fillColorColumn = "Abundance",
fillColors = c("white", "blue"), alpha = 0.55,
fillTransformation = "log10",
fillLegendTitle = "Abundance")
Modifications
To display modifications, there are a number of options. A few are displayed below.
This code shows, based on the example peptide table, where there are ‘Carbamidomethyl’ (C) or “Phospho” (S/T/Y) modifications. The column clear shows if the position is known:
- means exact position known
- exact position not known, the actual location may be another position (amino acid) usually close by
- combination of 1 & 2: there are peptide hits with exact known positions and there are those where it’s not clear where the modifications is located precisely
getAllModPositionInProtein(peptideTable = OVATable("peptide"),
Accession = "P01012",
whichModification = c("Carbamidomethyl","Phospho"),
positionColumn = "PositionsinProteins",
sumPositions = TRUE)
$Carbamidomethyl
position clear
1 73 1
2 120 1
3 367 1
$Phospho
position clear
1 68 3
2 75 2
3 76 2
4 81 2
5 82 2
6 344 1
To use the info we must map the info onto the protein data
<- getAllModPositionInProtein(peptideTable = OVATable("peptide"),
newData Accession = "P01012",
whichModification = "Phospho",
positionColumn = "PositionsinProteins",
sumPositions = TRUE)
<- addDataToProtein(proteinDF = ovaProtein,
ovaProtein dataframe = newData$Phospho,
dataColumn = "clear", dataName = "Phospho", NAValue = "0")
%>%
ovaProtein select(position, sequence, coverage, Phospho) %>%
slice(66:75) %>%
tibble()
# A tibble: 10 × 4
position sequence coverage Phospho
<int> <chr> <chr> <chr>
1 66 G 1 0
2 67 D 1 0
3 68 S 1 3
4 69 I 1 0
5 70 E 1 0
6 71 A 1 0
7 72 Q 1 0
8 73 C 1 0
9 74 G 1 0
10 75 T 1 2
And to display it:
displayProtein(ovaProtein, columns = 50,
textColorColumn = "coverage", textColors = c("black","red"),
fillColorColumn = "Phospho",
fillColors = c("white", "lightblue","steelblue","blue"),
fillColorLevels = c("0","1","2","3"), replaceFillNA = "0",
fillColorLabels = c("No label","Clear", "Unclear", "Clear & Unclear"),
fillLegendTitle = "Phosphorylation sites")
Another way to show abundance and the modification positions at the same time:
$Phospho <- ifelse(ovaProtein$Phospho == "0", "", "*")
ovaProtein
plotProteinVariable(proteinDF = ovaProtein, column = "Abundance",
yLabel = "Abundance (log10)", yLog = TRUE,
ylogSetValue = 1E4, yLimits = c(1E3, 2E10),
drawYzero = TRUE, zeroType = "dotted", zeroAlpha = 0.5,
title = "Peptide Abundance & Phosphorylation across the protein",
addSequence = TRUE, sequenceColumn = "Phospho",
textSize = 4, textColor = "blue",
caption = "Due to log10 transformation all 'zero' values were set to 10000",
)
3-D visualization
It’s also possible to map the information onto 3D displays of protein structures via the r3dmol library. Main difference here is that since ggplot2 is not used the colors need to be calculated beforehand and then mapped onto the protein structure. The proteinAnnotation package provides a set of color ‘translation’ functions.
library(r3dmol)
# get pdb file from AlphaFold: https://alphafold.ebi.ac.uk/entry/P01012
<- tempfile(fileext = ".pdb")
downloadedFile download.file("https://alphafold.ebi.ac.uk/files/AF-P01012-F1-model_v4.pdb", destfile = downloadedFile)
# generate a 3d object
<- r3dmol(viewer_spec = m_viewer_spec(
p3d cartoonQuality = 20,
lowerZoomLimit = 50,
upperZoomLimit = 350
%>%
)) m_add_model(downloadedFile) %>%
m_set_style(style = m_style_cartoon(arrows = F)) %>%
m_zoom_to()
# display w/o modifications
p3d
Since the protein data we’ve worked with so far is actually missing the Methionine residue at the n-terminus (since what you work with in the lab usually is the mature version of a protein), we need to add it to the protein data
<- rbind(data.frame(position = 0, sequence = "M", K = " ", trypticSite = "",
ovaProtein realTrypticSite = "non tryptic", Abundance = 0,
Score = 0, coverage = 0, Phospho = "") , ovaProtein)
$position <- 1:386 ovaProtein
To map the tryptic sites on the 3D structure:
<- translateRangeToColors(rangeValues = "tryptic", insideColor = "red", outsideColor = "white", inbetween = FALSE)
colorFunction
$realTrypticSiteColor <- colorFunction(ovaProtein$realTrypticSite)
ovaProtein
addStyleElement(p3d, proteinDF = ovaProtein, colorColumn = "realTrypticSiteColor", arrows = FALSE)
Of course we can still add styles/items to the 3D display
addStyleElement(p3d, proteinDF = ovaProtein, colorColumn = "realTrypticSiteColor", arrows = FALSE) %>%
m_add_style(
style = c(
m_style_stick(colorScheme = "default")
),sel = m_sel(resi = ovaProtein$position[ovaProtein$realTrypticSite == "tryptic"])
)
To map, as an example, the peptide abundance information we can use the translateToGradientColors function
<- translateToGradientColors(minValue = min(ovaProtein$Abundance),
linearColors maxValue = max(ovaProtein$Abundance),
colors = c("white", "darkblue"))
$AbundanceColors <- linearColors(ovaProtein$Abundance)
ovaProtein
addStyleElement(p3d, proteinDF = ovaProtein, colorColumn = "AbundanceColors", arrows = FALSE)
To use a ‘logarithmic’ color scale:
<- translateToGradientColors(minValue = min(ovaProtein$Abundance),
logColors maxValue = max(ovaProtein$Abundance),
colors = c("white", "darkblue"),
transformFunction = log10p)
$AbundanceLogColors <- logColors(ovaProtein$Abundance)
ovaProtein
addStyleElement(p3d, proteinDF = ovaProtein, colorColumn = "AbundanceLogColors", arrows = FALSE)
Of course we can also map other data, such as the b-values which are present in the pdb file itself. The b-values are a measure for the model confidence of AlphaFold, see also: Ovalbumin AlphaFold entry. This code reproduces what is seen in the 3D viewer on the AlphaFold site:
<- bio3d::read.pdb(downloadedFile)
pdbInfo
<- as.data.frame(pdbInfo$atom) %>%
atoms ::distinct(resno, .keep_all = T)
dplyr
$b <- atoms$b
ovaProtein
<- data.frame(rangeValues = NA,
dfColors insideColors = c("#0053D6", "#65CBF3", "#FFDB13", "#FF7D45"),
columnName = "b", inbetween = T)
$rangeValues <- list(c(90,Inf), c(70,90), c(50, 70), c(-Inf, 50))
dfColors
$bColor <- translateRangesToColors(ranges = dfColors,
ovaProteinproteinData = ovaProtein, outsideColor = "purple")
addStyleElement(p3d, proteinDF = ovaProtein, colorColumn = "bColor", arrows = FALSE)
And of course we can make the whole thing spin around:
addStyleElement(p3d, proteinDF = ovaProtein, colorColumn = "bColor", arrows = FALSE) %>%
m_rotate(angle = 90, axis = "y") %>%
m_spin(speed = 1)
Closing remarks
This package is a work in progress. One of the things that needs adjustment is the naming conventions of especially the function arguments.
I noticed myself that all functions are very dependent on using the right sequence, especially when creating the protein data. If care is not taken it may easily result in shifting of data. I intend to build some safeguards for this into the functions, but for now it’s recommended to double-check everything that comes out.
Contact Info: benbruyneel@gmail.com