SRM Analysis, revisiting correlations between abundance & environment
Taking a simpler approach to identifying correlations between environmental data and protein abundances. Goals of today was to:
- Subselect salinity data, generate summary stats, to be able to include in analysis
- Generate correlation plots with R^2 and P-values between differentially abundant proteins and environmental summary stats
- From plots, select env. summary stats to use in 1) multiple linear regression model & 2) structural equation model
- Generate correlation plots with selected env. variables to identify potential interactions
- Run multiple linear regression model
- Interpret results
- Revise paper to include results from these tasks
1. Subselected salinity data, generate summary stats, include in analysis
The environmental data manipulation referenced in this post was executed in the Stats 2 R script.
In one of our team meetings we made the determination to ignore salinity data for the time being, since several probes malfunctioned. Up until this point I have performed all tasks/stats on salinity data alongside the other parameters, but ignored salinity when analyzing correlations. Today I generated correlation plots and found salinity may play a role. So, I decided to clean up salinity data as much as I could to include in my analysis. To do so, I reviewed raw salinity data via this plotly plot. Determined that the following salinity data needed to be removed due to probe malfunction: All CIE; FBB > 7/3 @ 09:50; WBB > 6/25 @ 05:30. I then removed all outliers from the updated salinity time-series data, and re-plotted. This is the resulting plot
Raw salinity data:
Bad data & outliers removed:
2. Generated correlation plots with R^2 and P-values between differentially abundant proteins and environmental summary stats
The rest of this post refers to actions taken in the Stats 4 R script.
Discovered the excellent ggscatter
function in the ggpubr
library to generate correlation plots. I generated correlation plots for the 3 differentially abundant proteins (HSP90, Puromycin-sensitive aminopeptidase, Trifunctional Enzyme). At first I was going to just use Pep1 (the most abundant peptide from each protein) to generate these plots. Instead, I z-transformed all peptide abundance data so they were all on the same scale, and thus could regress all peptides from each protein against environmental paramters.
It’s a good idea to quickly summarize how I’ve manipulated peptide abundance data in this project:
- lambda-transformed abundance by each protein to create normal distributions, then compared across habitats, sites, and regions using ANOVA
- z-transformed by each peptide ((X-mean)/SD), then ran correlation analysis against each environmental parameter
3. From plots, selected env. summary stats to use in models
Selected the following environmental parameters per the respective proteins for model constructions, using an approximate alpha=0.01 as cutoff: % Growth, DO.sd, DO.var, pH.sd.1, salinity.mean, salinity.sd.1. Here are the correlation plots for these parameters, using HSP90 for the purposes of this example:
4. Generate correlation plots with selected env. variables to identify potential interactions
Generated a correlation plot with the pairs()
function to identify any environmental parameters that are highly correlated; if identified, I can leave one of the two correlated metrics out of my linear model.
From the correlation plot, it appears that:
- DO.sd & DO.var highly correlated. Select DO.sd since it’s correlation p=value is slightly smaller
- pH.sd & pH.var highly correlated. Select pH.var since it’s correlation p=value is smaller and r value higher
5. Run multiple linear regression model
Ran the following model and viewed summary:
Call:
lm(formula = value ~ DO.mean.loc.Z + DO.sd.loc.Z + pH.var.loc.Z +
Salinity.mean.loc.Z + Temperature.mean.loc.Z, data = data.pepsum.Env.Stats.HSP90.Z,
singular.ok = TRUE)
Residuals:
Min 1Q Median 3Q Max
-1.7794 -0.4787 0.1177 0.5790 1.2928
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2450 0.5825 0.421 0.67510
DO.mean.loc.Z 0.6813 1.1876 0.574 0.56776
DO.sd.loc.Z -1.0086 0.9314 -1.083 0.28206
pH.var.loc.Z 0.3494 0.2268 1.541 0.12723
Salinity.mean.loc.Z 1.0207 0.3477 2.935 0.00432 **
Temperature.mean.loc.Z NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7499 on 82 degrees of freedom
(54 observations deleted due to missingness)
Multiple R-squared: 0.4062, Adjusted R-squared: 0.3772
F-statistic: 14.02 on 4 and 82 DF, p-value: 9.27e-09
Also ran anova()
on the model:
> anova(lm.HSP90)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
DO.mean.loc.Z 1 25.041 25.0415 44.5306 2.74e-09 ***
DO.sd.loc.Z 1 0.575 0.5754 1.0232 0.314744
pH.var.loc.Z 1 1.077 1.0767 1.9147 0.170197
Salinity.mean.loc.Z 1 4.846 4.8458 8.6171 0.004319 **
Residuals 82 46.112 0.5623
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
The coefficient significances are not adequate, and one parameter (T) was left out due to singularity. After messing with models for weeks I have decided to not include them in my paper. I’ve made this decision for several reasons: 1) I don’t believe that I have all influential parameters at hand to construct a representative model, 2) With the exiting environmental data, I believe that correlation analysis is adequate for the purposes of our paper, and perhaps most importantly 3) I don’t know what I’m doing so unable to make informed decisions nor interpret results (I hope to take the regression course next quarter). I also attempted to build a Structural Equation Model, but again, failed.