Refining growth/condition marker discovery in Pacific cod, GWAS Day 4

I am interested to see whether I can use the putative growth/condition markers detected in experimental Pacific cod to predict “performance” in our reference/wild caught cod from different marine regions (Northern Bering Sea, Aleutians, Eastern Bering Sea / western Gulf of Alaska, eastern Gulf of Alaska). I have a beagle file containing genotype likelihoods for sites detected in both experimental and reference fish. I filtered that beagle file for only our putative growth/performance markers, then ran PCAngsd to generate a covariance matrix, then PCA in R. My thought is that I could potentially use a growth/condition gradient along a PC axis from experimental fish to identify reference fish populations/regions that lie on the high growth/condition end (i.e. potential high performers).

Here is the PCA of those ~270 putative growth/condition markers with growth/condition index of experimental fish indicated by the gradient, and all reference fish in gray. Second figure is a regression of growth/condition index ~ PC1 scores for experimental fish only, which I could possibly use to potentially predict performance in reference fish - the more negative PC1 score the better?. The boxplot shows PC1 scores for reference fish, potentially indicating eGOA fish have the most negative PC1 score / best performance indicator and tagged/NBS fish have the highest PC1 scor.

What’s happening here

This is strange, though. When I plot PC1 scores for experimental fish alongside reference fish, obviously the variability is much higher in experimental fish. BUT also - there are differences among temperature treatments, which isn’t a good sign. I believe this indicates that the marker discovery is being influenced by growth/condition differences among treatments. This suggests to me that I need to run my GWAS analyses separately for each temperature and look for 1) common markers among treatments that could indicate general growth/condition performance, and 2) unique markers among treatments that could reflect temperature-specific markers.

Re-run GWAS separately for each temperature

Here I perform four separate GWAS runs for each temperature treatment (0,5,9,16), with n=40/treatment for all but the 16C group (n=37, unfortunately we don’t have good DNA data for three that died!). I created four gwas subdirectories (temp-#), re-ran ANGSD to generate beagle files with genotype likelihoods, then ran angsd -doAsso. Here is an example set of scripts used for 0C group, located in the directory /home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/gwas/temp-0/:

First, resolving GWAS issue

For some unknown reason the GWAS ran successfully on raw genotype likelihoods for 0C, 5C, and 9C, but NOT for 16C. I kept encountering a “segmentation fault” error, which I’ve previously encountered with angsd v0.940 (but it’s weird that it’s only happening with one set of samples). After TOO MUCH TIME troubleshooting, I’ve decided to pivot to another approach - using angsd v0.933 instead. That version kept throwing an error related to improperly handing chromosome name; it couldn’t get past the “” in each chromosome id_site (NC_XXXXX.1#####) in the beagle files, so I never attempted to use it. But now I have to! So, I’ll modify the input beagles and genome reference to remove that first “_”.

Modify pcod genome file to remove “_” in chromosome ID

sed '/^>/ s/NC_/NC/' /home/lspencer/references/pcod-ncbi/GCF_031168955.1_ASM3116895v1_genomic_rehead.fa > pcod-genome.fa  #use sed to replace "NC_" with "NC" in header lines

(base) [lspencer@node28 gwas]$ head pcod-genome.fa #view results
>NC082382.1
CAGACCTCCAATAGAGAGCTGCTCCCCCTTCGCACAAAGCCGCTGCTGGTGAATGCTCGAAGCGTTGTGTGATTGAATCG
CTTTAATGCCGTTCCATGTCACGTTGATCGTTTTTTTGCACAACGAGCAAAAAGCTTCCCGGTCATTTCCCATCACGGGT
TTCAGCCAGTCTTTAAATGTCGGGTCGTCAACCCATGTTTGCGAaaacttgcatttacccattcTCTCATAGAGCCAGAA
ACTCAGCCGTACAGAACTCTGAGGGGAAAAGGCGGAAAATGTTGctggtgttacaccgaattctgtgacccaccgaaaag
tgtggcCCCAGGGGGTGcagttgcacattttctgcaacatgcacgtgtgcattataacaaaaaacataaataaaacataa
gatctctggtgggtctttctttttttgatactaacattaattctaaacactattttacacagtagcctatactaggaaat
gctcaaaggtaaatcagcagAATTTAACcatgactgtagctttttatgggtaaagaagcaacggtgaaggacccTACTAA
ACGCccgtctgtaaaatgctaatcaaatcaatcaaatgtatttattaagcacctttaataaaaaggtaattggttggggg
gagatgttatgttttatgtatgtttttgtcatgcctgtctacgcttcaaactcgtgcatattgcagtaaatatgcaacag

(base) [lspencer@node28 gwas]$ samtools faidx pcod-genome.fa #index new genome file 

(base) [lspencer@node28 gwas]$ cat pcod-genome.fa.fai #view index file 
NC082382.1      26289739        12      80      81
NC082383.1      23805147        26618385        80      81
NC082384.1      26984552        50721109        80      81
NC082385.1      35077496        78042980        80      81
NC082386.1      22175970        113558957       80      81
NC082387.1      27900680        136012139       80      81
NC082388.1      28186669        164261590       80      81
NC082389.1      23760065        192800605       80      81
NC082390.1      23393856        216857683       80      81
NC082391.1      23728004        240543975       80      81
NC082392.1      26267057        264568592       80      81
NC082393.1      27587113        291164000       80      81
NC082394.1      22312454        319095964       80      81
NC082395.1      26048790        341687336       80      81
NC082396.1      24646214        368061748       80      81
NC082397.1      29952890        393016052       80      81
NC082398.1      16295523        423343366       80      81
NC082399.1      21077965        439842596       80      81
NC082400.1      17993618        461184048       80      81
NC082401.1      23303860        479402599       80      81
NC082402.1      19478734        502997770       80      81
NC082403.1      19411328        522720001       80      81
NC082404.1      20003359        542373983       80      81
NC036931.1      16569   562627396       80      81

angsdARRAY.sh - generate GLs (and some other files) for each chromosome with desired ANGSD settings

#!/bin/bash

#SBATCH --cpus-per-task=10
#SBATCH --time=7-00:00:00
#SBATCH --job-name=angsd_0
#SBATCH --output=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/gwas/temp-0/angsd_%A-%a.out
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=laura.spencer@noaa.gov
#SBATCH --array=1-24%6

## ANGSD Settings
# -setminDepth 50 is based on 50% of the individuals and the average sequencing coverage of 3X (i.e. 50 ~ 3x * 0.5*40)
# -setmaxDepth 300 is based on a very generous 2 times all the individuals and the average depth, to avoid sites overly sequenced (i.e. 300 >~ 2 * 40 * 3x)
# -minInd 20 sets calling only variants with ~50% of individuals
# -minMaqQ was 15, now 30
# -minQ was 15, now 33
# -GL was 1, now 2

module unload bio/angsd/0.933
module load bio/angsd/0.933
base=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental
temp="0"
workdir=${base}/gwas/temp-${temp}
bams=${workdir}/bamslist_${temp}.txt

# array input is chromosomes
JOBS_FILE=${base}/scripts/pcod-exp_angsdARRAY_input.txt
IDS=$(cat ${JOBS_FILE})

for sample_line in ${IDS}
do
        job_index=$(echo ${sample_line} | awk -F ":" '{print $1}')
        contig=$(echo ${sample_line} | awk -F ":" '{print $2}')
        if [[ ${SLURM_ARRAY_TASK_ID} == ${job_index} ]]; then
                break
        fi
done

angsd -b ${bams} \
-ref /home/lspencer/references/pcod-ncbi/GCF_031168955.1_ASM3116895v1_genomic.fa \
-r ${contig}: \
-out ${workdir}/Chr${job_index}_${temp} \
-nThreads 10 \
-uniqueOnly 1 \
-remove_bads 1 \
-trim 0 \
-C 50 \
-minMapQ 30 \
-minQ 33 \
-doCounts 1 \
-setminDepth 50 \
-setmaxDepth 300 \
-GL 2 \
-doBcf 1 \
-doPost 1 \
-doGeno 32 \
-doGlf 2 \
-doMaf 1 \
-doMajorMinor 1 \
-doDepth 1 \
-dumpCounts 3 \
-only_proper_pairs 1 \
-minInd 20 \
-minmaf 0.05 \
-SNP_pval 1e-10 \
-baq 1

concat.sh - concatenage beagles (containing GLs) into one whole genome beagle file

#!/bin/bash

#SBATCH --cpus-per-task=10
#SBATCH --job-name=concat-beagles-0
#SBATCH --mail-type=ALL
#SBATCH --mail-user=laura.spencer@noaa.gov
#SBATCH --output=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/gwas/temp-0/concat.out

# Define the input directory containing all .beagle.gz files
temp="0"
input_dir="/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/gwas/temp-${temp}"
output_file="${input_dir}/wholegenome-${temp}.beagle"

# Extract the header from the first .beagle.gz file
zcat "${input_dir}/Chr1_0.beagle.gz" | head -n 1 > "${output_file}"

# Loop through chromosomes 1 to 24 and concatenate their contents (excluding headers)
for i in {1..24}; do
    file="${input_dir}/Chr${i}_0.beagle.gz"
    if [[ -f "$file" ]]; then
        echo "Processing $file..."
        zcat "$file" | tail -n +2 >> "${output_file}"
    else
        echo "Warning: File $file not found. Skipping..."
    fi
done

# Compress the resulting .beagle file
gzip "${output_file}"

echo "Concatenation complete. Output written to ${output_file}.gz"

gwas.sh - edit beagle chrom IDs, impute genotype probabilities, and perform association tests using various traits using both “raw” GLs and imputed GPs

NOTE: here I only show code for GWAS using the raw GLs due to space, but code is same just with $imputed beagle file variable and different output names)

#!/bin/bash
#SBATCH --time=0-20:00:00
#SBATCH -p himem
#SBATCH --mem=100G
#SBATCH --job-name=gwas-16
#SBATCH --mail-type=ALL
#SBATCH --mail-user=laura.spencer@noaa.gov
#SBATCH --output=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/gwas/temp-16/gwas-16.out

module load bio/angsd/0.933 #important to use this version

temp="16"
base=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/gwas
beagle=${base}/temp-${temp}/wholegenome-${temp}.beagle.gz
fai=${base}/pcod-genome.fa.fai

# Remove "_" from chromosome IDs in beagle file (it confuses angsd)
zcat ${beagle} | sed 's/NC_/NC/g' | gzip > ${base}/temp-${temp}/wholegenome-${temp}-rehead.beagle.gz
beagle_rehead=${base}/temp-${temp}/wholegenome-${temp}-rehead.beagle.gz

# FIRST PERFORM IMPUTATION (to fill in NA values) with beagle

# All experimental fish
java -Xmx15000m -jar /home/lspencer/programs/beagle.jar \
like=${beagle_rehead} \
out=${base}/temp-${temp}/imputed

imputed=${base}/temp-${temp}/imputed.wholegenome-${temp}-rehead.beagle.gz.gprobs.gz

# ---------------------
#  CPI from all 4 growth and condition metrics (SGR-sl, SGR-ww, HSI, Kwet)
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-pi.grow.cond1.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-pi-grow-cond1.out \
-fai ${fai}

# ------------------
#  CPI from 3 growth and condition metrics (SGR-sl, SGR-ww, HSI)
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-pi.grow.cond2.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-pi-grow-cond2.out \
-fai ${fai}

# ------------------
#  CPI from 2 growth and condition metrics (SGR-ww, HSI)
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-pi.grow.cond3.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-pi-grow-cond3.out \
-fai ${fai}

# ------------------
# SGR - standard length
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-sgr-sl.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-sgr-sl.out \
-fai ${fai}

# SGR - wet weight
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-sgr-ww.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-sgr-ww.out \
-fai ${fai}

# HSI
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-hsi.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-hsi.out \
-fai ${fai}

# KWet
angsd \
-doMaf 4 \
-beagle ${beagle_rehead} \
-yQuant ${base}/temp-${temp}/${temp}-kwet.txt \
-doAsso 4 \
-out ${base}/temp-${temp}/${temp}-gwas-kwet.out \
-fai ${fai}

Examine GWAS results in R

Here is R Code I used to import, filter, view, and annotate GWAS-identified “putative markers”:

temp="16"
trait.label="Composite index #3, growth rates & liver index" #
trait="pi.grow.cond2" #"SGR.ww.trt" #SGR.sl.trt, Kwet, pi.grow.cond1, hsi, pi.grow.cond3
trait.file="pi-grow-cond2" #sgr-sl, sgr-ww, kwet, pi-grow-cond1, hsi, pi-grow-cond3

# Here i use imputed genotypes 
lrt.temp<-read.table(gzfile(paste0("../integrated/gwas/by-temp/", temp, "-gwas-impute-", trait.file, ".out.lrt0.gz")), header=T, sep="\t") %>% 
  mutate(Chromosome=gsub("NC", "NC_", Chromosome)) #important to add back the "_" in the chromosome name or else no markers will be annotated 

#we have a few LRT values that are -999, we should remove them. 
length(lrt.temp$LRT) # number of loci 
length(which(lrt.temp$LRT == -999)) #number that will be removed 

length(which(lrt.temp$LRT != -999))

#remove the values that are not -999 and that are negative
lrt.temp_filt<-lrt.temp[-c(which(lrt.temp$LRT == -999),which(lrt.temp$LRT <= 0)),]

#add snp "name" from rownumber 
lrt.temp_filt$SNP<-paste("r",1:length(lrt.temp_filt$Chromosome), sep="")

# get pvalues from chisq 
lrt.temp_filt$pvalue<-pchisq(lrt.temp_filt$LRT, df=1, lower=F)

hist(lrt.temp_filt$LRT, breaks=50)
hist(lrt.temp_filt$pvalue, breaks=50)
qqnorm(lrt.temp_filt$pvalue)

# #we also need to make sure we don't have any tricky values like those below
# lrt.temp_filt<-lrt.temp_filt[-c(which(lrt.temp_filt$pvalue == "NaN" ),
#                       which(lrt.temp_filt$pvalue == "Inf"),
#                       which(lrt.temp_filt$LRT == "Inf")),]
                      
# jpeg(filename = paste0("../integrated/gwas/by-temp/", temp, "-growth-condition-logp3-manhattan.jpeg"),
#      width = 900, height = 450, quality = 100)
manhattan(lrt.temp_filt %>% mutate(chr=as.numeric(as.factor(Chromosome))), 
          chr="chr", bp="Position", p="pvalue", genomewideline = 3, cex=1, suggestiveline = F,
          main=paste0("Putative markers associated with\n", trait.label, ", Temperature = ", temp), 
          highlight = (lrt.temp_filt %>% mutate(logp=-log10(pvalue)) %>% filter(logp>3))$SNP)
#dev.off()

lrt.temp_filt %>% mutate(logp=-log10(pvalue)) %>% filter(logp>3) %>% arrange(desc(LRT)) %>%
  unite("marker", Chromosome:Position, sep="_") %>% select(marker) %>% 
  write_delim(file=paste0("../integrated/gwas/", temp, "-gc1-markers-logp3.txt"),  col_names = F)

# did this already in previous code chunk 
# # Prepare gene regions with a 2000 bp flank
# genes4gwas <- pcod.gtf %>% mutate(ncbi_id=paste0("GeneID:", ncbi_id)) %>% 
#   filter(feature=="gene") %>% 
#   mutate(start_flank=start-2000) %>%
#   mutate(start_flank=as.integer(case_when(start_flank<0~0, TRUE~start_flank))) #%>% 
# 
# gene_ranges <- GRanges(
#   seqnames =genes4gwas$seqname,
#   ranges=IRanges(
#   start=genes4gwas$start_flank,
# #  start = genes4gwas$start,
#   end=genes4gwas$end),
#   ncbi_id=genes4gwas$ncbi_id,
#   gene_id=genes4gwas$gene_id)

# Prepare SNP sites
gc1.sites.temp <- lrt.temp_filt %>% mutate(logp=-log10(pvalue)) %>% filter(logp>3) %>% mutate(marker=paste0(Chromosome, "_", Position))

snp_ranges.temp <- GRanges(
  seqnames = gc1.sites.temp$Chromosome,
  ranges=IRanges(
    start = gc1.sites.temp$Position,
    end = gc1.sites.temp$Position),
  pvalue=gc1.sites.temp$pvalue,
  Major=gc1.sites.temp$Major,
  Minor=gc1.sites.temp$Minor)

# Find overlaps between gene flanks and SNP sites
overlaps.temp <- findOverlaps(gene_ranges, snp_ranges.temp)

# Extract matching rows
overlapping_genes.temp <- genes4gwas[queryHits(overlaps.temp), ]
overlapping_snps.temp <- gc1.sites.temp[subjectHits(overlaps.temp), ]

# Combine results into a single data frame
result.temp <- cbind(overlapping_genes.temp, overlapping_snps.temp) %>% 
  mutate(feature=case_when(Position>=start~"gene", Position<start~"upstream")) %>% 
  dplyr::select(-c(score, frame, attributes,biotype, exon)) %>% 
#  dplyr::select(feature, seqname, strand, start_flank, start, end, Position, gene_id, ncbi_id, gene_id) %>% 
  left_join(pcod.blast.GO[c("ncbi_id", "spid", "gene", "protein_names", "gene_ontology_biological_process")] %>% 
              mutate(ncbi_id=paste0("GeneID:", ncbi_id)), by="ncbi_id")

# result.temp %>% dplyr::select(-Position) %>% distinct() %>% group_by(feature) %>%  tally() %>% 
#   mutate(total=nrow(genes4gwas)) %>% mutate(perc=round(100*n/total, 2))

# Table of putative markers 
result.temp %>% dplyr::select(Chromosome, Position, LRT, pvalue, spid, protein_names) %>%
ungroup() %>% 
distinct() %>% arrange(desc(LRT)) %>% 
mutate(across(c(LRT:pvalue), ~ signif(.x, 3))) #%>% write_clip()

# NOTE: this beagle contains GLs for ALL treatments, and is missing some markers that were identified in the treatment-specific ANGSD run 
exp.beagle.gc1.sites.temp <- 
  exp.beagle %>% filter(marker %in% gc1.sites.temp$marker) %>% 
  pivot_longer(X100_1:X9_3, names_to = "individual.allele", values_to = "probability") %>% 
  separate(individual.allele, into=c("individual", "allele"), sep = "_") %>% 
  mutate(allele=case_when(
    allele==1 ~ paste0(allele1, "/", allele1),
    allele==2 ~ paste0(allele1, "/", allele2),
    allele==3 ~ paste0(allele2, "/", allele2))) %>% 
    mutate(allele_base = allele %>%
           str_replace_all("0", "A") %>%
           str_replace_all("1", "C") %>%
           str_replace_all("2", "G") %>%
           str_replace_all("3", "T")) %>% 
  mutate(probability=round(as.numeric(probability), digits=4)) %>% 
  group_by(marker, individual) %>% 
  filter(!(n() == 3 & all(probability == 0.3333))) %>%  
  slice_max(probability,n=1) %>% 
    mutate(individual=gsub("X", "", individual), allele_base) %>%
    left_join(phenotypes2, by=c("individual"="genetic_sampling_count")) %>%
mutate(temperature = replace_na(temperature, "16")) #%>% 
#  filter(temperature==temp)


gc1.markers.temp <- (gc1.sites.temp %>% filter(marker %in% exp.beagle$marker) %>% arrange(desc(LRT)))$marker
for (i in 1:length(gc1.markers.temp)) {
  meta <- exp.beagle.gc1.sites.temp %>% filter(marker==gc1.markers.temp[i]) %>% ungroup() %>%  
  select(marker) %>% distinct() %>%  
  left_join(result.temp %>% mutate(marker=paste0(Chromosome, "_", Position))) 
  
  print(exp.beagle.gc1.sites.temp %>% filter(marker==gc1.markers.temp[i]) %>% #filter(probability>0.75) %>% 
#  filter(!is.na(allele_base)) %>% droplevels() %>% 
    ggplot() + geom_boxplot(aes(x=allele_base, 
                                y=!!sym(trait),
                                fill=temperature), alpha=0.7, outlier.shape = NA) + 
    geom_point(aes(x=allele_base, 
                   y=!!sym(trait), 
                   color=point), position=position_jitter(w = 0.2,h = 0)) + 
  theme_minimal() + facet_wrap(~temperature, scales="free_y", nrow = 2, ncol=2) +
  scale_fill_manual(name="temperature", 
                   values=c("0"="#2c7bb6", "5"="#abd9e9", "9"="#fdae61", "16"="#d7191c"),
                   labels=c("0"="0degC","5"="5degC","9"="9degC (optimal)","16"="16degC")) +
  scale_color_manual(name=NULL, values=c("No.eGOA"="red", "No.wGOA"="black", "Yes.wGOA"="green2"), 
                     labels=c("No.wGOA"="Survived",  "No.eGOA"="eGOA cluster in PCA", "Yes.wGOA"="Died")) + 
  ggtitle(paste0("Marker: ", meta$marker, "\n Gene ID: ", meta$gene_id, "\n Protein: ", meta$protein_names)))
}

Putative markers associated with growth rate (standard length) in warm-exposed fish

Chromosome	Position	LRT	pvalue	spid	protein_names
Chr22	2266449	13.5	0.000238	P12107	Collagen alpha-1(XI) chain
Chr6	16384831	13.1	0.000295	Q6XJU9	Osteoclast-stimulating factor 1
Chr22	2266770	12.9	0.000332	P12107	Collagen alpha-1(XI) chain
Chr4	29206268	12.6	0.000383	Q5DU56	Protein NLRC3
Chr22	2266677	12.5	0.000399	P12107	Collagen alpha-1(XI) chain
Chr4	33251932	12.5	0.000415	Q5R5U3	Zinc finger protein 271
Chr10	14094944	11.8	0.000577	Q91Y13	Protocadherin alpha-7 (PCDH-alpha-7)
Chr12	25112546	11.3	0.000767	P55199	RNA polymerase II elongation factor ELL
Chr23	8966730	11.3	0.000786	Q6P5D8	Structural maintenance of chromosomes flexible hinge domain-containing protein 1
Chr20	18401423	11.1	0.000871	Q6GMI9	UDP-glucuronic acid decarboxylase 1
Chr7	8158249	10.8	0.000995	O43264	Centromere/kinetochore protein zw10 homolog

Putative markers associated with growth rate (wet weight) in warm-exposed fish

Chromosome	Position	LRT	pvalue	spid	protein_names
Chr20	18401423	19.8	8.71E-06	Q6GMI9	UDP-glucuronic acid decarboxylase 1
Chr22	2266449	18.5	1.71E-05	P12107	Collagen alpha-1(XI) chain
Chr22	2266770	17.3	3.24E-05	P12107	Collagen alpha-1(XI) chain
Chr22	2266677	16.6	4.65E-05	P12107	Collagen alpha-1(XI) chain
Chr22	2871726	13.1	0.00029	Q95460	Major histocompatibility complex class I-related gene protein
Chr4	29206268	12.7	0.000369	Q5DU56	Protein NLRC3
Chr10	13841648	12.6	0.000383	Q5DRB8	Protocadherin gamma-A2
Chr10	13841648	12.6	0.000383	Q9Y5G7	Protocadherin gamma-A6
Chr7	8158249	12.2	0.000479	O43264	Centromere/kinetochore protein zw10 homolog
Chr1	10698321	12.2	0.000485	Q6RY07	Acidic mammalian chitinase (AMCase) (EC 3.2.1.14)
Chr5	20744974	12.1	0.000499	Q7RTR2	NLR family CARD domain-containing protein 3
Chr8	6194665	12	0.000543	Q8R4G8	BTB/POZ domain-containing protein KCTD1 (Vitamin A-deficient testicular protein 6)
Chr22	2871709	12	0.000545	Q95460	Major histocompatibility complex class I-related gene protein
Chr9	5952237	11.5	0.000685	Q8AXZ4	Contactin-1a (F3/F11/contactin) (Neural cell recognition molecule F11)
Chr7	1643219	11.5	0.000709	P59046	NACHT, LRR and PYD domains-containing protein 12 (Monarch-1)
Chr7	1643219	11.5	0.000709	Q7RTR2	NLR family CARD domain-containing protein 3 (CARD15-like protein)
Chr15	14084438	11.4	0.000716	Q7LFX5	Carbohydrate sulfotransferase 15

Putative markers associated with hepatosomatic index in warm-exposed fish

Chromosome	Position	LRT	pvalue	spid	protein_names
Chr10	26512	24.5	7.50E-07	Q5SVR0	TBC1 domain family member 9B
Chr8	7686394	20.6	5.69E-06	Q90XF2	Protein kinase C iota type
Chr12	8663963	20.3	6.69E-06	P15924	Desmoplakin (DP)
Chr12	8663139	20.1	7.20E-06	P15924	Desmoplakin (DP)
Chr8	7684131	19.6	9.48E-06	Q90XF2	Protein kinase C iota type
Chr8	7688322	18.6	1.62E-05	Q90XF2	Protein kinase C iota type
Chr15	5279980	17.2	3.41E-05	Q5RC21	Translin-associated protein X
Chr15	5279992	17.1	3.54E-05	Q5RC21	Translin-associated protein X
Chr10	26590	16.5	4.85E-05	Q5SVR0	TBC1 domain family member 9B
Chr18	18184067	15.2	9.66E-05	Q9ULK0	Glutamate receptor ionotropic, delta-1
Chr7	24047591	14.6	0.000134	Q2WG80	Protein ripply1
Chr7	24047555	14.5	0.000138	Q2WG80	Protein ripply1
Chr7	23954622	14.4	0.000151	O88491	Histone-lysine N-methyltransferase, H3 lysine-36 specific
Chr12	8664098	14.2	0.00016	P15924	Desmoplakin (DP)
Chr14	6110294	14.2	0.000161	Q7SXF1	7-dehydrocholesterol reductase
Chr12	8666427	14.1	0.000171	P15924	Desmoplakin (DP)
Chr8	7688395	14	0.000185	Q90XF2	Protein kinase C iota type
Chr7	22172083	12.9	0.000325	Q6ZMW2	Zinc finger protein 782
Chr7	11635733	12.5	0.000403	Q3U3V8	X-ray radiation resistance-associated protein 1
Chr8	7688446	12.3	0.00046	Q90XF2	Protein kinase C iota type
Chr4	9882897	12.1	0.000498	Q3USH5	Splicing factor, suppressor of white-apricot homolog
Chr3	16934535	11.8	0.00058	A2Q0R8	Small ribosomal subunit protein eS1
Chr3	16934535	11.8	0.00058	Q5HYK7	SH3 domain-containing protein 19
Chr1	6185889	11.8	0.000581	Q86YJ5	E3 ubiquitin-protein ligase MARCHF9
Chr12	6825238	11.8	0.00059	Q9QYW0	Protein AATF (Apoptosis-antagonizing transcription factor)
Chr9	20279221	11.7	0.000632	A1L271	Guanine nucleotide-binding protein subunit beta-5a
Chr3	16934470	11.7	0.000642	A2Q0R8	Small ribosomal subunit protein eS1
Chr3	16934470	11.7	0.000642	Q5HYK7	SH3 domain-containing protein 19
Chr18	18184044	11.6	0.000674	Q9ULK0	Glutamate receptor ionotropic, delta-1
Chr7	22188972	11.6	0.000676	Q6ZMW2	Zinc finger protein 782
Chr7	22182763	11.5	0.000684	Q6ZMW2	Zinc finger protein 782
Chr7	11635676	11.5	0.000709	Q3U3V8	X-ray radiation resistance-associated protein 1
Chr7	22174547	11.3	0.000771	Q6ZMW2	Zinc finger protein 782
Chr9	20279076	11.2	0.000829	A1L271	Guanine nucleotide-binding protein subunit beta-5a
Chr14	19635364	11	0.000893	Q6DJE5	Dysbindin domain-containing protein 1
Chr14	19635364	11	0.000893	Q0P464	Methenyltetrahydrofolate synthase domain-containing protein
Chr14	19635364	11	0.000893	O95995	Dynein regulatory complex subunit 4 (Growth arrest-specific protein 11)
Chr18	18193173	10.9	0.000938	Q9ULK0	Glutamate receptor ionotropic, delta-1
Chr2	21411415	10.9	0.000968	Q7Z410	Transmembrane protease serine 9
Chr7	22186315	10.9	0.00097	Q6ZMW2	Zinc finger protein 782
Chr18	18185188	10.8	0.000992	Q9ULK0	Glutamate receptor ionotropic, delta-1

Working on pcod genotype imputation pipeline

I previously performed genotype imputation without any reference panel/map. I did some reading, and it looks like I could greatly improve imputation accuracy if I provide phased haplotype reference panel, built from other Pacific cod WGS data. I happen to have a lcWGS data from ~600 reference fish (depth ~3x) AND the big/little fish from 2021 (depth ~14x). I’m going to see if I can use those datasets to build the phased reference panel.

NOTE: phased reference panel: genetic variants are assigned to their respective chromosomes, distinguishing which alleles came from one parent versus the other.

NOTE: I tried to use a beagle file (merged to have both ref & big/little data) as a reference panel, but no dice. I need a VCF file, so I’ll build that from “scratch” using angsd’s doVCF option

Run ANGSD to generate VCF with genotype probabilities from reference population and Big/Little 2021 cod

I’m working in this repon on Sedna: /home/lspencer/pcod-general/imputation-ref-panel, and ran the below code in the script angsd-impute-ref-panel.sh. It produced BCF files for each chromosome.

#!/bin/bash

#SBATCH --cpus-per-task=10
#SBATCH --time=0-20:00:00
#SBATCH --job-name=angsd
#SBATCH --output=/home/lspencer/pcod-general/imputation-ref-panel/angsd-ref-panel_%A-%a.out
#SBATCH --mail-type=ALL
#SBATCH --mail-user=lspencer@noaa.gov
#SBATCH --array=1-24%24

module unload bio/angsd/0.940
module load bio/angsd/0.940

JOBS_FILE=/home/lspencer/pcod-general/imputation-ref-panel/angsdARRAY_input.txt
IDS=$(cat ${JOBS_FILE})

for sample_line in ${IDS}
do
        job_index=$(echo ${sample_line} | awk -F ":" '{print $1}')
        contig=$(echo ${sample_line} | awk -F ":" '{print $2}')
        if [[ ${SLURM_ARRAY_TASK_ID} == ${job_index} ]]; then
                break
        fi
done

angsd -b /home/lspencer/pcod-general/imputation-ref-panel/bamslist-ref-panel.txt -ref /home/lspencer/references/pcod-ncbi/GCF_031168955.1_ASM3116895v1_genomic.fa \
-r ${contig}: -out /home/lspencer/pcod-general/imputation-ref-panel/panel_${contig}_polymorphic \
-nThreads 10 -uniqueOnly 1 -doCounts 1 -setMinDepth 700 -remove_bads 1 -trim 0 -C 50 -minMapQ 35 -minQ 30 \
-dobcf 1 -doGlf 2 -GL 1 -doMaf 1 -doMajorMinor 1 -dopost 1 -dogeno 32 \
-minMaf 0.05 -SNP_pval 1e-10 -only_proper_pairs 1
(base) [lspencer@node28 imputation-ref-panel]$ pwd
/home/lspencer/pcod-general/imputation-ref-panel

I then concatenated the BCF files into one using bcftools concat, then converted to VCF, filtered for minor allele frequency > 5% and max missingness 15%. This produced the file whole-genome_miss15.vcf.gz. There are still some missing genotypes, so here I try to impute those using BEAGLE v5.4:

java -jar /home/lspencer/programs/beagle.v5.4.jar gt=/home/lspencer/pcod-general/imputation-ref-panel/whole-genome_miss15.vcf.gz gp=true impute=true out=/home/lspencer/pcod-general/imputed

beagle.29Oct24.c8e.jar (version 5.4)
Copyright (C) 2014-2022 Brian L. Browning
Enter "java -jar beagle.29Oct24.c8e.jar" to list command line argument
Start time: 11:17 AM PST on 25 Nov 2024

Command line: java -Xmx1136m -jar beagle.29Oct24.c8e.jar
  gt=/home/lspencer/pcod-general/imputation-ref-panel/whole-genome_miss15.vcf.gz
  gp=true
  impute=true
  out=/home/lspencer/pcod-general/imputed
  nthreads=1

No genetic map is specified: using 1 cM = 1 Mb

Reference samples:                    0
Study     samples:                  708

Window 1 [NC_036931.1:47-16349]
Study     markers:                   84

Burnin  iteration 1:           5 seconds
Burnin  iteration 2:           1 second
Burnin  iteration 3:           1 second

Estimated ne:                  5216323
Estimated err:                 6.4e-03

Phasing iteration 1:           0 seconds
Phasing iteration 2:           0 seconds
Phasing iteration 3:           0 seconds
Phasing iteration 4:           0 seconds
Phasing iteration 5:           0 seconds
Phasing iteration 6:           0 seconds
Phasing iteration 7:           0 seconds
Phasing iteration 8:           0 seconds
Phasing iteration 9:           0 seconds
Phasing iteration 10:          0 seconds
Phasing iteration 11:          0 seconds
Phasing iteration 12:          0 seconds

Window 2 [NC_082382.1:8368-26260017]
Study     markers:               17,298

...

yada yada yada

This should produce a VCF file with phased haplotypes, and a log file. I should next assess the phasing quality using tools like SHAPEIT’s quality checks or Beagle’s output metrics.

Written on November 21, 2024