Methylation Island Functional Analysis

In last week’s Epigenetic Reading Group we discussed this Long et al. 2013 paper, which found some interesting associations between the size of a non-methylated island (NMI) across 7 vertebrate taxa, and the function of genes within islands of various sizes. They found that “[conserved] NMIs tended to be longer, have higher CpG density, and are generally found associated with the TSSs of protein coding genes.”, and “In contrast, tissue-specific NMIs that are differentially methylated tended to be shorter, have lower CpG density and are found away from protein-coding gene TSSs.”

To investigate any association between methylated island size and function in the Olympia oyster (Ostrea lurida), I identified methylated islands using the MBD-Seq data, extracted long and short islands (>1000bp, <1000bp), identified annotated genes that overlap with the methyation islands, then identified enriched biological and molecular functions in short and long methylated islands.

In this first round analysis I chose to assign “long” methylation islands as those >1000 bp, and “short” as those <1000bp (and in this case, greater than 500bp). This threshold was based on the size distribution of unique (shorter) and shared (longer) non-methylated islands in the vertebrates that Long et al. found (see screen shot below). Obviously, the size distribution of non-methylated islands in vertebrates could be very different from methylated islands in invertebrates. But, one has to start somewhere!

image

I performed my analysis in the RMarkdown notebook 01b-General-Methylation-Patterns.Rmd. I’m pasting my code and results here:


Methylation islands functional analysis

Here I will find methylation islands, measure their length, then bin them into short islands & long islands, identify genes that overlap with the islands, then see whether short/long islands are enriched for certain biological functions or molecular processes. I will use Yaamini’s script, and the script from Jeong et al. 2018 code, which I saved in the resources/ subdirectory of this repo. See also the Jeong et al. paper, and this GitHub issue where Yaamini & Steven discuss the optimal settings to use when running the script.

Usage of the script:

  • ./methyl_island_sliding_window.pl
  • window size - starting size of the methylation island window. I will use 500 bp, as per Yaamini & Steven’s analysis.
  • mCpG fraction - the minimum fraction of methylated CpGs required within the window to be accepted. I will use 0.02, aka 2%
  • step size - base pairs to extend an accepted window by (continues extending by the step size as long as the mCpG fraction is met). I will use 50 bp
  • mCpG File - input file with a list of all methylated CpGs in the genome, sorted by scaffold/chromosome and position

Create input file - methylated loci

File needs to have chromosome and locus (aka start bp)

awk '{print $1"\t"$2}' ../analyses/methylation-characteristics/all_methylated_5x.bed > ../analyses/methylation-characteristics/methylation-islands/all_methylated_5x-reduced.bed
head ../analyses/methylation-characteristics/methylation-islands/all_methylated_5x-reduced.bed
Contig0	161
Contig0	349
Contig0	481
Contig0	514
Contig0	532
Contig0	583
Contig0	742
Contig0	775
Contig0	782
Contig0	844

Run Methylation Island Analysis

../analyses/methylation-characteristics/methylation-islands/methyl_island_sliding_window.pl 500 0.02 50 \ ../analyses/methylation-characteristics/methylation-islands/all_methylated_5x-reduced.bed > \
../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50.tab

Resulting file structure and number of methylation islands

# chr, star, end, number mCpG
head ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50.tab
# Number of methylation islands
wc -l ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50.tab
Contig0	481	1471	21
Contig0	8123	8407	10
Contig0	13826	14435	13
Contig0	35182	35743	16
Contig0	61419	61752	10
Contig0	64123	64431	14
Contig0	65414	65584	10
Contig0	66900	67377	11
Contig0	67564	67974	18
Contig0	69160	69640	11
   48455 ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50.tab

Filter by MI length (>500 bp) and include MI length in a new column

awk '{if ($3-$2 >= 500) { print $1"\t"$2"\t"$3"\t"$4"\t"$3-$2}}' ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50.tab \
> ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50_filtered.tab

#preview resulting file, and count how many MI remain 
head ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50_filtered.tab
wc -l ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50_filtered.tab

Count max & min mCpG in an island

awk 'NR==1{max = $4 + 0; next} {if ($4 > max) max = $4;} END {print max}' \
../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50_filtered.tab
awk 'NR==1{min = $4 + 0; next} {if ($4 < min) min = $4;} END {print min}' \
../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50_filtered.tab
720
11

Find average length of islands

awk '{ total += $2 } END { print total/NR }' ../analyses/methylation-characteristics/methylation-islands/methylation-islands_500-02-50_filtered.tab
8983.82

Inspect results of Methylation Island analysis

meth.islands <- read_delim(file=here::here("analyses", "methylation-characteristics", "methylation-islands", "methylation-islands_500-02-50_filtered.tab"), delim = "\t", col_names=FALSE) %>% 
  setNames(c("contig_island", "start_island", "end_island", "n_mCpG_island", "length_island"))

# Save ALL methylated islands to .bed file (for gene enrichment analysis)
write_delim(meth.islands, here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-all.bed"), delim = '\t', col_names = F)

# Check out length distribution 
hist(log(meth.islands$length_island))

image

See distribution of LONG methylated islands, count and save as .bed

hist(subset(meth.islands, length_island>1000)$length_island, breaks = 200)
nrow(subset(meth.islands, length_island>1000))
write_delim(subset(meth.islands, length_island>1000), here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-long.bed"), delim = '\t', col_names = F)
7379

image

See distribution of SHORT methylated islands, count and save as .bed

hist(subset(meth.islands, length_island<1000)$length_island, breaks = 200)
nrow(subset(meth.islands, length_island<1000))
write_delim(subset(meth.islands, length_island<1000), here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-short.bed"), delim = '\t', col_names = F)

image

19142

Find genes located within methylated islands

# LONG methylated islands (MI)
bedtools intersect -wb -a "../analyses/methylation-characteristics/methylation-islands/meth-island-long.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation-characteristics/methylation-islands/meth-island-long-genes.bed

# SHORT methylated islands (MI)
bedtools intersect -wb -a "../analyses/methylation-characteristics/methylation-islands/meth-island-short.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation-characteristics/methylation-islands/meth-island-short-genes.bed

# ALL methylated islands (MI)
bedtools intersect -wb -a "../analyses/methylation-characteristics/methylation-islands/meth-island-all.bed" -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" >  ../analyses/methylation-characteristics/methylation-islands/meth-island-all-genes.bed

Read in genes within methylated island, split “Notes” column and extract Uniprot ID

# Copy Uniprot SPID of genes overlapping with LONG meth islands 
read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-long-genes.bed"), delim = "\t", col_names = FALSE) %>% 
  setNames(c(colnames(meth.islands), 
             "contig_gene", "source", "feature", "start_gene", "end_gene", 
             "unknown1", "strand", "unknown2", "notes_gene")) %>% 
  mutate(SPID=str_extract(notes_gene, "SPID=(.*?);")) %>% 
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>% 
  select(SPID) %>%   na.omit() %>% as.vector() %>% write_clip()

# Copy Uniprot SPID of genes overlapping with SHORT meth islands 
read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-short-genes.bed"), delim = "\t", col_names = FALSE) %>% select(X14) %>%
  mutate(SPID=str_extract(X14, "SPID=(.*?);")) %>% 
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>% 
  select(SPID) %>%   na.omit() %>% as.vector() %>% write_clip()

# Copy Uniprot SPID of genes overlapping with ALL meth islands (my background list)
read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-all-genes.bed"), delim = "\t", col_names = FALSE) %>% select(X14) %>%
  mutate(SPID=str_extract(X14, "SPID=(.*?);")) %>% 
  mutate(SPID=str_remove(SPID, "SPID=")) %>% mutate(SPID=str_remove(SPID, ";")) %>% 
  select(SPID) %>%   na.omit() %>% as.vector() %>% write_clip()

# Question: many genes contain multiple "short" methylation islands. Do I include those genes multiple times in enrichment analysis? Or do I only include them once? 

Merge enriched GO terms with GO Slims

# Read in the GO Slim table  
GOSlim <- read_delim(here::here("resources", "GO-GOslim.sorted"), delim = '\t', col_names = FALSE) %>% 
  setNames(c("GO", "term", "slim", "category")) %>% 
  mutate_at(vars(category, slim), as.factor)

GO.enriched.MI.short <- read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-short-enriched-BF.txt"), delim = '\t', col_names = TRUE) %>% 
   separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") %>% 
   left_join(GOSlim, by=c("GO", "term")) #add slim 

GO.enriched.MI.long <- read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-long-enriched-BF.txt"), delim = '\t', col_names = TRUE) %>% 
   separate(Term, into=c("GO", "term"), remove=TRUE,sep = "~") %>% 
   left_join(GOSlim, by=c("GO", "term")) #add slim 

Enriched Biological Functions in LONG methylation islands (>1000 bp)

Term Count % PValue List Total Pop Hits FDR
GO:0007156~homophilic cell adhesion via plasma membrane adhesion molecules 34 1.5 4.3 1961 40 0.019897411819458567
GO:0006351~transcription, DNA-templated 252 11.1 0.002 1961 469 1.0
GO:0007155~cell adhesion 60 2.7 0.005 1961 97 1.0
GO:0007186~G-protein coupled receptor signaling pathway 33 1.4 0.005 1961 48 1.0
GO:0007507~heart development 41 1.8 0.01 1961 64 1.0
GO:0007169~transmembrane receptor protein tyrosine kinase signaling pathway 18 0.8 0.02 1961 24 1.0
GO:0050852~T cell receptor signaling pathway 10 0.4 0.02 1961 11 1.0
GO:0006355~regulation of transcription, DNA-templated 206 9.1 0.02 1961 393 1.0
GO:0045944~positive regulation of transcription from RNA polymerase II promoter 106 4.7 0.03 1961 193 1.0
GO:0008203~cholesterol metabolic process 17 0.8 0.03 1961 23 1.0
GO:0008104~protein localization 20 0.9 0.04 1961 29 1.0
GO:0006357~regulation of transcription from RNA polymerase II promoter 47 2.1 0.04 1961 80 1.0
GO:0006397~mRNA processing 49 2.2 0.04 1961 84 1.0
GO:0034220~ion transmembrane transport 17 0.8 0.04 1961 24 1.0
GO:0060071~Wnt signaling pathway, planar cell polarity pathway 7 0.3 0.05 1961 7 1.0
GO:0007528~neuromuscular junction development 14 0.6 0.05 1961 19 1.0
GO:0018108~peptidyl-tyrosine phosphorylation 20 0.9 0.06 1961 30 1.0
GO:0006511~ubiquitin-dependent protein catabolic process 39 1.7 0.06 1961 66 1.0
GO:0006366~transcription from RNA polymerase II promoter 39 1.7 0.06 1961 66 1.0
GO:0042127~regulation of cell proliferation 22 1.0 0.06 1961 34 1.0
GO:0043966~histone H3 acetylation 8 0.4 0.07 1961 9 1.0
GO:0050982~detection of mechanical stimulus 8 0.4 0.07 1961 9 1.0
GO:0072661~protein targeting to plasma membrane 8 0.4 0.07 1961 9 1.0
GO:0008355~olfactory learning 8 0.4 0.07 1961 9 1.0
GO:0007411~axon guidance 34 1.5 0.07 1961 57 1.0
GO:0045893~positive regulation of transcription, DNA-templated 59 2.6 0.07 1961 106 1.0
GO:0007275~multicellular organism development 94 4.2 0.08 1961 176 1.0
GO:0007257~activation of JUN kinase activity 9 0.4 0.08 1961 11 1.0
GO:0030501~positive regulation of bone mineralization 9 0.4 0.08 1961 11 1.0
GO:0046426~negative regulation of JAK-STAT cascade 6 0.3 0.09 1961 6 1.0
GO:0044331~cell-cell adhesion mediated by cadherin 6 0.3 0.09 1961 6 1.0
GO:0070593~dendrite self-avoidance 6 0.3 0.09 1961 6 1.0
GO:0060021~palate development 14 0.6 0.09 1961 20 1.0
GO:0000398~mRNA splicing, via spliceosome 33 1.5 0.09 1961 56 1.0
GO:0070588~calcium ion transmembrane transport 23 1.0 0.09 1961 37 1.0
GO:0016339~calcium-dependent cell-cell adhesion via plasma membrane cell adhesion molecules 10 0.4 0.1 1961 13 1.0
GO:0007157~heterophilic cell-cell adhesion via plasma membrane cell adhesion molecules 10 0.4 0.1 1961 13 1.0
GO:0021591~ventricular system development 10 0.4 0.1 1961 13 1.0

Enriched Biological Functions in SHORT methylation islands (<1000 bp)

Term Count % PValue List Total Pop Hits FDR
GO:0042493~response to drug 55 1.423027166882277 0.022317643564457215 3363 59 1.0
GO:0006974~cellular response to DNA damage stimulus 105 2.716688227684347 0.027144714536471236 3363 118 1.0
GO:0007018~microtubule-based movement 39 1.0090556274256144 0.08008950477964477 3363 42 1.0
GO:0051726~regulation of cell cycle 39 1.0090556274256144 0.08008950477964477 3363 42 1.0
GO:0006979~response to oxidative stress 33 0.8538163001293662 0.08106833809744202 3363 35 1.0
GO:0007264~small GTPase mediated signal transduction 60 1.5523932729624839 0.09132361327223226 3363 67 1.0
GO:0006260~DNA replication 65 1.6817593790426906 0.09411699904720267 3363 73 1.0

Enrichment Analysis using GO MWU

Make input files for GO_MWU

Save loci IDS and GO TERMS for All GENES that were identified that overlap with methylated islands (including +/- 2kb)

read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-all-genes.bed"), delim = "\t", col_names = FALSE) %>% 
  setNames(c(colnames(meth.islands), 
             "contig_gene", "source", "feature", "start_gene", "end_gene", 
             "unknown1", "strand", "unknown2", "notes_gene")) %>% 
  mutate(island_bps=paste(contig_island, start_island, end_island, sep="_")) %>%
  mutate(GO=str_extract(notes_gene, "Ontology_term=(.*?);")) %>% 
mutate(GO = str_replace(GO, pattern="Ontology_term=",replacement = "")) %>%
  mutate(GO = str_replace(GO, pattern=";",replacement = "")) %>%  
  mutate(GO = str_replace_all(GO, pattern=",",replacement = ";")) %>%
  select(island_bps, GO) %>% drop_na(GO) %>%
  write.table(here::here("analyses", "GO_MWU", "GO_MWU_GO-terms_meth-islands"),sep="\t",quote = F,row.names = F, col.names=F)

Save loci IDS and SIGNIFICANCE (0=significant, 1=not significant) for SHORT methylated islands

# Create a dataframe of "contig_bps" for all genes that overlap with methylation islands (aka the background list of genes), and add a column to indicate significance. Start by adding "1" to all.  
go.mwu.islands.all <- read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-all-genes.bed"), delim = "\t", col_names = FALSE) %>% 
  setNames(c(colnames(meth.islands), 
             "contig_gene", "source", "feature", "start_gene", "end_gene", 
             "unknown1", "strand", "unknown2", "notes_gene")) %>% 
  mutate(island_bps=paste(contig_island, start_island, end_island, sep="_")) %>% 
  select(island_bps) %>% add_column(sig = c(1)) 

# Generate vector of genes that overlap with long methylated islands, then short islands 
long <- read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-long-genes.bed"), delim = "\t", col_names = FALSE) %>% 
  mutate(island_bps=paste(X1, X2, X3, sep="_")) %>% select(island_bps) %>% unlist() %>% as.vector()
short <- read_delim(here::here("analyses", "methylation-characteristics", "methylation-islands", "meth-island-short-genes.bed"), delim = "\t", col_names = FALSE) %>% 
  mutate(island_bps=paste(X1, X2, X3, sep="_")) %>% select(island_bps) %>% unlist() %>% as.vector()

# Replace 0's in the significance column for any "contig_bps" that was a LONG meth island, then with a SHORT one 
go.mwu.islands.long <- go.mwu.islands.all
go.mwu.islands.long[which(go.mwu.islands.long$island_bps %in% long),]$sig <- 0

go.mwu.islands.short <- go.mwu.islands.all
go.mwu.islands.short[which(go.mwu.islands.short$island_bps %in% short),]$sig <- 0

# Save significance files for GO MWU 
write.csv(go.mwu.islands.long, here::here("analyses", "GO_MWU", "GO_MWU_signif_meth-islands-long"),quote = F,row.names = F)
write.csv(go.mwu.islands.short, here::here("analyses", "GO_MWU", "GO_MWU_signif_meth-islands-short"),quote = F,row.names = F)

To run GO MWU analysis, open the R file “GO_MWU-islands.R” and follow prompt.

RESULTS of GO MWU analysis: 1 significant GO term in the SHORT methylation islands = cell adhesion (p-adjusted=0.001192251)

Re-do enrichment analysis with revised bp threshold (2000bp) for long vs. short methylation islands

Out of curiosity, I re-did the entire enrichment analysis with LONG methylation islands >2000bp, and SHORT islands <2000 bp. No functions were enriched according to GO MWU. Here are the DAVID results:

LONG islands (>2kb)

image

SHORT islands (<2kb, but >500bp)

No enriched biological functions.

Written on November 30, 2020