Tutorial

For this tutorial we are going to use ultra-long nanopore data from GM12878 aligned to chromosome 20 only. The modbam was generated using megalodon and haplotype tags were assigned using PEPPER.

Download Data

You can download all the files needed to go through this tutorial here.

After extractiong, you should have the following files:

.
├── gencode.v38.annotation.sorted.gtf.gz
├── gencode.v38.annotation.sorted.gtf.gz.tbi
├── genes.bed
├── gm12878_H3K27ac_ENCFF798KYP.bigWig
├── gm12878_H3K4me1_ENCFF190RZM.bigWig
├── gm12878_ul_sup_megalodon_HP_chr20.bam
├── gm12878_ul_sup_megalodon_HP_chr20.bam.bai
└── promoters.bed

Plotting

Basic plot

First let's start with plotting an imprinted region for GNAS gene:

modbamtools plot -r chr20:58815000-58895000 \
    --gtf gencode.v38.annotation.sorted.gtf.gz \
    --out . \
    --prefix gm12878_GNAS \
    --samples GM12878 \
    --track-titles Genes\
    gm12878_ul_sup_megalodon_HP_chr20.bam 

gm12878_GNAS

Haplotype separation

We can group the reads based on HP tag by using --hap option:

modbamtools plot -r chr20:58815000-58895000 \
    --gtf gencode.v38.annotation.sorted.gtf.gz \
    --out . \
    --hap \
    --prefix gm12878_GNAS \
    --samples GM12878 \
    --track-titles Genes\
    gm12878_ul_sup_megalodon_HP_chr20.bam 

gm12878_GNAS_hap

Bigwig tracks

Bigwig tracks can also be added to the plot with --bigwig:

modbamtools plot -r chr20:58820000-58895000 \
    --gtf gencode.v38.annotation.sorted.gtf.gz \
    --out . \
    --hap \
    --bigwig gm12878_H3K27ac_ENCFF798KYP.bigWig \
    --bigwig gm12878_H3K4me1_ENCFF190RZM.bigWig \
    --prefix gm12878_GNAS_hap_h3k27ac_h3k4me1 \
    --samples GM12878 \
    --track-titles Genes,H3K27ac,H3k4me1\
    gm12878_ul_sup_megalodon_HP_chr20.bam 

gm12878_GNAS_hap_h3k27ac_h3k4me1

Cluster plot

We can use --cluster option to perform clustering on the go and group reads based on the assigned cluster. To show an example, we focus on gene FRG1EP (you can find unclustered plot here). For more information about the clustering algorithm see section clustering below.

modbamtools plot -r chr20:29460146-29507179 \
    --gtf gencode.v38.annotation.sorted.gtf.gz \
    --out . \
    --cluster \
    --prefix gm12878_FRG1EP_cluster \
    --track-titles Genes\
    gm12878_ul_sup_megalodon_HP_chr20.bam

gm12878_FRG1EP_cluster

Multiple input regions (Bed)

If we are interested in plotting multiple regions simultaneously, we can feed modbamtools a bed file. Output formats can either be html or pdf in a single report document.

modbamtools plot --batch genes.bed \
    --gtf gencode.v38.annotation.sorted.gtf.gz \
    --out . \
    --hap \
    --prefix gm12878_batch \
    --samples GM12878 \
    --track-titles Genes\
    gm12878_ul_sup_megalodon_HP_chr20.bam

gm12878_batch

Clustering

modbamtools uses a hierarichal density-based spatial clustering of noisy data algorithm (HDBSCAN) to cluster reads based on their modification state. modbamtools cluster takes a bed file as input and appends each line with two columns : number of clusters found, and number of reads used for clustering after a quality control step

Let's try it on our genes.bed file:

modbamtools cluster --bed genes.bed \
    --threads 3 \
    --out genes_clustered.bed \
    gm12878_ul_sup_megalodon_HP_chr20.bam

We can see from the output below that all 3 regions have 2 clusters. We can confirm this by using the plotting commands above.

chr20   29460146    29507179    FRG1EP  unprocessed_pseudogene  ENSG00000282995.1   2   42
chr20   58818918    58850903    GNAS-AS1    lncRNA  ENSG00000235590.7   2   30
chr20   63556085    63576239    HELZ2   protein_coding  ENSG00000130589.16  2   37

Regional methylation calculation

modbamtools calcMeth will accept a bed file as input and append modification stats as extra columns to each region of interest. For each region, Average percent modification is calculated for each read that spans a minimum portion of the region (This can be tweaked by --min_cov. Default is 80%). Each read has to have at least 5 calls to be considered valid for analysis (This can also be changed with --min_calls). Then, Average percent modifcation for the region is calculated using valid reads. The appended columns are outlined below:

Average Percent Modification Standrad Deviation for Avg Percent Modification Coverage

Let's try it on our promoters.bed file:

modbamtools calcMeth --bed promoters.bed \
    --threads 3 \
    --out promoters_calcMeth.bed \
    gm12878_ul_sup_megalodon_HP_chr20.bam
chr20   29496679    29498179    FRG1EP  unprocessed_pseudogene  ENSG00000282995.1   63.00634042910756   23.160648801539175  79
chr20   58850403    58851903    GNAS-AS1    lncRNA  ENSG00000235590.7   58.26586037568643   30.18888731341584   48
chr20   63573739    63575239    HELZ2   protein_coding  ENSG00000130589.16  36.52218479283339   8.51755978542393    44

Haplotype stats

We can also output stats for each haplotype based on HP tag in the modbam with --hap. The added columns when this option is used are the following:

Average Percent Modification Standrad Deviation for Avg Percent Modification Coverage Average Percent Modification (haplotype 1) Standrad Deviation for Avg Percent Modification (haplotype 1) Coverage (haplotype 1) Average Percent Modification (haplotype 2) Standrad Deviation for Avg Percent Modification (haplotype 2) Coverage (haplotype 2)
modbamtools calcMeth --bed promoters.bed \
    --threads 3 \
    --hap \
    --out promoters_calcMeth_hap.bed \
    gm12878_ul_sup_megalodon_HP_chr20.bam
chr20   29496679    29498179    FRG1EP  unprocessed_pseudogene  ENSG00000282995.1   63.00634042910756   23.160648801539175  79  44.27765063921806   18.419436878488273  36  80.732273703281 10.455159906458562  36
chr20   58850403    58851903    GNAS-AS1    lncRNA  ENSG00000235590.7   58.26586037568643   30.18888731341584   48  26.62614126968194   6.853899031338103   22  86.90174211774595   5.326307202363974   20
chr20   63573739    63575239    HELZ2   protein_coding  ENSG00000130589.16  36.52218479283339   8.51755978542393    44  38.93859253773206   8.05201039505814    16  35.09734468342445   8.949332073699683   25