Transcriptomes

RNAvigate has some functionality to extract transcript-coordinate data from genomic-coordinate data files.

[ ]:

import rnavigate as rnav

Transcripts

First, we need to set up the genome and transcriptome annotations, then we can retreive information about our transcript(s) of interest, here SERPINA1 (Ensembl ID: ENST00000393087.9).

As we’ll see later, this Transcript object provides useful tools on it’s own, and can be used with BED files to extract transcript-coordinate profiles or annotations.

[ ]:

GRCh38 = rnav.transcriptomics.Transcriptome(
    genome="GCF_000001405.26_GRCh38_genomic.fna",
    annotation="MANE.GRCh38.v1.0.ensembl_genomic.gtf"
)

SERPINA1 = GRCh38.get_transcript("ENST00000393087.9")

eCLIP Peaks

RNAvigate parses BED6 and narrowPeak (BED6+4) files, and includes specific functions to download peak files from the ENCORE eCLIP database.

First, we can use rnav.transcriptomics.download_eclip_peaks to retreive the eCLIP peaks from ENCORE. This downloads one narrowPeak file for each combination of protein target and cell line (K562 and HepG2). We only need to do this once. The data can be saved to a central location and reused in other notebooks.

With these files, we can create the eCLIP “database” using rnav.transcriptomics.eCLIPDatabase.

To help us to start thinking about this data, we can display all of the proteins that bind SERPINA1. Binding sites will be displayed in transcript coordinates.

[ ]:

eclip_path = "../../../reference_data/eCLIP_downloads"
# rnav.transcriptomics.download_eclip_peaks(outpath=eclip_path)
# rnav.transcriptomics.create_eclip_table(inpath=eclip_path, outpath=eclip_path)
eclip = rnav.transcriptomics.eCLIPDatabase(inpath=eclip_path)

eclip.print_all_peaks(SERPINA1)

Creating annotations and profiles

We will use the methods of Transcript and eCLIPDatabase to create annotations and profiles, and assign these directly to data keywords. We can use any data keywords we like for this assignment.

eclip.get_eclip_density will create a per-nucleotide profile. The value of each nucleotide is the total number of eCLIP peaks overlapping that position. This can be useful to get a sense of overall protein binding and which regions may be functional protein-binding scaffolds.

eclip.get_annotation will create an annotation of protein binding regions for a given protein target and cell line.

transcript.get_cds_annotation creates a span annotation to highlight the coding sequence.

transcript.get_junctions_annotation creates a span annotation to highlight exon-exon junctions. Each span is two nucleotides: the 3’ end of the 5’ exon, and the 5’ end of the 3’ exon.

transcript.get_exon_annotation creates a span annotation to highlight a specified exon.

[ ]:

test = rnav.Sample(
    sample="SERPINA1 mRNA",
    SERPINA1=SERPINA1,
    eCLIP=eclip.get_eclip_density(transcript=SERPINA1, cell_line="HepG2"),
    cds=SERPINA1.get_cds_annotation(color="red"),
    ddx3x=eclip.get_annotation(SERPINA1, "HepG2", "DDX3X", color="blue"),
    junctions=SERPINA1.get_junctions_annotation(color="black"),
    exon3=SERPINA1.get_exon_annotation(3),
)

Plotting

With these profiles and annotations, we can start creating plots.

For example, here a profile of eCLIP peak density over SERPINA1.

red bar: coding sequence
blue bars: DDX3X binding regions (in the 5’ UTR)

[ ]:

plot = rnav.plot_profile(
    [test],
    sequence="SERPINA1",
    profile="eCLIP",
    annotations=["cds", "ddx3x"],
)