{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transcriptomes\n",
    "==============\n",
    "\n",
    "RNAvigate has some functionality to extract transcript-coordinate data from\n",
    "genomic-coordinate data files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import rnavigate as rnav\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Transcripts\n",
    "-----------\n",
    "\n",
    "First, we need to set up the genome and transcriptome annotations, then we can retreive information about our transcript(s) of interest, here SERPINA1 (Ensembl ID: ENST00000393087.9).\n",
    "\n",
    "As we'll see later, this `Transcript` object provides useful tools on it's own, and can be used with BED files to extract transcript-coordinate profiles or annotations.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "GRCh38 = rnav.transcriptomics.Transcriptome(\n",
    "    genome=\"GCF_000001405.26_GRCh38_genomic.fna\",\n",
    "    annotation=\"MANE.GRCh38.v1.0.ensembl_genomic.gtf\"\n",
    ")\n",
    "\n",
    "SERPINA1 = GRCh38.get_transcript(\"ENST00000393087.9\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "eCLIP Peaks\n",
    "-----------\n",
    "\n",
    "RNAvigate parses BED6 and narrowPeak (BED6+4) files, and includes specific functions to download peak files from the ENCORE eCLIP database.\n",
    "\n",
    "First, we can use `rnav.transcriptomics.download_eclip_peaks` to retreive the eCLIP peaks from\n",
    "[ENCORE](https://www.encodeproject.org/encore-matrix/?type=Experiment&status=released&internal_tags=ENCORE).\n",
    "This downloads one narrowPeak file for each combination of protein target and cell line (K562 and HepG2).\n",
    "We only need to do this once.\n",
    "The data can be saved to a central location and reused in other notebooks.\n",
    "\n",
    "With these files, we can create the eCLIP \"database\" using `rnav.transcriptomics.eCLIPDatabase`.\n",
    "\n",
    "To help us to start thinking about this data, we can display all of the proteins that bind SERPINA1. Binding sites will be displayed in transcript coordinates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eclip_path = \"../../../reference_data/eCLIP_downloads\"\n",
    "# rnav.transcriptomics.download_eclip_peaks(outpath=eclip_path)\n",
    "# rnav.transcriptomics.create_eclip_table(inpath=eclip_path, outpath=eclip_path)\n",
    "eclip = rnav.transcriptomics.eCLIPDatabase(inpath=eclip_path)\n",
    "\n",
    "eclip.print_all_peaks(SERPINA1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating annotations and profiles\n",
    "---------------------------------\n",
    "\n",
    "We will use the methods of `Transcript` and `eCLIPDatabase` to create annotations and profiles, and assign these directly to data keywords.\n",
    "We can use any data keywords we like for this assignment.\n",
    "\n",
    "`eclip.get_eclip_density` will create a per-nucleotide profile.\n",
    "The value of each nucleotide is the total number of eCLIP peaks overlapping that position.\n",
    "This can be useful to get a sense of overall protein binding and which regions may be functional protein-binding scaffolds.\n",
    "\n",
    "`eclip.get_annotation` will create an annotation of protein binding regions for a given protein target and cell line.\n",
    "\n",
    "`transcript.get_cds_annotation` creates a span annotation to highlight the coding sequence.\n",
    "\n",
    "`transcript.get_junctions_annotation` creates a span annotation to highlight exon-exon junctions.\n",
    "Each span is two nucleotides: the 3' end of the 5' exon, and the 5' end of the 3' exon.\n",
    "\n",
    "`transcript.get_exon_annotation` creates a span annotation to highlight a specified exon.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test = rnav.Sample(\n",
    "    sample=\"SERPINA1 mRNA\",\n",
    "    SERPINA1=SERPINA1,\n",
    "    eCLIP=eclip.get_eclip_density(transcript=SERPINA1, cell_line=\"HepG2\"),\n",
    "    cds=SERPINA1.get_cds_annotation(color=\"red\"),\n",
    "    ddx3x=eclip.get_annotation(SERPINA1, \"HepG2\", \"DDX3X\", color=\"blue\"),\n",
    "    junctions=SERPINA1.get_junctions_annotation(color=\"black\"),\n",
    "    exon3=SERPINA1.get_exon_annotation(3),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting\n",
    "--------\n",
    "\n",
    "With these profiles and annotations, we can start creating plots.\n",
    "\n",
    "For example, here a profile of eCLIP peak density over SERPINA1.\n",
    "\n",
    "- red bar: coding sequence\n",
    "- blue bars: DDX3X binding regions (in the 5' UTR)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot = rnav.plot_profile(\n",
    "    [test],\n",
    "    sequence=\"SERPINA1\",\n",
    "    profile=\"eCLIP\",\n",
    "    annotations=[\"cds\", \"ddx3x\"],\n",
    ")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "RNAvigate",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  },
  "nbsphinx": {
   "execute": "never"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}