Loading data

RNAvigate is built around the Sample, which is a grouping of datasets that came from a single RNA studied under a single set of experimental conditions. For example, a Sample could contain a sequence, primer location annotations, a ShapeMapper profile, and a predicted secondary structure for an in-vitro structure probing experiment. A second Sample could contain the same data for an in-vivo experiment.

Creating a Sample and assigning data files to it using data keywords accomplishes 4 tasks:

  1. The data are organized as a Sample.

  2. The data are easy to access via the assigned data keywords.

  3. The data keyword tells RNAvigate how to parse and represent the data as one of the data classes described below.

  4. The data is then compatible with all of RNAvigate’s visualization and analysis tools.

Data class

Description

sequence

an RNA sequence

annotation

sites or regions of interest along an RNA sequence

secondary structure

the base-pairing pattern of an RNA sequence

tertiary structure

the 3D atomic coordinates of an RNA sequence

profile

per-nucleotide measurements along an RNA sequence

interactions

inter-nucleotide measurements within an RNA sequence

Creating and using a Sample

Samples are created using rnav.Sample().

import rnavigate as rnav               # Load RNAvigate and give it the alias "rnav"

my_sample = rnav.Sample(               # create a new sample
   sample="My sample name",            # provide a name for plot labels
   data_keyword="my_data.txt",         # load data file 1
   data_keyword2="my_other_data.txt",  # load data file 2
)

Above, sample="My sample name" provides a label, to appear in plot titles and legends, for any data that came from this sample. "My sample name" should be replaced with any string that uniquely and succinctly identifies this sample. A sample label is always required.

data_keyword should be replaced with a data keyword appropriate for your specific data (see below).

Then, visualizing this data would look something like this:

plot = rnav.plot_arcs(         # represent my data as an arc plot
   samples=[my_sample],        # visualize my_sample
   sequence="data_keyword",    # positionally align all data to this sequence
   profile="data_keyword",     # display profile data
   structure="data_keyword2",  # display secondary structure
)

plot_arcs can be replaced with other plotting functions, which are introduced in the next guide: Visualizing data.

Before we get into data keywords, rnav.Sample accepts two other arguments: inherit and keep_inherited_defaults. These are used to share data between samples, e.g. a literature-accepted structure shared between experimental samples. This sharing saves on memory and computation time.

Example usage:

shared_data = rnav.Sample(
   name='shared data',
   keyword1='big_structure.pdb')

sample1 = rnav.Sample(
   name='knockout',
   inherit=shared_data,
   keyword2='sample1-data.txt')

sample2 = rnav.Sample(
   name='control',
   inherit=shared_data,
   keyword2='sample2-data.txt')

sample1 and sample2 now both have keyword1, which is shared, and keyword2, which is not.

At the moment, default keywords are only used to simplify data keyword inputs. For example, the ringmap data keyword uses the sequence provided by default_profile, which is the first profile-type data provided to the Sample.

Data keywords

Data keywords can either be an arbitrary keyword or a standard keyword:

Arbitrary data keywords

An arbitrary keyword is useful if you are loading 2 or more of the same data type into a single sample. Arbitrary keywords must follow some simple rules:

  1. Cannot conflict with a given sample’s other data keywords.

  2. Cannot be inherit or keep_inherited_defaults

  3. Cannot consist only of valid nucleotides: AUCGTaucgt

  4. Cannot start with a number: 0123456789

  5. Must only contain numbers, letters and underscores.

If an arbitrary data keyword is used, a dictionary must be provided, specifying the standard data keyword to use for parsing inputs.

Example:

my_sample = rnav.Sample(
   sample="example",
   standard_keyword="input_file_1.txt",
   arbitrary_keyword={"standard_keyword": "input_file_2.txt"}
)

Standard data keywords

Sequence data

sequence

an RNA sequence

example uses:

  • aligning data between sequences

  • all data in RNAvigate is associated with a sequence and can be aligned to other data, or vice versa.

input explaination:

  • Input should be a fasta file, a sequence string, or another data keyword. If another data keyword is provided, the sequence from that data is retrieved.

example inputs:

# fasta file
my_sample = rnav.Sample(
   name="example",
   sequence="path/to/my_sequence.fa",
)

# sequence string
my_sample = rnav.Sample(
   name="example",
   sequence="AUCAGCGCUAUGACUGCGAUGACUGA",
)

# data keyword
my_sample = rnav.Sample(
   name="example",
   data_keyword="some_data_with_a_sequence"
   sequence="data_keyword",
)

back to Standard data keywords

Annotation data

motif

Annotation of occurances of a sequence motif

example uses:

  • highlighting nucleotides in skyline, profile, arc, circle, or secondary structure diagram plots

  • coloring nucleotides in circle plots, secondary structure diagrams, 3D molecule renderings, or linear regression scatter plots

input explaination:

  • Input should be a dictionary containing:

    • "motif": a string that uses the nucleotide alphabet

      • e.g.: “DRACH” for potential m6A modification sites

alphabet

meaning

matches

A, U, C, G

identity

A, U, C, G

B

not A

U/C/G

D

not C

A/U/G

H

not G

A/U/C

V

not U

A/C/G

W

weak

A/U

S

strong

C/G

M

amino

A/C

K

ketone

U/G

R

purine

A/G

Y

pyrimidine

U/C

N

any

A/U/C/G

  • "sequence": same as sequence keyword

  • "color": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

  • "name": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   m6A={
      "motif": "DRACH",
      "sequence": "my_rna.fa",
      "color": "blue",
      "name": "m6A motif"
      }
   )

back to Standard data keywords

orfs

Annotation of open-reading frames

example uses:

  • same as motif

  • coming soon: displaying amino acid translation and codon usage scores

input explaination:

  • Input should be a dictionary containing:

    • "orfs":

      • "all" annotates all open-reading frames

      • "longest" annotates only the longest open reading frame

    • "sequence": same as sequence keyword

    • "color": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

    • "name": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   main_orf={
      "orfs": "longest",
      "sequence": "my_sequence.fa",
      "color": "green",
      "name": "Longest ORF"
      }
   )

back to Standard data keywords

spans

Annotation of any regions of interest

example uses:

input explaination:

  • input is a dictionary containing

    • "spans": a list of lists of 2 integers. Each inner list specifies a

      start and end position of a span (1-indexed, inclusive)

    • "sequence": same as sequence keyword

    • "color": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

    • "name": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   regions={
      "spans": [[10, 13], [65, 72]],
      "sequence": "my_sequence.fa",
      "color": "purple",
      "name": "interesting regions"
      }
   )

back to Standard data keywords

sites

Annotation of any sites of interest

example uses:

input explaination:

  • input is a dictionary containing

    • "sites": a list of nucleotide positions (1-indexed, inclusive)

    • "sequence": same as sequence keyword

    • "color": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

    • "name": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   m6a_sites={
      "sites": [10, 13, 65, 72],
      "sequence": "my_sequence.fa",
      "color": "purple",
      "name": "m6A sites"
      }
   )

back to Standard data keywords

group

Annotation of any group of nucleotides, such as a binding pocket

example uses:

input explaination:

  • input is a dictionary containing

    • "group": a list of nucleotide positions (1-indexed, inclusive)

    • "sequence": same as sequence keyword

    • "color": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

    • "name": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   ligand_pocket={
      "sites": [10, 13, 65, 72],
      "sequence": "my_sequence.fa",
      "color": "purple",
      "name": "ligand-binding pocket"
      }
   )

back to Standard data keywords

primers

Annotation of primer binding sites

example uses:

input explaination:

  • input is a dictionary containing

    • "primers": a list of lists of 2 integers. Each inner list specifies a

      start and end position of a primer (1-indexed, inclusive). A reverse primer is specified by listing the 3’ -> 5’ start and end, e.g. [300, 278]

    • "sequence": same as sequence keyword

    • "color": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

    • "name": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   pcr_primers={
      "primers": [[1, 22], [300, 278]],
      "sequence": "my_sequence.fa",
      "color": "purple",
      "name": "primer-binding sites"
      }
   )

back to Standard data keywords

domains

Annotation of RNA domains

example uses:

  • same as motif

  • plus: labelling domains across the x-axis of skyline, profile, and arc plots

input explaination:

  • input is a dictionary containing

    • "domains": a list of lists of 2 integers. Each inner list specifies a

      start and end position of a primer (1-indexed, inclusive).

    • "sequence": same as sequence keyword

    • "colors": a valid color or hexcode, e.g. "blue", "grey", or "#fa4ce2"

    • "names": an arbitrary name to use on plots

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   mrna_domains={
      "domains": [[1, 62], [63, 205], [206,300]],
      "sequence": "my_rna.fa",
      "colors": ["purple", "green", "orange"],
      "names": ["5'UTR", "CDS", "3'UTR"],
      }
   )

back to Standard data keywords

Secondary structure data

ss

A secondary structure with optional diagram drawing

example uses:

  • visualizing base pairs on arc plots, circle plots, and secondary structure

    diagrams

  • calculating contact distances

    • the shortest path between nucleotides in a secondary structure graph

  • determining how well per-nucleotide data predict base pairing status (AUROC)

input explaination:

  • Input should be one of the following formats:

    • secondary structure files (no diagram)

      • connection table (.ct)

      • dotbracket notation (.dot, .dbn, etc.)

    • secondary structure diagram files

      • StructureEditor (.nsd or .cte)

      • XRNA (.xrna)

      • VARNA (.varna)

      • FORNA (.json)

        • click “add molecule” and paste in a dotbracket notation structure.

        • arrange it how you like

        • click the download button in the lower-right, then click “json”

      • R2DT (.json)

        • Type in an RNA sequence, R2DT creates the secondary structure

        • Click on the R2DT paper link to learn more about how it works

        • Once the structure is drawn, click “Edit in XRNA”

        • Arrange it how you like it

        • In the upper-left, type in a file name, choose “json”, click “download”

  • Note: The file format is determined by the file extension. Since FORNA and R2DT both produce json, the extension should be provided.

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   ss="my_structure.ct"
   )

other optional inputs:

  • "extension" is used to specify a file extension.

  • "autoscale" is used to scale coordinates to look good in RNAvigate plots

  • "structure_number" is used to specify which structure to load if the file

    contains multiple structures (0-indexed). Default is to load the first structure. This currently only works with dotbracket, ct, and NSD files.

typical optional argument examples:

# specify r2dt vs forna json
my_sample = rnav.Sample(
   sample="example",
   ss={
      "ss": "my_rna.json",
      "extension": "r2dt,
   }
)

#specify structure number
my_sample = rnav.Sample(
   sample="example",
   ss={
      "ss": "my_rna.ct",
      "structure_number": 3,
   }
)

back to Standard data keywords

Tertiary structure data

pdb

A tertiary structure with atomic coordinates

example uses:

  • rendering 3D molecules with data overlayed

  • computing 3D distances between nucleotides

input explaination:

  • Input should be a dictionary containing these keys:

    • “pdb”: a standard PDB file (.pdb or .cif)

    • “chain”: A chain ID

    • “sequence”: same inputs as sequence keyword

      • This is not needed if a sequence is found in the file header.

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   pdb={
      "pdb": "my_structure.pdb",
      "chain": "X",
      "sequence": "my_rna.fa" # not needed if sequence in pdb header
      }
   )

back to Standard data keywords

Profile data

profile

per-nucleotide data that does not have a more specific data keyword:

  • shapemap for SHAPE-, DMS-, or other MaP method

  • dancemap for DanceMapper reactivities

  • rnpmap for RNPMapper data

  • profile for everything else

example uses:

  • visualizing per-nucleotide data on profile, skyline, or arc plots.

  • coloring nucleotides in a secondary structure diagram, circle plot, or 3D molecular rendering

  • calculating profile-to-profile linear regressions

  • calculating ROC curves

  • Renormalizing per-nucleotide data

input explaination:

  • These inputs allow a lot of customization in loading data.

  • For a full explaination, see Custom profiles

back to standard data keywords

shapemap

SHAPE, DMS, or other reagent per-nucleotide reactivities

Two similar data keywords: - dmsmap applies DMS-MaP normalization to profile when loaded

  • shapemap_rnaframework accepts an RNAframework xml file.

dmsmap="path/to/shapemap_profile.txt"

is equivalent to

dmsmap={'shapemap': "shapemap_profile.txt", "normalize": "DMS"}

example uses:

  • same as profile

  • plus: visualizing quality control metrics if a log file is specified

input explaination:

  • Input should be a ShapeMapper2 profile.txt file. This file contains the most

    complete per-nucleotide data from a ShapeMapper2 run.

example inputs:

my_sample = rnav.Sample(
   sample="example",
   shapemap="shapemap_profile.txt"
)

other optional inputs:

  • "normalize":

    • Defaults to not performing any renormalization.

    • "DMS" will perform DMS-MaP renormalization

    • "eDMS" will perform eDMS-MaP renormalization

    • "boxplot" will perform ShapeMapper2 renormalization (with 1 improvement)

    • By default, renormalization is performed on the HQ_profile and HQ_stderr

      columns, and overwrites the Norm_profile and Norm_stderr columns

    • Normalization can also be done after rnav.Sample creation.

    • type: help(rnav.data.Profile.normalize)

  • "log" is used to specify a log file. If ShapeMapper2 was run with the

    --per-read-histograms flag, this file will contain read length distribution and mutations-per-read distribution. This data can then be visualized with rnav.plot_QC.

  • "metric", "metric_defaults", "sequence", and "read_table_kw" are

    explained in Custom interactions, but are not recommended for standard ShapeMapper2 files.

typical optional input example:

my_sample = rnav.Sample(
   sample="example",
   shapemap={
      "shapemap": "shapemap_profile.txt",
      "log": "shapemap_log.txt",
   }
)

back to Standard data keywords

dancemap

Reactivity profile of a single component of a DanceMapper model

example uses:

input explaination:

  • Input should be a dictionary containing:

    • "dancemap": the DanceMapper reactivities.txt file

    • "component"`: which component of the DANCE model to load

  • This works best if each component is a seperate rnav.Sample

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   dancemap={
      "dancemap": "mydancemap_reactivities.txt",
      "component": 0},
   )
my_sample_2 = rnav.Sample(
   sample="example2",
   dancemap={
      "dancemap": "mydancemap_reactivities.txt",
      "component": 1},
   )

other optional inputs:

  • "metric", "metric_defaults", "sequence", and "read_table_kw" are explained in Custom profiles, but are not recommended for standard DanceMapper files.

back to Standard data keywords

rnpmap

RNP-MaP per-nucleotide reactivities.

RNPMapper software

example uses:

input explaination:

  • Input should be the output csv file from RNPMapper

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   rnpmap="myrnpmap_output.csv",
   )

other optional inputs:

  • "metric", "metric_defaults", "sequence", and "read_table_kw" are explained in Custom interactions, but are not recommended for standard RNPMapper files.

back to standard data keywords

Interactions data

interactions

inter-nucleotide data that does not have a more specific data keyword:

  • ringmap for RingMapper correlations

  • pairmap for PairMapper correlations

  • shapejump for ShapeJump deletion events

  • pairprob for pairing probabilities

  • allpossible for every possible nucleotide pairing from a sequence

  • interactions: for everything else

example uses:

  • visualizing interaction networks in arc and circle plots, secondary structure

    diagrams and 3D molecule renderings

  • filtering interactions based on many different factors.

    • see :doc:’/guides/filters’ guide.

  • calculating a distance distribution histogram of a set of interactions

input explaination:

  • These inputs allow a lot of customization in loading data.

  • For a full explaination, see Custom interactions

back to standard data keywords

ringmap

single-molecule correlations from a DMS- or eDMS-MaP experiment

RingMapper software

example uses:

input explaination:

  • Input should be the correlations file output from RingMapper.

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   ringmap="myringmap_corrs.txt",
   )

other optional inputs:

  • "sequence" is used to specify the sequence, and accepts the same inputs

    as the sequence keyword

  • "metric", "metric_defaults", "read_table_kw", and "window" are

    explained in Custom interactions, but are generally not recommended for RingMapper files.

typical optional input example:

my_sample = rnav.Sample(
   sample="example",
   ringmap={
      "ringmap": "myringmap_corrs.txt"
      "sequence": "my_rna.fa"
      }
   )

back to Standard data keywords

pairmap

single-molecule correlations from a DMS- or eDMS-MaP experiment reflective of base pairing

PairMapper software (part of RingMapper)

example uses:

input explaination:

  • Input should be the pairmap.txt output file from PairMapper.

example inputs:

my_sample = rnav.Sample(
   sample="example",
   pairmap="mydata_pairmap.txt",
   )

other optional inputs:

  • "sequence" is used to specify the sequence, and accepts the same inputs as the sequence keyword

  • "metric", "metric_defaults", "read_table_kw", and "window" are explained in Custom interactions, but are generally not recommended for PairMapper files

typtical optional input example:

my_sample = rnav.Sample(
   sample="example",
   pairmap={
      "pairmap": "mydata_pairmap.txt",
      "sequence": "my_rna.fa",
   }
)

back to Standard data keywords

shapejump

ShapeJump inter-nucleotide RT deletion events

ShapeJumper software

example uses:

input explaination:

  • Input should be the deletions.txt output file from ShapeJumper.

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   shapejump="mydata_deletions.txt",
   )

other optional inputs:

  • "sequence" is used to specify the sequence, and accepts the same inputs as the sequence keyword

  • "metric", "metric_defaults", "read_table_kw", and "window" are explained in Custom interactions, but are generally not recommended for ShapeJumper files.

typical optional argument example:

my_sample = rnav.Sample(
   sample="example",
   shapejump={
      "shapejump": "mydata_deletions.txt",
      "sequence": "my_rna.fa",
   }
)

back to Standard data keywords

pairprob

inter-nucleotide predicted pairing probabilities

RNAStructure software

example uses:

input explaination:

  • Input should be a dotplot plain text file from running RNAstructure

    partition followed by ProbabilityPlot with -t option

For example:

partition my_sequence.fa pair_probabilities.dp
ProbabilityPlot -t pair_probabilities.dp pair_probabilities.txt

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   pairprob={"pairprob": "pair_probabilities.txt",
            "sequence": "my_sequence.fa"}
   )

other optional inputs:

  • "sequence" is used to specify the sequence, and accepts the same inputs as the sequence keyword

  • "metric", "metric_defaults", "read_table_kw", and "window" are explained in Custom interactions, but are generally not recommended for “” files

typical optional argument example:

my_sample = rnav.Sample(
   sample="example",
   pairprob={
      "pairprob": "pair_probabilities.txt",
      "sequence": "my_rna.fa",
   }
)

back to Standard data keywords

allpossible

All possible inter-nucleotide pairings for a given sequence

example uses:

  • same as interactions

  • plus: calculating the expected distance distribution of a filtering scheme

input explanation:

  • This keyword has the same expected inputs as the sequence keyword.

  • Note: the size of the data increases with the sequence length squared.

example inputs:

my_sample_1 = rnav.Sample(
   sample="example1",
   allpossible="my_rna.fa"
)

other optional inputs:

  • "window" is used to specify the window size of the interacting regions.

    • Default: "window": 1 means nucleotide i to nucleotide j

    • "window": 3 means nucleotides i:i+3 to nucleotides j:j+3

  • "metric", "metric_defaults", and "read_table_kw" are explained

    in Custom interactions.

typical optional argument example:

my_sample = rnav.Sample(
   sample="example",
   allpossible={
      "allpossible": "my_rna.fa",
      "window": 3,
   }
)

back to Standard data keywords