rnavigate.data package
Submodules
rnavigate.data.alignments module
Alignment objects map coordinates, vectors, and dataframes to a new sequence
Classes
- BaseAlignment (ABC)
abstract base class for alignments
- SequenceAlignment (BaseAlignment)
aligns one sequence another sequence
- RegionAlignment (BaseAlignment)
cuts a sequence between a start and end position
- AlignmentChain (BaseAlignment)
allows chaining of above alignments
- class rnavigate.data.alignments.AlignmentChain(*alignments)
Bases:
BaseAlignmentCombines a list of alignments into one.
Parameters
- alignmentslist of Alignment objects
the alignments to chain together
Attributes
- alignmentslist
the constituent alignments
- starting_sequencestr
starting sequence of alignments[0]
- target_sequencestr
target sequence of alignments[-1]
- mappingnumpy.array
an array which maps from starting_sequence to target_sequence. index of starting_sequence is mapping[index] of target sequence
- get_inverse_alignment()
Alignments require a method to get the inverted alignment
- class rnavigate.data.alignments.BaseAlignment(starting_sequence, target_length)
Bases:
ABCAbstract base class for alignments
Parameters
- starting_sequencestring
the sequence to be aligned
- target_lengthint
the length of the target sequence
Attributes
- starting_sequencestring
the beginning sequence
- mappingnumpy.array
the alignment map array. index of starting_sequence is mapping[index] of target_sequence
- target_sequencestring
the portion of starting sequence that is mapped
- target_lengthinteger
the length of the target sequence
- abstractmethod get_inverse_alignment()
Alignments require a method to get the inverted alignment
- abstractmethod get_mapping()
Alignments require a mapping from starting to target sequence
- get_target_sequence()
Gets the portion of starting sequence that fits the alignment
- map_dataframe(dataframe, position_columns)
Takes a dataframe and maps position columns to target sequence.
Rows with unmapped positions are dropped.
Parameters
- dataframepandas.DataFrame
a dataframe with position columns
- position_columnslist of str
a list of columns containing positions to map
Returns
- pandas.DataFrame
a new dataframe (copy) with position columns mapped or dropped
- map_indices(indices, keep_minus_one=True)
Takes a list of indices (0-index) and maps them to target sequence
Parameters
- indicesint or list of int
a single or list of integer indices
- keep_minus_onebool, defaults to True
whether to keep unmapped starting sequence indices (-1) in the returned array.
Returns
- numpy.array
the equivalent indices in target sequence
- map_nucleotide_dataframe(dataframe, position_column='Nucleotide', sequence_column='Sequence')
Takes a per-nt dataframe and map it to the target sequence.
Dataframe must have 1 row per nucleotide in starting sequence, with a position column and a sequence column. Dataframe is mapped to have the same format, but for target sequence nucleotides and positions.
Parameters
- dataframepandas.DataFrame
a per-nucleotide dataframe
- position_columnstring, defaults to “Nucleotide”
name of the position column.
- sequence_columnstring, defaults to “Sequence”
name of the sequence column.
Returns
- pandas.DataFrame
a new dataframe (copy) mapped to target sequence. Unmapped starting sequence positions are dropped and unmapped target sequence positions are filled.
- map_positions(positions, keep_zero=True)
Takes a list of positions (1-index) and maps them to target sequence
Parameters
- positionsint or list of int
a single or list of integer positions
- keep_zerobool, defaults to True
whether to keep unmapped starting sequence positions (0) in the returned array.
Returns
- numpy.array
the equivalent positions in target sequence
- map_values(values, fill=nan)
Takes an array of length equal to starting sequence and maps them to target sequence, unmapped positions in starting sequence are dropped and unmapped positions in target sequence are filled with fill value.
Parameters
- valuesiterable
values to map to target sequence.
- fillany, defaults to np.nan
a value for unmapped positions in target sequence.
Returns
- numpy.array
an array of values equal in length to target sequence
- class rnavigate.data.alignments.SequenceAlignment(sequence1, sequence2, align_kwargs=None, full=False, use_previous=True)
Bases:
BaseAlignmentThe most useful feature of RNAvigate. Maps positions from one sequence to a totally different sequence using user-defined pairwise alignment or automatic pairwise alignment.
Parameters
- sequence1string
the sequence to be aligned
- sequence2string
the sequence to align to
- align_kwargsdict, defaults to None
a dictionary of arguments to pass to pairwise2.align.globalms
- fullbool, defaults to False
whether to keep unmapped starting sequence positions.
- use_previousbool, defaults to True
whether to use previously set alignments
Attributes
- sequence1str
the sequence to be aligned
- sequence2str
the sequence to align to
- alignment1str
the alignment string matching sequence1 to sequence2
- alignment2str
the alignment string matching sequence2 to sequence1
- starting_sequencestr
sequence1
- target_sequencestr
sequence2 if full is False, else alignment2
- mappingnumpy.array
the alignment map array. index of starting_sequence is mapping[index] of target_sequence
- get_alignment()
Gets an alignment that has either been user-defined or previously calculated or produces a new pairwise alignment between two sequences.
Returns
- alignment1, alignment2tuple of 2 str
the alignment strings matching sequence1 and sequence2, respectively.
- get_inverse_alignment()
Gets an alignment that maps from sequence2 to sequence1.
- get_mapping()
Calculates a mapping from starting sequence to target sequence.
Returns
- mappingnumpy.array
an array that maps to an index of target sequence. index of starting_sequence is mapping[index] of target_sequence
- print(print_format='full')
Print the alignment in a human-readable format.
Parameters
- print_format“full”, “cigar”, “long” or “short”, defaults to “full”
how to format the alignment. “full”: the full length alignment with changes labeled “X” “cigar”: the CIGAR string “long”: locations and sequences of each change “short”: total number of matches, mismatches, and indels
- print_all_changes()
Print location and sequence of all changes.
- print_cigar()
Print the CIGAR string
- print_number_of_changes()
Print the total numbers of matches, mismatches, and indels.
- class rnavigate.data.alignments.StructureAlignment(sequence1, sequence2, structure1=None, structure2=None, full=False)
Bases:
BaseAlignmentExperimental secondary structure alignment based on RNAlign2D algorithm (https://doi.org/10.1186/s12859-021-04426-8)
Parameters
- sequence1string
the sequence to be aligned
- sequence2string
the sequence to align to
- structure1string, defaults to None
the secondary structure of sequence1
- structure2string, defaults to None
the secondary structure of sequence2
- fullbool, defaults to False
whether to align to full length of sequence2 or just mapped length
Attributes
- sequence1str
the sequence to be aligned
- sequence2str
the sequence to align to
- structure1str
the secondary structure of sequence1
- structure2str
the secondary structure of sequence2
- alignment1str
the alignment string matching sequence1 to sequence2
- alignment2str
the alignment string matching sequence2 to sequence1
- starting_sequencestr
sequence1
- target_sequencestr
sequence2 if full is False, else alignment2
- mappingnumpy.array
the alignment map array. index of starting_sequence is mapping[index] of target_sequence
- get_alignment()
Aligns pseudo-amino-acid sequences according to RNAlign2D rules.
Returns
- alignment1, alignment2tuple of 2 str
the alignment strings matching sequence1 and sequence2, respectively.
- get_inverse_alignment()
Gets an alignment that maps from sequence2 to sequence1.
- get_mapping()
Calculates a mapping from starting sequence to target sequence.
Returns
- mappingnumpy.array
an array which maps an indices to the target sequence. starting_sequence[idx] == target_sequence[self.mapping[idx]]
- set_as_default_alignment()
Set this as the default alignment between sequence1 and sequence2.
- rnavigate.data.alignments.convert_sequence(aas, nts, dbn)
Convert pseudo-amino-acid sequence to nucleotide and dotbracket or vice versa.
Parameters
- aasstring or True
the amino acid sequence if True, returns the amino acid translation of nts and dbn
- ntsstring or True
the nucleotide sequence if True, returns the nucleotide translation of aas
- dbnstring or True
the dot-bracket notation string if True, returns the dot-bracket translation of aas
Returns
- string
sequence of the specified translation. If nts and dbn are True, returns a tuple.
Example
conver_sequence(aas=”ACDEFGHIKLMNPQRSTVWY”, nts=True, dbn=True) returns (“AAAAACCCCCUUUUUGGGGG”, “([.])([.])([.])([.])”)
- rnavigate.data.alignments.lookup_alignment(sequence1, sequence2, t_or_u='U')
look up a previously set alignment in the _alignments_cache
Parameters
- sequence1string
The first sequence to align
- sequence2string
The second sequence to be aligned to
- t_or_u“T”, “U”, or False, defaults to “U”
“T” converts “U”s to “T”s “U” converts “U”s to “T”s False does nothing
Returns
- dictionary, if an alignment is found, otherwise None
- {“seqA”: sequence1 with gap characters representing alignment,
“seqB”: sequence2 with gap characters representing alignment}
- rnavigate.data.alignments.set_alignment(sequence1, sequence2, alignment1, alignment2, t_or_u='U')
Add an alignment to be used as the default between two sequences.
When objects with these sequences are aligned for visualization, RNAvigate uses this alignment instead of an automated pairwise sequence alignment. Alignment 1 and 2 must have matching lengths. alignment(1,2) and sequence(1,2) must differ only by dashes “-“.
- e.g.:
sequence1 =”AAGCUUCGGUACAUGCAAGAUGUAC” sequence2 =”AUCGAUCGAGCUGCUGUGUACGUAC” alignment1=”AAGCUUCG———GUACAUGCAAGAUGUAC” alignment2=”AUCGAUCGAGCUGCUGUGUAC———GUAC”
|mm| | indel | | indel |
Parameters
- sequence1string
the first sequence
- sequence2string
the second sequence
- alignment1string
first sequence, plus dashes “-” indicating indels
- alignment2string
second sequence, plus dashes “-” indicating indels
- t_or_u“T”, “U”, or False
“T” converts “U”s to “T”s
- rnavigate.data.alignments.set_multiple_sequence_alignment(fasta, set_pairwise=False)
Set alignments from a multiple sequence alignment Pearson fasta file.
Sets alignments to a base sequence, then returns the base sequence to be when a multiple sequence alignment plot is desired. Also sets all pairwise alignments, if desired. When setting pairwise alignments, dashes that are shared between pairwise sequences are removed first.
Parameters
- fastastring
location of Pearson fasta file
- set_pairwisebool, defaults to False
whether to set every pairwise alignment as well as the multiple sequence alignment.
rnavigate.data.annotation module
annotations.py contains Annotations and subclasses.
- class rnavigate.data.annotation.Annotation(input_data, annotation_type, sequence, name=None, color='blue')
Bases:
SequenceBasic annotation class to store 1D features of an RNA sequence
- Each feature type must be a seperate instance. Feature types include:
a group of separted nucleotides (e.g. binding pocket) regions of interest (e.g. coding sequence, Alu elements) sites of interest (e.g. m6A locations) primer binding sites.
Parameters
- input_datalist
List will be treated according to annotation_type argument. Expected behaviors for each value of annotation_type: “sites” or “group”: 1-indexed location of sites of interest
example: [1, 10, 20, 30] is four sites, 1, 10, 20, and 30
- “spans”: 1-indexed, inclusive locations of spans of interest
example: [[1, 10], [20, 30]] is two spans, 1 to 10 and 20 to 30
- “primers”: Similar to spans, but 5’/3’ direction is preserved.
example: [[1, 10], [30, 20]] forward 1 to 10, reverse 30 to 20
- annotation_type“group”, “sites”, “spans”, or “primers”
The type of annotation.
- sequencestr or pandas.DataFrame
Nucleotide sequence, path to fasta file, or dataframe containing a “Sequence” column.
- namestr, defaults to None
Name of annotation.
- colormatplotlib color-like, defaults to “blue”
Color to be used for displaying this annotation on plots.
Attributes
- datapandas.DataFrame
Stores the list of sites or regions
- namestr
The label for this annotation for use on plots
- colorvalid matplotlib color
Color to represent annotation on plots
- sequencestr
The reference sequence string
- property boolean
Return a boolean array of the annotation on the sequence.
- classmethod from_boolean_array(values, sequence, annotation_type, name, color='blue', window=1)
Create an Annotation from an array of boolean values.
True values are used to create the Annotation.
Parameters
- valueslist of True or False
the boolean array
- sequencestring or rnav.data.Sequence
the sequence of the Annotation
- annotation_type“spans”, “sites”, “primers”, or “group”
the type of the new annotation If “spans” or “primers”, adjacent True values, or values within window are collapse to a region.
- namestring
a name for labelling the annotation.
- colorstring, defaults to “blue”
a color for plotting the annotation
- windowinteger, defaults to 1
a window around True values to include in the annotation.
Returns
- rnavigate.data.Annotation
the new Annotation
- from_sites(sites)
Create the self.data dataframe from a list of sites.
- from_spans(spans)
Create the self.data dataframe from a list of spans.
- get_aligned_data(alignment)
Aligns this Annotation to a new sequence and returns a copy.
Parameters
- alignmentrnavigate.data.Alignment
Alignment object used to align to a new sequence.
Returns
- rnavigate.data.Annotation
A new Annotation with the same name, color, and annotation type, but with the input data aligned to the target sequence.
- get_sites()
Returns a list of nucleotide positions included in this annotation.
Returns
- sitestuple
a list of nucleotide positions
- get_subsequences(buffer=0)
- class rnavigate.data.annotation.Motif(input_data, sequence, name=None, color='blue')
Bases:
AnnotationAutomatically annotates the occurances of a sequence motif as spans.
Parameters
- input_datastr
sequence motif to search for. Uses conventional nucleotide codes. e.g. “DRACH” = [AGTU] [AG] A C [ATUC]
- sequencestr or pandas.DataFrame
Nucleotide sequence, path to fasta file, or dataframe containing a “Sequence” column.
- namestr, defaults to None
Name of annotation.
- colormatplotlib color-like, defaults to “blue”
Color to be used for displaying this annotation on plots.
Attributes
- datapandas.DataFrame
Stores the list of regions that match the motif
- namestr
The label for this annotation for use on plots
- colorvalid matplotlib color
Color to represent annotation on plots
- sequencestr
The reference sequence string
- get_aligned_data(alignment)
Searches the new sequence for the motif and returns a new Motif annotation.
Parameters
- alignmentrnavigate.data.Alignment
Alignment object used to align to a new sequence.
Returns
- rnavigate.data.Motif
A new Motif with the same name, color, and motif but with the input data aligned to the target sequence.
- class rnavigate.data.annotation.ORFs(input_data, name=None, sequence=None, color='blue')
Bases:
AnnotationAutomatically annotations occurances of open-reading frames as spans.
Parameters
- input_data“longest” or “all”
which ORFs to annotate. “longest” annotates the longest ORF. “all” annotates all potential ORFs.
- sequencestr or pandas.DataFrame
Nucleotide sequence, path to fasta file, or dataframe containing a “Sequence” column.
- namestr, defaults to None
Name of annotation.
- colormatplotlib color-like, defaults to “blue”
Color to be used for displaying this annotation on plots.
Attributes
- datapandas.DataFrame
Stores the list of regions that match the motif
- namestr
The label for this annotation for use on plots
- colorvalid matplotlib color
Color to represent annotation on plots
- sequencestr
The reference sequence string
- get_aligned_data(alignment)
Searches the new sequence for ORFs and returns a new ORF annotation.
Parameters
- alignmentrnavigate.data.Alignment
Alignment object used to align to a new sequence.
Returns
- rnavigate.data.ORFs
A new ORFs annotation with the same name, color, and input_data but with the input data aligned to the target sequence.
- get_spans_from_orf(sequence, which='all')
Given a sequence string, returns spans for specified ORFs
Parameters
- sequencestring
RNA nucleotide sequence
- which“longest” or “all”, defaults to “all”
“all” returns all spans, “longest” returns the longest span
Returns
- list of tuples
(start, end) position of each ORF 1-indexed, inclusive
- rnavigate.data.annotation.domains(input_data, names, colors, sequence)
Create a list of Annotations from a list of spans.
Currently, domains functionality in RNAvigate just uses a list of spans. In the future, this should be a dedicated class. Generally, domains should cover an entire sequence without overlap, but this is not enforced. e.g. [[1, 100], [101, 200]] for a 200 nt sequence.
Parameters
- input_datalist of lists
list of spans for each domain
- nameslist of strings
list of names for each domain
- colorslist of valid matplotlib colors
list of colors for each domain
- sequencestring
sequence to be annotated
Returns
- list of rnavigate.data.Annotation
list of Annotations
rnavigate.data.colors module
- class rnavigate.data.colors.ScalarMappable(cmap, normalization, values, title='', tick_labels=None, **cbar_args)
Bases:
_ScalarMappableUsed to map scalar values to a color and to create a colorbar plot.
Parameters
- cmapstr, tuple, float, or list
A valid mpl color, list of valid colors or a valid colormap name
- normalization“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- valueslist
The values to use when normalizing the data
- titlestr, defaults to “”
The title of the colorbar.
- tick_labelslist, defaults to None
The labels to use for the colorbar ticks. If None, values are determined automatically.
- **cbar_argsdict
Additional arguments to pass to the colorbar function
Attributes
- rnav_normstr
The type of normalization to use when mapping values to colors
- rnav_valslist
The values to use when normalizing the data
- rnav_cmaplist
The colors to use when mapping values to colors
- cbar_argsdict
Additional arguments to pass to the colorbar function
- tick_labelslist
The labels to use for the colorbar ticks. If None, values are determined automatically.
- titlestr
The title of the colorbar.
- get_cmap(cmap)
Converts a cmap specification to a matplotlib colormap object.
Parameters
- cmapstring, tuple, float, or list
A valid mpl color, list of valid colors or a valid colormap name
Returns
- matplotlib colormap
a colormap matching the input
- get_norm(normalization, values, cmap)
Given a normalization type and values, return a normalization object.
Parameters
- normalization“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- valueslist
The values to use when normalizing the data
- cmapmatplotlib colormap
The colormap to use when normalizing the data
Returns
- matplotlib.colors normalization object
Used to normalize data before mapping to colors
rnavigate.data.data module
Classes for storing and manipulating data for RNAvigate.
- This module contains the base classes for RNAvigate data classes:
Sequence: represents a nucleotide sequence Data: represents a data table with a sequence
- class rnavigate.data.data.Data(input_data, sequence, metric, metric_defaults, read_table_kw=None, name=None)
Bases:
SequenceThe base class for RNAvigate Profile and Interactions classes.
Parameters
- input_datapandas.DataFrame or str
a pandas dataframe or path to a data file
- sequencestring or rnavigate.data.Sequence
the sequence to use for the data
- metricstring or dict
the column of the dataframe to use as the default metric to visualize
- metric_defaultsdict
a dictionary of metric defaults
- read_table_kwdict, optional
kwargs dictionary passed to pd.read_table
- namestring, optional
the name of the data, defaults to None
Attributes
- datapandas.DataFrame
the data table
- filepathstring
the path to the data file
- sequencestring or rnavigate.data.Sequence
the sequence to use for the data
- metricstring or dict
the column of the dataframe to use as the metric to visualize
- metric_defaultsdict
A dictionary of metric values and default settings for visualization
- default_metricstring
the default metric to use for visualization
- add_metric_defaults(metric_defaults)
Add metric defaults to self.metric_defaults
- property cmap
Get the colormap to use for colorbars and to retrieve colors.
- property color_column
Get the column of the dataframe to use as the color for visualization.
- property colors
Get one matplotlib color-like value for each nucleotide in self.sequence.
- property error_column
Get the column of the dataframe to use as the error for visualization.
- property metric
Get the column of the dataframe to use as the metric for visualization.
- class rnavigate.data.data.Sequence(input_data, name=None, entry=0)
Bases:
objectA class for storing and manipulating RNA sequences.
Parameters
- sequencestring or pandas.DataFrame
sequence string, fasta file, or a Pandas dataframe containing a “Sequence” column
- namestring, optional
The name of the sequence, defaults to None
- entryint, defaults to 0
The index of the sequence in the fasta file if a fasta file is provided
Attributes
- sequencestring
The sequence string
- namestring
The name of the sequence
- other_infodict
A dictionary of additional information about the sequence
- null_alignmentSequenceAlignment
An alignment of the sequence to itself
- get_aligned_data(alignment)
Get a copy of the sequence positionally aligned to another sequence.
Parameters
- alignmentrnavigate.data.Alignment
the alignment to use
Returns
- aligned_sequencernavigate.data.Sequence
the aligned sequence
- get_colors(source, pos_cmap='rainbow', profile=None, structure=None, annotations=None)
Get colors and colormap representing information about the sequence.
Parameters
- sourcestr, list, or matplotlib color-like
the source of the color information if a string, must be one of:
“sequence”, “position”, “profile”, “structure”, “annotations”
- if a list, must be a list of matplotlib color-like values, colormap
will be None.
- if a matplotlib color-like value, all nucleotides will be colored
that color, colormap will be None.
- pos_cmapstr, defaults to “rainbow”
cmap used for position colors if source is “position”
- profilernavigate.data.Profile, optional
the profile to use to get colors if source is “profile”
- structurernavigate.data.SecondaryStructure, optional
the structure to use to get colors if source is “structure”
- annotationslist of rnavigate.data.Annotations, optional
the annotations to use to get colors if source is “annotations”
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_annotations(annotations, default_color='gray')
Get colors and colormap representing sequence annotations.
Parameters
- annotationslist of rnavigate.data.Annotations
the annotations to use to get colors.
- default_colormatplotlib color-like, defaults to “gray”
the color to use for nucleotides not in any annotation
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_positions(pos_cmap='rainbow')
Get colors and colormap representing the nucleotide position.
Parameters
- pos_cmapstr, defaults to “rainbow”
cmap used for position colors
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_profile(profile)
Get colors and colormap representing per-nucleotide data.
Parameters
- profilernavigate.data.Profile
the profile to use to get colors.
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_sequence()
Get a colors and colormap representing the nucleotide sequence.
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_structure(structure)
Get colors and colormap representing base-pairing status.
Parameters
- structurernavigate.data.SecondaryStructure
the structure to use to get colors.
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_region(region='all')
Checks region input for validity and returns start and end positions.
If region is “all”, returns 1, self.length. Otherwise, ensures that region is between these values and returns the values, sorted.
Parameters
- regionlist of 2 int
start and end positions of the region
Returns
- start, endint, int
the starting and ending positions
- get_region_data(region='all')
Get a copy of the data object containing only the specified region.
Parameters
- regionlist of 2 int, defaults to “all”
start and end positions of the region
Returns
- region_datarnavigate.data.Sequence
the sequence containing only the specified region
- get_seq_from_dataframe(dataframe)
Parse a dataframe for the sequence string, store as self.sequence.
Parameters
- dataframepandas.DataFrame
must contain a “Sequence” column
- normalize_sequence(t_or_u='U', uppercase=True)
Converts sequence to all uppercase nucleotides and corrects T or U.
Parameters
- t_or_u“T”, “U”, or False, defaults to “U”
“T” converts “U”s to “T”s “U” converts “T”s to “U”s False does nothing.
- uppercasebool, defaults to True
Whether to make sequence all uppercase
- rnavigate.data.data.normalize_sequence(sequence, t_or_u='U', uppercase=True)
Returns sequence as all uppercase nucleotides and/or corrects T or U.
Parameters
- sequencestring or RNAvigate Sequence)
The sequence If given an RNAvigate Sequence, the sequence string is retrieved
- t_or_u“T”, “U”, or False, defaults to “U”
“T” converts “U”s to “T”s “U” converts “T”s to “U”s False does nothing
- uppercase bool, defaults to True
Whether to make sequence all uppercase
Returns
- string
the cleaned-up sequence string
rnavigate.data.interactions module
- class rnavigate.data.interactions.AllPossible(sequence, metric='data', input_data=None, metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating all possible interactions.
Parameters
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the pairing probability data.
- metricstring, defaults to “Probability”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the pairing probability data.
- namestr, optional
A name for the AllPossible object.
Attributes
- datapandas.DataFrame
The pairing probability data.
- class rnavigate.data.interactions.Interactions(input_data, sequence, metric, metric_defaults, read_table_kw=None, window=1, name=None)
Bases:
DataA class for storing and manipulating interactions data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing interactions data. If dataframe, the dataframe containing interactions data. The dataframe must contain columns “i”, “j”, and self.metric. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the interactions data.
- metricstring
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict
kwargs passed to pandas.read_table() when reading input_data.
- windowint
The window size used to generate the interactions data.
- namestr
The name of the data object.
Attributes
- datapandas.DataFrame
The interactions data.
- windowint
The window size that is being represented by i-j pairs.
- copy(apply_filter=False)
Returns a copy of the interactions, optionally with masked rows removed.
Parameters
- apply_filterbool, defaults to False
If True, masked rows (“mask” == False) are dropped.
Returns
- rnavigate.data.Interactions
A copy of the interactions.
- count_filter(**kwargs)
Counts the number of interactions that pass the given filters.
- data_specific_filter(**kwargs)
Does nothing for the base Interactions class, can be overwritten in subclasses.
- Returns:
dict: dictionary of keyword argument pairs
- filter(prefiltered=False, reset_filter=True, structure=None, min_cd=None, max_cd=None, paired_only=False, ss_only=False, ds_only=False, profile=None, min_profile=None, max_profile=None, compliments_only=False, nts=None, max_distance=None, min_distance=None, exclude_nts=None, isolate_nts=None, resolve_conflicts=None, **kwargs)
Convenience function that applies the above filters simultaneously.
Parameters
- prefilteredbool, defaults to False
If True, the mask is not updated.
- reset_filterbool, defaults to True
If True, the mask is reset before applying filters.
- structurernavigate.data.SecondaryStructure, defaults to None
The structure to use for filtering.
- min_cdint, defaults to None
The minimum contact distance to allow.
- max_cdint, defaults to None
The maximum contact distance to allow.
- paired_onlybool, defaults to False
If True, only keep interactions that are paired in the structure.
- ss_onlybool, defaults to False
If True, only keep interactions between single-stranded nucleotides.
- ds_onlybool, defaults to False
If True, only keep interactions between double-stranded nucleotides.
- profilernavigate.data.Profile, defaults to None
The profile to use for masking.
- min_profilefloat, defaults to None
The minimum profile value to allow.
- max_profilefloat, defaults to None
The maximum profile value to allow.
- compliments_onlybool, defaults to False
If True, only keep interactions where i and j are complimentary nucleotides.
- ntsstr, defaults to None
If compliment_only is False, only keep interactions where i and j are in nts.
- max_distanceint, defaults to None
The maximum distance to allow. If None, no maximum distance is set.
- min_distanceint, defaults to None
The minimum distance to allow. If None, no minimum distance is set.
- exclude_ntslist of int, defaults to None
A list of positions to exclude.
- isolate_ntslist of int, defaults to None
A list of positions to isolate.
- resolve_conflictsstr, defaults to None
If not None, conflicting windows are resolved using the Maximal Weighted Independent Set. The weights are taken from the metric value. The graph is first broken into components to speed up the identification of the MWIS. Then the mask is updated to only include the MWIS.
- **kwargsdict
Each keyword should have the format “column_operator” where column is a valid column name of the dataframe and operator is one of:
“ge”: greater than or equal to “le”: less than or equal to “gt”: greater than “lt”: less than “eq”: equal to “ne”: not equal to
The values given to these keywords are then used in the comparison and False comparisons are filtered out. e.g.:
self.mask_on_values(Statistic_ge=23) evaluates to: self.update_mask(self.data[“Statistic”] >= 23)
Returns
- masknumpy array
a boolean array of the same length as self.data
- get_aligned_data(alignment, apply_filter=True)
Returns a copy mapped to a new sequence with masked rows removed.
Parameters
- alignmentrnavigate.data.SequenceAlignment
The alignment to use for mapping the interactions.
- apply_filterbool, defaults to True
If True, masked rows (“mask” == False) are dropped.
Returns
- rnavigate.data.Interactions
Interactions mapped to a new sequence.
- get_ij_colors()
Gets i, j, and colors lists for plotting interactions.
i and j are the 5’ and 3’ ends of each interaction, and colors is the color to use for each interaction. Values of self.data[self.metric] are normalized to 0 to 1, which correspond to self.min_max values. These are then mapped to a color using self.cmap.
Returns
- ilist
5’ ends of each interaction
- jlist
3’ ends of each interaction
- colorslist
colors to use for each interaction
- get_sorted_data()
Returns a copy of the data sorted by self.metric.
Returns
- pandas.DataFrame
a copy of the data sorted by self.metric
- mask_on_distance(max_dist=None, min_dist=None)
Mask interactions based on their distance in sequence space.
Parameters
- max_distint, defaults to None
The maximum distance to allow. If None, no maximum distance is set.
- min_distint, defaults to None
The minimum distance to allow. If None, no minimum distance is set.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_position(exclude=None, isolate=None)
Mask interactions based on their i and j positions.
Parameters
- excludelist of int, defaults to None
A list of positions to exclude.
- isolatelist of int, defaults to None
A list of positions to isolate.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_profile(profile, min_profile=None, max_profile=None)
Masks interactions based on per-nucleotide measurements.
Parameters
- profilernavigate.data.Profile
The profile to use for masking.
- min_profilefloat, defaults to None
The minimum profile value to allow.
- max_profilefloat, defaults to None
The maximum profile value to allow.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_sequence(compliment_only=None, nts=None)
Mask interactions based on sequence.
Parameters
- compliment_onlybool, defaults to None
If True, only keep interactions where i and j are complimentary nucleotides.
- ntsstr, defaults to None
If compliment_only is False, only keep interactions where i and j are in nts.
Returns
- numpy array
a boolean array of the same length as self.data
- mask_on_structure(structure, min_cd=None, max_cd=None, ss_only=False, ds_only=False, paired_only=False)
Masks interactions based on a secondary structure.
Parameters
- structurernavigate.data.SecondaryStructure
The secondary structure to use for masking.
- min_cdint, defaults to None
The minimum contact distance to allow.
- max_cdint, defaults to None
The maximum contact distance to allow.
- ss_onlybool, defaults to False
If True, only keep interactions between single-stranded nucleotides.
- ds_onlybool, defaults to False
If True, only keep interactions between double-stranded nucleotides.
- paired_onlybool, defaults to False
If True, only keep interactions that are paired in the structure.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_values(**kwargs)
Mask interactions based on values in self.data.
Parameters
- kwargsdict
Each keyword should have the format “column_operator” where column is a valid column name of the dataframe and operator is one of:
“ge”: greater than or equal to “le”: less than or equal to “gt”: greater than “lt”: less than “eq”: equal to “ne”: not equal to
The values given to these keywords are then used in the comparison and False comparisons are filtered out. e.g.:
self.mask_on_values(Statistic_ge=23) evaluates to: self.update_mask(self.data[“Statistic”] >= 23)
Returns
- masknumpy array
a boolean array of the same length as self.data
- print_new_file(outfile=None)
Create a new file with mapped and filtered interactions.
Parameters
- outfilestr, defaults to None
path to an output file. If None, file string is printed to console.
- reset_mask()
Resets the mask to all True (removes previous filters)
- resolve_conflicts(metric=None)
Uses an experimental method to resolve conflicts.
Resolves conflicting windows using the Maximal Weighted Independent Set. The weights are taken from the metric value. The graph is first broken into components to speed up the identification of the MWIS. Then the mask is updated to only include the MWIS. This method is computationally expensive for large or dense datasets.
Parameters
- metricstr, defaults to None
The metric to use for weighting the graph. If None, self.metric is used.
Returns
- masknumpy array
a boolean array of the same length as self.data
- set_3d_distances(pdb, atom)
Wrapper for set_distances for backwards compatibility.
- set_distances(structure, atom="O2'")
Sets the Distance column value based on nt distances in the given structure.
If structure is a SecondaryStructure, contact distances are calculated, and if structure is a PDB, 3D distances are calculated. These distances are averaged across the window and stored in a new “Distance” column in self.data.
Parameters
- structurernavigate.data.SecondaryStructure or rnavigate.data.PDB
Structure object to use for calculating distances
- atomstr
atom id to use for calculating distances in a PDB structure
- update_mask(mask)
Updates the mask by ANDing the current mask with the given mask.
- class rnavigate.data.interactions.PAIRMaP(input_data, sequence=None, metric='Class', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
RINGMaPA class for storing and manipulating PAIRMaP data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing PAIRMaP data. If dataframe, the dataframe containing PAIRMaP data. The dataframe must contain columns “i”, “j”, “Statistic”, and “Class”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the PAIRMaP data.
- metricstring, defaults to “Class”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the PAIRMaP data. If an input file is provided, this value is overwritten by the value in the header.
- namestr, optional
A name for the interactions object.
Attributes
- datapandas.DataFrame
The PAIRMaP data.
- data_specific_filter(all_pairs=False, **kwargs)
Used by Interactions.filter(). By default, non-primary and -secondary pairs are removed. all_pairs=True changes this behavior.
Parameters
- all_pairsbool, defaults to False
whether to include all PAIRs.
Returns
- kwargsdict
any additional keyword-argument pairs are returned
- masknumpy array
a boolean array of the same length as self.data
- class rnavigate.data.interactions.PairingProbability(input_data, extension=None, sequence=None, metric='Probability', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating pairing probability data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing pairing probability data. If dataframe, the dataframe containing pairing probability data. The dataframe must contain columns “i”, “j”, “Probability”, and “log10p”. Dataframe may also include other columns.
- extensionstring, defaults to None
The file extension of the input_data. If None, the extension is determined from the input_data string. Options are “.bps”, “.txt”, and “.dp”. If the extension is “.bps”, the sequence is parsed from the file. If the extension is “.txt” or “.dp”, the sequence must be provided via the sequence argument.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the pairing probability data.
- metricstring, defaults to “Probability”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the pairing probability data.
- namestr, optional
A name for the PairingProbability object.
Attributes
- datapandas.DataFrame
The pairing probability data.
- data_specific_filter(**kwargs)
By default, interactions with probabilities less than 0.03 are removed.
Returns
- kwargsdict
any additional keyword-argument pairs are returned
- masknumpy array
a boolean array of the same length as self.data
- get_entropy_profile(print_out=False, save_file=None)
Calculates per-nucleotide Shannon entropy from pairing probabilities.
Parameters
- print_outbool, defaults to False
If True, entropy values are printed to console.
- save_filestr, defaults to None
If not None, entropy values are saved to this file.
Returns
- rnavigate.data.Profile
a Profile object containing the entropy data
- class rnavigate.data.interactions.RINGMaP(input_data, sequence=None, metric='Statistic', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating RINGMaP data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing RINGMaP data. If dataframe, the dataframe containing RINGMaP data. The dataframe must contain columns “i”, “j”, “Statistic”, and “Zij”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the RINGMaP data.
- metricstring, defaults to “Statistic”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the RINGMaP data. If an input file is provided, this value is overwritten by the value in the header.
- namestr, optional
A name for the interactions object.
Attributes
- datapandas.DataFrame
The RINGMaP data.
- data_specific_filter(positive_only=False, negative_only=False, **kwargs)
Adds filters for “Sign” column to parent filter() function
Parameters
- positive_onlybool, defaults to False
If True, only keep positive correlations.
- negative_onlybool, defaults to False
If True, only keep negative correlations.
Returns
- kwargsdict
any additional keyword-argument pairs are returned
- masknumpy array
a boolean array of the same length as self.data
- get_sorted_data()
Sorts on the product of self.metric and “Sign” columns.
Except when self.metric is “Distance”.
Returns
- pandas.DataFrame
a copy of the data sorted by (self.metric * “Sign”) columns
- read_file(filepath, read_table_kw=None)
Parses a RINGMaP correlations file and stores data as a dataframe.
Also sets self.window (usually 1, from header).
Parameters
- filepathstr
path to correlations file.
- read_table_kwdict, defaults to {}
kwargs passed to pandas.read_table().
Returns
- pandas.DataFrame
the RINGMaP data
- class rnavigate.data.interactions.SHAPEJuMP(input_data, sequence=None, metric='Percentile', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating SHAPEJuMP data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing SHAPEJuMP data. If dataframe, the dataframe containing SHAPEJuMP data. The dataframe must contain columns “i”, “j”, “Metric” (JuMP rate) and “Percentile” (percentile ranking). Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the SHAPEJuMP data.
- metricstring, defaults to “Percentile”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict
kwargs passed to pandas.read_table() when reading input_data.
- windowint
The window size used to generate the SHAPEJuMP data.
- namestr
A name for the interactions object.
Attributes
- datapandas.DataFrame
The SHAPEJuMP data.
- read_file(input_data, read_table_kw=None)
Parses a deletions.txt file and stores it as a dataframe.
Also calculates a “Percentile” column.
Parameters
- input_datastr
path to deletions.txt file
- read_table_kwdict, defaults to {}
kwargs passed to pandas.read_table().
Returns
- pandas.DataFrame
the SHAPEJuMP data
- class rnavigate.data.interactions.StructureAsInteractions(input_data, sequence, metric=None, metric_defaults=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating structure data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing structure data. If dataframe, the dataframe containing structure data. The dataframe must contain columns “i”, “j”, and “Structure”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the structure data.
- metricstring, defaults to “Structure”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the structure data.
- namestr, optional
A name for the StructureAsInteractions object.
Attributes
- datapandas.DataFrame
The structure data.
- class rnavigate.data.interactions.StructureCompareMany(input_data, sequence, metric=None, metric_defaults=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating a comparison of many structures.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing structure data. If dataframe, the dataframe containing structure data. The dataframe must contain columns “i”, “j”, and “Structure”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the structure data.
- metricstring, defaults to “Structure”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the structure data.
- namestr, optional
A name for the StructureAsInteractions object.
Attributes
- datapandas.DataFrame
The structure data.
- class rnavigate.data.interactions.StructureCompareTwo(input_data, sequence, metric=None, metric_defaults=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating a comparison of two structures.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing structure data. If dataframe, the dataframe containing structure data. The dataframe must contain columns “i”, “j”, and “Structure”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the structure data.
- metricstring, defaults to “Structure”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the structure data.
- namestr, optional
A name for the StructureAsInteractions object.
Attributes
- datapandas.DataFrame
The structure data.
rnavigate.data.pdb module
The PDB object to represent tertiary structures with atomic coordinates.
This data can be used to filter interactions by 3D distance, and to visualize profile and interactions data on interactive 3D structures.
- class rnavigate.data.pdb.PDB(input_data, chain, sequence=None, name=None)
Bases:
SequenceA class to represent RNA tertiary structures with atomic coordinates.
This data can be used to filter interactions by 3D distance, and to visualize profile and interactions data on interactive 3D structures.
Parameters
- input_datastr
path to a PDB or CIF file
- chainstr
chain identifier of RNA of interest
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is required if the sequence cannot be found in the header Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
Attributes
- sequencestr
The RNA sequence
- lengthint
The length of the RNA sequence
- namestr
A name for the data set
- pathstr
The path to the PDB or CIF file
- chainstr
The chain identifier of the RNA of interest
- offsetint
The offset between the sequence positions and the PDB residue indices
- pdbBio.PDB.Structure.Structure
The PDB structure
- pdb_idxnp.array
The PDB indices of the RNA
- pdb_seqnp.array
The PDB sequence of the RNA
- distance_matrixdict
A dictionary of distance matrices for each atom type
- get_distance(i, j, atom="O2'")
Get the distance between given atom in nucleotides i and j (1-indexed).
Parameters
- iint
The first nucleotide
- jint
The second nucleotide
- atomstring or dict, defaults to “O2’”
The atom to use for distance calculations. If a string, the same atom will be used for all residues. If a dict, the atom will be chosen based on the nucleotide type. If “DMS”, the N1 atom will be used for A and G, and the N3 atom will be used for U and C.
Returns
- distancefloat
The distance between the atoms
- get_distance_matrix(atom="O2'")
Get the pairwise atomic distance matrix for all residues.
Parameters
- atomstring or dict, defaults to “O2’”
The atom to use for distance calculations. If a string, the same atom will be used for all residues. If a dict, the atom will be chosen based on the nucleotide type. If “DMS”, the N1 atom will be used for A and G, and the N3 atom will be used for U and C.
Returns
- matrixNxN numpy.ndarray
A 2D array of pairwise distances. N is the length of the RNA.
- get_pdb_idx(seq_idx)
Return the PDB index given the sequence index (0-indexed).
- get_seq_idx(pdb_idx)
Return the sequence index given the PDB index.
- get_sequence(pdb)
Find the sequence in the provided CIF or PDB file.
Parameters
- pdbstr
path to a PDB or CIF file
Returns
- sequencestring
The RNA sequence
- get_sequence_from_seqres(seqres)
Used by get_sequence to parse the SEQRES entries.
Parameters
- seqreslist
A list of SEQRES entries for the RNA chain of interest
Returns
- sequencestring
The RNA sequence
- get_xyz_coord(nt, atom)
Return the x, y, and z coordinates for a given residue and atom.
Parameters
- ntint
The nucleotide of interest (1-indexed)
- atomstring or dict, defaults to “O2’”
The atom to use for distance calculations. If a string, the same atom will be used for all residues. If a dict, the atom will be chosen based on the nucleotide type. If “DMS”, the N1 atom will be used for A and G, and the N3 atom will be used for U and C.
Returns
- xyzlist
A list of x, y, and z coordinates
- is_valid_idx(pdb_idx=None, seq_idx=None)
Determines if a PDB or sequence index is in the PDB structure.
Parameters
- pdb_idxint, optional
A PDB index (1-indexed). Defaults to None.
- seq_idxint, optional
A sequence index (1-indexed). Defaults to None.
Returns
- bool
True if the index is in the PDB structure, False otherwise.
- read_pdb(pdb)
Read a PDB or CIF file into the data structure.
Parameters
- pdbstr
path to a PDB or CIF file
- set_indices()
Uses self.data and self.sequence to set self.offset
rnavigate.data.profile module
- class rnavigate.data.profile.DanceMaP(input_data, component, read_table_kw=None, sequence=None, metric='Norm_profile', metric_defaults=None, name=None)
Bases:
SHAPEMaPA class to represent per-nucleotide DanceMaP data.
Parameters
- input_datastr or pandas.DataFrame
path to a DanceMapper reactivities.txt file or a pandas DataFrame
- componentint
Which component of the DanceMapper ensemble to read in (0-indexed).
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. These are not necessary for reactivities.txt files. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is not necessary for reactivities.txt files. Defaults to None.
- metricstr, defaults to “Norm_profile”
The name of the set of value-to-color options to use.
- read_file(input_data, read_table_kw={})
Convert data file to pandas dataframe and store as self.data
Parameters
- filepathstring
path to data file containing interactions
- read_table_kwdict
kwargs dictionary passed to pd.read_table
Returns
- dataframepandas.DataFrame
the data table
- property recreation_kwargs
A dictionary of keyword arguments to pass when recreating the object.
- class rnavigate.data.profile.DeltaProfile(profile1, profile2, metric=None, metric_defaults=None, name=None)
Bases:
ProfileA class to represent the difference between two profiles.
Parameters
- profile1Profile
The first profile to compare.
- profile2Profile
The second profile to compare.
- metricstr, optional
The name of the metric to use. Defaults to the metric of profile1.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters. Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
- class rnavigate.data.profile.Profile(input_data, metric='default', metric_defaults=None, read_table_kw=None, sequence=None, name=None)
Bases:
DataA class to represent per-nucleotide data.
Parameters
- input_datastr or pandas.DataFrame
path to a csv or tab file or a pandas DataFrame Table must be 1 row for each nucleotide in the sequence. table columns must contain these columns:
A nucleotide position column labelled “Nucleotide” A sequence column labelled “Sequence” with 1 of (A, C, G, U, T) per row
These will be added to the table if sequence is provided.
- A data measurement column labelled “Profile” with a float or integer
Label may be another name if specified in metric_defaults
- Optionally: A measurement error column.
Label must be specified in metric_defaults
- Other columns may be present, and set up using metric_defaults.
See metric_defaults for more information.
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is required if input_data does not contain a “Sequence” column. Defaults to None.
- metricstr, defaults to “default”
The name of the set of value-to-color options to use. “default” specifies:
“Profile” column is used No error rates are present Values are normalized to the range [0, 1] Values are mapped to colors using the “viridis” colormap
- “Distance” specifies:
(3-D) “Distance” column is used No error rates are present Values in the range [5, 50] are normalized to the range [0, 1] Values are mapped to colors using the “cool” colormap
Other options may be defined in metric_defaults.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters:
- “metric_column”str
The name of the column to use as the metric. Plots and analyses that use per-nucleotide data will use this column. If “color_column” is not provided, this column also defines colors.
- “error_column”str or None
The name of the column to use as the error. If None, no error bars are plotted.
- “color_column”str or None
The name of the column to use for coloring. If None, colors are defined by “metric_column”.
- “cmap”str or list
The name of the colormap to use. If a list, the list of colors to use.
- “normalization”str
The type of normalization to use. In order to be used with colormaps, values are normalized to either be integers for categorical colormaps, or floats in the range [0, 1] for continuous colormaps. “none” : no normalization is performed “min_max” : values are scaled to floats in the range [0, 1] based on
the upper and lower bounds defined in “values”
- “0_1”values are scaled to floats in the range [0, 1] based on
the minimum and maximum values in the data
- “bins”values are scaled an integer based on bins defined by the
list of bounds defined in “values”
- “percentiles”values are scaled to floats in the range [0, 1]
based on upper and lower percentile bounds defined by “values”
- “values”list or None
The values to use when normalizing the data. if “normalization” is “min_max”, this should be a list of two values
defining the upper and lower bounds.
- if “normalization” is “bins”, this should be a list of values
of length 1 less than the length of cmap. example: [5, 10, 20] defines 4 bins:
(-infinity, 5), [5, 10), [10, 20), [20, infinity)
- if “normalization” is “percentiles”, this should be a list of two
values defining the upper and lower percentile bounds.
if “normalization” is “0_1” or “none”, this should be None.
- “title”str, defaults to “”
The title of the colorbar.
- “ticks”list, defaults to None
The tick locations to use for the colorbar. If None, values are determined automatically.
- “tick_labels”list, defaults to None
The labels to use for the colorbar ticks. If None, values are determined automatically from “ticks”.
- “extend”“neither”, “both”, “min”, or “max”, defaults to “neither”
Which ends of the colorbar to extend (places an arrow head).
Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
Attributes
- datapandas.DataFrame
The data table
- calculate_gini_index(values)
Calculate the Gini index of an array of values.
- calculate_windows(column, window, method='median', new_name=None, minimum_points=None, mask_na=True)
calculates a windowed operation over a column of data.
Result is stored in a new column. Value of each window is assigned to the center position of the window.
Parameters
- columnstr
name of column to perform operation on
- windowint
window size, must be an odd number
- methodstring or function, defaults to “median”
operation to perform over windows. if string, must be “median”, “mean”, “minimum”, or “maximum” if function, must take a 1D numpy array as input and return a scalar
- new_namestr, defaults to f”{method}_{window}_nt”
name of new column for stored result.
- minimum_pointsint, defaults to value of window
minimum number of points within each window.
- mask_nabool, defaults to True
whether to mask the result of the operation where the original column has a nan value.
- copy()
Returns a copy of the Profile.
- classmethod from_array(input_data, sequence, **kwargs)
Construct a Profile object from an array of values.
Parameters
- input_datalist or np.array
A list or array of values to use as the metric.
- sequencestr
The RNA sequence.
- **kwargs
Additional keyword arguments to pass to the Profile constructor.
Returns
- Profile
A Profile object with the provided values.
- get_aligned_data(alignment)
Returns a new Profile object with the data aligned to a sequence.
Parameters
- alignmentrnavigate.data.SequenceAlignment
The alignment to use to map rows of self.data to a new sequence.
Returns
- Profile
A new Profile object with the data aligned to the sequence in the alignment.
- get_plotting_dataframe()
Returns a dataframe with the data to be plotted.
Returns
- pandas.DataFrame
A dataframe with the columns “Nucleotide”, “Values”, “Errors”, and “Colors”.
- norm_boxplot(values)
removes outliers (> 1.5 * IQR) and scales the mean to 1.
NOTE: This method varies slightly from normalization method used in the SHAPEMapper pipeline. Shapemapper sets undefined values to 0, and then uses these values when computing iqr and 90th percentile. Including these values can skew these result. This method excludes such nan values. Other elements are the same.
Parameters
- values1D numpy array
values to normalize
Returns
- (float, float)
scaling factor and error propagation factor
- norm_eDMS(values)
Calculates norm factors following eDMS pernt scheme in ShapeMapper 2.2
Parameters
- values1D numpy array
values to normalize
Returns
- (float, float)
scaling factor and error propagation factor
- norm_percentiles(values, lower_bound=90, upper_bound=99, median_or_mean='mean')
Calculates factors to scale the median between percentile bounds to 1.
Parameters
- values1D numpy array
values to normalize
- lower_boundint or float, optional
percentile of lower bound, Defaults to 90
- upper_boundint or float, optional
percentile of upper bound, Defaults to 99
- median_or_meanstring, optional
whether to use the median or mean of the values between the bounds.
Returns
- (float, float)
scaling factor and error propagation factor
- normalize(profile_column=None, new_profile=None, error_column=None, new_error=None, norm_method='boxplot', nt_groups=None, profile_factors=None, **norm_kwargs)
Normalize values in a column, and store in a new column.
By default, performs ShapeMapper2 boxplot normalization on self.metric and stores the result as “Norm_profile”.
Parameters
- profile_columnstring, defaults to self.metric
column name of values to normalize
- new_profilestring, defaults to “Norm_profile”
column name of new normalized values
- error_columnstring, defaults to self.error_column
column name of error values to propagate
- new_errorstring, defaults to “Norm_error”
column name of new propagated error values
- norm_methodstring, defaults to “boxplot”
normalization method to use. “DMS” uses self.norm_percentile and nt_groups=[‘AC’, ‘UG’]
scales the median of 90th to 95th percentiles to 1 As and Cs are normalized seperately from Us and Gs
- “eDMS” uses self.norm_eDMS and nt_groups=[‘A’, ‘U’, ‘C’, ‘G’]
Applies the new eDMS-MaP normalization. Each nucleotide is normalized seperately.
- “boxplot” uses self.norm_boxplot and nt_groups=[‘AUCG’]
removes outliers (> 1.5 iqr) and scales median to 1 scales nucleotides together unless specified with nt_groups
- “percentile” uses self.norm_percentile and nt_groups=[‘AUCG’]
scales the median of 90th to 95th percentiles to 1 scales nucleotides together unless specified with nt_groups
Defaults to “boxplot”: the default normalization of ShapeMapper
- nt_groupslist of strings, defaults to None
A list of nucleotides to group e.g. [‘AUCG’] groups all nts together
[‘AC’, ‘UG’] groups As with Cs and Us with Gs [‘A’, ‘C’, ‘U’, ‘G’] scales each nt seperately
Default depends on norm_method
- profile_factorsdictionary, defaults to None
- a scaling factor (float) for each nucleotide. keys must be:
‘A’, ‘C’, ‘U’, ‘G’
Note: using this argument overrides any calculation of scaling Defaults to None
- **norm_kwargs
these are passed to the norm_method function
Returns
- profile_factorsdict
the new profile scaling factors dictionary
- normalize_external(profiles, **kwargs)
normalize reactivities using other profiles to normfactors.
Parameters
- profileslist of rnavigate.data.Profile
a list of other profiles used to compute scaling factors
Returns
- profile_factorsdict
the new profile scaling factors dictionary
- normalize_sequence(t_or_u='U', uppercase=True)
Changes the values in self.data[“Sequence”] to the normalized sequence.
Parameters
- t_or_u“T” or “U”, Defaults to “U”.
Whether to replace T with U or U with T.
- uppercasebool, Defaults to True.
Whether to convert the sequence to uppercase.
- property recreation_kwargs
A dictionary of keyword arguments to pass when recreating the object.
- winsorize(column, lower_bound=None, upper_bound=None)
Winsorize the data between bounds.
If either bound is set to None, one-sided Winsorization is performed.
Parameters
- columnstring
the column of data to be winsorized
- lower_boundNumber or None, defaults to None
Data below this value is set to this value. If None, no lower bound is applied.
- upper_boundNumber or None, defaults to None
Data above this value is set to this value. If None, no upper bound is applied.
- class rnavigate.data.profile.RNPMaP(input_data, read_table_kw=None, sequence=None, metric='NormedP', metric_defaults=None, name=None)
Bases:
ProfileRepresents per-nucleotide RNPMaP data.
Parameters
- input_datastr or pandas.DataFrame
path to an RNAModMapper reactivities.txt file or a pandas DataFrame
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. These are not necessary for reactivities.txt files. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is not necessary for reactivities.txt files. Defaults to None.
- metricstr, defaults to “NormedP”
The name of the set of value-to-color options to use.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters. Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
- class rnavigate.data.profile.SHAPEMaP(input_data, normalize=None, read_table_kw=None, sequence=None, metric='Norm_profile', metric_defaults=None, log=None, name=None)
Bases:
ProfileA class to represent per-nucleotide SHAPE-MaP data.
Parameters
- input_datastr or pandas.DataFrame
path to a ShapeMapper2 profile.txt or .map file or a pandas DataFrame
- normalize“DMS”, “eDMS”, “boxplot”, “percentiles”, or None, defaults to None
The normalization method to use. “DMS” uses self.norm_percentile and nt_groups=[‘AC’, ‘UG’]
scales the median of 90th to 95th percentiles to 1 As and Cs are normalized seperately from Us and Gs
- “eDMS” uses self.norm_eDMS and nt_groups=[‘A’, ‘U’, ‘C’, ‘G’]
Applies the new eDMS-MaP normalization. Each nucleotide is normalized seperately.
- “boxplot” uses self.norm_boxplot and nt_groups=[‘AUCG’]
removes outliers (> 1.5 iqr) and scales median to 1 scales nucleotides together unless specified with nt_groups
- “percentiles” uses self.norm_percentile and nt_groups=[‘AUCG’]
scales the median of 90th to 95th percentiles to 1 scales nucleotides together unless specified with nt_groups
Defaults to None: no normalization is performed
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. These are not necessary for profile.txt and .map files. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is not necessary for profile.txt and .map files. Defaults to None.
- metricstr, defaults to “Norm_profile”
The name of the set of value-to-color options to use. “Norm_profile” specifies:
“Norm_profile” column is used “Norm_stderr” column is used for error bars Values are normalized to bins:
(-inf, -0.4), [-0.4, 0.4), [0.4, 0.85), [0.85, 2), [2, inf)
Bins are mapped to “grey”, “black”, “orange”, “red”, “red”
Other options may be defined in metric_defaults.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters:
- “metric_column”str
The name of the column to use as the metric. Plots and analyses that use per-nucleotide data will use this column. If “color_column” is not provided, this column also defines colors.
- “error_column”str or None
The name of the column to use as the error. If None, no error bars are plotted.
- “color_column”str or None
The name of the column to use for coloring. If None, colors are defined by “metric_column”.
- “cmap”str or list
The name of the colormap to use. If a list, the list of colors to use.
- “normalization”str
The type of normalization to use. In order to be used with colormaps, values are normalized to either be integers for categorical colormaps, or floats in the range [0, 1] for continuous colormaps. “none” : no normalization is performed “min_max” : values are scaled to floats in the range [0, 1] based on
the upper and lower bounds defined in “values”
- “0_1”values are scaled to floats in the range [0, 1] based on
the minimum and maximum values in the data
- “bins”values are scaled an integer based on bins defined by the
list of bounds defined in “values”
- “percentiles”values are scaled to floats in the range [0, 1]
based on upper and lower percentile bounds defined by “values”
- “values”list or None
The values to use when normalizing the data. if “normalization” is “min_max”, this should be a list of two values
defining the upper and lower bounds.
- if “normalization” is “bins”, this should be a list of values
of length 1 less than the length of cmap. example: [5, 10, 20] defines 4 bins:
(-infinity, 5), [5, 10), [10, 20), [20, infinity)
- if “normalization” is “percentiles”, this should be a list of two
values defining the upper and lower percentile bounds.
if “normalization” is “0_1” or “none”, this should be None.
- “title”str, defaults to “”
The title of the colorbar.
- “ticks”list, defaults to None
The tick locations to use for the colorbar. If None, values are determined automatically.
- “tick_labels”list, defaults to None
The labels to use for the colorbar ticks. If None, values are determined automatically from “ticks”.
- “extend”“neither”, “both”, “min”, or “max”, defaults to “neither”
Which ends of the colorbar to extend (places an arrow head).
Defaults to None.
- logstr, optional
Path to a ShapeMapper v2 shapemap_log.txt file with mutations-per-molecule and read-length histograms. These will be present if the –per-read-histogram flag was used when running ShapeMapper v2. Currently, this is not working with ShapeMapper v2.2 files. Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
Attributes
- datapandas.DataFrame
The data table
- classmethod from_rnaframework(input_data, normalize=None)
Construct a SHAPEMaP object from an RNAFramework output file.
Parameters
- input_datastr
path to an RNAFramework .xml reactivities file
- normalize“DMS”, “eDMS”, “boxplot”, “percentiles”, or None, defaults to None
The normalization method to use. “DMS” uses self.norm_percentile and nt_groups=[‘AC’, ‘UG’]
scales the median of 90th to 95th percentiles to 1 As and Cs are normalized seperately from Us and Gs
- “eDMS” uses self.norm_eDMS and nt_groups=[‘A’, ‘U’, ‘C’, ‘G’]
Applies the new eDMS-MaP normalization. Each nucleotide is normalized seperately.
- “boxplot” uses self.norm_boxplot and nt_groups=[‘AUCG’]
removes outliers (> 1.5 iqr) and scales median to 1 scales nucleotides together unless specified with nt_groups
- “percentiles” uses self.norm_percentile and nt_groups=[‘AUCG’]
scales the median of 90th to 95th percentiles to 1 scales nucleotides together unless specified with nt_groups
Defaults to None: no normalization is performed
Returns
- SHAPEMaP
A SHAPEMaP object with the provided values.
- read_log(log)
Read the ShapeMapper log file.
Parameters
- logstr
Path to a ShapeMapper v2 shapemap_log.txt file with mutations-per-molecule and read-length histograms.
Returns
- read_lengthspandas.DataFrame
A dataframe with the columns “Read_length”, “Modified_read_length”, and “Untreated_read_length”.
- mutations_per_moleculepandas.DataFrame
A dataframe with the columns “Mutation_count”, “Modified_mutations_per_molecule”, and “Untreated_mutations_per_molecule”.
rnavigate.data.secondary_structure module
- class rnavigate.data.secondary_structure.SecondaryStructure(input_data, extension=None, autoscale=True, name=None, **kwargs)
Bases:
SequenceBase class for secondary structures.
Parameters
- input_datastr or pandas.DataFrame
A dataframe or filepath containing a secondary structure DataFrame should contain these columns:
[“Nucleotide”, “Sequence”, “Pair”]
“Pair” column must be redundant. Filepath parsing is determined by file extension:
varna, xrna, nsd, cte, ct, dbn, bracket, json (R2DT), forna
- extensionstr, optional
The file extension of the input_data file. If not provided, the extension will be inferred from the input_data filepath.
- autoscalebool, optional
Whether to automatically scale the x and y coordinates. Defaults to True.
- namestr, optional
The name of the RNA sequence. Defaults to None.
Attributes
- datapandas.DataFrame
DataFrame storing base-pairs
- filepathstr
The path to the input file, if provided, otherwise “dataframe”
- sequencestr
The RNA sequence
- ntsnumpy.array
The “Nucleotide” column of data
- pair_ntsnumpy.array
The “Pair” column of data
- headerstr
Header information from CT file
- xcoordinatesnumpy.array
The “X_coordinate” column of data
- ycoordinatesnumpy.array
The “X_coordinate” column of data
- distance_matrixnumpy.array
The contact distance matrix of the RNA structure
- add_pairs(pairs, break_conflicting_pairs=False)
Add base pairs to current secondary structure.
Parameters
- pairslist
1-indexed list of paired residues. e.g. [(1, 20), (2, 19)]
- break_conflicting_pairsbool, defaults to False
Whether to break existing pairs if there is a conflict
- as_interactions(structure2=None)
Returns rnavigate.Interactions representation of this, or more, structures.
Parameters
- structure2SecondaryStructure or list of these, defaults to None
If provided, basepairs from all structures are included and labeled by which structures contain them and how many structures contain them.
- property boolean
Return a boolean array of paired and unpaired nucleotides.
- break_noncanonical_pairs()
Removes non-canonical basepairs from the secondary structure.
WARNING: this deletes information.
- break_pairs_nts(nt_positions)
break base pairs at the given list of positions.
WARNING: this deletes information.
Parameters
- nt_positionslist of int
1-indexed positions to break pairs
- break_pairs_region(start, end, break_crossing=True, inverse=False)
Removes pairs from the specified region (1-indexed, inclusive).
WARNING: this deletes information
Parameters
- startint
start position (1-indexed, inclusive)
- endint
end position (1-indexed, inclusive)
- break_crossingbool, defaults to True
Whether to keep pairs that cross over the specified region
- inversebool, defaults to False
Invert the behavior, i.e. remove pairs that are not in this region
- break_singleton_pairs()
Removes singleton basepairs from the secondary structure.
WARNING: This deletes information.
- compute_ppv_sens(structure2, exact=True)
Compute the PPV and sensitivity between this and another structure.
True and False are determined from this structure. Positive and Negative are determined from structure2.
PPV = TP / (TP + FP) Sensitivity = TP / (TP + FN)
Parameters
- structure2SecondaryStructure
The SecondaryStructure to compare to.
- exactbool, defaults to True
True requires BPs to be exactly correct. False allows +/-1 bp slippage.
Returns
- float
sensitivity
- float
PPV
- 2-tuple of floats
(TP, TP+FP, TP+FN)
- contact_distance(i, j)
Returns the contact distance between positions i and j
- copy()
- fill_mismatches(mismatch=1)
Adds base pairs to fill 1,1 and optionally 2,2 mismatches.
Parameters
- mismatchint, defaults to 1
1 will fill only 1,1 mismatches 2 will fill 1,1 and 2,2 mismatches
- classmethod from_pairs_list(input_data, sequence)
Creates a SecondaryStructure from a list of pairs and a sequence.
Parameters
- input_datalist
1-indexed list of base pairs. e.g. [(1, 20), (2, 19)]
- sequencestr
The RNA sequence. e.g., “AUCGUGUCAUGCUA”
- classmethod from_sequence(input_data)
Creates a SecondaryStructure from a sequence string.
This structure is initialized with no base pairs. If base pairs are needed, use SecondaryStructure.from_pairs_list().
- get_aligned_data(alignment)
Returns a new SecondaryStructure object matching the alignment target.
Parameters
- alignmentdata.Alignment
An alignment object used to map values
- get_distance_matrix(recalculate=False, max_cd=50)
Get a matrix of pair-wise shortest path distances through the structure.
This function uses a BFS algorithm. The structure is represented as a complete graph with nucleotides as vertices and base-pairs and backbone as edges. All edges are length 1. Matrix is stored as an attribute for future use.
If the attribute is set (not None) and recalculate is False, the attribute will be returned.
Based on Tom’s contact_distance, but expanded to return the pairwise matrix. New contact_distance method added to return the distance between two positions.
By default, the maximum contact distance is set to 50. This will be the maximum value reported in the matrix, i.e. a value of 50 in the matrix means >= 50. This prevents the algorithm from running for a very long time on long RNAs. If you need a larger value, set max_cd to a higher value.
Parameters
- recalculatebool, defaults to False
Set to True to recalculate the matrix even if the attribute is set.
- max_cdint, defaults to 50
The maximum contact distance to calculate.
- get_dotbracket()
Get a dotbracket notation string representing the secondary structure.
- Pseudoknot levels:
1: () 2: [] 3: {} 4: <> 5: Aa 6: Bb 7: Cc etc…
Returns
- str
A dot-bracket representation of the secondary structure
- get_helices(fill_mismatches=True, split_bulge=True, keep_singles=False)
Get a dictionary of helices from the secondary structure.
Keys are equivalent to list indices. Values are lists of paired nucleotides (1-indexed) in that helix. e.g. {0:[(1,50),(2,49),(3,48)}
Parameters
- fill_mismatchesbool, defaults to True
Whether 1-1 and 2-2 bulges are replaced with base pairs
- split_bulgebool, defaults to True
Whether to split helices on bulges
- keep_singlesbool, defaults to False
Whether to return helices that contain only 1 base-pair
Returns
- dict
A dictionary of helices
- get_human_dotbracket()
Get a human-readable dotbracket string representing the secondary structure.
This is an experimental format designed to be more human readable, i.e. no counting of brackets required.
Letters, instead of brackets, are used to denote nested base pairs.
Each helix is assigned a letter, which is incremented one letter alphabetically from the nearest enclosing stem.
Non-nested helices (pseudoknots) are assigned canonical brackets.
- From this canonical dbn string:
how many bases are in the base stem? how many nested helices are there? ((((….(((.[[..)))))(((…(((..]].))))))))
- Same question, new format:
AABB….CCC.[[..cccbbBBB…CCC..]].cccbbbaa
- Read this as:
- ((_______________________________________)) (level 1 = A)
- ((_______________))(((______________))) (level 2 = B)
- (((_____))) (((_____))) (level 3 = C)
[[__________________]] (pseudoknot = [])
- Pseudoknot levels:
1: Aa, Bb, Cc, etc. 2: [], 3: {}, 4: <>
- get_interactions_df()
Returns a DataFrame of i, j basepairs.
Returns
- pandas.DataFrame
- A DataFrame with columns:
i: the 5’ (1-indexed) position of the base pair j: the 3’ (1-indexed) position of the base pair Structure: always 1
- get_junction_nts()
Get a list of junction nucleotides (paired, but at the end of a chain).
Returns
- list
A list of 1-indexed positions of junction nucleotides
- get_nonredundant_ct()
Returns the ct attribute in a non-redundant form.
Only returns pairs in which i < j For example:
self.ct[i-1] == j self.ct[j-1] == i BUT self.get_nonredundant_ct()[j-1] == 0
Returns
- numpy.array
A non-redundant array of base pairs
- get_paired_nts()
Get a list of residues that are paired.
Returns
- list
A list of 1-indexed positions of paired nucleotides
- get_pairs()
Get a non-redundant list of base pairs i < j as a array of tuples.
Returns
- list
A list of 1-indexed positions. e.g., [(1, 50), (2, 49), …]
- get_pseudoknots(fill_mismatches=True)
Get the pk1 and pk2 pairs from the secondary structure.
Ignores single base pairs. PK1 is defined as the helix crossing the most other bps. If there is a tie, the most 5’ helix is called pk1 returns pk1 and pk2 as a list of base pairs e.g [(1,10),(2,9)…
Parameters
- fill_mismatchesbool, defaults to True
Whether 1-1 and 2-2 bulges are replaced with base pairs
Returns
- list of 2 lists of 2-tuples
A list of base pairs for pk1 and pk2
- get_structure_elements()
This code is not yet implemented.
Returns a string with a character for each nucleotide, indicating what kind of structure element it is a part of.
- Characters:
Dangling Ends (E) Stems (S) Hairpin Loops (H) Bulges (B) Internal Loops (I) MultiLoops (M) External Loops (X) Pseudoknot (P)
- get_unpaired_nts()
Get a list of residues that are unpaired.
Returns
- list
A list of 1-indexed positions of unpaired nucleotides
- normalize_dtypes()
Convert dtypes of SecondaryStructure dataframe for consistency.
- normalize_sequence(t_or_u='U', uppercase=True)
Normalize the sequence attribute (fix case and/or U <-> T).
- property nts
- property pair_nts
- read_ct(structure_number=0)
Loads secondary structure information from a given ct file.
Requires a properly formatted header.
Parameters
- structure_numberint, defaults to 0
0-indexed structure number to load from the ct file.
- read_cte()
Generates SecondaryStructure object data from a CTE file
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_dotbracket()
Generates SecondaryStructure object data from a dot-bracket file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_forna()
Generates SecondaryStructure object data from a FORNA JSON file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_nsd(structure_number=0)
Generates SecondaryStructure object data from an NSD file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_r2dt()
Generates SecondaryStructure object data from an R2DT JSON file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_varna()
Generates SecondaryStructure object data from a VARNA file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_xrna()
Generates SecondaryStructure object data from an XRNA file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- transform_coordinates(flip=None, scale=None, center=None, rotate_degrees=None)
Perform transformations on X and Y structure coordinates.
To acheive vertical and horizontal flip together, rotate 180 degrees.
Parameters
- flipstr, optional
“horizontal” or “vertical”
- scalefloat, optional
new median distance of basepairs
- centertuple of floats, optional
new center x and y coordinate
- rotate_degreesfloat, optional
number of degrees to rotate structure
- write_ct(out_file)
Write structure to a ct file.
- write_cte(out_file)
Write structure to CTE format for Structure Editor.
- write_dbn(rna_name, region='all', out_file=None)
Write the structure to a dot-bracket file.
Parameters
- rna_namestr
The name of the RNA sequence
- regionlist of 2 integers, optional
The region (start and end positions) of the RNA to write to file. Defaults to “all”.
- out_filestr, optional
The name of the output file. If not provided, the dbn file is printed.
- write_sto(out_file, name='seq')
Write structure to Stockholm (STO) file to use in infernal searches.
- property xcoordinates
- property ycoordinates
- class rnavigate.data.secondary_structure.SequenceCircle(input_data, gap=30, name=None, **kwargs)
Bases:
SecondaryStructureA circular SecondaryStructure-like representation of RNA sequence.
- class rnavigate.data.secondary_structure.StructureCoordinates(x, y, pairs=None)
Bases:
objectHelper class to perform structure coordinate transformations
Parameters
- xnumpy.array
x coordinates
- ynumpy.array
y coordinates
- pairslist of pairs, optional
list of base-paired positions required if scaling coordinates
- center(x=0, y=0)
Center structure on the given x, y coordinate
Parameters
- xint, defaults to 0
x coordinate of structure center
- yint, defaults to 0
y coordinate of structure center
- flip(horizontal=True)
Flip structure vertically or horizontally.
Parameters
- horizontalbool, defaults to True
whether to flip structure horizontally, otherwise vertically
- get_center_point()
Get the x, y coordinates for the center of structure.
Returns
- float
x coordinate of structure center
- float
y coordinate of structure center
Module contents
- class rnavigate.data.AlignmentChain(*alignments)
Bases:
BaseAlignmentCombines a list of alignments into one.
Parameters
- alignmentslist of Alignment objects
the alignments to chain together
Attributes
- alignmentslist
the constituent alignments
- starting_sequencestr
starting sequence of alignments[0]
- target_sequencestr
target sequence of alignments[-1]
- mappingnumpy.array
an array which maps from starting_sequence to target_sequence. index of starting_sequence is mapping[index] of target sequence
- get_inverse_alignment()
Alignments require a method to get the inverted alignment
- class rnavigate.data.AllPossible(sequence, metric='data', input_data=None, metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating all possible interactions.
Parameters
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the pairing probability data.
- metricstring, defaults to “Probability”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the pairing probability data.
- namestr, optional
A name for the AllPossible object.
Attributes
- datapandas.DataFrame
The pairing probability data.
- class rnavigate.data.Annotation(input_data, annotation_type, sequence, name=None, color='blue')
Bases:
SequenceBasic annotation class to store 1D features of an RNA sequence
- Each feature type must be a seperate instance. Feature types include:
a group of separted nucleotides (e.g. binding pocket) regions of interest (e.g. coding sequence, Alu elements) sites of interest (e.g. m6A locations) primer binding sites.
Parameters
- input_datalist
List will be treated according to annotation_type argument. Expected behaviors for each value of annotation_type: “sites” or “group”: 1-indexed location of sites of interest
example: [1, 10, 20, 30] is four sites, 1, 10, 20, and 30
- “spans”: 1-indexed, inclusive locations of spans of interest
example: [[1, 10], [20, 30]] is two spans, 1 to 10 and 20 to 30
- “primers”: Similar to spans, but 5’/3’ direction is preserved.
example: [[1, 10], [30, 20]] forward 1 to 10, reverse 30 to 20
- annotation_type“group”, “sites”, “spans”, or “primers”
The type of annotation.
- sequencestr or pandas.DataFrame
Nucleotide sequence, path to fasta file, or dataframe containing a “Sequence” column.
- namestr, defaults to None
Name of annotation.
- colormatplotlib color-like, defaults to “blue”
Color to be used for displaying this annotation on plots.
Attributes
- datapandas.DataFrame
Stores the list of sites or regions
- namestr
The label for this annotation for use on plots
- colorvalid matplotlib color
Color to represent annotation on plots
- sequencestr
The reference sequence string
- property boolean
Return a boolean array of the annotation on the sequence.
- classmethod from_boolean_array(values, sequence, annotation_type, name, color='blue', window=1)
Create an Annotation from an array of boolean values.
True values are used to create the Annotation.
Parameters
- valueslist of True or False
the boolean array
- sequencestring or rnav.data.Sequence
the sequence of the Annotation
- annotation_type“spans”, “sites”, “primers”, or “group”
the type of the new annotation If “spans” or “primers”, adjacent True values, or values within window are collapse to a region.
- namestring
a name for labelling the annotation.
- colorstring, defaults to “blue”
a color for plotting the annotation
- windowinteger, defaults to 1
a window around True values to include in the annotation.
Returns
- rnavigate.data.Annotation
the new Annotation
- from_sites(sites)
Create the self.data dataframe from a list of sites.
- from_spans(spans)
Create the self.data dataframe from a list of spans.
- get_aligned_data(alignment)
Aligns this Annotation to a new sequence and returns a copy.
Parameters
- alignmentrnavigate.data.Alignment
Alignment object used to align to a new sequence.
Returns
- rnavigate.data.Annotation
A new Annotation with the same name, color, and annotation type, but with the input data aligned to the target sequence.
- get_sites()
Returns a list of nucleotide positions included in this annotation.
Returns
- sitestuple
a list of nucleotide positions
- get_subsequences(buffer=0)
- class rnavigate.data.DanceMaP(input_data, component, read_table_kw=None, sequence=None, metric='Norm_profile', metric_defaults=None, name=None)
Bases:
SHAPEMaPA class to represent per-nucleotide DanceMaP data.
Parameters
- input_datastr or pandas.DataFrame
path to a DanceMapper reactivities.txt file or a pandas DataFrame
- componentint
Which component of the DanceMapper ensemble to read in (0-indexed).
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. These are not necessary for reactivities.txt files. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is not necessary for reactivities.txt files. Defaults to None.
- metricstr, defaults to “Norm_profile”
The name of the set of value-to-color options to use.
- read_file(input_data, read_table_kw={})
Convert data file to pandas dataframe and store as self.data
Parameters
- filepathstring
path to data file containing interactions
- read_table_kwdict
kwargs dictionary passed to pd.read_table
Returns
- dataframepandas.DataFrame
the data table
- property recreation_kwargs
A dictionary of keyword arguments to pass when recreating the object.
- class rnavigate.data.Data(input_data, sequence, metric, metric_defaults, read_table_kw=None, name=None)
Bases:
SequenceThe base class for RNAvigate Profile and Interactions classes.
Parameters
- input_datapandas.DataFrame or str
a pandas dataframe or path to a data file
- sequencestring or rnavigate.data.Sequence
the sequence to use for the data
- metricstring or dict
the column of the dataframe to use as the default metric to visualize
- metric_defaultsdict
a dictionary of metric defaults
- read_table_kwdict, optional
kwargs dictionary passed to pd.read_table
- namestring, optional
the name of the data, defaults to None
Attributes
- datapandas.DataFrame
the data table
- filepathstring
the path to the data file
- sequencestring or rnavigate.data.Sequence
the sequence to use for the data
- metricstring or dict
the column of the dataframe to use as the metric to visualize
- metric_defaultsdict
A dictionary of metric values and default settings for visualization
- default_metricstring
the default metric to use for visualization
- add_metric_defaults(metric_defaults)
Add metric defaults to self.metric_defaults
- property cmap
Get the colormap to use for colorbars and to retrieve colors.
- property color_column
Get the column of the dataframe to use as the color for visualization.
- property colors
Get one matplotlib color-like value for each nucleotide in self.sequence.
- property error_column
Get the column of the dataframe to use as the error for visualization.
- property metric
Get the column of the dataframe to use as the metric for visualization.
- class rnavigate.data.DeltaProfile(profile1, profile2, metric=None, metric_defaults=None, name=None)
Bases:
ProfileA class to represent the difference between two profiles.
Parameters
- profile1Profile
The first profile to compare.
- profile2Profile
The second profile to compare.
- metricstr, optional
The name of the metric to use. Defaults to the metric of profile1.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters. Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
- class rnavigate.data.Interactions(input_data, sequence, metric, metric_defaults, read_table_kw=None, window=1, name=None)
Bases:
DataA class for storing and manipulating interactions data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing interactions data. If dataframe, the dataframe containing interactions data. The dataframe must contain columns “i”, “j”, and self.metric. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the interactions data.
- metricstring
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict
kwargs passed to pandas.read_table() when reading input_data.
- windowint
The window size used to generate the interactions data.
- namestr
The name of the data object.
Attributes
- datapandas.DataFrame
The interactions data.
- windowint
The window size that is being represented by i-j pairs.
- copy(apply_filter=False)
Returns a copy of the interactions, optionally with masked rows removed.
Parameters
- apply_filterbool, defaults to False
If True, masked rows (“mask” == False) are dropped.
Returns
- rnavigate.data.Interactions
A copy of the interactions.
- count_filter(**kwargs)
Counts the number of interactions that pass the given filters.
- data_specific_filter(**kwargs)
Does nothing for the base Interactions class, can be overwritten in subclasses.
- Returns:
dict: dictionary of keyword argument pairs
- filter(prefiltered=False, reset_filter=True, structure=None, min_cd=None, max_cd=None, paired_only=False, ss_only=False, ds_only=False, profile=None, min_profile=None, max_profile=None, compliments_only=False, nts=None, max_distance=None, min_distance=None, exclude_nts=None, isolate_nts=None, resolve_conflicts=None, **kwargs)
Convenience function that applies the above filters simultaneously.
Parameters
- prefilteredbool, defaults to False
If True, the mask is not updated.
- reset_filterbool, defaults to True
If True, the mask is reset before applying filters.
- structurernavigate.data.SecondaryStructure, defaults to None
The structure to use for filtering.
- min_cdint, defaults to None
The minimum contact distance to allow.
- max_cdint, defaults to None
The maximum contact distance to allow.
- paired_onlybool, defaults to False
If True, only keep interactions that are paired in the structure.
- ss_onlybool, defaults to False
If True, only keep interactions between single-stranded nucleotides.
- ds_onlybool, defaults to False
If True, only keep interactions between double-stranded nucleotides.
- profilernavigate.data.Profile, defaults to None
The profile to use for masking.
- min_profilefloat, defaults to None
The minimum profile value to allow.
- max_profilefloat, defaults to None
The maximum profile value to allow.
- compliments_onlybool, defaults to False
If True, only keep interactions where i and j are complimentary nucleotides.
- ntsstr, defaults to None
If compliment_only is False, only keep interactions where i and j are in nts.
- max_distanceint, defaults to None
The maximum distance to allow. If None, no maximum distance is set.
- min_distanceint, defaults to None
The minimum distance to allow. If None, no minimum distance is set.
- exclude_ntslist of int, defaults to None
A list of positions to exclude.
- isolate_ntslist of int, defaults to None
A list of positions to isolate.
- resolve_conflictsstr, defaults to None
If not None, conflicting windows are resolved using the Maximal Weighted Independent Set. The weights are taken from the metric value. The graph is first broken into components to speed up the identification of the MWIS. Then the mask is updated to only include the MWIS.
- **kwargsdict
Each keyword should have the format “column_operator” where column is a valid column name of the dataframe and operator is one of:
“ge”: greater than or equal to “le”: less than or equal to “gt”: greater than “lt”: less than “eq”: equal to “ne”: not equal to
The values given to these keywords are then used in the comparison and False comparisons are filtered out. e.g.:
self.mask_on_values(Statistic_ge=23) evaluates to: self.update_mask(self.data[“Statistic”] >= 23)
Returns
- masknumpy array
a boolean array of the same length as self.data
- get_aligned_data(alignment, apply_filter=True)
Returns a copy mapped to a new sequence with masked rows removed.
Parameters
- alignmentrnavigate.data.SequenceAlignment
The alignment to use for mapping the interactions.
- apply_filterbool, defaults to True
If True, masked rows (“mask” == False) are dropped.
Returns
- rnavigate.data.Interactions
Interactions mapped to a new sequence.
- get_ij_colors()
Gets i, j, and colors lists for plotting interactions.
i and j are the 5’ and 3’ ends of each interaction, and colors is the color to use for each interaction. Values of self.data[self.metric] are normalized to 0 to 1, which correspond to self.min_max values. These are then mapped to a color using self.cmap.
Returns
- ilist
5’ ends of each interaction
- jlist
3’ ends of each interaction
- colorslist
colors to use for each interaction
- get_sorted_data()
Returns a copy of the data sorted by self.metric.
Returns
- pandas.DataFrame
a copy of the data sorted by self.metric
- mask_on_distance(max_dist=None, min_dist=None)
Mask interactions based on their distance in sequence space.
Parameters
- max_distint, defaults to None
The maximum distance to allow. If None, no maximum distance is set.
- min_distint, defaults to None
The minimum distance to allow. If None, no minimum distance is set.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_position(exclude=None, isolate=None)
Mask interactions based on their i and j positions.
Parameters
- excludelist of int, defaults to None
A list of positions to exclude.
- isolatelist of int, defaults to None
A list of positions to isolate.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_profile(profile, min_profile=None, max_profile=None)
Masks interactions based on per-nucleotide measurements.
Parameters
- profilernavigate.data.Profile
The profile to use for masking.
- min_profilefloat, defaults to None
The minimum profile value to allow.
- max_profilefloat, defaults to None
The maximum profile value to allow.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_sequence(compliment_only=None, nts=None)
Mask interactions based on sequence.
Parameters
- compliment_onlybool, defaults to None
If True, only keep interactions where i and j are complimentary nucleotides.
- ntsstr, defaults to None
If compliment_only is False, only keep interactions where i and j are in nts.
Returns
- numpy array
a boolean array of the same length as self.data
- mask_on_structure(structure, min_cd=None, max_cd=None, ss_only=False, ds_only=False, paired_only=False)
Masks interactions based on a secondary structure.
Parameters
- structurernavigate.data.SecondaryStructure
The secondary structure to use for masking.
- min_cdint, defaults to None
The minimum contact distance to allow.
- max_cdint, defaults to None
The maximum contact distance to allow.
- ss_onlybool, defaults to False
If True, only keep interactions between single-stranded nucleotides.
- ds_onlybool, defaults to False
If True, only keep interactions between double-stranded nucleotides.
- paired_onlybool, defaults to False
If True, only keep interactions that are paired in the structure.
Returns
- masknumpy array
a boolean array of the same length as self.data
- mask_on_values(**kwargs)
Mask interactions based on values in self.data.
Parameters
- kwargsdict
Each keyword should have the format “column_operator” where column is a valid column name of the dataframe and operator is one of:
“ge”: greater than or equal to “le”: less than or equal to “gt”: greater than “lt”: less than “eq”: equal to “ne”: not equal to
The values given to these keywords are then used in the comparison and False comparisons are filtered out. e.g.:
self.mask_on_values(Statistic_ge=23) evaluates to: self.update_mask(self.data[“Statistic”] >= 23)
Returns
- masknumpy array
a boolean array of the same length as self.data
- print_new_file(outfile=None)
Create a new file with mapped and filtered interactions.
Parameters
- outfilestr, defaults to None
path to an output file. If None, file string is printed to console.
- reset_mask()
Resets the mask to all True (removes previous filters)
- resolve_conflicts(metric=None)
Uses an experimental method to resolve conflicts.
Resolves conflicting windows using the Maximal Weighted Independent Set. The weights are taken from the metric value. The graph is first broken into components to speed up the identification of the MWIS. Then the mask is updated to only include the MWIS. This method is computationally expensive for large or dense datasets.
Parameters
- metricstr, defaults to None
The metric to use for weighting the graph. If None, self.metric is used.
Returns
- masknumpy array
a boolean array of the same length as self.data
- set_3d_distances(pdb, atom)
Wrapper for set_distances for backwards compatibility.
- set_distances(structure, atom="O2'")
Sets the Distance column value based on nt distances in the given structure.
If structure is a SecondaryStructure, contact distances are calculated, and if structure is a PDB, 3D distances are calculated. These distances are averaged across the window and stored in a new “Distance” column in self.data.
Parameters
- structurernavigate.data.SecondaryStructure or rnavigate.data.PDB
Structure object to use for calculating distances
- atomstr
atom id to use for calculating distances in a PDB structure
- update_mask(mask)
Updates the mask by ANDing the current mask with the given mask.
- class rnavigate.data.Motif(input_data, sequence, name=None, color='blue')
Bases:
AnnotationAutomatically annotates the occurances of a sequence motif as spans.
Parameters
- input_datastr
sequence motif to search for. Uses conventional nucleotide codes. e.g. “DRACH” = [AGTU] [AG] A C [ATUC]
- sequencestr or pandas.DataFrame
Nucleotide sequence, path to fasta file, or dataframe containing a “Sequence” column.
- namestr, defaults to None
Name of annotation.
- colormatplotlib color-like, defaults to “blue”
Color to be used for displaying this annotation on plots.
Attributes
- datapandas.DataFrame
Stores the list of regions that match the motif
- namestr
The label for this annotation for use on plots
- colorvalid matplotlib color
Color to represent annotation on plots
- sequencestr
The reference sequence string
- get_aligned_data(alignment)
Searches the new sequence for the motif and returns a new Motif annotation.
Parameters
- alignmentrnavigate.data.Alignment
Alignment object used to align to a new sequence.
Returns
- rnavigate.data.Motif
A new Motif with the same name, color, and motif but with the input data aligned to the target sequence.
- class rnavigate.data.ORFs(input_data, name=None, sequence=None, color='blue')
Bases:
AnnotationAutomatically annotations occurances of open-reading frames as spans.
Parameters
- input_data“longest” or “all”
which ORFs to annotate. “longest” annotates the longest ORF. “all” annotates all potential ORFs.
- sequencestr or pandas.DataFrame
Nucleotide sequence, path to fasta file, or dataframe containing a “Sequence” column.
- namestr, defaults to None
Name of annotation.
- colormatplotlib color-like, defaults to “blue”
Color to be used for displaying this annotation on plots.
Attributes
- datapandas.DataFrame
Stores the list of regions that match the motif
- namestr
The label for this annotation for use on plots
- colorvalid matplotlib color
Color to represent annotation on plots
- sequencestr
The reference sequence string
- get_aligned_data(alignment)
Searches the new sequence for ORFs and returns a new ORF annotation.
Parameters
- alignmentrnavigate.data.Alignment
Alignment object used to align to a new sequence.
Returns
- rnavigate.data.ORFs
A new ORFs annotation with the same name, color, and input_data but with the input data aligned to the target sequence.
- get_spans_from_orf(sequence, which='all')
Given a sequence string, returns spans for specified ORFs
Parameters
- sequencestring
RNA nucleotide sequence
- which“longest” or “all”, defaults to “all”
“all” returns all spans, “longest” returns the longest span
Returns
- list of tuples
(start, end) position of each ORF 1-indexed, inclusive
- class rnavigate.data.PAIRMaP(input_data, sequence=None, metric='Class', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
RINGMaPA class for storing and manipulating PAIRMaP data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing PAIRMaP data. If dataframe, the dataframe containing PAIRMaP data. The dataframe must contain columns “i”, “j”, “Statistic”, and “Class”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the PAIRMaP data.
- metricstring, defaults to “Class”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the PAIRMaP data. If an input file is provided, this value is overwritten by the value in the header.
- namestr, optional
A name for the interactions object.
Attributes
- datapandas.DataFrame
The PAIRMaP data.
- data_specific_filter(all_pairs=False, **kwargs)
Used by Interactions.filter(). By default, non-primary and -secondary pairs are removed. all_pairs=True changes this behavior.
Parameters
- all_pairsbool, defaults to False
whether to include all PAIRs.
Returns
- kwargsdict
any additional keyword-argument pairs are returned
- masknumpy array
a boolean array of the same length as self.data
- class rnavigate.data.PDB(input_data, chain, sequence=None, name=None)
Bases:
SequenceA class to represent RNA tertiary structures with atomic coordinates.
This data can be used to filter interactions by 3D distance, and to visualize profile and interactions data on interactive 3D structures.
Parameters
- input_datastr
path to a PDB or CIF file
- chainstr
chain identifier of RNA of interest
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is required if the sequence cannot be found in the header Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
Attributes
- sequencestr
The RNA sequence
- lengthint
The length of the RNA sequence
- namestr
A name for the data set
- pathstr
The path to the PDB or CIF file
- chainstr
The chain identifier of the RNA of interest
- offsetint
The offset between the sequence positions and the PDB residue indices
- pdbBio.PDB.Structure.Structure
The PDB structure
- pdb_idxnp.array
The PDB indices of the RNA
- pdb_seqnp.array
The PDB sequence of the RNA
- distance_matrixdict
A dictionary of distance matrices for each atom type
- get_distance(i, j, atom="O2'")
Get the distance between given atom in nucleotides i and j (1-indexed).
Parameters
- iint
The first nucleotide
- jint
The second nucleotide
- atomstring or dict, defaults to “O2’”
The atom to use for distance calculations. If a string, the same atom will be used for all residues. If a dict, the atom will be chosen based on the nucleotide type. If “DMS”, the N1 atom will be used for A and G, and the N3 atom will be used for U and C.
Returns
- distancefloat
The distance between the atoms
- get_distance_matrix(atom="O2'")
Get the pairwise atomic distance matrix for all residues.
Parameters
- atomstring or dict, defaults to “O2’”
The atom to use for distance calculations. If a string, the same atom will be used for all residues. If a dict, the atom will be chosen based on the nucleotide type. If “DMS”, the N1 atom will be used for A and G, and the N3 atom will be used for U and C.
Returns
- matrixNxN numpy.ndarray
A 2D array of pairwise distances. N is the length of the RNA.
- get_pdb_idx(seq_idx)
Return the PDB index given the sequence index (0-indexed).
- get_seq_idx(pdb_idx)
Return the sequence index given the PDB index.
- get_sequence(pdb)
Find the sequence in the provided CIF or PDB file.
Parameters
- pdbstr
path to a PDB or CIF file
Returns
- sequencestring
The RNA sequence
- get_sequence_from_seqres(seqres)
Used by get_sequence to parse the SEQRES entries.
Parameters
- seqreslist
A list of SEQRES entries for the RNA chain of interest
Returns
- sequencestring
The RNA sequence
- get_xyz_coord(nt, atom)
Return the x, y, and z coordinates for a given residue and atom.
Parameters
- ntint
The nucleotide of interest (1-indexed)
- atomstring or dict, defaults to “O2’”
The atom to use for distance calculations. If a string, the same atom will be used for all residues. If a dict, the atom will be chosen based on the nucleotide type. If “DMS”, the N1 atom will be used for A and G, and the N3 atom will be used for U and C.
Returns
- xyzlist
A list of x, y, and z coordinates
- is_valid_idx(pdb_idx=None, seq_idx=None)
Determines if a PDB or sequence index is in the PDB structure.
Parameters
- pdb_idxint, optional
A PDB index (1-indexed). Defaults to None.
- seq_idxint, optional
A sequence index (1-indexed). Defaults to None.
Returns
- bool
True if the index is in the PDB structure, False otherwise.
- read_pdb(pdb)
Read a PDB or CIF file into the data structure.
Parameters
- pdbstr
path to a PDB or CIF file
- set_indices()
Uses self.data and self.sequence to set self.offset
- class rnavigate.data.PairingProbability(input_data, extension=None, sequence=None, metric='Probability', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating pairing probability data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing pairing probability data. If dataframe, the dataframe containing pairing probability data. The dataframe must contain columns “i”, “j”, “Probability”, and “log10p”. Dataframe may also include other columns.
- extensionstring, defaults to None
The file extension of the input_data. If None, the extension is determined from the input_data string. Options are “.bps”, “.txt”, and “.dp”. If the extension is “.bps”, the sequence is parsed from the file. If the extension is “.txt” or “.dp”, the sequence must be provided via the sequence argument.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the pairing probability data.
- metricstring, defaults to “Probability”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the pairing probability data.
- namestr, optional
A name for the PairingProbability object.
Attributes
- datapandas.DataFrame
The pairing probability data.
- data_specific_filter(**kwargs)
By default, interactions with probabilities less than 0.03 are removed.
Returns
- kwargsdict
any additional keyword-argument pairs are returned
- masknumpy array
a boolean array of the same length as self.data
- get_entropy_profile(print_out=False, save_file=None)
Calculates per-nucleotide Shannon entropy from pairing probabilities.
Parameters
- print_outbool, defaults to False
If True, entropy values are printed to console.
- save_filestr, defaults to None
If not None, entropy values are saved to this file.
Returns
- rnavigate.data.Profile
a Profile object containing the entropy data
- class rnavigate.data.Profile(input_data, metric='default', metric_defaults=None, read_table_kw=None, sequence=None, name=None)
Bases:
DataA class to represent per-nucleotide data.
Parameters
- input_datastr or pandas.DataFrame
path to a csv or tab file or a pandas DataFrame Table must be 1 row for each nucleotide in the sequence. table columns must contain these columns:
A nucleotide position column labelled “Nucleotide” A sequence column labelled “Sequence” with 1 of (A, C, G, U, T) per row
These will be added to the table if sequence is provided.
- A data measurement column labelled “Profile” with a float or integer
Label may be another name if specified in metric_defaults
- Optionally: A measurement error column.
Label must be specified in metric_defaults
- Other columns may be present, and set up using metric_defaults.
See metric_defaults for more information.
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is required if input_data does not contain a “Sequence” column. Defaults to None.
- metricstr, defaults to “default”
The name of the set of value-to-color options to use. “default” specifies:
“Profile” column is used No error rates are present Values are normalized to the range [0, 1] Values are mapped to colors using the “viridis” colormap
- “Distance” specifies:
(3-D) “Distance” column is used No error rates are present Values in the range [5, 50] are normalized to the range [0, 1] Values are mapped to colors using the “cool” colormap
Other options may be defined in metric_defaults.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters:
- “metric_column”str
The name of the column to use as the metric. Plots and analyses that use per-nucleotide data will use this column. If “color_column” is not provided, this column also defines colors.
- “error_column”str or None
The name of the column to use as the error. If None, no error bars are plotted.
- “color_column”str or None
The name of the column to use for coloring. If None, colors are defined by “metric_column”.
- “cmap”str or list
The name of the colormap to use. If a list, the list of colors to use.
- “normalization”str
The type of normalization to use. In order to be used with colormaps, values are normalized to either be integers for categorical colormaps, or floats in the range [0, 1] for continuous colormaps. “none” : no normalization is performed “min_max” : values are scaled to floats in the range [0, 1] based on
the upper and lower bounds defined in “values”
- “0_1”values are scaled to floats in the range [0, 1] based on
the minimum and maximum values in the data
- “bins”values are scaled an integer based on bins defined by the
list of bounds defined in “values”
- “percentiles”values are scaled to floats in the range [0, 1]
based on upper and lower percentile bounds defined by “values”
- “values”list or None
The values to use when normalizing the data. if “normalization” is “min_max”, this should be a list of two values
defining the upper and lower bounds.
- if “normalization” is “bins”, this should be a list of values
of length 1 less than the length of cmap. example: [5, 10, 20] defines 4 bins:
(-infinity, 5), [5, 10), [10, 20), [20, infinity)
- if “normalization” is “percentiles”, this should be a list of two
values defining the upper and lower percentile bounds.
if “normalization” is “0_1” or “none”, this should be None.
- “title”str, defaults to “”
The title of the colorbar.
- “ticks”list, defaults to None
The tick locations to use for the colorbar. If None, values are determined automatically.
- “tick_labels”list, defaults to None
The labels to use for the colorbar ticks. If None, values are determined automatically from “ticks”.
- “extend”“neither”, “both”, “min”, or “max”, defaults to “neither”
Which ends of the colorbar to extend (places an arrow head).
Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
Attributes
- datapandas.DataFrame
The data table
- calculate_gini_index(values)
Calculate the Gini index of an array of values.
- calculate_windows(column, window, method='median', new_name=None, minimum_points=None, mask_na=True)
calculates a windowed operation over a column of data.
Result is stored in a new column. Value of each window is assigned to the center position of the window.
Parameters
- columnstr
name of column to perform operation on
- windowint
window size, must be an odd number
- methodstring or function, defaults to “median”
operation to perform over windows. if string, must be “median”, “mean”, “minimum”, or “maximum” if function, must take a 1D numpy array as input and return a scalar
- new_namestr, defaults to f”{method}_{window}_nt”
name of new column for stored result.
- minimum_pointsint, defaults to value of window
minimum number of points within each window.
- mask_nabool, defaults to True
whether to mask the result of the operation where the original column has a nan value.
- copy()
Returns a copy of the Profile.
- classmethod from_array(input_data, sequence, **kwargs)
Construct a Profile object from an array of values.
Parameters
- input_datalist or np.array
A list or array of values to use as the metric.
- sequencestr
The RNA sequence.
- **kwargs
Additional keyword arguments to pass to the Profile constructor.
Returns
- Profile
A Profile object with the provided values.
- get_aligned_data(alignment)
Returns a new Profile object with the data aligned to a sequence.
Parameters
- alignmentrnavigate.data.SequenceAlignment
The alignment to use to map rows of self.data to a new sequence.
Returns
- Profile
A new Profile object with the data aligned to the sequence in the alignment.
- get_plotting_dataframe()
Returns a dataframe with the data to be plotted.
Returns
- pandas.DataFrame
A dataframe with the columns “Nucleotide”, “Values”, “Errors”, and “Colors”.
- norm_boxplot(values)
removes outliers (> 1.5 * IQR) and scales the mean to 1.
NOTE: This method varies slightly from normalization method used in the SHAPEMapper pipeline. Shapemapper sets undefined values to 0, and then uses these values when computing iqr and 90th percentile. Including these values can skew these result. This method excludes such nan values. Other elements are the same.
Parameters
- values1D numpy array
values to normalize
Returns
- (float, float)
scaling factor and error propagation factor
- norm_eDMS(values)
Calculates norm factors following eDMS pernt scheme in ShapeMapper 2.2
Parameters
- values1D numpy array
values to normalize
Returns
- (float, float)
scaling factor and error propagation factor
- norm_percentiles(values, lower_bound=90, upper_bound=99, median_or_mean='mean')
Calculates factors to scale the median between percentile bounds to 1.
Parameters
- values1D numpy array
values to normalize
- lower_boundint or float, optional
percentile of lower bound, Defaults to 90
- upper_boundint or float, optional
percentile of upper bound, Defaults to 99
- median_or_meanstring, optional
whether to use the median or mean of the values between the bounds.
Returns
- (float, float)
scaling factor and error propagation factor
- normalize(profile_column=None, new_profile=None, error_column=None, new_error=None, norm_method='boxplot', nt_groups=None, profile_factors=None, **norm_kwargs)
Normalize values in a column, and store in a new column.
By default, performs ShapeMapper2 boxplot normalization on self.metric and stores the result as “Norm_profile”.
Parameters
- profile_columnstring, defaults to self.metric
column name of values to normalize
- new_profilestring, defaults to “Norm_profile”
column name of new normalized values
- error_columnstring, defaults to self.error_column
column name of error values to propagate
- new_errorstring, defaults to “Norm_error”
column name of new propagated error values
- norm_methodstring, defaults to “boxplot”
normalization method to use. “DMS” uses self.norm_percentile and nt_groups=[‘AC’, ‘UG’]
scales the median of 90th to 95th percentiles to 1 As and Cs are normalized seperately from Us and Gs
- “eDMS” uses self.norm_eDMS and nt_groups=[‘A’, ‘U’, ‘C’, ‘G’]
Applies the new eDMS-MaP normalization. Each nucleotide is normalized seperately.
- “boxplot” uses self.norm_boxplot and nt_groups=[‘AUCG’]
removes outliers (> 1.5 iqr) and scales median to 1 scales nucleotides together unless specified with nt_groups
- “percentile” uses self.norm_percentile and nt_groups=[‘AUCG’]
scales the median of 90th to 95th percentiles to 1 scales nucleotides together unless specified with nt_groups
Defaults to “boxplot”: the default normalization of ShapeMapper
- nt_groupslist of strings, defaults to None
A list of nucleotides to group e.g. [‘AUCG’] groups all nts together
[‘AC’, ‘UG’] groups As with Cs and Us with Gs [‘A’, ‘C’, ‘U’, ‘G’] scales each nt seperately
Default depends on norm_method
- profile_factorsdictionary, defaults to None
- a scaling factor (float) for each nucleotide. keys must be:
‘A’, ‘C’, ‘U’, ‘G’
Note: using this argument overrides any calculation of scaling Defaults to None
- **norm_kwargs
these are passed to the norm_method function
Returns
- profile_factorsdict
the new profile scaling factors dictionary
- normalize_external(profiles, **kwargs)
normalize reactivities using other profiles to normfactors.
Parameters
- profileslist of rnavigate.data.Profile
a list of other profiles used to compute scaling factors
Returns
- profile_factorsdict
the new profile scaling factors dictionary
- normalize_sequence(t_or_u='U', uppercase=True)
Changes the values in self.data[“Sequence”] to the normalized sequence.
Parameters
- t_or_u“T” or “U”, Defaults to “U”.
Whether to replace T with U or U with T.
- uppercasebool, Defaults to True.
Whether to convert the sequence to uppercase.
- property recreation_kwargs
A dictionary of keyword arguments to pass when recreating the object.
- winsorize(column, lower_bound=None, upper_bound=None)
Winsorize the data between bounds.
If either bound is set to None, one-sided Winsorization is performed.
Parameters
- columnstring
the column of data to be winsorized
- lower_boundNumber or None, defaults to None
Data below this value is set to this value. If None, no lower bound is applied.
- upper_boundNumber or None, defaults to None
Data above this value is set to this value. If None, no upper bound is applied.
- class rnavigate.data.RINGMaP(input_data, sequence=None, metric='Statistic', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating RINGMaP data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing RINGMaP data. If dataframe, the dataframe containing RINGMaP data. The dataframe must contain columns “i”, “j”, “Statistic”, and “Zij”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the RINGMaP data.
- metricstring, defaults to “Statistic”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the RINGMaP data. If an input file is provided, this value is overwritten by the value in the header.
- namestr, optional
A name for the interactions object.
Attributes
- datapandas.DataFrame
The RINGMaP data.
- data_specific_filter(positive_only=False, negative_only=False, **kwargs)
Adds filters for “Sign” column to parent filter() function
Parameters
- positive_onlybool, defaults to False
If True, only keep positive correlations.
- negative_onlybool, defaults to False
If True, only keep negative correlations.
Returns
- kwargsdict
any additional keyword-argument pairs are returned
- masknumpy array
a boolean array of the same length as self.data
- get_sorted_data()
Sorts on the product of self.metric and “Sign” columns.
Except when self.metric is “Distance”.
Returns
- pandas.DataFrame
a copy of the data sorted by (self.metric * “Sign”) columns
- read_file(filepath, read_table_kw=None)
Parses a RINGMaP correlations file and stores data as a dataframe.
Also sets self.window (usually 1, from header).
Parameters
- filepathstr
path to correlations file.
- read_table_kwdict, defaults to {}
kwargs passed to pandas.read_table().
Returns
- pandas.DataFrame
the RINGMaP data
- class rnavigate.data.RNPMaP(input_data, read_table_kw=None, sequence=None, metric='NormedP', metric_defaults=None, name=None)
Bases:
ProfileRepresents per-nucleotide RNPMaP data.
Parameters
- input_datastr or pandas.DataFrame
path to an RNAModMapper reactivities.txt file or a pandas DataFrame
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. These are not necessary for reactivities.txt files. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is not necessary for reactivities.txt files. Defaults to None.
- metricstr, defaults to “NormedP”
The name of the set of value-to-color options to use.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters. Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
- class rnavigate.data.SHAPEJuMP(input_data, sequence=None, metric='Percentile', metric_defaults=None, read_table_kw=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating SHAPEJuMP data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing SHAPEJuMP data. If dataframe, the dataframe containing SHAPEJuMP data. The dataframe must contain columns “i”, “j”, “Metric” (JuMP rate) and “Percentile” (percentile ranking). Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the SHAPEJuMP data.
- metricstring, defaults to “Percentile”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap)
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict
kwargs passed to pandas.read_table() when reading input_data.
- windowint
The window size used to generate the SHAPEJuMP data.
- namestr
A name for the interactions object.
Attributes
- datapandas.DataFrame
The SHAPEJuMP data.
- read_file(input_data, read_table_kw=None)
Parses a deletions.txt file and stores it as a dataframe.
Also calculates a “Percentile” column.
Parameters
- input_datastr
path to deletions.txt file
- read_table_kwdict, defaults to {}
kwargs passed to pandas.read_table().
Returns
- pandas.DataFrame
the SHAPEJuMP data
- class rnavigate.data.SHAPEMaP(input_data, normalize=None, read_table_kw=None, sequence=None, metric='Norm_profile', metric_defaults=None, log=None, name=None)
Bases:
ProfileA class to represent per-nucleotide SHAPE-MaP data.
Parameters
- input_datastr or pandas.DataFrame
path to a ShapeMapper2 profile.txt or .map file or a pandas DataFrame
- normalize“DMS”, “eDMS”, “boxplot”, “percentiles”, or None, defaults to None
The normalization method to use. “DMS” uses self.norm_percentile and nt_groups=[‘AC’, ‘UG’]
scales the median of 90th to 95th percentiles to 1 As and Cs are normalized seperately from Us and Gs
- “eDMS” uses self.norm_eDMS and nt_groups=[‘A’, ‘U’, ‘C’, ‘G’]
Applies the new eDMS-MaP normalization. Each nucleotide is normalized seperately.
- “boxplot” uses self.norm_boxplot and nt_groups=[‘AUCG’]
removes outliers (> 1.5 iqr) and scales median to 1 scales nucleotides together unless specified with nt_groups
- “percentiles” uses self.norm_percentile and nt_groups=[‘AUCG’]
scales the median of 90th to 95th percentiles to 1 scales nucleotides together unless specified with nt_groups
Defaults to None: no normalization is performed
- read_table_kwdict, optional
Keyword arguments to pass to pandas.read_table. These are not necessary for profile.txt and .map files. Defaults to None.
- sequencernavigate.Sequence or str, optional
A sequence to use as the reference sequence. This is not necessary for profile.txt and .map files. Defaults to None.
- metricstr, defaults to “Norm_profile”
The name of the set of value-to-color options to use. “Norm_profile” specifies:
“Norm_profile” column is used “Norm_stderr” column is used for error bars Values are normalized to bins:
(-inf, -0.4), [-0.4, 0.4), [0.4, 0.85), [0.85, 2), [2, inf)
Bins are mapped to “grey”, “black”, “orange”, “red”, “red”
Other options may be defined in metric_defaults.
- metric_defaultsdict, optional
Keys are metric names, to be used with metric. Values are dictionaries of plotting parameters:
- “metric_column”str
The name of the column to use as the metric. Plots and analyses that use per-nucleotide data will use this column. If “color_column” is not provided, this column also defines colors.
- “error_column”str or None
The name of the column to use as the error. If None, no error bars are plotted.
- “color_column”str or None
The name of the column to use for coloring. If None, colors are defined by “metric_column”.
- “cmap”str or list
The name of the colormap to use. If a list, the list of colors to use.
- “normalization”str
The type of normalization to use. In order to be used with colormaps, values are normalized to either be integers for categorical colormaps, or floats in the range [0, 1] for continuous colormaps. “none” : no normalization is performed “min_max” : values are scaled to floats in the range [0, 1] based on
the upper and lower bounds defined in “values”
- “0_1”values are scaled to floats in the range [0, 1] based on
the minimum and maximum values in the data
- “bins”values are scaled an integer based on bins defined by the
list of bounds defined in “values”
- “percentiles”values are scaled to floats in the range [0, 1]
based on upper and lower percentile bounds defined by “values”
- “values”list or None
The values to use when normalizing the data. if “normalization” is “min_max”, this should be a list of two values
defining the upper and lower bounds.
- if “normalization” is “bins”, this should be a list of values
of length 1 less than the length of cmap. example: [5, 10, 20] defines 4 bins:
(-infinity, 5), [5, 10), [10, 20), [20, infinity)
- if “normalization” is “percentiles”, this should be a list of two
values defining the upper and lower percentile bounds.
if “normalization” is “0_1” or “none”, this should be None.
- “title”str, defaults to “”
The title of the colorbar.
- “ticks”list, defaults to None
The tick locations to use for the colorbar. If None, values are determined automatically.
- “tick_labels”list, defaults to None
The labels to use for the colorbar ticks. If None, values are determined automatically from “ticks”.
- “extend”“neither”, “both”, “min”, or “max”, defaults to “neither”
Which ends of the colorbar to extend (places an arrow head).
Defaults to None.
- logstr, optional
Path to a ShapeMapper v2 shapemap_log.txt file with mutations-per-molecule and read-length histograms. These will be present if the –per-read-histogram flag was used when running ShapeMapper v2. Currently, this is not working with ShapeMapper v2.2 files. Defaults to None.
- namestr, optional
A name for the data set. Defaults to None.
Attributes
- datapandas.DataFrame
The data table
- classmethod from_rnaframework(input_data, normalize=None)
Construct a SHAPEMaP object from an RNAFramework output file.
Parameters
- input_datastr
path to an RNAFramework .xml reactivities file
- normalize“DMS”, “eDMS”, “boxplot”, “percentiles”, or None, defaults to None
The normalization method to use. “DMS” uses self.norm_percentile and nt_groups=[‘AC’, ‘UG’]
scales the median of 90th to 95th percentiles to 1 As and Cs are normalized seperately from Us and Gs
- “eDMS” uses self.norm_eDMS and nt_groups=[‘A’, ‘U’, ‘C’, ‘G’]
Applies the new eDMS-MaP normalization. Each nucleotide is normalized seperately.
- “boxplot” uses self.norm_boxplot and nt_groups=[‘AUCG’]
removes outliers (> 1.5 iqr) and scales median to 1 scales nucleotides together unless specified with nt_groups
- “percentiles” uses self.norm_percentile and nt_groups=[‘AUCG’]
scales the median of 90th to 95th percentiles to 1 scales nucleotides together unless specified with nt_groups
Defaults to None: no normalization is performed
Returns
- SHAPEMaP
A SHAPEMaP object with the provided values.
- read_log(log)
Read the ShapeMapper log file.
Parameters
- logstr
Path to a ShapeMapper v2 shapemap_log.txt file with mutations-per-molecule and read-length histograms.
Returns
- read_lengthspandas.DataFrame
A dataframe with the columns “Read_length”, “Modified_read_length”, and “Untreated_read_length”.
- mutations_per_moleculepandas.DataFrame
A dataframe with the columns “Mutation_count”, “Modified_mutations_per_molecule”, and “Untreated_mutations_per_molecule”.
- class rnavigate.data.ScalarMappable(cmap, normalization, values, title='', tick_labels=None, **cbar_args)
Bases:
_ScalarMappableUsed to map scalar values to a color and to create a colorbar plot.
Parameters
- cmapstr, tuple, float, or list
A valid mpl color, list of valid colors or a valid colormap name
- normalization“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- valueslist
The values to use when normalizing the data
- titlestr, defaults to “”
The title of the colorbar.
- tick_labelslist, defaults to None
The labels to use for the colorbar ticks. If None, values are determined automatically.
- **cbar_argsdict
Additional arguments to pass to the colorbar function
Attributes
- rnav_normstr
The type of normalization to use when mapping values to colors
- rnav_valslist
The values to use when normalizing the data
- rnav_cmaplist
The colors to use when mapping values to colors
- cbar_argsdict
Additional arguments to pass to the colorbar function
- tick_labelslist
The labels to use for the colorbar ticks. If None, values are determined automatically.
- titlestr
The title of the colorbar.
- get_cmap(cmap)
Converts a cmap specification to a matplotlib colormap object.
Parameters
- cmapstring, tuple, float, or list
A valid mpl color, list of valid colors or a valid colormap name
Returns
- matplotlib colormap
a colormap matching the input
- get_norm(normalization, values, cmap)
Given a normalization type and values, return a normalization object.
Parameters
- normalization“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- valueslist
The values to use when normalizing the data
- cmapmatplotlib colormap
The colormap to use when normalizing the data
Returns
- matplotlib.colors normalization object
Used to normalize data before mapping to colors
- class rnavigate.data.SecondaryStructure(input_data, extension=None, autoscale=True, name=None, **kwargs)
Bases:
SequenceBase class for secondary structures.
Parameters
- input_datastr or pandas.DataFrame
A dataframe or filepath containing a secondary structure DataFrame should contain these columns:
[“Nucleotide”, “Sequence”, “Pair”]
“Pair” column must be redundant. Filepath parsing is determined by file extension:
varna, xrna, nsd, cte, ct, dbn, bracket, json (R2DT), forna
- extensionstr, optional
The file extension of the input_data file. If not provided, the extension will be inferred from the input_data filepath.
- autoscalebool, optional
Whether to automatically scale the x and y coordinates. Defaults to True.
- namestr, optional
The name of the RNA sequence. Defaults to None.
Attributes
- datapandas.DataFrame
DataFrame storing base-pairs
- filepathstr
The path to the input file, if provided, otherwise “dataframe”
- sequencestr
The RNA sequence
- ntsnumpy.array
The “Nucleotide” column of data
- pair_ntsnumpy.array
The “Pair” column of data
- headerstr
Header information from CT file
- xcoordinatesnumpy.array
The “X_coordinate” column of data
- ycoordinatesnumpy.array
The “X_coordinate” column of data
- distance_matrixnumpy.array
The contact distance matrix of the RNA structure
- add_pairs(pairs, break_conflicting_pairs=False)
Add base pairs to current secondary structure.
Parameters
- pairslist
1-indexed list of paired residues. e.g. [(1, 20), (2, 19)]
- break_conflicting_pairsbool, defaults to False
Whether to break existing pairs if there is a conflict
- as_interactions(structure2=None)
Returns rnavigate.Interactions representation of this, or more, structures.
Parameters
- structure2SecondaryStructure or list of these, defaults to None
If provided, basepairs from all structures are included and labeled by which structures contain them and how many structures contain them.
- property boolean
Return a boolean array of paired and unpaired nucleotides.
- break_noncanonical_pairs()
Removes non-canonical basepairs from the secondary structure.
WARNING: this deletes information.
- break_pairs_nts(nt_positions)
break base pairs at the given list of positions.
WARNING: this deletes information.
Parameters
- nt_positionslist of int
1-indexed positions to break pairs
- break_pairs_region(start, end, break_crossing=True, inverse=False)
Removes pairs from the specified region (1-indexed, inclusive).
WARNING: this deletes information
Parameters
- startint
start position (1-indexed, inclusive)
- endint
end position (1-indexed, inclusive)
- break_crossingbool, defaults to True
Whether to keep pairs that cross over the specified region
- inversebool, defaults to False
Invert the behavior, i.e. remove pairs that are not in this region
- break_singleton_pairs()
Removes singleton basepairs from the secondary structure.
WARNING: This deletes information.
- compute_ppv_sens(structure2, exact=True)
Compute the PPV and sensitivity between this and another structure.
True and False are determined from this structure. Positive and Negative are determined from structure2.
PPV = TP / (TP + FP) Sensitivity = TP / (TP + FN)
Parameters
- structure2SecondaryStructure
The SecondaryStructure to compare to.
- exactbool, defaults to True
True requires BPs to be exactly correct. False allows +/-1 bp slippage.
Returns
- float
sensitivity
- float
PPV
- 2-tuple of floats
(TP, TP+FP, TP+FN)
- contact_distance(i, j)
Returns the contact distance between positions i and j
- copy()
- fill_mismatches(mismatch=1)
Adds base pairs to fill 1,1 and optionally 2,2 mismatches.
Parameters
- mismatchint, defaults to 1
1 will fill only 1,1 mismatches 2 will fill 1,1 and 2,2 mismatches
- classmethod from_pairs_list(input_data, sequence)
Creates a SecondaryStructure from a list of pairs and a sequence.
Parameters
- input_datalist
1-indexed list of base pairs. e.g. [(1, 20), (2, 19)]
- sequencestr
The RNA sequence. e.g., “AUCGUGUCAUGCUA”
- classmethod from_sequence(input_data)
Creates a SecondaryStructure from a sequence string.
This structure is initialized with no base pairs. If base pairs are needed, use SecondaryStructure.from_pairs_list().
- get_aligned_data(alignment)
Returns a new SecondaryStructure object matching the alignment target.
Parameters
- alignmentdata.Alignment
An alignment object used to map values
- get_distance_matrix(recalculate=False, max_cd=50)
Get a matrix of pair-wise shortest path distances through the structure.
This function uses a BFS algorithm. The structure is represented as a complete graph with nucleotides as vertices and base-pairs and backbone as edges. All edges are length 1. Matrix is stored as an attribute for future use.
If the attribute is set (not None) and recalculate is False, the attribute will be returned.
Based on Tom’s contact_distance, but expanded to return the pairwise matrix. New contact_distance method added to return the distance between two positions.
By default, the maximum contact distance is set to 50. This will be the maximum value reported in the matrix, i.e. a value of 50 in the matrix means >= 50. This prevents the algorithm from running for a very long time on long RNAs. If you need a larger value, set max_cd to a higher value.
Parameters
- recalculatebool, defaults to False
Set to True to recalculate the matrix even if the attribute is set.
- max_cdint, defaults to 50
The maximum contact distance to calculate.
- get_dotbracket()
Get a dotbracket notation string representing the secondary structure.
- Pseudoknot levels:
1: () 2: [] 3: {} 4: <> 5: Aa 6: Bb 7: Cc etc…
Returns
- str
A dot-bracket representation of the secondary structure
- get_helices(fill_mismatches=True, split_bulge=True, keep_singles=False)
Get a dictionary of helices from the secondary structure.
Keys are equivalent to list indices. Values are lists of paired nucleotides (1-indexed) in that helix. e.g. {0:[(1,50),(2,49),(3,48)}
Parameters
- fill_mismatchesbool, defaults to True
Whether 1-1 and 2-2 bulges are replaced with base pairs
- split_bulgebool, defaults to True
Whether to split helices on bulges
- keep_singlesbool, defaults to False
Whether to return helices that contain only 1 base-pair
Returns
- dict
A dictionary of helices
- get_human_dotbracket()
Get a human-readable dotbracket string representing the secondary structure.
This is an experimental format designed to be more human readable, i.e. no counting of brackets required.
Letters, instead of brackets, are used to denote nested base pairs.
Each helix is assigned a letter, which is incremented one letter alphabetically from the nearest enclosing stem.
Non-nested helices (pseudoknots) are assigned canonical brackets.
- From this canonical dbn string:
how many bases are in the base stem? how many nested helices are there? ((((….(((.[[..)))))(((…(((..]].))))))))
- Same question, new format:
AABB….CCC.[[..cccbbBBB…CCC..]].cccbbbaa
- Read this as:
- ((_______________________________________)) (level 1 = A)
- ((_______________))(((______________))) (level 2 = B)
- (((_____))) (((_____))) (level 3 = C)
[[__________________]] (pseudoknot = [])
- Pseudoknot levels:
1: Aa, Bb, Cc, etc. 2: [], 3: {}, 4: <>
- get_interactions_df()
Returns a DataFrame of i, j basepairs.
Returns
- pandas.DataFrame
- A DataFrame with columns:
i: the 5’ (1-indexed) position of the base pair j: the 3’ (1-indexed) position of the base pair Structure: always 1
- get_junction_nts()
Get a list of junction nucleotides (paired, but at the end of a chain).
Returns
- list
A list of 1-indexed positions of junction nucleotides
- get_nonredundant_ct()
Returns the ct attribute in a non-redundant form.
Only returns pairs in which i < j For example:
self.ct[i-1] == j self.ct[j-1] == i BUT self.get_nonredundant_ct()[j-1] == 0
Returns
- numpy.array
A non-redundant array of base pairs
- get_paired_nts()
Get a list of residues that are paired.
Returns
- list
A list of 1-indexed positions of paired nucleotides
- get_pairs()
Get a non-redundant list of base pairs i < j as a array of tuples.
Returns
- list
A list of 1-indexed positions. e.g., [(1, 50), (2, 49), …]
- get_pseudoknots(fill_mismatches=True)
Get the pk1 and pk2 pairs from the secondary structure.
Ignores single base pairs. PK1 is defined as the helix crossing the most other bps. If there is a tie, the most 5’ helix is called pk1 returns pk1 and pk2 as a list of base pairs e.g [(1,10),(2,9)…
Parameters
- fill_mismatchesbool, defaults to True
Whether 1-1 and 2-2 bulges are replaced with base pairs
Returns
- list of 2 lists of 2-tuples
A list of base pairs for pk1 and pk2
- get_structure_elements()
This code is not yet implemented.
Returns a string with a character for each nucleotide, indicating what kind of structure element it is a part of.
- Characters:
Dangling Ends (E) Stems (S) Hairpin Loops (H) Bulges (B) Internal Loops (I) MultiLoops (M) External Loops (X) Pseudoknot (P)
- get_unpaired_nts()
Get a list of residues that are unpaired.
Returns
- list
A list of 1-indexed positions of unpaired nucleotides
- normalize_dtypes()
Convert dtypes of SecondaryStructure dataframe for consistency.
- normalize_sequence(t_or_u='U', uppercase=True)
Normalize the sequence attribute (fix case and/or U <-> T).
- property nts
- property pair_nts
- read_ct(structure_number=0)
Loads secondary structure information from a given ct file.
Requires a properly formatted header.
Parameters
- structure_numberint, defaults to 0
0-indexed structure number to load from the ct file.
- read_cte()
Generates SecondaryStructure object data from a CTE file
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_dotbracket()
Generates SecondaryStructure object data from a dot-bracket file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_forna()
Generates SecondaryStructure object data from a FORNA JSON file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_nsd(structure_number=0)
Generates SecondaryStructure object data from an NSD file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_r2dt()
Generates SecondaryStructure object data from an R2DT JSON file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_varna()
Generates SecondaryStructure object data from a VARNA file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- read_xrna()
Generates SecondaryStructure object data from an XRNA file.
Resulting SecondaryStructure object will include nucleotide x and y coordinates and is compatible with plot_ss.
- transform_coordinates(flip=None, scale=None, center=None, rotate_degrees=None)
Perform transformations on X and Y structure coordinates.
To acheive vertical and horizontal flip together, rotate 180 degrees.
Parameters
- flipstr, optional
“horizontal” or “vertical”
- scalefloat, optional
new median distance of basepairs
- centertuple of floats, optional
new center x and y coordinate
- rotate_degreesfloat, optional
number of degrees to rotate structure
- write_ct(out_file)
Write structure to a ct file.
- write_cte(out_file)
Write structure to CTE format for Structure Editor.
- write_dbn(rna_name, region='all', out_file=None)
Write the structure to a dot-bracket file.
Parameters
- rna_namestr
The name of the RNA sequence
- regionlist of 2 integers, optional
The region (start and end positions) of the RNA to write to file. Defaults to “all”.
- out_filestr, optional
The name of the output file. If not provided, the dbn file is printed.
- write_sto(out_file, name='seq')
Write structure to Stockholm (STO) file to use in infernal searches.
- property xcoordinates
- property ycoordinates
- class rnavigate.data.Sequence(input_data, name=None, entry=0)
Bases:
objectA class for storing and manipulating RNA sequences.
Parameters
- sequencestring or pandas.DataFrame
sequence string, fasta file, or a Pandas dataframe containing a “Sequence” column
- namestring, optional
The name of the sequence, defaults to None
- entryint, defaults to 0
The index of the sequence in the fasta file if a fasta file is provided
Attributes
- sequencestring
The sequence string
- namestring
The name of the sequence
- other_infodict
A dictionary of additional information about the sequence
- null_alignmentSequenceAlignment
An alignment of the sequence to itself
- get_aligned_data(alignment)
Get a copy of the sequence positionally aligned to another sequence.
Parameters
- alignmentrnavigate.data.Alignment
the alignment to use
Returns
- aligned_sequencernavigate.data.Sequence
the aligned sequence
- get_colors(source, pos_cmap='rainbow', profile=None, structure=None, annotations=None)
Get colors and colormap representing information about the sequence.
Parameters
- sourcestr, list, or matplotlib color-like
the source of the color information if a string, must be one of:
“sequence”, “position”, “profile”, “structure”, “annotations”
- if a list, must be a list of matplotlib color-like values, colormap
will be None.
- if a matplotlib color-like value, all nucleotides will be colored
that color, colormap will be None.
- pos_cmapstr, defaults to “rainbow”
cmap used for position colors if source is “position”
- profilernavigate.data.Profile, optional
the profile to use to get colors if source is “profile”
- structurernavigate.data.SecondaryStructure, optional
the structure to use to get colors if source is “structure”
- annotationslist of rnavigate.data.Annotations, optional
the annotations to use to get colors if source is “annotations”
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_annotations(annotations, default_color='gray')
Get colors and colormap representing sequence annotations.
Parameters
- annotationslist of rnavigate.data.Annotations
the annotations to use to get colors.
- default_colormatplotlib color-like, defaults to “gray”
the color to use for nucleotides not in any annotation
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_positions(pos_cmap='rainbow')
Get colors and colormap representing the nucleotide position.
Parameters
- pos_cmapstr, defaults to “rainbow”
cmap used for position colors
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_profile(profile)
Get colors and colormap representing per-nucleotide data.
Parameters
- profilernavigate.data.Profile
the profile to use to get colors.
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_sequence()
Get a colors and colormap representing the nucleotide sequence.
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_colors_from_structure(structure)
Get colors and colormap representing base-pairing status.
Parameters
- structurernavigate.data.SecondaryStructure
the structure to use to get colors.
Returns
- colorsnumpy array
one matplotlib color-like value for each nucleotide in self.sequence
- colormaprnavigate.data.ScalarMappable
a colormap used for creating a colorbar
- get_region(region='all')
Checks region input for validity and returns start and end positions.
If region is “all”, returns 1, self.length. Otherwise, ensures that region is between these values and returns the values, sorted.
Parameters
- regionlist of 2 int
start and end positions of the region
Returns
- start, endint, int
the starting and ending positions
- get_region_data(region='all')
Get a copy of the data object containing only the specified region.
Parameters
- regionlist of 2 int, defaults to “all”
start and end positions of the region
Returns
- region_datarnavigate.data.Sequence
the sequence containing only the specified region
- get_seq_from_dataframe(dataframe)
Parse a dataframe for the sequence string, store as self.sequence.
Parameters
- dataframepandas.DataFrame
must contain a “Sequence” column
- normalize_sequence(t_or_u='U', uppercase=True)
Converts sequence to all uppercase nucleotides and corrects T or U.
Parameters
- t_or_u“T”, “U”, or False, defaults to “U”
“T” converts “U”s to “T”s “U” converts “T”s to “U”s False does nothing.
- uppercasebool, defaults to True
Whether to make sequence all uppercase
- class rnavigate.data.SequenceAlignment(sequence1, sequence2, align_kwargs=None, full=False, use_previous=True)
Bases:
BaseAlignmentThe most useful feature of RNAvigate. Maps positions from one sequence to a totally different sequence using user-defined pairwise alignment or automatic pairwise alignment.
Parameters
- sequence1string
the sequence to be aligned
- sequence2string
the sequence to align to
- align_kwargsdict, defaults to None
a dictionary of arguments to pass to pairwise2.align.globalms
- fullbool, defaults to False
whether to keep unmapped starting sequence positions.
- use_previousbool, defaults to True
whether to use previously set alignments
Attributes
- sequence1str
the sequence to be aligned
- sequence2str
the sequence to align to
- alignment1str
the alignment string matching sequence1 to sequence2
- alignment2str
the alignment string matching sequence2 to sequence1
- starting_sequencestr
sequence1
- target_sequencestr
sequence2 if full is False, else alignment2
- mappingnumpy.array
the alignment map array. index of starting_sequence is mapping[index] of target_sequence
- get_alignment()
Gets an alignment that has either been user-defined or previously calculated or produces a new pairwise alignment between two sequences.
Returns
- alignment1, alignment2tuple of 2 str
the alignment strings matching sequence1 and sequence2, respectively.
- get_inverse_alignment()
Gets an alignment that maps from sequence2 to sequence1.
- get_mapping()
Calculates a mapping from starting sequence to target sequence.
Returns
- mappingnumpy.array
an array that maps to an index of target sequence. index of starting_sequence is mapping[index] of target_sequence
- print(print_format='full')
Print the alignment in a human-readable format.
Parameters
- print_format“full”, “cigar”, “long” or “short”, defaults to “full”
how to format the alignment. “full”: the full length alignment with changes labeled “X” “cigar”: the CIGAR string “long”: locations and sequences of each change “short”: total number of matches, mismatches, and indels
- print_all_changes()
Print location and sequence of all changes.
- print_cigar()
Print the CIGAR string
- print_number_of_changes()
Print the total numbers of matches, mismatches, and indels.
- class rnavigate.data.SequenceCircle(input_data, gap=30, name=None, **kwargs)
Bases:
SecondaryStructureA circular SecondaryStructure-like representation of RNA sequence.
- class rnavigate.data.StructureAlignment(sequence1, sequence2, structure1=None, structure2=None, full=False)
Bases:
BaseAlignmentExperimental secondary structure alignment based on RNAlign2D algorithm (https://doi.org/10.1186/s12859-021-04426-8)
Parameters
- sequence1string
the sequence to be aligned
- sequence2string
the sequence to align to
- structure1string, defaults to None
the secondary structure of sequence1
- structure2string, defaults to None
the secondary structure of sequence2
- fullbool, defaults to False
whether to align to full length of sequence2 or just mapped length
Attributes
- sequence1str
the sequence to be aligned
- sequence2str
the sequence to align to
- structure1str
the secondary structure of sequence1
- structure2str
the secondary structure of sequence2
- alignment1str
the alignment string matching sequence1 to sequence2
- alignment2str
the alignment string matching sequence2 to sequence1
- starting_sequencestr
sequence1
- target_sequencestr
sequence2 if full is False, else alignment2
- mappingnumpy.array
the alignment map array. index of starting_sequence is mapping[index] of target_sequence
- get_alignment()
Aligns pseudo-amino-acid sequences according to RNAlign2D rules.
Returns
- alignment1, alignment2tuple of 2 str
the alignment strings matching sequence1 and sequence2, respectively.
- get_inverse_alignment()
Gets an alignment that maps from sequence2 to sequence1.
- get_mapping()
Calculates a mapping from starting sequence to target sequence.
Returns
- mappingnumpy.array
an array which maps an indices to the target sequence. starting_sequence[idx] == target_sequence[self.mapping[idx]]
- set_as_default_alignment()
Set this as the default alignment between sequence1 and sequence2.
- class rnavigate.data.StructureAsInteractions(input_data, sequence, metric=None, metric_defaults=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating structure data.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing structure data. If dataframe, the dataframe containing structure data. The dataframe must contain columns “i”, “j”, and “Structure”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the structure data.
- metricstring, defaults to “Structure”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the structure data.
- namestr, optional
A name for the StructureAsInteractions object.
Attributes
- datapandas.DataFrame
The structure data.
- class rnavigate.data.StructureCompareMany(input_data, sequence, metric=None, metric_defaults=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating a comparison of many structures.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing structure data. If dataframe, the dataframe containing structure data. The dataframe must contain columns “i”, “j”, and “Structure”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the structure data.
- metricstring, defaults to “Structure”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the structure data.
- namestr, optional
A name for the StructureAsInteractions object.
Attributes
- datapandas.DataFrame
The structure data.
- class rnavigate.data.StructureCompareTwo(input_data, sequence, metric=None, metric_defaults=None, window=1, name=None)
Bases:
InteractionsA class for storing and manipulating a comparison of two structures.
Parameters
- input_datastring or pandas.DataFrame
If string, a path to a file containing structure data. If dataframe, the dataframe containing structure data. The dataframe must contain columns “i”, “j”, and “Structure”. Dataframe may also include other columns.
- sequencestring or rnavigate.data.Sequence
The sequence string corresponding to the structure data.
- metricstring, defaults to “Structure”
The column name to use for visualization.
- metric_defaultsdict
Keys are metric names and values are dictionaries of metric-specific defaults. These defaults include:
- “metric_column”string
the column name to use for visualization
- “cmap”string or matplotlib.colors.Colormap
the colormap to use for visualization
- “normalization”“min_max”, “0_1”, “none”, or “bins”
The type of normalization to use when mapping values to colors
- “values”list of float
The values to used with normalization of the data
- “title”string
the title to use for colorbars
- “extend”“min”, “max”, “both”, or “neither”
Which ends to extend when drawing the colorbar.
“tick_labels” : list of string
- read_table_kwdict, optional
kwargs passed to pandas.read_table() when reading input_data.
- windowint, defaults to 1
The window size used to generate the structure data.
- namestr, optional
A name for the StructureAsInteractions object.
Attributes
- datapandas.DataFrame
The structure data.
- class rnavigate.data.StructureCoordinates(x, y, pairs=None)
Bases:
objectHelper class to perform structure coordinate transformations
Parameters
- xnumpy.array
x coordinates
- ynumpy.array
y coordinates
- pairslist of pairs, optional
list of base-paired positions required if scaling coordinates
- center(x=0, y=0)
Center structure on the given x, y coordinate
Parameters
- xint, defaults to 0
x coordinate of structure center
- yint, defaults to 0
y coordinate of structure center
- flip(horizontal=True)
Flip structure vertically or horizontally.
Parameters
- horizontalbool, defaults to True
whether to flip structure horizontally, otherwise vertically
- get_center_point()
Get the x, y coordinates for the center of structure.
Returns
- float
x coordinate of structure center
- float
y coordinate of structure center
- rnavigate.data.domains(input_data, names, colors, sequence)
Create a list of Annotations from a list of spans.
Currently, domains functionality in RNAvigate just uses a list of spans. In the future, this should be a dedicated class. Generally, domains should cover an entire sequence without overlap, but this is not enforced. e.g. [[1, 100], [101, 200]] for a 200 nt sequence.
Parameters
- input_datalist of lists
list of spans for each domain
- nameslist of strings
list of names for each domain
- colorslist of valid matplotlib colors
list of colors for each domain
- sequencestring
sequence to be annotated
Returns
- list of rnavigate.data.Annotation
list of Annotations
- rnavigate.data.lookup_alignment(sequence1, sequence2, t_or_u='U')
look up a previously set alignment in the _alignments_cache
Parameters
- sequence1string
The first sequence to align
- sequence2string
The second sequence to be aligned to
- t_or_u“T”, “U”, or False, defaults to “U”
“T” converts “U”s to “T”s “U” converts “U”s to “T”s False does nothing
Returns
- dictionary, if an alignment is found, otherwise None
- {“seqA”: sequence1 with gap characters representing alignment,
“seqB”: sequence2 with gap characters representing alignment}
- rnavigate.data.normalize_sequence(sequence, t_or_u='U', uppercase=True)
Returns sequence as all uppercase nucleotides and/or corrects T or U.
Parameters
- sequencestring or RNAvigate Sequence)
The sequence If given an RNAvigate Sequence, the sequence string is retrieved
- t_or_u“T”, “U”, or False, defaults to “U”
“T” converts “U”s to “T”s “U” converts “T”s to “U”s False does nothing
- uppercase bool, defaults to True
Whether to make sequence all uppercase
Returns
- string
the cleaned-up sequence string
- rnavigate.data.set_alignment(sequence1, sequence2, alignment1, alignment2, t_or_u='U')
Add an alignment to be used as the default between two sequences.
When objects with these sequences are aligned for visualization, RNAvigate uses this alignment instead of an automated pairwise sequence alignment. Alignment 1 and 2 must have matching lengths. alignment(1,2) and sequence(1,2) must differ only by dashes “-“.
- e.g.:
sequence1 =”AAGCUUCGGUACAUGCAAGAUGUAC” sequence2 =”AUCGAUCGAGCUGCUGUGUACGUAC” alignment1=”AAGCUUCG———GUACAUGCAAGAUGUAC” alignment2=”AUCGAUCGAGCUGCUGUGUAC———GUAC”
|mm| | indel | | indel |
Parameters
- sequence1string
the first sequence
- sequence2string
the second sequence
- alignment1string
first sequence, plus dashes “-” indicating indels
- alignment2string
second sequence, plus dashes “-” indicating indels
- t_or_u“T”, “U”, or False
“T” converts “U”s to “T”s
- rnavigate.data.set_multiple_sequence_alignment(fasta, set_pairwise=False)
Set alignments from a multiple sequence alignment Pearson fasta file.
Sets alignments to a base sequence, then returns the base sequence to be when a multiple sequence alignment plot is desired. Also sets all pairwise alignments, if desired. When setting pairwise alignments, dashes that are shared between pairwise sequences are removed first.
Parameters
- fastastring
location of Pearson fasta file
- set_pairwisebool, defaults to False
whether to set every pairwise alignment as well as the multiple sequence alignment.