File formats for Multiple Sequence Alignments (MSA)
ClustalW format
The ClustalW format is a relatively simple text file containing a single multiple sequence alignment of DNA, RNA, or protein sequences. It was first used as an output format for the clustalw programs, but nowadays it may also be generated by various other sequence alignment tools. The specification is straight forward:
The first line starts with the words
CLUSTAL W
or
CLUSTALW
After the above header there is at least one empty line
Finally, one or more blocks of sequence data are following, where each block is separated by at least one empty line
Each line in a blocks of sequence data consists of the sequence name followed by the sequence symbols, separated by at least one whitespace character. Usually, the length of a sequence in one block does not exceed 60 symbols. Optionally, an additional whitespace separated cumulative residue count may follow the sequence symbols. Optionally, a block may be followed by a line depicting the degree of conservation of the respective alignment columns.
Note
Sequence names and the sequences must not contain whitespace characters! Allowed gap symbols are the hyphen ("-"), and dot (".").
Warning
Please note that many programs that output this format tend to truncate the sequence names to a limited number of characters, for instance the first 15 characters. This can destroy the uniqueness of identifiers in your MSA.
Sequence names must not contain whitespace characters. Otherwise, the parts after the first whitespace will be dropped. The only allowed gap character is the hyphen ("-").
The multiple alignment format (MAF) is usually used to store multiple alignments on DNA level between entire genomes. It consists of independent blocks of aligned sequences which are annotated by their genomic location. Consequently, an MAF formatted MSA file may contain multiple records. MAF files start with a line
##maf
which is optionally extended by whitespace delimited key=value pairs. Lines starting with the character ("#") are considered comments and usually ignored.
A MAF block starts with character ("a") at the beginning of a line, optionally followed by whitespace delimited key=value pairs. The next lines start with character ("s") and contain sequence information of the form
s src start size strand srcSize sequence
where
src is the name of the sequence source
start is the start of the aligned region within the source (0-based)
size is the length of the aligned region without gap characters
strand is either ("+") or ("-"), depicting the location of the aligned region relative to the source
srcSize is the size of the entire sequence source, e.g. the full chromosome
sequence is the aligned sequence including gaps depicted by the hyphen ("-")
##maf version=1 scoring=tba.v8
# tba.v8 (((human chimp) baboon) (mouse rat))
# multiz.v7
# maf_project.v5 _tba_right.maf3 mouse _tba_C
# single_cov2.v4 single_cov2 /dev/stdin
a score=23262.0
s hg16.chr7 27578828 38 + 158545518 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s panTro1.chr6 28741140 38 + 161576975 AAA-GGGAATGTTAACCAAATGA---ATTGTCTCTTACGGTG
s baboon 116834 38 + 4622798 AAA-GGGAATGTTAACCAAATGA---GTTGTCTCTTATGGTG
s mm4.chr6 53215344 38 + 151104725 -AATGGGAATGTTAAGCAAACGA---ATTGTCTCTCAGTGTG
s rn3.chr4 81344243 40 + 187371129 -AA-GGGGATGCTAAGCCAATGAGTTGTTGTCTCTCAATGTG
a score=5062.0
s hg16.chr7 27699739 6 + 158545518 TAAAGA
s panTro1.chr6 28862317 6 + 161576975 TAAAGA
s baboon 241163 6 + 4622798 TAAAGA
s mm4.chr6 53303881 6 + 151104725 TAAAGA
s rn3.chr4 81444246 6 + 187371129 taagga
a score=6636.0
s hg16.chr7 27707221 13 + 158545518 gcagctgaaaaca
s panTro1.chr6 28869787 13 + 161576975 gcagctgaaaaca
s baboon 249182 13 + 4622798 gcagctgaaaaca
s mm4.chr6 53310102 13 + 151104725 ACAGCTGAAAATA
File formats to manipulate the RNA folding grammar
Command Files
The RNAlib and many programs of the ViennaRNA Package can parse and apply data from so-called command files. These commands may refer to structure constraints or even extensions of the RNA folding grammar (such as Unstructured Domains). Commands are given as a line of whitespace delimited data fields. The syntax we use extends the constraint definitions used in the mfold / UNAfold software, where each line begins with a command character followed by a set of positions.
However, we introduce several new commands, and allow for an optional loop type context specifier in form of a sequence of characters, and an orientation flag that enables one to force a nucleotide to pair upstream, or downstream.
Constraint commands
The following set of commands is recognized:
F Force
P Prohibit
C Conflicts/Context dependency
A Allow (for non-canonical pairs)
E Soft constraints for unpaired position(s), or base pair(s)
The optional loop type context specifier [LOOP] may be a combination of the following:
E Exterior loop
H Hairpin loop
I Interior loop
M Multibranch loop
A All loops
For structure constraints, we additionally allow one to address base pairs enclosed by a particular kind of loop, which results in the specifier [WHERE] which consists of [LOOP] plus the following character:
i enclosed pair of an Interior loop
m enclosed pair of a Multibranch loop
If no [LOOP] or [WHERE] flags are set, all contexts are considered (equivalent to A )
Controlling the orientation of base pairing
For particular nucleotides that are forced to pair, the following [ORIENTATION] flags may be used:
U Upstream
D Downstream
If no [ORIENTATION] flag is set, both directions are considered.
Sequence coordinates
Sequence positions of nucleotides/base pairs are based and consist of three positions , , and . Alternativly, four positions may be provided as a pair of two position ranges , and using the '-' sign as delimiter within each range, i.e. , and .
Valid constraint commands
Below are resulting general cases that are considered valid constraints:
"Forcing a range of nucleotide positions to be paired":
Syntax:
F i 0 k [WHERE] [ORIENTATION]
Description:
Enforces the set of consecutive nucleotides starting at position to be paired. The optional loop type specifier [WHERE] allows to force them to appear as closing/enclosed pairs of certain types of loops.
"Forcing a set of consecutive base pairs to form":
Syntax:
F i j k [WHERE]
Description:
Enforces the base pairs to form. The optional loop type specifier [WHERE] allows to specify in which loop context the base pair must appear.
"Prohibiting a range of nucleotide positions to be paired":
Syntax:
P i 0 k [WHERE]
Description:
Prohibit a set of consecutive nucleotides to participate in base pairing, i.e. make these positions unpaired. The optional loop type specifier [WHERE] allows to force the nucleotides to appear within the loop of specific types.
"Probibiting a set of consecutive base pairs to form":
Syntax:
P i j k [WHERE]
Description:
Probibit the base pairs to form. The optional loop type specifier [WHERE] allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.
"Prohibiting two ranges of nucleotides to pair with each other":
Syntax:
P i-j k-l [WHERE]
Description:
Prohibit any nucleotide to pair with any other nucleotide . The optional loop type specifier [WHERE] allows to specify the type of loop they are disallowed to be the closing or an enclosed pair of.
"Enforce a loop context for a range of nucleotide positions":
Syntax:
C i 0 k [WHERE]
Description:
This command enforces nucleotides to be unpaired similar to prohibiting nucleotides to be paired, as described above. It too marks the corresponding nucleotides to be unpaired, however, the [WHERE] flag can be used to enforce specfic loop types the nucleotides must appear in.
"Remove pairs that conflict with a set of consecutive base pairs":
Syntax:
C i j k
Description:
Remove all base pairs that conflict with a set of consecutive base pairs . Two base pairs and conflict with each other if , or .
"Allow a set of consecutive (non-canonical) base pairs to form":
Syntax:
A i j k [WHERE]
Description:
This command enables the formation of the consecutive base pairs , no matter if they are canonical, or non-canonical. In contrast to the above F and W commands, which remove conflicting base pairs, the A command does not. Therefore, it may be used to allow non-canoncial base pair interactions. Since the RNAlib does not contain free energy contributions for non-canonical base pairs , they are scored as the maximum of similar, known contributions. In terms of a Nussinov like scoring function the free energy of non-canonical base pairs is therefore estimated as
The optional loop type specifier [WHERE] allows to specify in which loop context the base pair may appear.
"Apply pseudo free energy to a range of unpaired nucleotide positions":
Syntax:
E i 0 k e
Description:
Use this command to apply a pseudo free energy of to the set of consecutive nucleotides, starting at position . The pseudo free energy is applied only if these nucleotides are considered unpaired in the recursions, or evaluations, and is expected to be given in .
"Apply pseudo free energy to a set of consecutive base pairs":
Syntax
E i j k e
Use this command to apply a pseudo free energy of to the set of base pairs . Energies are expected to be given in .
Valid domain extensions commands
"Add ligand binding to unpaired motif (a.k.a. unstructured domains)":
Syntax:
UD m e [LOOP]
Description:
Add ligand binding to unpaired sequence motif (given in IUPAC format, capital letters) with binding energy in particular loop type(s).
Example:
UD AAA -5.0 A
The above example applies a binding free energy of for a motif AAA that may be present in all loop types.
File Formats for Energy Parameters
JSON Parameter Files for Modified Bases
The functions vrna_sc_mod(), vrna_sc_mod_json() and alike implement an energy correction framework to account for modified bases in the secondary structure predictions. To supply these functions with the energy parameters and general specifications of the base modification, the following JSON data format may be used:
JSON data must consist of a header section modified_bases. This header is an object with the mandatory keys:
name specifying a name of the modified base
unmodified that consists of a single upper-case letter of the unmodified version of this base,
the one_letter_code key to specify which letter is used for the modified bases in the subsequent energy parameters, and
an array of pairing_partners.
The latter must be uppercase characters. An optional sources key may contain an array of related publications, e.g. those the parameters have been derived from.
Next to the header may follow additional keys to specify the actual energy contributions of the modified base in various loop contexts. All energy contributions must be specified in free energies in units of . To allow for rescaling of the free energies at temperatures that differ from the default ( ), enthalpy parameters may be specified as well. Those, however are optional. The keys for free energy (at ) and enthalpy parameters have the suffixes _energies and _enthalpies, respectively.
The parser and underlying framework currently supports the following loop contexts:
base pair stacks (via the stacking key prefix).
This key must point to an object with one key value pair for each stacking interaction data is provided for. Here, the key consists of four upper-case characters denoting the interacting bases, where the the first two represent one strand in 5' to 3' direction and the last two the opposite strand in 3' to 5' direction. The values are energies in .
terminal mismatches (via the mismatch key prefix)
This key points to an object with key value pairs for each mismatch energy parameter that is available. Keys are 4 characters long nucleotide one-letter codes as used in base pair stacks above. The second and fourth character denote the two unpaired mismatching bases, while the other two represent the closing base pair.
dangling ends (via the dangle5 and dangle3 key prefixes)
The object behind these keys, again, consists of key value pairs for each dangling end energy parameter. Keys are 3 characters long where the first two represent the two nucleotides that form the base pair, and the third is the unpaired base that either stacks on the 3' or 5' end of the enclosed part of the base pair.
terminal pairs (via the terminal key prefix)
Terminal base pairs, such as AU or GU, sometimes receive an additional energy penalty. The object behind this key may list energy parameters to apply whenever particular base pairs occur at the end of a helix. Each of those parameters is specified as key value pair, where the key consists of two upper-case characters denoting the terminal base pair.
Below is a JSON template specifying most of the possible input parameters. Actual energy parameter files can be found in the source code tarball within the misc/ subdirectory.
{
"modified_base" : {
"name" : "My modification (M)",
"sources" : [
{
"authors" : "Author 1, Author 2",
"title" : "UV-melting of modified oligos",
"journal" : "Some journal",
"year" : 2022,
"doi" : "10.0000/000000"
}
],
"unmodified" : "G",
"pairing_partners" : [
"U","A"
],
"one_letter_code" : "M",
"fallback" : "G",
"stacking_energies" : {
"MAUU" : -1.2,
"AGMC" : -2.73
},
"stacking_enthalpies" : {
"MAUU" : -11.1,
"AGMC" : -9.73
},
"terminal_energies" : {
"MU" : 0.5,
"UM" : 0.5
},
"terminal_enthalpies" : {
"MU" : 2.0,
"UM" : 2.0
},
"mismatch_energies" : {
"CMGM" : -1.11,
"AGUM" : -0.73
},
"mismatch_enthalpies" : {
"CMGM" : -11.11,
"AGUM" : -7.73
},
"dangle5_energies" : {
"UAM" : -1.01
},
"dangle5_enthalpies" : {
"UAM" : -6.01
},
"dangle3_energies" : {
"CGM" : -2.1,
"GCM" : -1.3
}
}
}
See also
misc/rna_mod_template_parameters.json in the source code tarball
An actual example of real-world data may look like
{
"modified_base" : {
"name" : "Pseudouridine",
"sources" : [
{
"authors": "Graham A. Hudson, Richard J. Bloomingdale, and Brent M. Znosko",
"title" : "Thermodynamic contribution and nearest-neighbor parameters of pseudouridine-adenosine base pairs in oligoribonucleotides",
"journal" : "RNA 19:1474-1482",
"year" : 2013,
"doi" : "10.1261/rna.039610.113"
}
],
"unmodified" : "U",
"pairing_partners" : [
"A"
],
"one_letter_code" : "P",
"fallback" : "U",
"stacking_energies" : {
"APUA" : -2.8,
"CPGA" : -2.77,
"GPCA" : -3.29,
"UPAA" : -1.62,
"PAAU" : -2.10,
"PCAG" : -2.49,
"PGAC" : -2.2,
"PUAA" : -2.74
},
"stacking_enthalpies" : {
"APUA" : -22.08,
"CPGA" : -16.23,
"GPCA" : -24.07,
"UPAA" : -20.81,
"PAAU" : -12.47,
"PCAG" : -17.29,
"PGAC" : -11.19,
"PUAA" : -26.94
},
"terminal_energies" : {
"PA" : 0.31,
"AP" : 0.31
},
"terminal_enthalpies" : {
"PA" : -2.04,
"AP" : -2.04
},
"duplexes" : {
"CGAPACGGCUAUGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -9.93,
"dG37_p" : -10.12
},
"CGCPACGGCGAUGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -10.96,
"dG37_p" : -11.17
},
"CGGPACGGCCAUGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -11.71,
"dG37_p" : -11.53
},
"CGUPACGGCAAUGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -9.10,
"dG37_p" : -8.83
},
"CGAPCCGGCUAGGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -11.92,
"dG37_p" : -11.53
},
"CGCPCCGGCGAGGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -12.93,
"dG37_p" : -12.57
},
"CGGPCCGGCCAGGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -12.76,
"dG37_p" : -12.94
},
"CGUPCCGGCAAGGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -9.76,
"dG37_p" : -10.24
},
"CGAPGCGGCUACGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -11.45,
"dG37_p" : -11.40
},
"CGCPGCGGCGACGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -12.35,
"dG37_p" : -12.45
},
"CGGPGCGGCCACGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -12.59,
"dG37_p" : -12.81
},
"CGUPGCGGCAACGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -10.34,
"dG37_p" : -10.11
},
"CGAPUCGGCUAAGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -10.42,
"dG37_p" : -10.86
},
"CGCPUCGGCGAAGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -12.06,
"dG37_p" : -11.91
},
"CGGPUCGGCCAAGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -12.51,
"dG37_p" : -12.27
},
"CGUPUCGGCAAAGC" : {
"length1" : 7,
"length2" : 7,
"dG37" : -9.51,
"dG37_p" : -9.58
},
"GCGCAPCGCGUA" : {
"length1" : 6,
"length2" : 6,
"dG37" : -9.90,
"dG37_p" : -9.71
},
"GCGCCPCGCGGA" : {
"length1" : 6,
"length2" : 6,
"dG37" : -10.63,
"dG37_p" : -10.84
},
"GCGCGPCGCGCA" : {
"length1" : 6,
"length2" : 6,
"dG37" : -10.43,
"dG37_p" : -10.46
},
"GCGCUPCGCGAA" : {
"length1" : 6,
"length2" : 6,
"dG37" : -8.55,
"dG37_p" : -8.50
},
"PAGCGCAUCGCG" : {
"length1" : 6,
"length2" : 6,
"dG37" : -8.93,
"dG37_p" : -8.99
},
"PCGCGCAGCGCG" : {
"length1" : 6,
"length2" : 6,
"dG37" : -9.56,
"dG37_p" : -9.66
},
"PGGCGCACCGCG" : {
"length1" : 6,
"length2" : 6,
"dG37" : -10.30,
"dG37_p" : -10.27
},
"PUGCGCAACGCG" : {
"length1" : 6,
"length2" : 6,
"dG37" : -9.77,
"dG37_p" : -9.65
}
}
}
}
See also
misc/rna_mod_pseudouridine_parameters.json in the source code tarball
Generated on Wed Jul 19 2023 20:07:30 for RNAlib-2.6.3 by 1.9.7