HyperMotif:
hypergeometric-based analysis of sequence motifs
USER GUIDE
TABLE
OF CONTENTS
Short linear motifs constitute an
important group of functional elements at the level of DNA, RNA, and
protein. At the level of DNA and RNA, the most important class of linear
motifs encompasses cis-regulatory
elements that are recognized and bound by trans-acting
elements. At the protein level, short linear motifs are involved in a
variety of processes, such as mediating protein-protein interactions,
interaction with ligand at protein active site, mediating
enzymatic reactions, and modulating protein activity via post-translational
modifications. Functional elements often show a non-random pattern of
distribution among defined groups of genes and proteins. For example, cis-regulatory elements are over-represented in
certain groups of genes if genes are clustered based on co-expression. As
another example, signature peptides of protein active sites are
over-represented among proteins that have similar molecular functions.
Therefore, by finding motifs that are over-represented in a particular
group of genes or proteins, we can identify functional sequence elements.
HyperMotif
is a package that facilitates identification of group-specific short linear
motifs. The programs within this package implement the concept of finding
over-represented motifs based on tests of hypergeometric
distribution. HyperMotif provides the flexibility
to search both nucleic acid and protein sequences for identification of
functional sequence motifs. Furthermore, this package allows prediction of
protein/gene categories based on the discovered short linear motifs.
This
is a list of the components of the HyperMotif
package, along with a brief description of what they do:
HyperMotif: This
program is able to discover, de novo,
the motifs that are over-represented in predefined sets of DNA/RNA/protein
sequences.
NBPreCat: This
program is able to integrate the motifs that are discovered by HyperMotif into near optimal naïve Bayesian networks
that can be used for function prediction. It can also use predefined
classifiers (such as classifiers previously found by NBPreCat)
to identify the likely functions of novel proteins.
MotifScan: This
program can be used to search for instances of the motifs that are
discovered by HyperMotif in a set of
DNA/RNA/protein sequences.
ProMatch: This
program is able to identify PROSITE patterns that significantly overlap the
motifs discovered by HyperMotif. Any other set of
patterns can also be examined by this program as long as they are presented
in the same format as PROSITE patterns. ProMatch
is also able to identify the overlap among HyperMotif
patterns.
We
have included pre-compiled files for Windows, Linux, and Mac OS. All that
needs to be done is to download the compressed file and extract the package
on a hard drive, and then copy the appropriate binary files to the root
folder of HyperMotif. Binary files are located in
a folder named “bin”. For example, in Windows, if you have extracted the
package on drive C, you need to copy all the binary files that are located
in C:\HyperMotif\bin\win into C:\HyperMotif. In the end, you will have four
binary files in the root folder: C:\HyperMotif\HyperMotif.exe, C:\HyperMotif\NBPreCat.exe,
C:\HyperMotif\MotifScan.exe, and C:\HyperMotif\ProMatch.exe.
If
needed, you can use the provided source codes to compile your own version
of HyperMotif that is compatible with the machine
you are using. In most cases, all that needs to be done is to use the makefile in the source_codes
folder to create the binary files using a GNU compiler. GNU compilers are
standard components of Linux and Mac OS. For Windows, you need to download
and install a GNU compiler, such as MinGW. The makefile in
the source_codes folder calls “mingw32-make” by default
to create the different components of the HyperMotif
package:
MAKE = mingw32-make
If
a different compiler is used, the first line of source_codes/Makefile needs to be modified accordingly. For example,
in Linux, the first line should be this:
MAKE = make
The binary files will be created in
the root folder of HyperMotif, and will be ready
to use.
The
following are the file formats that the HyperMotif
package uses as the input of its programs. The output formats are described
separately for each program.
è
FASTA: FASTA is a standard file format for storage of
DNA/RNA/protein sequences. For a description of the FASTA file format, see this article on
Wikipedia. FASTA files cannot have lines longer than 999 characters when
read by HyperMotif. Longer sequences should be
broken into more than one line.
è
BLAST9: BLAST9 is the output format of NCBI BLAST program
when the “–m” parameter is set to 9. In this tab-delimited format, each
line indicates a pair of homologous genes, followed by 10 numbers that are
ignored by HyperMotif, and thus can be anything.
All lines that start with the number sign (#) are considered as comment
lines and are ignored. This is an example of a BLAST9 file:
# BLASTP 2.2.17
# Query: Q5ADX5
# Database: GO.20101023.inMolFunc.fasta
Q5ADX5 P32783 0
0 0 0 0 0 0 0 0 0
This
example text simply indicates that P32783 is a homolog of Q5ADX5. The rest
of the lines are ignored.
è
TABx2: TABx2 is a tab-delimited file with two columns. This
file format is used to indicate the gene/protein categories. In each line,
the first entry is the gene or protein symbol/ID, and the second entry is
the category symbol/ID. The symbols and IDs cannot contain spaces. This is
an example TABx2 file:
Tb09.160.0390
GO:0004497
Tb09.160.0450 GO:0004672
Tb09.160.0450 GO:0005524
In
this example text, it is indicated that Tb09.160.0390 belongs to the category
GO:0004497, and Tb09.160.0450 belongs to the two
categories GO:0004672 and GO:0005524. TABx2 files are used to define the
gold standard sets which will be used by HyperMotif
for motif discovery and by NBPreCat for function
prediction.
è
OBO: OBO is a standard format for defining ontologies. OBO files are used by HyperMotif
and NBPreCat to properly construct the positive
and negative gold standard sets based on the protein-category associations
of the TABx2 file. For a description of the OBO file format, see this article.
è
SE: SE is a tab-delimited file format used by NBPreCat for defining the features of genes/proteins,
based on which functions are predicted. SE files can be automatically
generated using the MotifScan program, as
described later. SE files have four columns. The first column is the gene
or protein symbol/ID. The second column is a description, which cannot have
spaces, and is ignored by the program. The third column is the feature
associated with that gene/protein. The fourth column is the status of the
feature; if the status is “T”, that line is considered; if the status is
“F”, that line is ignored. Also, lines that start with “$” are considered
as comment lines and are ignored. This is an example SE file:
$Protein
Method Motif Status
Q59J86 HyperMotif E.KKK
T
P26725 HyperMotif GV..LL F
P26725 HyperMotif SGKT
T
This
example text indicates that Q59J86 has an instance of the EXKKK motif, and
P26725 has an instance of each of GVXXLL and SGKT motifs, but the GVXXLL
motif should be ignored here.
è
NOA: NOA is the file format that is used by MotifScan as the source of motif definitions in order
to find the motif instances. NOA files are originally used by Cytoscape to
describe node attributes. We have adapted the same format to describe
motifs and their nature (linear RNA, DNA, or protein motifs). The first line
of the NOA files that are used by MotifScan must
be this:
$SequenceElementType
(class=java.lang.String)
The
rest of the lines define the motifs:
[Motif
Sequence] = [Motif Type]
where
instead of [Motif Sequence] and [Motif Type] appropriate strings must be
written. Here is an example:
$SequenceElementType
(class=java.lang.String)
G.G[LI]KT = PMOTIF
ATCG[CG]AG = 5'MOTIF
AUAUU.GAU = 3'MOTIF
This
example text defines three motif: GXG[LI]KT is a
protein motif, ATCG[CG]AG is a DNA motif, and AUAUUNGAU is an RNA motif.
Protein motifs may contain the letter ‘X’ and nucleotide motifs may contain
the letter ‘N’, indicating fully degenerate position. The letter ‘.’ can be
used for both nucleotide and protein motifs, again
indicating fully degenerate sites. Degenerate sites can also be specified
using square brackets. For example, [LI] means a position where either ‘L’
or ‘I’ can occur. For protein motifs, B
is equal to [DN], and Z is equal to [EQ]. ‘U’ and ‘T’ can be used for both
DNA and RNA motifs.
è
PFF: PFF is one of the output formats of ps_scan.pl from
PROSITE. PFF files describe the instances of PROSITE patterns in query
protein sequences. ProMatch uses PFF files as its
input to identify significant overlaps between PROSITE patterns and other
short protein motifs, such as those discovered by HyperMotif.
PFF files are tab-delimited files with four columns. The first column is
the protein name; the second column is the pattern start position; the
third column is the pattern end position; and the fourth column is the
pattern PROSITE accession number.
è
NB: NB files describe the naïve Bayesian networks for
prediction of gene/protein categories. These files are both generated and
read by NBPreCat. Each classifier starts with the
letter ‘>’, followed by the name of the corresponding category. The next
line, which must start with the string “#Cutoff:”,
determines the likelihood cutoff above which an object (gene/protein) is
deemed to belong to this category. The next lines describe the features
that are used by this naïve Bayesian classifier. These lines are
tab-delimited, with six columns: (1) the name of the feature, (2) the
p-value of association of this feature with this category, (3) the q-value
(FDR) of association, (4) the likelihood of a protein being in this category
given that the protein has this feature, (5) the likelihood of a protein
being in this category given that the protein does NOT have this feature,
(6) the table that describes the number of gold standard positive/negative
proteins which had/did not have this feature. When reading an NB file, NBPreCat only considers the first, fourth and fifth
columns, and ignores the rest. The following is an example NB file:
$ELEMENT P-VALUE Q-VALUE L(C|E) L(C|E')
(C'&E')/(C'&E)/(C&E')/(C&E)
>GO:0004672
#Cutoff: 48.1217
EV.II 1.72171e-006 0.0168177 10.2413
0.917813 3157/6/65/6
GKT..L 3.26958e-005 0.159686 7.22036 0.908947 3139/24/64/7
LKK.LS 3.83083e-005 0.124732 8.61331 0.944352 3162/1/67/4
R.S.RY 3.83902e-005 0.0937488 8.66575
0.931825 3157/6/66/5
>GO:0005089
#Cutoff: 32.8219
F.SK.S 2.42838e-006 0.0237204 8.42924
0.386629 3172/18/2/4
TEIS.A 1.95618e-005 0.0955396 6.92188
0.538929 3187/3/3/3
In this example
text, two naïve Bayesian classifiers are defined, one for GO:0004672 and one for GO:0005089. Lines that start with
‘$’ are considered as comment lines and are ignored.
HyperMotif is a program for de novo discovery of short linear
protein and nucleotide motifs that are specific to pre-determined sets of
gene/protein categories. The input of HyperMotif
is a list of genes/proteins with their corresponding categories, their
associated sequences, a file that describes the homologous gene/protein
pairs, and a file or alternatively, a set of MySQL
tables that describe the category ontology relationships. HyperMotif has the following parameters:
-fasta [string]: The FASTA
input file containing the sequences. The maximum number of sequences is
100000, and the maximum sequence length is 30000.
-cat [string]:
A TABx2 file in which each line contains the name of a protein followed by
the name of the category to which it belongs. The maximum number of
categories is 100000.
-sql [string]: The folder
that contains the MySQL tables for category
ontology relationships. This folder needs to have two files, one named
“term.txt”, and one named “graph_path.txt”. “term.txt” contains the
following entries per line: [id], [name], [term_type],
[acc], [is_obsolete], [is_root],
[is_relation].
“graph_path.txt” contains the following entries per line: [id], [term1_id],
[term2_id], [relationship_type_id], [distance], [relation_distance]. This
parameter is optional, and is alternative to the parameter –obo.
-obo [string]:
The OBO file that describes the category ontology relationships. This
parameter is optional, and is alternative to the parameter –sql.
-only [string]:
If this parameter is used, the motifs will be found only for the specified
category.
-blast [string]:
The BLAST9 file describing homologous pairs of genes/proteins. This
parameter is optional.
-minsize [integer]: The
minimum size of a category to be considered. Categories whose number of
members is less than this number will be ignored. The default value is 10.
-window
[integer]: The window size for scanning the query
sequences in order to find linear motifs. The default value is 6, the
maximum allowed value is 8 and the minimum is 2.
-mininf [integer]: The minimum
number of informative residues per motif. This parameter cannot be larger
than –window. The minimum value is 2, and the default is 4 unless –window
is smaller than 4.
-q [double]:
The q-value (FDR) cutoff for motifs. The default value is 0.01, the minimum
is 0 and the maximum is 1.
-out [string]:
The output generic name. HyperMotif creates
several output files, the name of which starts with this string.
This
is an example usage of HyperMotif:
HyperMotif -fasta proteins.fasta -cat categories.tab –only GO:0005089 -sql SQLFolder -blast homologs.blast -minsize 15
-window 7 -mininf 4 -q 0.1 -out test.output
The
output of HyperMotif consists of three files:
[output].motifs.tab:
This file contains the motif-category associations found by HyperMotif. The file has five columns: (1) the motif,
(2) the associated category, (3) the p-value of association, (4) the
q-value (FDR) of association, (5), the table that describes the number of
proteins that are/are not in that category and contain/do not contain the
motif. The elements of this table are as follows: The number of proteins
that are not in the category and do not have the motif / the number of
proteins that are in the category but do not have the motif / the number of
proteins that are not in the category but have the motif / the number of
proteins that are in the category and have the motif.
[output].motifs.noa:
This NOA file can later be used by MotifScan in
order to find motif instances in other query protein or nucleotide
sequences. By default, all motifs are annotated as PMOTIF. The user must
change this manually if it is not correct.
[output].se: This SE file
can later be used by NBPreCat in order to find
classifiers that are able to predict categories.
NBPreCat
finds near-optimal sets of features that can be used in the context of
naïve Bayesian networks to predict the categories to which genes/proteins
belong. It then creates and validates these naïve Bayesian networks, and
uses them to make new predictions for previously uncharacterized
genes/proteins. NBPreCat takes the following
parameters:
-se [string]:
The feature input file in the SE format. The maximum number of
genes/proteins is 100000, and the maximum number of features is 10000.
-cat [string]:
A TABx2 file in which each line contains the name of a protein followed by
the name of the category to which it belongs. The maximum number of
categories is 100000.
-sql [string]: The folder
that contains the MySQL tables for category
ontology relationships. This folder needs to have two files, one named
“term.txt”, and one named “graph_path.txt”. “term.txt” contains the
following entries per line: [id], [name], [term_type],
[acc], [is_obsolete], [is_root],
[is_relation].
“graph_path.txt” contains the following entries per line: [id], [term1_id],
[term2_id], [relationship_type_id], [distance], [relation_distance]. This
parameter is optional, and is alternative to the parameter –obo.
-obo [string]:
The OBO file that describes the category ontology relationships. This
parameter is optional, and is alternative to the parameter –sql.
-only [string]:
If this parameter is used, one classifier will be trained only for the
specified category.
-blast [string]:
The BLAST9 file describing homologous pairs of genes/proteins. This
parameter is optional.
-q [double]:
The q-value (FDR) cutoff for finding features that are significantly
associated with the categories. The default value is 0.5, the minimum is 0
and the maximum is 1. Only features will be considered for each category
whose q-values are below this number.
-ppv [double]: The precision
cutoff for prediction of gene/protein-category assignments. The minimum
value is 0, the maximum is 1, and the default value is 0.5. If a classifier
does not meet this precision at any sensitivity, the classifier will not be
written in the output.
-class [string]:
The classifier file in the NB format, created previously by NBPreCat. If this parameter is provided, instead of
training a classifier, NBPreCat will simply apply
the provided classifier for making new predictions. If this parameter is
provided, the following parameters will be ignored: -blast, -q, -ppv. Also, if this parameter is used, the parameter
–cat becomes optional.
-out [string]:
The output generic name. NBPreCat creates several
output files, the name of which starts with this string.
This
is an example usage of NBPreCat:
NBPreCat -se features.se -cat categories.tab –only
GO:0005089 -sql SQLFolder
-blast homologs.blast -q 0.1 –ppv
0.8 -out test.output
This
is another example:
NBPreCat -se features.se -class classifiers.tab -out test.output
The
output of NBPreCat consists of four files:
[output].all.tab:
This file contains the results of cross-validation of all the trained
classifiers, regardless of whether the classifiers meet the precision
criterion or not. Each classifier starts with ‘>’ followed by the name
of the corresponding category. In the next lines, the likelihoods that are
calculated by this classifier for each gene/protein are indicated. Each
line has five columns: (1) the name of the gene/protein, (2) whether the
protein belongs to this category (TP), does not belong to this category
(FP), or is uncharacterized for this category (UNKNOWN), (3) whether this
protein is included in the training/validation set (+) or is not included
(-), (4) the score of association with this category, which is the same as
its likelihood in column (5). Lines that start with ‘$’ are comments.
[output].classifiers.tab:
The NB file containing the trained classifiers that meet the precision
cutoff.
[output].roc: The ROC
curves for classifiers that meet the precision cutoff. Each classifier
begins with ‘#’ followed by the name of the associated category. The next
line indicates the different likelihood values, with the likelihood value
that results in the desired precision marked by an asterisk ‘*’ on top of
it. The next three lines include the number of TP and FP and the precision
value (PPV) associated with each likelihood value.
[output].predictions.tab:
This file describes the predictions made by the classifiers that meet the
precision cutoff. Each line has six entries: (1) the gene/protein name, (2)
the predicted category, (3) the status of the prediction (TP: true
positive, FP: false positive, and UNKNOWN), (4) the likelihood of this
gene/protein-category association, (5) the estimated precision at this
likelihood, (6) the probability that random classifier would achieve the
same sensitivity as this classifier at the specified precision.
It
should be noted that if the parameter –class is used, of the above four
outputs only the last one will be created.
MotifScan
is a simple tool to search DNA, RNA and protein sequences for instances of
short linear motifs. MotifScan takes the
following parameters:
-protein
[string]: A FASTA file that contains a set of
protein sequences based on which the profiles of protein motifs will be
determined.
-5utr [string]:
A FASTA file that contains a set of nucleotide sequences based on which the
profiles of 5' UTR motifs will be determined. Note that these sequences will
be treated as DNA sequences. Therefore, both the forward and reverse
strands will be searched for motif instances.
-3utr [string]:
A FASTA file that contains a set of nucleotide sequences based on which the
profiles of 3' UTR motifs will be determined. These sequences will be
treated as RNA sequences, and therefore only the forward strand will be
searched for motif instances.
-noa [string]: The NOA file
that contains short linear motifs and their types (PMOTIF, 5’MOTIF,
3’MOTIF).
-out [string]:
The output file in the SE format.
NOTE:
The maximum number of sequences, including DNA sequences, reverse
complements of DNA sequences, RNA sequences, and protein sequences is
100000. The maximum sequence length is 10000.
This
is an example usage of MotifScan:
MotifScan -protein proteins.fasta
-noa motifs.noa -out test.output
MotifScan
automatically adds the extension .se to the output file name.
ProMatch
is a tool to compare a set of motifs with PROSITE patterns. This program
identifies motifs whose instances significantly overlap those of PROSITE
patterns. The input parameters of ProMatch are as
follows:
-fasta [string]: A FASTA file
that contains a set of protein sequences based on which the profiles of
protein motifs will be determined. The maximum number of protein sequences
is 100000, and the maximum sequence length is 100000.
-noa [string]: The NOA file
that contains short linear protein motifs. Only motifs of the type PMOTIF
will be considered. The maximum number of motifs is 100000.
-prosite [string]: A PFF file
describing the occupancy profile of PROSITE patterns in the proteins. The
maximum number of pattern instances per protein is 1000. This parameter is
optional. If this parameter is not used, ProMatch
will identify the significant overlaps among the motifs that are read from
the NOA file (redundancy-check).
-out [string]:
The output tab-delimited file.
This
is an example usage of ProMatch:
ProMatch –fasta proteins.fasta –noa
motifs.noa –pff prosite.pff –out test.output
This
is another example:
ProMatch –fasta proteins.fasta –noa
motifs.noa –out test.output
ProMatch
output is a tab-delimited file with four columns. Each line represents a
significant match between a motif and a pattern (or a motif and another
motif if the –prosite parameter is not used). The
first column is the motif, the second column is the PROSITE pattern, the
third column is the p-value of the overlap between the motif and the
PROSITE pattern, and the fourth column describes the overlap of the motif
instances and PROSITE pattern instances in more details: the total number
of sliding windows whose length is equal to the motif / the total number of
sliding windows that overlap the PROSITE pattern / the total number of
sliding windows that match the motif sequence / the total number of sliding
windows that overlap the PROSITE pattern and also match the motif sequence.
We
have analyzed the whole GO database using HyperMotif
to identify function-specific short protein motifs, and have used NBPreCat to create naïve Bayesian classifiers that can
predict protein molecular functions using these motifs. The function-specific
motifs are stored in data/GO/motifs/GO.MolFunc.motifs.noa
and the classifiers are stored in data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab.
An example job has been provided that demonstrates how these files can be
used to predict molecular functions of uncharacterized proteins. This
example job can be found in a Windows batch file in jobs/tbur.GOMolFunc. Here, we explain the steps required for
predicting molecular functions.
The
first step is to use MotifScan to identify the
instances of the function-specific motifs in the query proteins. The
proteins need to be provided in a FASTA file. For example, if the proteins
are stored in myproteins.fasta, this command
needs to be executed:
MotifScan -protein myproteins.fasta
-noa data/GO/motifs/GO.MolFunc.motifs.noa
-out myproteins.motifs.se
This
will create a file named myproteins.motifs.se, which contains the instances
of the function-specific short protein motifs in the query proteins. Note
that if the extension .se is not provided, MotifScan
will automatically add it to the end of the output file name.
The
next step is to use NBPreCat to predict the
protein molecular functions based on motif instances:
NBPreCat -se myproteins.motifs.se
-class data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab -out myproteins.pedictions.tab
This
will invoke NBPreCat to use the naïve Bayesian
networks that we have provided in combination with the motif instances in
order to predict the functions of the query proteins. The predictions will
be stored in myproteins.predictions.tab. Note
that if the extension .predictions.tab is not
provided, NBPreCat will automatically add it to
the end of the output file. For example, the following command will result
in the same output file:
NBPreCat -se myproteins.motifs.se
-class data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab -out myproteins
If
some of the query proteins already have some annotations, these annotations
can be provided to NBPreCat in order to identify
which predictions are true positives (TP) or false positives (FP). For
example, if the TABx2 file myannotations.tab contains the molecular
function annotations based on GO terms, the following command can be used:
NBPreCat -se myproteins.motifs.se
–cat myannotations.tab –sql data/GO/go-seqdb-tables -class data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab
-out myproteins.predictions.tab
The
parameter “–sql data/GO/go-seqdb-tables”
indicates that there is a folder, named go-seqdb-tables, containing MySQL
tables that describe the term ontology relationships (this folder is
included in the package). This helps NBPreCat to
correctly assign proteins with known annotations to all the appropriate GO terms
at different levels in order to correctly identify true positives and false
positives. Alternatively, you can use an OBO file.
|