Salavati Lab

 

Institute of Parasitology

 

McGill University

Logo

Home

People

Publications

Software

Contact

 

About

Download

Manual

Log Book

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

HyperMotif: hypergeometric-based analysis of sequence motifs

USER GUIDE

 

 

TABLE OF CONTENTS

INTRODUCTION

Back to the table of contents

 

Short linear motifs constitute an important group of functional elements at the level of DNA, RNA, and protein. At the level of DNA and RNA, the most important class of linear motifs encompasses cis-regulatory elements that are recognized and bound by trans-acting elements. At the protein level, short linear motifs are involved in a variety of processes, such as mediating protein-protein interactions, interaction with ligand at protein active site, mediating enzymatic reactions, and modulating protein activity via post-translational modifications. Functional elements often show a non-random pattern of distribution among defined groups of genes and proteins. For example, cis-regulatory elements are over-represented in certain groups of genes if genes are clustered based on co-expression. As another example, signature peptides of protein active sites are over-represented among proteins that have similar molecular functions. Therefore, by finding motifs that are over-represented in a particular group of genes or proteins, we can identify functional sequence elements.

HyperMotif is a package that facilitates identification of group-specific short linear motifs. The programs within this package implement the concept of finding over-represented motifs based on tests of hypergeometric distribution. HyperMotif provides the flexibility to search both nucleic acid and protein sequences for identification of functional sequence motifs. Furthermore, this package allows prediction of protein/gene categories based on the discovered short linear motifs.

 

COMPONENTS

Back to the table of contents

 

This is a list of the components of the HyperMotif package, along with a brief description of what they do:

HyperMotif: This program is able to discover, de novo, the motifs that are over-represented in predefined sets of DNA/RNA/protein sequences.

NBPreCat: This program is able to integrate the motifs that are discovered by HyperMotif into near optimal naïve Bayesian networks that can be used for function prediction. It can also use predefined classifiers (such as classifiers previously found by NBPreCat) to identify the likely functions of novel proteins.

MotifScan: This program can be used to search for instances of the motifs that are discovered by HyperMotif in a set of DNA/RNA/protein sequences.

ProMatch: This program is able to identify PROSITE patterns that significantly overlap the motifs discovered by HyperMotif. Any other set of patterns can also be examined by this program as long as they are presented in the same format as PROSITE patterns. ProMatch is also able to identify the overlap among HyperMotif patterns.

 

INSTALLATION

Back to the table of contents

 

We have included pre-compiled files for Windows, Linux, and Mac OS. All that needs to be done is to download the compressed file and extract the package on a hard drive, and then copy the appropriate binary files to the root folder of HyperMotif. Binary files are located in a folder named “bin”. For example, in Windows, if you have extracted the package on drive C, you need to copy all the binary files that are located in C:\HyperMotif\bin\win into C:\HyperMotif. In the end, you will have four binary files in the root folder: C:\HyperMotif\HyperMotif.exe, C:\HyperMotif\NBPreCat.exe, C:\HyperMotif\MotifScan.exe, and C:\HyperMotif\ProMatch.exe.

If needed, you can use the provided source codes to compile your own version of HyperMotif that is compatible with the machine you are using. In most cases, all that needs to be done is to use the makefile in the source_codes folder to create the binary files using a GNU compiler. GNU compilers are standard components of Linux and Mac OS. For Windows, you need to download and install a GNU compiler, such as MinGW. The makefile in the source_codes folder calls “mingw32-make” by default to create the different components of the HyperMotif package:

MAKE = mingw32-make

If a different compiler is used, the first line of source_codes/Makefile needs to be modified accordingly. For example, in Linux, the first line should be this:

MAKE = make

The binary files will be created in the root folder of HyperMotif, and will be ready to use.

 

FILE FORMATS

Back to the table of contents

 

The following are the file formats that the HyperMotif package uses as the input of its programs. The output formats are described separately for each program.

è FASTA: FASTA is a standard file format for storage of DNA/RNA/protein sequences. For a description of the FASTA file format, see this article on Wikipedia. FASTA files cannot have lines longer than 999 characters when read by HyperMotif. Longer sequences should be broken into more than one line.

è BLAST9: BLAST9 is the output format of NCBI BLAST program when the “–m” parameter is set to 9. In this tab-delimited format, each line indicates a pair of homologous genes, followed by 10 numbers that are ignored by HyperMotif, and thus can be anything. All lines that start with the number sign (#) are considered as comment lines and are ignored. This is an example of a BLAST9 file:

# BLASTP 2.2.17
# Query: Q5ADX5
# Database: GO.20101023.inMolFunc.fasta
Q5ADX5   P32783   0   0   0   0   0   0   0   0   0    0

This example text simply indicates that P32783 is a homolog of Q5ADX5. The rest of the lines are ignored.

è TABx2: TABx2 is a tab-delimited file with two columns. This file format is used to indicate the gene/protein categories. In each line, the first entry is the gene or protein symbol/ID, and the second entry is the category symbol/ID. The symbols and IDs cannot contain spaces. This is an example TABx2 file:

Tb09.160.0390    GO:0004497
Tb09.160.0450    GO:0004672
Tb09.160.0450    GO:0005524

In this example text, it is indicated that Tb09.160.0390 belongs to the category GO:0004497, and Tb09.160.0450 belongs to the two categories GO:0004672 and GO:0005524. TABx2 files are used to define the gold standard sets which will be used by HyperMotif for motif discovery and by NBPreCat for function prediction.

è OBO: OBO is a standard format for defining ontologies. OBO files are used by HyperMotif and NBPreCat to properly construct the positive and negative gold standard sets based on the protein-category associations of the TABx2 file. For a description of the OBO file format, see this article.

è SE: SE is a tab-delimited file format used by NBPreCat for defining the features of genes/proteins, based on which functions are predicted. SE files can be automatically generated using the MotifScan program, as described later. SE files have four columns. The first column is the gene or protein symbol/ID. The second column is a description, which cannot have spaces, and is ignored by the program. The third column is the feature associated with that gene/protein. The fourth column is the status of the feature; if the status is “T”, that line is considered; if the status is “F”, that line is ignored. Also, lines that start with “$” are considered as comment lines and are ignored. This is an example SE file:

$Protein    Method       Motif    Status
Q59J86      HyperMotif   E.KKK    T
P26725      HyperMotif   GV..LL   F
P26725      HyperMotif   SGKT     T

This example text indicates that Q59J86 has an instance of the EXKKK motif, and P26725 has an instance of each of GVXXLL and SGKT motifs, but the GVXXLL motif should be ignored here.

è NOA: NOA is the file format that is used by MotifScan as the source of motif definitions in order to find the motif instances. NOA files are originally used by Cytoscape to describe node attributes. We have adapted the same format to describe motifs and their nature (linear RNA, DNA, or protein motifs). The first line of the NOA files that are used by MotifScan must be this:

$SequenceElementType (class=java.lang.String)

The rest of the lines define the motifs:

[Motif Sequence] = [Motif Type]

where instead of [Motif Sequence] and [Motif Type] appropriate strings must be written. Here is an example:

$SequenceElementType (class=java.lang.String)
G.G[LI]KT = PMOTIF
ATCG[CG]AG = 5'MOTIF
AUAUU.GAU = 3'MOTIF

This example text defines three motif: GXG[LI]KT is a protein motif, ATCG[CG]AG is a DNA motif, and AUAUUNGAU is an RNA motif. Protein motifs may contain the letter ‘X’ and nucleotide motifs may contain the letter ‘N’, indicating fully degenerate position. The letter ‘.’ can be used for both nucleotide and protein motifs, again indicating fully degenerate sites. Degenerate sites can also be specified using square brackets. For example, [LI] means a position where either ‘L’ or ‘I’ can occur. For protein motifs, B is equal to [DN], and Z is equal to [EQ]. ‘U’ and ‘T’ can be used for both DNA and RNA motifs.

è PFF: PFF is one of the output formats of ps_scan.pl from PROSITE. PFF files describe the instances of PROSITE patterns in query protein sequences. ProMatch uses PFF files as its input to identify significant overlaps between PROSITE patterns and other short protein motifs, such as those discovered by HyperMotif. PFF files are tab-delimited files with four columns. The first column is the protein name; the second column is the pattern start position; the third column is the pattern end position; and the fourth column is the pattern PROSITE accession number.

è NB: NB files describe the naïve Bayesian networks for prediction of gene/protein categories. These files are both generated and read by NBPreCat. Each classifier starts with the letter ‘>’, followed by the name of the corresponding category. The next line, which must start with the string “#Cutoff:”, determines the likelihood cutoff above which an object (gene/protein) is deemed to belong to this category. The next lines describe the features that are used by this naïve Bayesian classifier. These lines are tab-delimited, with six columns: (1) the name of the feature, (2) the p-value of association of this feature with this category, (3) the q-value (FDR) of association, (4) the likelihood of a protein being in this category given that the protein has this feature, (5) the likelihood of a protein being in this category given that the protein does NOT have this feature, (6) the table that describes the number of gold standard positive/negative proteins which had/did not have this feature. When reading an NB file, NBPreCat only considers the first, fourth and fifth columns, and ignores the rest. The following is an example NB file:

$ELEMENT P-VALUE        Q-VALUE     L(C|E)    L(C|E')    (C'&E')/(C'&E)/(C&E')/(C&E)

>GO:0004672
#Cutoff: 48.1217
EV.II    1.72171e-006   0.0168177   10.2413   0.917813   3157/6/65/6
GKT..L   3.26958e-005   0.159686    7.22036   0.908947   3139/24/64/7
LKK.LS   3.83083e-005   0.124732    8.61331   0.944352   3162/1/67/4
R.S.RY   3.83902e-005   0.0937488   8.66575   0.931825   3157/6/66/5

>GO:0005089
#Cutoff: 32.8219
F.SK.S   2.42838e-006   0.0237204   8.42924   0.386629   3172/18/2/4
TEIS.A   1.95618e-005   0.0955396   6.92188   0.538929   3187/3/3/3

In this example text, two naïve Bayesian classifiers are defined, one for GO:0004672 and one for GO:0005089. Lines that start with ‘$’ are considered as comment lines and are ignored.

 

HyperMotif

Back to the table of contents

 

HyperMotif is a program for de novo discovery of short linear protein and nucleotide motifs that are specific to pre-determined sets of gene/protein categories. The input of HyperMotif is a list of genes/proteins with their corresponding categories, their associated sequences, a file that describes the homologous gene/protein pairs, and a file or alternatively, a set of MySQL tables that describe the category ontology relationships. HyperMotif has the following parameters:

-fasta [string]: The FASTA input file containing the sequences. The maximum number of sequences is 100000, and the maximum sequence length is 30000.

-cat [string]: A TABx2 file in which each line contains the name of a protein followed by the name of the category to which it belongs. The maximum number of categories is 100000.

-sql [string]: The folder that contains the MySQL tables for category ontology relationships. This folder needs to have two files, one named “term.txt”, and one named “graph_path.txt”. “term.txt” contains the following entries per line: [id], [name], [term_type], [acc], [is_obsolete], [is_root], [is_relation]. “graph_path.txt” contains the following entries per line: [id], [term1_id], [term2_id], [relationship_type_id], [distance], [relation_distance]. This parameter is optional, and is alternative to the parameter –obo.

-obo [string]: The OBO file that describes the category ontology relationships. This parameter is optional, and is alternative to the parameter –sql.

-only [string]: If this parameter is used, the motifs will be found only for the specified category.

-blast [string]: The BLAST9 file describing homologous pairs of genes/proteins. This parameter is optional.

-minsize [integer]: The minimum size of a category to be considered. Categories whose number of members is less than this number will be ignored. The default value is 10.

-window [integer]: The window size for scanning the query sequences in order to find linear motifs. The default value is 6, the maximum allowed value is 8 and the minimum is 2.

-mininf [integer]: The minimum number of informative residues per motif. This parameter cannot be larger than –window. The minimum value is 2, and the default is 4 unless –window is smaller than 4.

-q [double]: The q-value (FDR) cutoff for motifs. The default value is 0.01, the minimum is 0 and the maximum is 1.

-out [string]: The output generic name. HyperMotif creates several output files, the name of which starts with this string.

 

This is an example usage of HyperMotif:

HyperMotif -fasta proteins.fasta -cat categories.tab –only GO:0005089 -sql SQLFolder -blast homologs.blast -minsize 15 -window 7 -mininf 4 -q 0.1 -out test.output

 

The output of HyperMotif consists of three files:

[output].motifs.tab: This file contains the motif-category associations found by HyperMotif. The file has five columns: (1) the motif, (2) the associated category, (3) the p-value of association, (4) the q-value (FDR) of association, (5), the table that describes the number of proteins that are/are not in that category and contain/do not contain the motif. The elements of this table are as follows: The number of proteins that are not in the category and do not have the motif / the number of proteins that are in the category but do not have the motif / the number of proteins that are not in the category but have the motif / the number of proteins that are in the category and have the motif.

[output].motifs.noa: This NOA file can later be used by MotifScan in order to find motif instances in other query protein or nucleotide sequences. By default, all motifs are annotated as PMOTIF. The user must change this manually if it is not correct.

[output].se: This SE file can later be used by NBPreCat in order to find classifiers that are able to predict categories.

 

NBPreCat

Back to the table of contents

 

NBPreCat finds near-optimal sets of features that can be used in the context of naïve Bayesian networks to predict the categories to which genes/proteins belong. It then creates and validates these naïve Bayesian networks, and uses them to make new predictions for previously uncharacterized genes/proteins. NBPreCat takes the following parameters:

-se [string]: The feature input file in the SE format. The maximum number of genes/proteins is 100000, and the maximum number of features is 10000.

-cat [string]: A TABx2 file in which each line contains the name of a protein followed by the name of the category to which it belongs. The maximum number of categories is 100000.

-sql [string]: The folder that contains the MySQL tables for category ontology relationships. This folder needs to have two files, one named “term.txt”, and one named “graph_path.txt”. “term.txt” contains the following entries per line: [id], [name], [term_type], [acc], [is_obsolete], [is_root], [is_relation]. “graph_path.txt” contains the following entries per line: [id], [term1_id], [term2_id], [relationship_type_id], [distance], [relation_distance]. This parameter is optional, and is alternative to the parameter –obo.

-obo [string]: The OBO file that describes the category ontology relationships. This parameter is optional, and is alternative to the parameter –sql.

-only [string]: If this parameter is used, one classifier will be trained only for the specified category.

-blast [string]: The BLAST9 file describing homologous pairs of genes/proteins. This parameter is optional.

-q [double]: The q-value (FDR) cutoff for finding features that are significantly associated with the categories. The default value is 0.5, the minimum is 0 and the maximum is 1. Only features will be considered for each category whose q-values are below this number.

-ppv [double]: The precision cutoff for prediction of gene/protein-category assignments. The minimum value is 0, the maximum is 1, and the default value is 0.5. If a classifier does not meet this precision at any sensitivity, the classifier will not be written in the output.

-class [string]: The classifier file in the NB format, created previously by NBPreCat. If this parameter is provided, instead of training a classifier, NBPreCat will simply apply the provided classifier for making new predictions. If this parameter is provided, the following parameters will be ignored: -blast, -q, -ppv. Also, if this parameter is used, the parameter –cat becomes optional.

-out [string]: The output generic name. NBPreCat creates several output files, the name of which starts with this string.

 

This is an example usage of NBPreCat:

NBPreCat -se features.se -cat categories.tab –only GO:0005089 -sql SQLFolder -blast homologs.blast -q 0.1 –ppv 0.8 -out test.output

This is another example:

NBPreCat -se features.se -class classifiers.tab -out test.output

 

The output of NBPreCat consists of four files:

[output].all.tab: This file contains the results of cross-validation of all the trained classifiers, regardless of whether the classifiers meet the precision criterion or not. Each classifier starts with ‘>’ followed by the name of the corresponding category. In the next lines, the likelihoods that are calculated by this classifier for each gene/protein are indicated. Each line has five columns: (1) the name of the gene/protein, (2) whether the protein belongs to this category (TP), does not belong to this category (FP), or is uncharacterized for this category (UNKNOWN), (3) whether this protein is included in the training/validation set (+) or is not included (-), (4) the score of association with this category, which is the same as its likelihood in column (5). Lines that start with ‘$’ are comments.

[output].classifiers.tab: The NB file containing the trained classifiers that meet the precision cutoff.

[output].roc: The ROC curves for classifiers that meet the precision cutoff. Each classifier begins with ‘#’ followed by the name of the associated category. The next line indicates the different likelihood values, with the likelihood value that results in the desired precision marked by an asterisk ‘*’ on top of it. The next three lines include the number of TP and FP and the precision value (PPV) associated with each likelihood value.

[output].predictions.tab: This file describes the predictions made by the classifiers that meet the precision cutoff. Each line has six entries: (1) the gene/protein name, (2) the predicted category, (3) the status of the prediction (TP: true positive, FP: false positive, and UNKNOWN), (4) the likelihood of this gene/protein-category association, (5) the estimated precision at this likelihood, (6) the probability that random classifier would achieve the same sensitivity as this classifier at the specified precision.

It should be noted that if the parameter –class is used, of the above four outputs only the last one will be created.

 

MotifScan

Back to the table of contents

 

MotifScan is a simple tool to search DNA, RNA and protein sequences for instances of short linear motifs. MotifScan takes the following parameters:

-protein [string]: A FASTA file that contains a set of protein sequences based on which the profiles of protein motifs will be determined.

-5utr [string]: A FASTA file that contains a set of nucleotide sequences based on which the profiles of 5' UTR motifs will be determined. Note that these sequences will be treated as DNA sequences. Therefore, both the forward and reverse strands will be searched for motif instances.

-3utr [string]: A FASTA file that contains a set of nucleotide sequences based on which the profiles of 3' UTR motifs will be determined. These sequences will be treated as RNA sequences, and therefore only the forward strand will be searched for motif instances.

-noa [string]: The NOA file that contains short linear motifs and their types (PMOTIF, 5’MOTIF, 3’MOTIF).

-out [string]: The output file in the SE format.

NOTE: The maximum number of sequences, including DNA sequences, reverse complements of DNA sequences, RNA sequences, and protein sequences is 100000. The maximum sequence length is 10000.

 

This is an example usage of MotifScan:

MotifScan -protein proteins.fasta -noa motifs.noa -out test.output

 

MotifScan automatically adds the extension .se to the output file name.

 

ProMatch

Back to the table of contents

ProMatch is a tool to compare a set of motifs with PROSITE patterns. This program identifies motifs whose instances significantly overlap those of PROSITE patterns. The input parameters of ProMatch are as follows:

-fasta [string]: A FASTA file that contains a set of protein sequences based on which the profiles of protein motifs will be determined. The maximum number of protein sequences is 100000, and the maximum sequence length is 100000.

-noa [string]: The NOA file that contains short linear protein motifs. Only motifs of the type PMOTIF will be considered. The maximum number of motifs is 100000.

-prosite [string]: A PFF file describing the occupancy profile of PROSITE patterns in the proteins. The maximum number of pattern instances per protein is 1000. This parameter is optional. If this parameter is not used, ProMatch will identify the significant overlaps among the motifs that are read from the NOA file (redundancy-check).

-out [string]: The output tab-delimited file.

 

This is an example usage of ProMatch:

ProMatchfasta proteins.fastanoa motifs.noa –pff prosite.pff –out test.output

This is another example:

ProMatchfasta proteins.fastanoa motifs.noa –out test.output

 

ProMatch output is a tab-delimited file with four columns. Each line represents a significant match between a motif and a pattern (or a motif and another motif if the –prosite parameter is not used). The first column is the motif, the second column is the PROSITE pattern, the third column is the p-value of the overlap between the motif and the PROSITE pattern, and the fourth column describes the overlap of the motif instances and PROSITE pattern instances in more details: the total number of sliding windows whose length is equal to the motif / the total number of sliding windows that overlap the PROSITE pattern / the total number of sliding windows that match the motif sequence / the total number of sliding windows that overlap the PROSITE pattern and also match the motif sequence.

 

Predicting protein molecular functions

Back to the table of contents

 

We have analyzed the whole GO database using HyperMotif to identify function-specific short protein motifs, and have used NBPreCat to create naïve Bayesian classifiers that can predict protein molecular functions using these motifs. The function-specific motifs are stored in data/GO/motifs/GO.MolFunc.motifs.noa and the classifiers are stored in data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab. An example job has been provided that demonstrates how these files can be used to predict molecular functions of uncharacterized proteins. This example job can be found in a Windows batch file in jobs/tbur.GOMolFunc. Here, we explain the steps required for predicting molecular functions.

The first step is to use MotifScan to identify the instances of the function-specific motifs in the query proteins. The proteins need to be provided in a FASTA file. For example, if the proteins are stored in myproteins.fasta, this command needs to be executed:

MotifScan -protein myproteins.fasta -noa data/GO/motifs/GO.MolFunc.motifs.noa -out myproteins.motifs.se

This will create a file named myproteins.motifs.se, which contains the instances of the function-specific short protein motifs in the query proteins. Note that if the extension .se is not provided, MotifScan will automatically add it to the end of the output file name.

The next step is to use NBPreCat to predict the protein molecular functions based on motif instances:

NBPreCat  -se myproteins.motifs.se -class data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab -out myproteins.pedictions.tab

This will invoke NBPreCat to use the naïve Bayesian networks that we have provided in combination with the motif instances in order to predict the functions of the query proteins. The predictions will be stored in myproteins.predictions.tab. Note that if the extension .predictions.tab is not provided, NBPreCat will automatically add it to the end of the output file. For example, the following command will result in the same output file:

NBPreCat  -se myproteins.motifs.se -class data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab -out myproteins

If some of the query proteins already have some annotations, these annotations can be provided to NBPreCat in order to identify which predictions are true positives (TP) or false positives (FP). For example, if the TABx2 file myannotations.tab contains the molecular function annotations based on GO terms, the following command can be used:

NBPreCat  -se myproteins.motifs.se –cat myannotations.tab –sql data/GO/go-seqdb-tables -class data/GO/motifs/GO.MolFunc.classifiers.ppv80.tab -out myproteins.predictions.tab

The parameter “–sql data/GO/go-seqdb-tables” indicates that there is a folder, named go-seqdb-tables, containing MySQL tables that describe the term ontology relationships (this folder is included in the package). This helps NBPreCat to correctly assign proteins with known annotations to all the appropriate GO terms at different levels in order to correctly identify true positives and false positives. Alternatively, you can use an OBO file.

 

 

 

 

Reza Salavati’s profile at McGill

Last updated on 6/5/2011 9:36:50 PM