MouseFunc I
A critical assessment of quantitative gene function assignment
from
genomic datasets in M. musculus
(Jul 17 - Oct 13, 2006)
Why predict function? Determination of gene function
is the
central goal in the field of functional genomics. Genomics experiments
have proven valuable in suggesting hypotheses that can be tested by
follow-up experimentation. Computational predictions of gene function
can serve as a statistically sound form of triage, focusing
experimental resources on the hypotheses (predictions) that are more
likely to be true. Among strong predictions, the most interesting can
be chosen by individual investigators with intuition and specialized
knowledge.
The role of prediction in gene function databases.
Model
organism databases in the Gene Ontology (GO) Consortium (e.g., SGD,
FlyBase, and MGI) track the types of evidence that support function
annotations. A substantial fraction of annotated genes are annotated
solely by virtue of predictions (ISS or IEA evidence codes). In 2004,
this fraction was 5% for S. cerevisiae, 50% in D. melanogaster, and 72%
in M. musculus. Although predictions have a substantial role in gene
function databases, they are not typically assigned measures of
confidence.
The need for measures of confidence in prediction.
Biologists
browsing a gene function annotation database react negatively to
annotations that are presented as fact, but which are not confidently
known to be true, since this is misleading. When a prediction of gene
function is placed alongside conclusions derived from direct
experimentation, it should either be high confidence or labeled clearly
as a prediction (or both), to avoid misleading. To address this issue,
model organism databases have developed evidence codes to label
annotations based on prediction or uncritical transfer from other
sources. Unfortunately, this provides no guidance as to which
predictions are confident and which are weak, and the user is prone to
.throw the baby out with the bathwater. by ignoring all predictions.
Furthermore, the tolerance of researchers for false positives depends
on a complex tradeoff between the importance of the biological question
and the cost of follow-up experiments. Thus, to achieve their full
potential value predictions should be provided with interpretable
levels of confidence, e.g., an estimated probability that the
prediction is correct.
Why compare? Assessment of performance of
different method on
a standardized data set according to standardized performance criteria
is the only way to draw meaningful conclusions about the strengths and
weaknesses of the algorithms employed. Just as the fields of protein
structure prediction, machine learning, natural language processing
have benefited from competitions, we hope that an organized comparison
will motivate bioinformatics groups to think deeply about an important
problem, and that it will provide a focus around which ideas can be
exchanged between diverse groups. Furthermore, we expect that that the
simple act of sharing prediction results in a common format will make
it possible for these results to be compiled and shared with
experimental biologists in a transparent and useful way, perhaps via
model organism databases. There will be a period of comment and
discussion on the dataset, on procedures used to share data, methods,
and results and on measures used to evaluate gene function predictions.
Step 1: Organization.
A period of invitation, comment and discussion (via MouseFunc@googlegroups.com),
and commitment to participate (ending July 14).
Step 2: Release of the training data (July 17).
Briefly, training data consists of
- A specified collection of GO terms to be predicted (divided
into
12 categories according to GO branch or specificity (# of genes
annotated)
- A specified set of genes*,
divided into a
training and a test set (10% of genes). All gene IDs will be anonymized**.
- A variety of gene*
properties.
- Biological relationships between genes*.
* For simplicity, properties of
proteins encoded by a
given gene will be mapped to that gene ID.
** Anonymization precludes
sequence-based
prediction methods beyond presence/absence of protein sequence
patterns. (Participants agree on their honor to not attempt decoding of
the IDs.)
A more complete description of
the training data can
be found here
Step 3. Submit methods and predictions (September 29) Extended to Friday Oct 13th, 2006!
3a: Code sharing. Each participant posts all code
used to
generate and apply predictive models, together with relevant parameters
and the resulting model. It should be possible to generate the final
score matrix from input data by running a single script/executable. If
the code uses random initializations then the random seeds should be
included.
3b: Predictions. Each participant submits a
matrix of scores
(ranging from 0 to 1) for each gene and GO term to be predicted.
Step 4. Performance assessment (Oct 2 - Oct 16 Oct 16 - Oct 30)
All predictions will be deanonymized and performance will assessed both
on the: a) the held-out collection of genes and b) novel predictions
for all genes.
4a. Predictions on held-out genes. A
variety of performance
measures will be applied: area under the ROC curve (AUC), precision at
1% recall (P01R), precision at 10% recall (P10R), precision at 50%
recall (P50R), and precision at 80% recall (P80R). These measures will
be applied to each GO term individually, and median performance values
will be calculated for 12 categories of GO terms (with the indicated #
of GO terms in each):
Specificity*** |
GO Branch |
|
BP |
CC |
MF |
3-10 |
952 |
151 |
475 |
11-30 |
435 |
97 |
142 |
31-100 |
239 |
48 |
111 |
101-300 |
100 |
30 |
35 |
*** Here, specificity is defined
as the number of
genes in the training set assigned to a particular GO term
Median performance values will also be calculated for the GO
terms
in each row and column of the above table (i.e., for GO terms of a
given specificity or in a given branch).
4b. Novel predictions. Predictions for
gene/GO term
combinations not annotated in the training set will be judged on the
basis of new GO term annotations in the ~8 months since data set
assembly. The same measures described above will be applied.
Step 5. Publish the results (by Dec 1).
Each participant provides a description of their approach (ideally
brief enough to also permit publication elsewhere) together with
references. We will write the paper summarizing performance, and submit
to Nature Biotechnology (previous interest expressed by editor G.
Taroncher).
All files are tab-delimited. Gene IDs have been anonymized.
All matrices can be downloaded here
in a tar.gz file (33Mb)
- All
Genes
IDs.
This file indicates what data is available per gene and is fully
populated (i.e, it includes all IDs present in the data and in
the
test set).
File: GenesIDs_and_Summary.txt.gz, 886Kb
- Test
Set. List
of genes IDs in the test set. File: TestSet.txt.gz, 11.7Kb
- GO categories to be predicted per hierarchy. The
following
ranges according to the gene count will be considered when evaluating
performance: 3-10, 11-30, 31-100, 101-300
DATA:
- Functional
Annotations.- The binary matrices contains all
non-IEA annotations from the Gene Ontology, separated by hierarchy
(i.e., Biological Process, Cellular Component, and Molecular Function),
available on Feb. 2006. In the matrices, a "1" indicates a gene being
annotated as having that function. The annotations were up-propagated
which means that genes are assigned all ancestor GO terms.
- Gene Expression.- The matrices only contain the
data for probes/tags mapping to
MGI genes tested in the
corresponding study.
The relation probe/tag to gene is either one to one or many probes to
one gene (i.e., in some cases there are multiple rows for the same
gene).
In the case of the SAGE data, there are different representations of
the data: individual tag counts, average tag count for tags
mapped to the same gene, and sum of the tag counts for tags mapped to the same gene.
- Protein
Annotations.-
Binary matrices indicating the protein pattern annotations available
from Pfam and InterPro (3133 from Pfam and 5404 from InterPro).
- Protein-Protein
Interactions.- Interactions were obtained by orthology
(provided by MGI)
from known and predicted human interactions available at OPHID. This
data is represented as a binary matrix and as a distance matrix.
The distance matrix indicates how separated are two genes in the
interaction network (“inf” if the genes are
unreachable;
i.e., they belong to disconnected sub-networks). In the PPI matrices,
all interactions are put together.
- Phenotype
Ontology (source: MGI)
- Phylogenetic
Profile.- This data indicates whether a gene has
putative orthologues in different species. There is a binary
matrix where
"1" means that a gene has an orthologous gene in the corresponding
species, and there is a score matrix.
- Source: bioMart-
18
different species (from yeast to human). Orthology in bioMart is based
on reciprocal BLAST hits and synteny. The score is given by
the
coverage times identity.
- Score
Matrix - file: phylogenetic_scores.txt.gz, 1.1Mb. (score =
coverage * identity)
- Source: Inparanoid
- Inparanoid data includes inparalogs and excludes
outparalogs. Data for 21
different species including some prokaryotes. The score is
the Inparanoid score which indicates how similar (1 = identical) an
inparalog is to the inparalog that is the main ortholog in the cluster.
- Diseases.- Matrix
(source OMIM
via homology)- file: omim_disease.txt.gz, 9.3Mb
Below are participants who successfully submitted their predicitions. Congratulations!
Submitted
- Yanjun Qi1, Judith Klein-Seetharaman1,2 & Ziv Bar-Joseph1 (1Carnegie Mellon University, 2University of Pittsburgh)
- Sara Mostafavi, David Warde-Farley, Chris Grouios & Quaid Morris (University of Toronto)
- Guillaume Obozinski, Charles Grant, Gert Lanckriet, Jian Qiu, Michael Jordan1 & William Stafford Noble (University of Washington, 1University of California - Berkeley)
- Murat Tasan, Weidong Tian, Frank
Gibbons & Fritz Roth (Harvard Medical School)
- Hyunju Lee, Minghua Deng, Ting Chen & Fengzhu Sun (University of Southern California)
- Yuanfang Guan, Chad L. Myers & Olga G. Troyanskaya (Princeton University)
- Michele Leone & Andrea Pagnani (Institute for Scientific Interchange, Turin, Italy)
- Trupti Joshi, Chao Zhang, Guan Ning Lin & Dong Xu (University of Missouri-Columbia)
- Wan Kyu Kim, Chase Krumpelman, & Edward Marcotte (University of Texas, Austin)
Issues arising before and during the competition will be discussed on a
Google
Discussion
Group.
NOTE: Discussion group is not longer active.
Software
- getAUC.tar .- Perl program to obtain ROC
curve, AUC_ROC and precision at several recalls for a list of GO
categories, and to get the median values for those measures across the
given GO categories.
- getAUC.Readme
- Command to run getAUC with the test files supplied:
- getAUC.pl path/goBP_2distribute.txt.gz goCategories.txt.gz
genesFile.txt.gz
- checkFormat_scoreMatrix.pl .- Perl program to verify the format of a score matrix. Usage: checkFormat_scoreMatrix.pl scoreMatrix_file.txt.gz
Submission Instructions
Please follow exactly the following
submission instructions.
The submission files should follow the
filename scheme below:
For score matrix:
FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"result".txt.gz.zip.
example: JP-FD-result.txt.gz.zip
For code:
FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"code".tar.zip.
example:JP-FD-code.tar.zip
The score matrix file should contain
the tab-delimited score values for each gene (one per line) and GO
term to be predicted. The IDs of the GO terms and genes should be
exactly the same as in the files provided. The score matrix should
contain a line for each gene ID in the file
"GenesIDs_and_Summary.txt.gz". The score values are the output
of the model and should be in the 0 to 1 range. The higher the score
the higher is the probability of the gene having the corresponding
function.
The score values should contain only
digits and at most one decimal point ("."). The score
values can also be in scientific notation, e.g. 1.23456e-04.The
file “sample_scoreMatrix.txt.gz”shows how the result filecould
look like. In addition, the perl script “checkFormat_scoreMatrix.pl”
available here can
be used to verify the format of the score matrix before submission.
Please include a README file (see
sample here) in the code submission file with indications on how to
compile (if necessary) and run the code, and on which systems the
code has been tested. In particular, if you use some standard
programs or require some libraries, please indicate where they can be
obtained and the versions that you have used. Please ensure that all
parameters used in running the code have been provided, as well as
random seeds if any randomization has been used.If you have
makefiles, please include them as well.
The submission deadline is 29th
September 2006 (Any time zone).
Extended to Friday Oct 13th, 2006!
Submissions should be made by uploading
the files here
Only the last submission before the
deadline will be evaluated and all other submissions will be
discarded.
Methods
section
A more complete description of the
methods suitable for inclusion as Supplementary Information in the
resulting manuscript should be submitted by Friday October 20, 2006 (extended!).
The description should include brief comparisons and references to
prior work.You can also refer to “(unpublished results)” if you
think you may publish this work separately outside of the competition
summary paper.Note that there will be an opportunity to revise this
section later, but you are encouraged to submit a draft while your
memory of the methods is still fresh.
The submission file should follow the
filename scheme below:
FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"methods".*
example: JP-FD-methods.*
“*” the file format can be either
.pdf, .txt.gz, or .doc
Please upload your methods file here.
To simplify subsequent analyses for ourselves and other investigators,
we derived a single set of prediction scores from the set of submitted
scores. We unified the independent submissions for each
evaluation category by adopting the scores from the submission with the
best Precision at 20% Recall (P20R) value for that evaluation
category (evaluated using held-out genes). The combined predictions
averaged 41% precision at 20% recall with 26% of GO terms having a P20R
value greater than 90%.
Set of predictions from individual groups
Last updated: Wed 3 Sep, 2008