MouseFunc I

A critical assessment of quantitative gene function assignment from genomic datasets in M. musculus

(Jul 17 - Oct 13, 2006)

Organizers
Introduction
Process and Timeline
The Data Set
Participants
Discussion Forum
Software
Submission Instructions
Performance assessment per submission
Download the predictions
Download the GO annotations for the held-out set and the prospective evaluation
Browse the predictions

Organizers

Tim Hughes and Lourdes Peña Castillo at University of Toronto.
Fritz Roth, Gabriel Berriz & Frank Gibbons at Harvard Medical School.

Why predict function? Determination of gene function is the central goal in the field of functional genomics. Genomics experiments have proven valuable in suggesting hypotheses that can be tested by follow-up experimentation. Computational predictions of gene function can serve as a statistically sound form of triage, focusing experimental resources on the hypotheses (predictions) that are more likely to be true. Among strong predictions, the most interesting can be chosen by individual investigators with intuition and specialized knowledge.

The role of prediction in gene function databases. Model organism databases in the Gene Ontology (GO) Consortium (e.g., SGD, FlyBase, and MGI) track the types of evidence that support function annotations. A substantial fraction of annotated genes are annotated solely by virtue of predictions (ISS or IEA evidence codes). In 2004, this fraction was 5% for S. cerevisiae, 50% in D. melanogaster, and 72% in M. musculus. Although predictions have a substantial role in gene function databases, they are not typically assigned measures of confidence.

The need for measures of confidence in prediction. Biologists browsing a gene function annotation database react negatively to annotations that are presented as fact, but which are not confidently known to be true, since this is misleading. When a prediction of gene function is placed alongside conclusions derived from direct experimentation, it should either be high confidence or labeled clearly as a prediction (or both), to avoid misleading. To address this issue, model organism databases have developed evidence codes to label annotations based on prediction or uncritical transfer from other sources. Unfortunately, this provides no guidance as to which predictions are confident and which are weak, and the user is prone to .throw the baby out with the bathwater. by ignoring all predictions. Furthermore, the tolerance of researchers for false positives depends on a complex tradeoff between the importance of the biological question and the cost of follow-up experiments. Thus, to achieve their full potential value predictions should be provided with interpretable levels of confidence, e.g., an estimated probability that the prediction is correct.

Why compare? Assessment of performance of different method on a standardized data set according to standardized performance criteria is the only way to draw meaningful conclusions about the strengths and weaknesses of the algorithms employed. Just as the fields of protein structure prediction, machine learning, natural language processing have benefited from competitions, we hope that an organized comparison will motivate bioinformatics groups to think deeply about an important problem, and that it will provide a focus around which ideas can be exchanged between diverse groups. Furthermore, we expect that that the simple act of sharing prediction results in a common format will make it possible for these results to be compiled and shared with experimental biologists in a transparent and useful way, perhaps via model organism databases. There will be a period of comment and discussion on the dataset, on procedures used to share data, methods, and results and on measures used to evaluate gene function predictions.

Process and Timeline

Step 1: Organization.

A period of invitation, comment and discussion (via MouseFunc@googlegroups.com), and commitment to participate (ending July 14).

Step 2: Release of the training data (July 17).

Briefly, training data consists of

A specified collection of GO terms to be predicted (divided into 12 categories according to GO branch or specificity (# of genes annotated)
A specified set of genes*, divided into a training and a test set (10% of genes). All gene IDs will be anonymized**.
A variety of gene* properties.
Biological relationships between genes*.

* For simplicity, properties of proteins encoded by a given gene will be mapped to that gene ID.

** Anonymization precludes sequence-based prediction methods beyond presence/absence of protein sequence patterns. (Participants agree on their honor to not attempt decoding of the IDs.)

A more complete description of the training data can be found here

Step 3. Submit methods and predictions (September 29) Extended to Friday Oct 13th, 2006!

3a: Code sharing. Each participant posts all code used to generate and apply predictive models, together with relevant parameters and the resulting model. It should be possible to generate the final score matrix from input data by running a single script/executable. If the code uses random initializations then the random seeds should be included.

3b: Predictions. Each participant submits a matrix of scores (ranging from 0 to 1) for each gene and GO term to be predicted.

Step 4. Performance assessment (Oct 2 - Oct 16 Oct 16 - Oct 30)

All predictions will be deanonymized and performance will assessed both on the: a) the held-out collection of genes and b) novel predictions for all genes.

4a. Predictions on held-out genes. A variety of performance measures will be applied: area under the ROC curve (AUC), precision at 1% recall (P01R), precision at 10% recall (P10R), precision at 50% recall (P50R), and precision at 80% recall (P80R). These measures will be applied to each GO term individually, and median performance values will be calculated for 12 categories of GO terms (with the indicated # of GO terms in each):

Specificity*** GO Branch

BP CC MF

3-10 952 151 475

11-30 435 97 142

31-100 239 48 111

101-300 100 30 35

*** Here, specificity is defined as the number of genes in the training set assigned to a particular GO term

Specificity***	GO Branch
	BP	CC	MF
3-10	952	151	475
11-30	435	97	142
31-100	239	48	111
101-300	100	30	35

Median performance values will also be calculated for the GO terms in each row and column of the above table (i.e., for GO terms of a given specificity or in a given branch).

4b. Novel predictions. Predictions for gene/GO term combinations not annotated in the training set will be judged on the basis of new GO term annotations in the ~8 months since data set assembly. The same measures described above will be applied.

Step 5. Publish the results (by Dec 1).

Each participant provides a description of their approach (ideally brief enough to also permit publication elsewhere) together with references. We will write the paper summarizing performance, and submit to Nature Biotechnology (previous interest expressed by editor G. Taroncher).

The Data Set

All files are tab-delimited. Gene IDs have been anonymized.
All matrices can be downloaded here in a tar.gz file (33Mb)

All Genes IDs. This file indicates what data is available per gene and is fully populated (i.e, it includes all IDs present in the data and in the test set). File: GenesIDs_and_Summary.txt.gz, 886Kb
Test Set. List of genes IDs in the test set. File: TestSet.txt.gz, 11.7Kb
GO categories to be predicted per hierarchy. The following ranges according to the gene count will be considered when evaluating performance: 3-10, 11-30, 31-100, 101-300

GO-BP to predict. File: BP_count.txt.gz, 77.6Kb
GO-CC to predict. File: CC_count.txt.gz, 12.2Kb
GO-MF to predict. File: MF_count.txt.gz, 38.4Kb

DATA:

Functional Annotations.- The binary matrices contains all non-IEA annotations from the Gene Ontology, separated by hierarchy (i.e., Biological Process, Cellular Component, and Molecular Function), available on Feb. 2006. In the matrices, a "1" indicates a gene being annotated as having that function. The annotations were up-propagated which means that genes are assigned all ancestor GO terms.

non-IEA Biological Process (BP) - file: goBP_2distribute.txt.gz, 48.6Mb.
non-IEA Cellular Component (CC) - file: goCC_2distribute.txt.gz, 9.8Mb.
non-IEA Molecular Function (MF) - file: goMF_2distribute.txt.gz, 30.3Mb
Gene Ontology (obo file) - file: gene_ontology.obo, 8.6Mb

Gene Expression.- The matrices only contain the data for probes/tags mapping to MGI genes tested in the corresponding study. The relation probe/tag to gene is either one to one or many probes to one gene (i.e., in some cases there are multiple rows for the same gene). In the case of the SAGE data, there are different representations of the data: individual tag counts, average tag count for tags mapped to the same gene, and sum of the tag counts for tags mapped to the same gene.

Zhang et al Data Matrix - file: Zhang_expData.txt.gz, 7.4Mb

Values are normalized, median subtracted, arcsinh transformed intensity measurements. Negative values were set to zero (Reference)

Su et al Data Matrix - file: Su_expData.txt.gz, 9.6Mb

Normalized and gcRMA-condensed expression data (Affymetrix). (Reference)

Mouse Atlas of Gene Expression

The data contains tag counts at Quality 99 cut-off for 139 SAGE libraries

Individual Tag Counts - file: sage_expData_individualTagCounts.txt.gz, 102.3Mb

Average Tag Counts - file: sage_expData_avgTagCounts.txt.gz, 20.1Mb

Summed-up Tag Counts- file: sage_expData_sumTagCounts.txt.gz, 4.7 Mb

Protein Annotations.- Binary matrices indicating the protein pattern annotations available from Pfam and InterPro (3133 from Pfam and 5404 from InterPro).

Pfam Matrix (source: Sanger Institute) - file: pfamA_domainData.txt.gz, 93.2Mb. Data contains only Pfam-A families; i.e., curated protein domains.
InterPro Matrix (source: EBI ) - file: interpro_domainData.txt.gz, 175.0 Mb.

Protein-Protein Interactions.- Interactions were obtained by orthology (provided by MGI) from known and predicted human interactions available at OPHID. This data is represented as a binary matrix and as a distance matrix. The distance matrix indicates how separated are two genes in the interaction network (“inf” if the genes are unreachable; i.e., they belong to disconnected sub-networks). In the PPI matrices, all interactions are put together.

Binary Matrix- file: ppi_adjacency.txt.gz, 96.9Mb
Distance Matrix- file: ppi_distance.txt.gz, 100.6Mb

Phenotype Ontology (source: MGI)

Phenotype Matrix - file: MGI_phenotype.txt.gz, 246.4 Kb
Phenotype Definitions (obo file) - file: MPheno_OBO.ontology

Phylogenetic Profile.- This data indicates whether a gene has putative orthologues in different species. There is a binary matrix where "1" means that a gene has an orthologous gene in the corresponding species, and there is a score matrix.

Source: bioMart- 18 different species (from yeast to human). Orthology in bioMart is based on reciprocal BLAST hits and synteny. The score is given by the coverage times identity.

Binary Matrix- file: phylogenetic_binary.txt.gz, 669.4Kb

Score Matrix - file: phylogenetic_scores.txt.gz, 1.1Mb. (score = coverage * identity)

Source: Inparanoid - Inparanoid data includes inparalogs and excludes outparalogs. Data for 21 different species including some prokaryotes. The score is the Inparanoid score which indicates how similar (1 = identical) an inparalog is to the inparalog that is the main ortholog in the cluster.

Binary Matrix - file: inparanoid_binary.txt.gz, 844Kb
Score Matrix - file: inparanoid_scores.txt.gz, 983Kb

Diseases.- Matrix (source OMIM via homology)- file: omim_disease.txt.gz, 9.3Mb

Participants

Below are participants who successfully submitted their predicitions. Congratulations!

Submitted

Yanjun Qi¹, Judith Klein-Seetharaman^1,2 & Ziv Bar-Joseph¹ (¹Carnegie Mellon University, ²University of Pittsburgh)
Sara Mostafavi, David Warde-Farley, Chris Grouios & Quaid Morris (University of Toronto)
Guillaume Obozinski, Charles Grant, Gert Lanckriet, Jian Qiu, Michael Jordan¹ & William Stafford Noble (University of Washington, ¹University of California - Berkeley)
Murat Tasan, Weidong Tian, Frank Gibbons & Fritz Roth (Harvard Medical School)
Hyunju Lee, Minghua Deng, Ting Chen & Fengzhu Sun (University of Southern California)
Yuanfang Guan, Chad L. Myers & Olga G. Troyanskaya (Princeton University)
Michele Leone & Andrea Pagnani (Institute for Scientific Interchange, Turin, Italy)
Trupti Joshi, Chao Zhang, Guan Ning Lin & Dong Xu (University of Missouri-Columbia)
Wan Kyu Kim, Chase Krumpelman, & Edward Marcotte (University of Texas, Austin)

Discussion Forum

Issues arising before and during the competition will be discussed on a Google Discussion Group.
NOTE: Discussion group is not longer active.

Software

getAUC.tar .- Perl program to obtain ROC curve, AUC_ROC and precision at several recalls for a list of GO categories, and to get the median values for those measures across the given GO categories.

getAUC.Readme
Command to run getAUC with the test files supplied:

getAUC.pl path/goBP_2distribute.txt.gz goCategories.txt.gz genesFile.txt.gz

checkFormat_scoreMatrix.pl .- Perl program to verify the format of a score matrix. Usage: checkFormat_scoreMatrix.pl scoreMatrix_file.txt.gz

Submission Instructions

Please follow exactly the following submission instructions.

The submission files should follow the filename scheme below:

For score matrix:
FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"result".txt.gz.zip.
example: JP-FD-result.txt.gz.zip

For code:
FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"code".tar.zip.
example:JP-FD-code.tar.zip

The score matrix file should contain the tab-delimited score values for each gene (one per line) and GO term to be predicted. The IDs of the GO terms and genes should be exactly the same as in the files provided. The score matrix should contain a line for each gene ID in the file "GenesIDs_and_Summary.txt.gz". The score values are the output of the model and should be in the 0 to 1 range. The higher the score the higher is the probability of the gene having the corresponding function.
The score values should contain only digits and at most one decimal point ("."). The score values can also be in scientific notation, e.g. 1.23456e-04.The file “sample_scoreMatrix.txt.gz”shows how the result filecould look like. In addition, the perl script “checkFormat_scoreMatrix.pl” available here can be used to verify the format of the score matrix before submission.

Please include a README file (see sample here) in the code submission file with indications on how to compile (if necessary) and run the code, and on which systems the code has been tested. In particular, if you use some standard programs or require some libraries, please indicate where they can be obtained and the versions that you have used. Please ensure that all parameters used in running the code have been provided, as well as random seeds if any randomization has been used.If you have makefiles, please include them as well.

The submission deadline is 29th September 2006 (Any time zone). Extended to Friday Oct 13th, 2006!

Submissions should be made by uploading the files here

Only the last submission before the deadline will be evaluated and all other submissions will be discarded.

Methods section

A more complete description of the methods suitable for inclusion as Supplementary Information in the resulting manuscript should be submitted by Friday October 20, 2006 (extended!). The description should include brief comparisons and references to prior work.You can also refer to “(unpublished results)” if you think you may publish this work separately outside of the competition summary paper.Note that there will be an opportunity to revise this section later, but you are encouraged to submit a draft while your memory of the methods is still fresh.

The submission file should follow the filename scheme below:

FirstAuthorInitials-SecondAuthorInitials-ithAuthorInitials-"methods".*
example: JP-FD-methods.*

“*” the file format can be either .pdf, .txt.gz, or .doc

Please upload your methods file here.

The unified set of predictions

To simplify subsequent analyses for ourselves and other investigators, we derived a single set of prediction scores from the set of submitted scores. We unified the independent submissions for each evaluation category by adopting the scores from the submission with the best Precision at 20% Recall (P20R) value for that evaluation category (evaluated using held-out genes). The combined predictions averaged 41% precision at 20% recall with 26% of GO terms having a P20R value greater than 90%.

Set of predictions from individual groups

Individual submissions

GO annotations for the held-out set and prospective evaluation

MouseFunc key

Last updated: Wed 3 Sep, 2008