Gene Function Prediction (examples.gene_func_prediction)

As a background reading before this example, we recommend user to read [Schietgat2010] and [Schachtner2008] where the authors study the use of decision tree based models for predicting the multiple gene functions and unsupervised matrix factorization techniques to extract marker genes from gene expression profiles for classification into diagnostic categories, respectively.

This example from functional genomics deals with predicting gene functions. Two main characteristics of gene function prediction task are:

  1. single gene can have multiple functions,
  2. the functions are organized in a hierarchy, in particular in a hierarchy structered as a rooted tree – MIPS Functional Catalogue. A gene related to some function is automatically related to all its ancestor functions. Data set used in this example originates from S. cerevisiae and has annotations from the MIPS Functional Catalogue.

The latter problem setting describes hierarchical multi-label classification (HMC).

Note

The S. cerevisiae FunCat annotated data set used in this example is not included in the datasets. If you wish to perform the gene function prediction experiments, start by downloading the data set. In particular D1 (FC) seq data set must be available for the example to run. Download links are listed in the datasets. To run the example, uncompress the data and put it into corresponding data directory, namely the extracted data set must exist in the S_cerevisiae_FC directory under datasets. Once you have the data installed, you are ready to start running the experiments.

Here is the outline of this gene function prediction task.

  1. Reading S. cerevisiae sequence data, i. e. train, validation and test set. Reading meta data, attributes’ labels and class labels. Weights are used to distinguish direct and indirect class memberships of genes in gene function classes according to FunCat annotations.
  2. Preprocessing, i. e. normalizing data matrix of test data and data matrix of joined train and validation data.
  3. Factorization of train data matrix. We used SNMF/L factorization algorithm for train data.
  4. Factorization of test data matrix. We used SNMF/L factorization algorithm for train data.
  5. Application of rules for class assignments. Three rules can be used, average correlation and maximal correlation, as in [Schachtner2008] and threshold maximal correlation. All class assignments rules are generalized to meet the hierarchy constraint imposed by the rooted tree structure of MIPS Functional Catalogue.
  6. Precision-recall (PR) evaluation measures.

To run the example simply type:

python gene_func_prediction.py

or call the module’s function:

import nimfa.examples
nimfa.examples.gene_func_prediction.run()

Note

This example uses matplotlib library for producing visual interpretation.

nimfa.examples.gene_func_prediction.assign_labels(corrs, train, idx2class, method=0.0)

Apply rules for class assignments. In [Schachtner2008] two rules are proposed, average correlation and maximal correlation. Here, both the rules are implemented and can be specified through :param:`method``parameter. In addition to these the threshold maximal correlation rule is possible as well. Class assignments rules are generalized to multi-label classification incorporating hierarchy constraints.

User can specify the usage of one of the following rules:
  1. average correlation,
  2. maximal correlation,
  3. threshold maximal correlation.

Though any method based on similarity measures can be used, we estimate correlation coefficients. Let w be the gene profile of test basis matrix for which we want to predict gene functions. For each class C a separate index set A of indices is created, where A encompasses all indices m, for which m-th profile of train basis matrix has label C. Index set B contains all remaining indices. Now, the average correlation coefficient between w and elements of A is computed, similarly average correlation coefficient between w and elements of B. Finally, w is assigned label C if the former correlation over the respective index set is greater than the latter correlation.

Note

Described rule assigns the class label according to an average correlation of test vector with all vectors belonging to one or the other index set. Minor modification of this rule is to assign the class label according to the maximal correlation occurring between the test vector and the members of each index set.

Note

As noted before the main problem of this example is the HMC (hierarchical multi-label classification) setting. Therefore we generalized the concepts from articles describing the use of factorization for binary classification problems to multi-label classification. Additionally, we use the weights for class memberships to incorporate hierarchical structure of MIPS MIPS Functional Catalogue.

Return mapping of gene functions to genes.

Parameters:
  • corrs (dict) – Estimated correlation coefficients between profiles of train basis matrix and profiles of test basis matrix.
  • train (dict) – Class information of train data set.
  • idx2class (dict) – Mapping between classes’ indices and classes’ labels.
  • method (float or str) – Type of rule for class assignments. Possible are average correlation, maximal correlation by specifying average or maximal respectively. In addition threshold maximal correlation is supported. If threshold rule is desired, threshold is specified instead. By default threshold rule is applied.
Return type:

dict

nimfa.examples.gene_func_prediction.compute_correlations(train, test)

Estimate correlation coefficients between profiles of train basis matrix and profiles of test basis matrix.

Return the estimated correlation coefficients of the features (variables).

Parameters:
  • train (dict) – Factorization matrix factors of train data set.
  • test (dict) – Factorization matrix factors of test data set.
Return type:

numpy.matrix

nimfa.examples.gene_func_prediction.factorize(data)

Perform factorization on S. cerevisiae FunCat annotated sequence data set (D1 FC seq).

Return factorized data, this is matrix factors as result of factorization (basis and mixture matrix).

Parameters:data (tuple) – Transformed data set containing attributes’ values, class information and possibly additional meta information.
nimfa.examples.gene_func_prediction.plot(func2gene, test, idx2class)

Report the performance with the precision-recall (PR) based evaluation measures.

Beside PR also ROC based evaluations have been used before to evaluate gene function prediction approaches. PR based better suits the characteristics of the common HMC task, in which many classes are infrequent with a small number of genes having particular function. That is for most classes the number of negative instances exceeds the number of positive instances. Therefore it is sometimes preferred to recognize the positive instances instead of correctly predicting the negative ones (i. e. gene does not have a particular function). That means that ROC curve might be less suited for the task as they reward a learner if it correctly predicts negative instances.

Return PR evaluations measures

Parameters:
  • labels (dict) – Mapping of genes to their predicted gene functions.
  • test (dict) – Class information of test data set.
  • idx2class (dict) – Mapping between classes’ indices and classes’ labels.
Return type:

tuple

nimfa.examples.gene_func_prediction.preprocess(data)

Preprocess S.cerevisiae FunCat annotated sequence data set (D1 FC seq). Preprocessing step includes data normalization.

Return preprocessed data.

Parameters:data (tuple) – Transformed data set containing attributes’ values, class information and possibly additional meta information.
nimfa.examples.gene_func_prediction.read()

Read S. cerevisiae FunCat annotated sequence data set (D1 FC seq).

Return attributes’ values and class information of the test data set and joined train and validation data set. Additional mapping functions are returned mapping attributes’ names and classes’ names to indices.

nimfa.examples.gene_func_prediction.run()

Run the gene function prediction example on the S. cerevisiae sequence data set (D1 FC seq).

The methodology is as follows:
  1. Reading S. cerevisiae sequence data, i. e. train, validation and test set. Reading meta data, attributes’ labels and class labels.
  2. Preprocessing, i. e. normalizing data matrix of test data and data matrix of joined train and validation data.
  3. Factorization of train data matrix. We used SNMF/L factorization algorithm for train data.
  4. Factorization of test data matrix. We used SNMF/L factorization algorithm for train data.
  5. Application of rules for class assignments. Three rules can be used, average correlation and maximal correlation, as in [Schachtner2008] and threshold maximal correlation. All class assignments rules are generalized to meet the hierarchy constraint imposed by the rooted tree structure of MIPS Functional Catalogue.
  6. PR evaluation measures.
nimfa.examples.gene_func_prediction.transform_data(path, include_meta=False)

Read data in the ARFF format and transform it to suitable matrix for factorization process. For each feature update direct and indirect class information exploiting properties of Functional Catalogue hierarchy.

Return attributes’ values and class information. If :param:`include_meta` is specified additional mapping functions are provided with mapping from indices to attributes’ names and indices to classes’ names.

Parameters:
  • path (str) – Path of directory with sequence data set (D1 FC seq).
  • include_meta (bool) – Specify if the header of the ARFF file should be skipped. The header of the ARFF file contains the name of the relation, a list of the attributes and their types. Default value is False.