Welcome to Nimfa

Nimfa is a Python library for nonnegative matrix factorization. It includes implementations of several factorization methods, initialization approaches, and quality scoring. Both dense and sparse matrix representation are supported.

Nimfa is distributed under the BSD license.


The sample script using Nimfa on medulloblastoma gene expression data is given below. It uses alternating least squares nonnegative matrix factorization with projected gradient method for subproblems [Lin2007] and Random Vcol [Albright2006] initialization algorithm. The returned object is fitted factorization model through which user can access matrix factors and estimate quality measures.

import nimfa

V = nimfa.examples.medulloblastoma.read(normalize=True)

lsnmf = nimfa.Lsnmf(V, seed='random_vcol', rank=50, max_iter=100)
lsnmf_fit = lsnmf()

print('Rss: %5.4f' % lsnmf_fit.fit.rss())
print('Evar: %5.4f' % lsnmf_fit.fit.evar())
print('K-L divergence: %5.4f' % lsnmf_fit.distance(metric='kl'))
print('Sparseness, W: %5.4f, H: %5.4f' % lsnmf_fit.fit.sparseness())

Running this script produces the following output, where slight differences in reported scores across different runs can be attributed to randomness of the Random Vcol initialization method:

Rss: 0.2668
Evar: 0.9997
K-L divergence: 38.8744
Sparseness, W: 0.7297, H: 0.8796

An IPython notebook shows how to use Nimfa to analyze breast cancer transcriptome data sets from The International Cancer Genome Consortium (ICGC). The analysis of cancer data was highlighted in the ACM XRDS magazine.

Content

Factorization Algorithms

  • BD - Bayesian nonnegative matrix factorization Gibbs sampler [Schmidt2009]
  • BMF - Binary matrix factorization [Zhang2007]
  • ICM - Iterated conditional modes nonnegative matrix factorization [Schmidt2009]
  • LFNMF - Fisher nonnegative matrix factorization for learning local features [Wang2004], [Li2001]
  • LSNMF - Alternating nonnegative least squares matrix factorization using projected gradient method for subproblems [Lin2007]
  • NMF - Standard nonnegative matrix factorization with Euclidean / Kullback-Leibler update equations and Frobenius / divergence / connectivity cost functions [Lee2001], [Brunet2004]
  • NSNMF - Nonsmooth nonnegative matrix factorization [Montano2006]
  • PMF - Probabilistic nonnegative matrix factorization [Laurberg2008], [Hansen2008]
  • PSMF - Probabilistic sparse matrix factorization [Dueck2005], [Dueck2004], [Srebro2001], [Li2007]
  • SNMF - Sparse nonnegative matrix factorization based on alternating nonnegativity constrained least squares [Park2007]
  • SNMNMF - Sparse network-regularized multiple nonnegative matrix factorization [Zhang2011]
  • PMFCC - Penalized matrix factorization for constrained clustering [FWang2008]
  • SepNMF - Separable nonnegative matrix factorization [Gillis2014], [Kumar2013], [Tepper2015], [Kapralov2016]

Initialization Algorithms

Quality Measures

  • Distance
  • Residuals
  • Connectivity matrix
  • Consensus matrix
  • Entropy of the fitted NMF model [Park2007]
  • Dominant basis components computation
  • Explained variance
  • Feature score computation representing its specificity to basis vectors [Park2007]
  • Computation of most basis specific features for basis vectors [Park2007]
  • Purity [Park2007]
  • Residual sum of squares (rank estimation) [Hutchins2008], [Frigyesi2008]
  • Sparseness [Hoyer2004]
  • Cophenetic correlation coefficient of consensus matrix (rank estimation) [Brunet2004]
  • Dispersion [Park2007]
  • Factorization rank estimation
  • Selected matrix factorization method specific

Utils

  • Fitted factorization model tracker across multiple runs
  • Residuals tracker across multiple factorizations / runs

Installation

Nimfa is compatible with Python 2 and Python 3 versions. The recommended way to install Nimfa is by issuing:

pip install nimfa

from the command line.

Nimfa makes extensive use of SciPy and NumPy libraries for fast and convenient dense and sparse matrix manipulation and some linear algebra operations. There are not any additional prerequisites.

Alternatively, you can download source code from Github.

Unzip the archive. To build and install run:

python setup.py install

Configuration

Methods configuration goes through:

  1. runtime specific options (e. g. tracking fitted model across multiple runs, tracking residuals across iterations, etc.);
  2. algorithm specific options (e. g. prior information with PSMF, type of update equations with NMF, initial value for noise variance with ICM, etc.).

For details and descriptions on algorithm specific options see specific algorithm documentation.

Usage

Following are basic usage examples that employ different implemented factorization algorithms.

Standard NMF - Divergence on scipy.sparse matrix with matrix factors estimation.

import numpy as np
import scipy.sparse as spr

import nimfa

V = spr.csr_matrix([[1, 0, 2, 4], [0, 0, 6, 3], [4, 0, 5, 6]])
print('Target:\n%s' % V.todense())

nmf = nimfa.Nmf(V, max_iter=200, rank=2, update='euclidean', objective='fro')
nmf_fit = nmf()

W = nmf_fit.basis()
print('Basis matrix:\n%s' % W.todense())

H = nmf_fit.coef()
print('Mixture matrix:\n%s' % H.todense())

print('Euclidean distance: %5.3f' % nmf_fit.distance(metric='euclidean'))

sm = nmf_fit.summary()
print('Sparseness Basis: %5.3f  Mixture: %5.3f' % (sm['sparseness'][0], sm['sparseness'][1]))
print('Iterations: %d' % sm['n_iter'])
print('Target estimate:\n%s' % np.dot(W.todense(), H.todense()))

LSNMF on numpy dense matrix with quality and performance measures.

import numpy as np

import nimfa

V = np.array([[1, 2, 3], [4, 5, 6], [6, 7, 8]])
print('Target:\n%s' % V)

lsnmf = nimfa.Lsnmf(V, max_iter=10, rank=3)
lsnmf_fit = lsnmf()

W = lsnmf_fit.basis()
print('Basis matrix:\n%s' % W)

H = lsnmf_fit.coef()
print('Mixture matrix:\n%s' % H)

print('K-L divergence: %5.3f' % lsnmf_fit.distance(metric='kl'))

print('Rss: %5.3f' % lsnmf_fit.fit.rss())
print('Evar: %5.3f' % lsnmf_fit.fit.evar())
print('Iterations: %d' % lsnmf_fit.n_iter)
print('Target estimate:\n%s' % np.dot(W, H))

LSNMF with Random VCol initialization and error tracking.

import numpy as np

import nimfa

V = np.array([[1, 2, 3], [4, 5, 6], [6, 7, 8]])
print('Target:\n%s' % V)

lsnmf = nimfa.Lsnmf(V, seed='random_vcol', max_iter=10, rank=3, track_error=True)
lsnmf_fit = lsnmf()

W = lsnmf_fit.basis()
print('Basis matrix:\n%s' % W)

H = lsnmf_fit.coef()
print('Mixture matrix:\n%s' % H)

# Objective function value for each iteration
print('Error tracking:\n%s' % lsnmf_fit.fit.tracker.get_error())

sm = lsnmf_fit.summary()
print('Rss: %5.3f' % sm['rss'])
print('Evar: %5.3f' % sm['evar'])
print('Iterations: %d' % sm['n_iter'])

LSNMF with Random VCol initialization and rank estimation.

import numpy as np

import nimfa

V = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12], [13, 14, 15, 16]])
print('Target:\n%s' % V)

lsnmf = nimfa.Lsnmf(V, seed='random_vcol', max_iter=10, rank=3, track_error=True)
lsnmf_fit = lsnmf()

W = lsnmf_fit.basis()
print('Basis matrix:\n%s' % W)

H = lsnmf_fit.coef()
print('Mixture matrix:\n%s' % H)

r = lsnmf.estimate_rank(rank_range=[2,3,4], what=['rss'])
pp_r = '\n'.join('%d: %5.3f' % (rank, vals['rss']) for rank, vals in r.items())
print('Rank estimate:\n%s' % pp_r)

ICM with Random C initialization and passed callback initialization function.

import nimfa

import numpy as np

V = np.array([[1, 2, 3], [4, 5, 6], [6, 7, 8]])
print('Target:\n%s' % V)

# Initialization callback function
def init_info(model):
    print('Initialized basis matrix:\n%s' % model.basis())
    print('Initialized  mixture matrix:\n%s' % model.coef())

# Callback is called after initialization and prior to factorization in each run
icm = nimfa.Icm(V, seed='random_c', max_iter=10, rank=3, callback_init=init_info)
icm_fit = icm()

W = icm_fit.basis()
print('Basis matrix:\n%s' % W)
print(W)

H = icm_fit.coef()
print('Mixture matrix:\n%s' % H)

sm = icm_fit.summary()
print('Rss: %5.3f' % sm['rss'])
print('Evar: %5.3f' % sm['evar'])
print('Iterations: %d' % sm['n_iter'])
print('KL divergence: %5.3f' % sm['kl'])
print('Euclidean distance: %5.3f' % sm['euclidean'])

BMF with default parameters, multiple runs and factor tracking for consensus matrix computation.

import numpy as np

import nimfa

V = np.random.rand(23, 200)

# Factorization will be run 3 times (n_run) and factors will be tracked for computing
# cophenetic correlation. Note increased time and space complexity
bmf = nimfa.Bmf(V, max_iter=10, rank=30, n_run=3, track_factor=True)
bmf_fit = bmf()

print('K-L divergence: %5.3f' % bmf_fit.distance(metric='kl'))

sm = bmf_fit.summary()
print('Rss: %5.3f' % sm['rss'])
print('Evar: %5.3f' % sm['evar'])
print('Iterations: %d' % sm['n_iter'])
print('Cophenetic correlation: %5.3f' % sm['cophenetic'])

Standard NMF - Euclidean update equations and fixed initialization (passed matrix factors).

import numpy as np

import nimfa

V = np.random.rand(30, 20)

init_W = np.random.rand(30, 4)
init_H = np.random.rand(4, 20)

# Fixed initialization of latent matrices
nmf = nimfa.Nmf(V, seed="fixed", W=init_W, H=init_H, rank=4)
nmf_fit = nmf()

print("Euclidean distance: %5.3f" % nmf_fit.distance(metric="euclidean"))
print('Initialization type: %s' % nmf_fit.seeding)
print('Iterations: %d' % nmf_fit.n_iter)

References

[Schmidt2009](1, 2) Mikkel N. Schmidt, Ole Winther, and Lars K. Hansen. Bayesian non-negative matrix factorization. In Proceedings of the 9th International Conference on Independent Component Analysis and Signal Separation, pages 540-547, Paraty, Brazil, 2009.
[Zhang2007]Zhongyuan Zhang, Tao Li, Chris H. Q. Ding and Xiangsun Zhang. Binary Matrix Factorization with applications. In Proceedings of 7th IEEE International Conference on Data Mining, pages 391-400, Omaha, USA, 2007.
[Wang2004]Yuan Wang, Yunde Jia, Changbo Hu and Matthew Turk. Fisher non-negative matrix factorization for learning local features. In Proceedings of the 6th Asian Conference on Computer Vision, pages 27-30, Jeju, Korea, 2004.
[Li2001]Stan Z. Li, Xinwen Huo, Hongjiang Zhang and Qian S. Cheng. Learning spatially localized, parts-based representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 207-212, Kauai, USA, 2001.
[Lin2007](1, 2) Chin J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural Computation, 19(10): 2756-2779, 2007.
[Lee2001]Daniel D. Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In Proceedings of the Neural Information Processing Systems, pages 556-562, Vancouver, Canada, 2001.
[Lee1999]Daniel D. Lee and H. Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788-791, 1999.
[Brunet2004](1, 2) Jean-P. Brunet, Pablo Tamayo, Todd R. Golub and Jill P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. In Proceedings of the National Academy of Sciences of the USA, 101(12): 4164-4169, 2004.
[Montano2006]Alberto Pascual-Montano, J. M. Carazo, Kieko Kochi, Dietrich Lehmann and Roberto D. Pascual-Marqui. Nonsmooth nonnegative matrix factorization (nsnmf). In IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(3): 403-415, 2006.
[Laurberg2008]Hans Laurberg, Mads G. Christensen, Mark D. Plumbley, Lars K. Hansen and Soren H. Jensen. Theorems on positive data: on the uniqueness of NMF. Computational Intelligence and Neuroscience, doi: 10.1155/2008/764206, 2008.
[Hansen2008]Lars K. Hansen. Generalization in high-dimensional factor models. Web: http://www.stanford.edu/group/mmds/slides2008/hansen.pdf, 2008.
[Dueck2005]Delbert Dueck, Quaid D. Morris and Brendan J. Frey. Multi-way clustering of microarray data using probabilistic sparse matrix factorization. Bioinformatics, 21(Suppl 1): 144-151, 2005.
[Dueck2004]Delbert Dueck and Brendan J. Frey. Probabilistic sparse matrix factorization. University of Toronto Technical Report PSI-2004-23, Probabilistic and Statistical Inference Group, University of Toronto, 2004.
[Srebro2001]Nathan Srebro and Tommi Jaakkola. Sparse matrix Factorization of gene expression data. Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 2001.
[Li2007]Huan Li, Yu Sun and Ming Zhan. The discovery of transcriptional modules by a two-stage matrix decomposition approach. Bioinformatics, 23(4): 473-479, 2007.
[Park2007](1, 2, 3, 4, 5, 6) Hyuonsoo Kim and Haesun Park. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics, 23(12): 1495-1502, 2007.
[Zhang2011]Shihua Zhang, Qingjiao Li and Xianghong J. Zhou. A novel computational framework for simultaneous integration of multiple types of genomic data to identify microRNA-gene regulatory modules. Bioinformatics, 27(13): 401-409, 2011.
[Boutsidis2007]Christos Boutsidis and Efstratios Gallopoulos. SVD-based initialization: A head start for nonnegative matrix factorization. Pattern Recognition, 41(4): 1350-1362, 2008.
[Albright2006](1, 2, 3) Russell Albright, Carl D. Meyer and Amy N. Langville. Algorithms, initializations, and convergence for the nonnegative matrix factorization. NCSU Technical Report Math 81706, NC State University, Releigh, USA, 2006.
[Hoyer2004]Patrik O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine Learning Research, 5: 1457-1469, 2004.
[Hutchins2008]Lucie N. Hutchins, Sean P. Murphy, Priyam Singh and Joel H. Graber. Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics, 24(23): 2684-2690, 2008.
[Frigyesi2008]Attila Frigyesi and Mattias Hoglund. Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Informatics, 6: 275-292, 2008.
[FWang2008]Fei Wang, Tao Li, Changshui Zhang. Semi-Supervised Clustering via Matrix Factorization. SDM 2008, 1-12, 2008.
[Schietgat2010]Leander Schietgat, Celine Vens, Jan Struyf, Hendrik Blockeel, Dragi Kocev and Saso Dzeroski. Predicting gene function using hierarchical multi-label decision tree ensembles. BMC Bioinformatics, 11(2), 2010.
[Schachtner2008]
  1. Schachtner, D. Lutter, P. Knollmueller, A. M. Tome, F. J. Theis, G. Schmitz, M. Stetter, P. Gomez Vilda and E. W. Lang. Knowledge-based gene expression classification via matrix factorization. Bioinformatics, 24(15): 1688-1697, 2008.
[Damle2014]Anil Damle, Sun Yuekai . A geometric approach to archetypal analysis and non-negative matrix factorization. arXiv preprint arXiv:1405.4275, 2014.
[Benson2014]Austin R. Benson, Jason D. Lee, Bartek Rajwa, David F. Gleich. Scalable methods for nonnegative matrix factorizations of near-separable tall-and-skinny matrices. NIPS, 945-953, 2014.
[Kumar2013]Abhishek Kumar, Vikas Sindhwani, Prabhanjan Kambadur. Fast conical hull algorithms for near-separable non-negative matrix factorization. ICML, 231-239, 2013.
[Gillis2014]Gillis, Nicolas, and Stephen A. Vavasis. Fast and robust recursive algorithms for separable nonnegative matrix factorization. IEEE TPAMI, 36(4): 698-714, 2014.
[Tepper2015]Mariano Tepper, Guillermo Sapiro. Compressed nonnegative matrix factorization is fast and accurate. IEEE TSP, 64(9): 2269-2283, 2016
[Kapralov2016]Michael, Kapralov, Vamsi Potluru, David Woodruff. How to fake multiply by a Gaussian Matrix. ICML, 2101-2110. 2016.

Citation

 @article{Zitnik2012,
  title     = {Nimfa: A Python Library for Nonnegative Matrix Factorization},
  author    = {Zitnik, Marinka and Zupan, Blaz},
  journal   = {Journal of Machine Learning Research},
  volume    = {13},
  pages     = {849-853},
  year      = {2012}
}

Disclaimer

This software and data is provided as-is, and there are no guarantees that it fits your purposes or that it is bug-free. Use it at your own risk!