NMF


$\text{The interpretation of } W \text{ is that each column is a basis element}$.

$\text{The interpretation of } H \text{ is that each column gives the ‘coordinates of a data point’ in the basis}$

Features matrix for W, and coefficients matrix for H

Clustering property

It automatically clusters the columns of input data V. If we impose an orthogonality constraint on H matrix, i.e. $HH^T=I$, then the above minimization is mathematically equivalent to the minimization of K-means clustering. Furthermore, the computed H gives the cluster membership, i.e., if $H_{kj}>H_{ij}$ for all $i \not= k$, this suggests that the input data $v_j$ belongs to k-th cluster.

Convex non-negative matrix factorization

In standard NMF, matrix factor $W \in R_{+}^{m \times k}$, i.e., W can be anything in that space. Convex NMF restricts the columns of W to convex combinations of the input data vectors. This greatly improves the quality of data representation of W. Furthermore, the resulting matrix factor H becomes more sparse and orthogonal.

Cost functions and regularizations

minimize the function $F(W, H)=||V - WH||^2_F$

Online NMF

Collaborative filtering in recommendation systems, where there may be many users and many items to recommend, and it would be inefficient to recalculate everything when one user or one item is added to the system.

Algorithms

  1. Initialize: W and H non negative
  2. Update until W and H are stable

The two multiplicative factors for W and H are matrices of ones when $V = WH$

Alternative approach:

  1. H is fixed and W found by a non-negative least squares solver, then W is fixed and H is found analogously.

Sequential NMF

THe contributioin from the NMF components ranked empirically when they are constructed one by one (sequentially), i.e., learn the (n+1)-th component with the first n components constructed.

Exact NMF

Exact solutions for the variants of NMF can be expected (in polynomial time) when additional constraints hold for matrix V. A polynomial time algorithm for solving nonnegative rank factorization if V contains a monomial sub matrix of rank equal to its rank was given by Compbell and Poole in 1981.

Application in Bioinformatics

NMF has been successfully applied in bioinformatics for clustering gene expression and DNA methylation data and finding the genes most representative of the clusters. In the analysis of cancer mutations it has been used to identify common patterns of mutations that occur in many cancers and that probably have distinct causes. NMF techniques can identify sources of variation such as cell types, disease subtypes, population stratification, tissue composition, and tumor clonality.

Implement in cellchat selectK

Cophenetic

In the clustering of biological information such as data from microarray experiments, the cophenetic similarity or cophenetic distance of two object is a measure of how similar those two objects have to be grouped into the same cluster. The cophenetic distance between two objects is the height of the dendrogram where the two branches that include the two objects merge into a single branch. Outside the context of a dendrogram, it is the distance between the largest two clusters that contains the two objects individually when they are merged into a single cluster that contains both.

Cophenetic correlation coefficient:

  • $x(i,j)=|X_i-X_j|$, the Euclidean distance between the ith and jth observations
  • $t(i,j)$, the dendrogrammatic distance between the model points Ti and Tj. This distance is the height of the node at which these two points are first joined together.

Author: Wulilichao
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Wulilichao !
  TOC