-
-
Save mblondel/7304099 to your computer and use it in GitHub Desktop.
# (C) Mathieu Blondel, November 2013 | |
# License: BSD 3 clause | |
import numpy as np | |
from scipy.linalg import svd | |
def frequent_directions(A, ell, verbose=False): | |
""" | |
Return the sketch of matrix A. | |
Parameters | |
---------- | |
A : array-like, shape = [n_samples, n_features] | |
Input matrix. | |
ell : int | |
Number of rows in the matrix sketch. | |
Returns | |
------- | |
B : array-like, shape = [ell, n_features] | |
Reference | |
--------- | |
Edo Liberty, "Simple and Deterministic Matrix Sketching", ACM SIGKDD, 2013. | |
""" | |
if ell % 2 == 1: | |
raise ValueError("ell must be an even number.") | |
n_samples, n_features = A.shape | |
if ell > n_samples: | |
raise ValueError("ell must be less than n_samples.") | |
if ell > n_features: | |
raise ValueError("ell must be less than n_features.") | |
B = np.zeros((ell, n_features), dtype=np.float64) | |
ind = np.arange(ell) | |
for i in xrange(n_samples): | |
zero_rows = ind[np.sum(np.abs(B) <= 1e-12, axis=1) == n_features] | |
if len(zero_rows) >= 1: | |
B[zero_rows[0]] = A[i] | |
else: | |
U, sigma, V = svd(B, full_matrices=False) | |
delta = sigma[ell / 2 - 1] ** 2 | |
sigma = np.sqrt(np.maximum(sigma ** 2 - delta, 0)) | |
B = np.dot(np.diag(sigma), V) | |
if verbose: | |
AA = np.dot(A.T, A) | |
BB = np.dot(B.T, B) | |
error = np.sum((AA - BB) ** 2) | |
if i == 0: | |
error_first = error | |
print error / error_first | |
return B | |
if __name__ == '__main__': | |
np.random.seed(0) | |
A = np.random.random((100, 20)) | |
B = frequent_directions(A, 10, verbose=True) | |
print B.shape |
Interesting. Isn't the paper about an online ("streaming") algorithm, however?
@samueljohn Added a license header, thanks!
@zygmuntz Indeed. I guess A
could be an iterator instead of a NumPy array.
@mblondel out of curiosity have you been to leverage such sketching successfully to speed up some machine learning related task?
@ogrisel Nope. My main motivation for implementing sketching was to visualize what kind of points does it learn. In particular, I was hoping that it would learn "representative" points. However this method is based on SVD so the number of points selected can only be less than min(n_samples, n_features)
. For 2d data, this means that it can choose at most 2 points...
Another remark is that it approximates the covariance matrix A^T A
so it most suited for speeding up algorithms that rely on the covariance matrix, such as PCA.
Overall, I am not so convinced that this is a useful method in practice.
Alright thanks.
This is amazing.
Beautiful. License?