SVD Imputation
Страница в процессе перевода. |
Often matrices and n-dimensional arrays with missing values can be imputed via a low rank approximation. Impute.jl provides one such method using a single value decomposition. The general idea is to:
-
Fill the missing values with some rough approximates (e.g.,
mean
,median
,rand
) -
Reconstruct this "completed" matrix with a low rank SVD approximation (i.e.,
k
largest singular values) -
Replace our initial estimates with the reconstructed values
-
Repeat steps 1-3 until convergence (update difference is below a tolerance)
To demonstrate how this is useful lets load a reduced MNIST dataset. We’ll want both the completed dataset and another dataset with 35% of the values set to -1.0
(indicating missingness).
TODO: Update example with more a realistic dataset like some microarray data
using Distances, Impute, Plots, Statistics, DataFrames, CSV mnist = Impute.dataset("test/matrix/mnist"); completed, incomplete = mnist[0.0], mnist[0.25];
Alright, before we get started lets have a look at what our incomplete data looks like:
heatmap(incomplete; color=:greys);
Okay, so as we’d expect there’s a reasonable bit of structure we can exploit. So how does the svd method compare against other common, yet simpler, methods?
data = Impute.declaremissings(incomplete; values=-1.0) # NOTE: SVD performance is almost identical regardless of the `init` setting. imputors = [ "0.5" => Impute.Replace(; values=0.5), "median" => Impute.Substitute(), "svd" => Impute.SVD(; tol=1e-2), ] results = map(last.(imputors)) do imp r = Impute.impute(data, imp; dims=:) return nrmsd(completed, r) end bar(first.(imputors), results);