Imputation

Страница в процессе перевода.

# Impute.Imputor — Type

Imputor

An imputor stores information about imputing values in AbstractArrays and Tables.tables. New imputation methods are expected to subtype Imputor and, at minimum, implement the _impute!(data::AbstractArrays, imp::<MyImputor>) method.

While fallback impute and impute! methods are provided to extend your _impute! methods to n-dimensional arrays and tables, you can always override these methods to change the behaviour as necessary.

# Impute.impute! — Method

impute!(data::AbstractArray, imp) -> data

Just returns the data when the array doesn’t contain missings

# Impute.impute! — Method

impute!(data::AbstractArray{Missing}, imp) -> data

Just return the data when the array only contains missings

# Impute.impute! — Method

impute!(data::A, imp; dims=:, kwargs...) -> A

Impute the missing values in the array data using the imputor imp. Optionally, you can specify the dimension to impute along.

Arguments

data::AbstractArray{Union{T, Missing}}: the data to be impute along dimensions dims
imp::Imputor: the Imputor method to use

Keyword Arguments

dims=:: The dimension to impute along. :rows and :cols are also supported for matrices.

Returns

AbstractArray{Union{T, Missing}}: the input data with values imputed

NOTES

Matrices have a deprecated dims=2 special case as dims=: is a breaking change
Mutation isn’t guaranteed for all array types, hence we return the result
eachslice is used internally which requires Julia 1.1

Example

julia> using Impute: Interpolate, impute!

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute!(M, Interpolate(); dims=1)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  3.0  4.0  5.0
 1.1  2.2  3.3  4.4  5.5

julia> M
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  3.0  4.0  5.0
 1.1  2.2  3.3  4.4  5.5

# Impute.impute! — Method

impute!(table, imp; cols=nothing) -> table

Imputes the data in a table by imputing the values 1 column at a time; if this is not the desired behaviour custom imputor methods should overload this method.

Arguments

imp::Imputor: the Imputor method to use
table: the data to impute

Keyword Arguments

cols: The columns to impute along (default is to impute all columns)

Returns

the input data with values imputed

Example

julia> using DataFrames; using Impute: Interpolate, impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> impute(df, Interpolate())
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      3.0       3.3
   4 │      4.0       4.4
   5 │      5.0       5.5

# Impute.impute! — Method

impute!(data::T, imp; kwargs...) -> T where T <: AbstractVector{<:NamedTuple}

Special case rowtables which are arrays, but we want to fallback to the tables method.

# Impute.impute — Method

impute(data::T, imp; kwargs...) -> T

Returns a new copy of the data with the missing data imputed by the imputor imp. For matrices and tables, data is imputed one variable/column at a time. If this is not the desired behaviour then you should overload this method or specify a different dims value.

Arguments

data: the data to be impute
imp::Imputor: the Imputor method to use

Returns

the input data with values imputed

Example

julia> using Impute: Interpolate, impute

julia> v = [1.0, 2.0, missing, missing, 5.0]
5-element Vector{Union{Missing, Float64}}:
 1.0
 2.0
  missing
  missing
 5.0

julia> impute(v, Interpolate())
5-element Vector{Union{Missing, Float64}}:
 1.0
 2.0
 3.0
 4.0
 5.0

Interpolation

# Impute.interp — Function

Impute.interp(data; dims=1)

Performs linear interpolation between the nearest values in an vector. See Impute.Interpolate for details.

Example

julia> using DataFrames; using Impute: Impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> Impute.interp(df)
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      3.0       3.3
   4 │      4.0       4.4
   5 │      5.0       5.5

# Impute.Interpolate — Type

Interpolate(; limit=nothing)

Performs linear interpolation between the nearest values in an vector. The current implementation is univariate, so each variable in a table or matrix will be handled independently.

!!! Missing values at the head or tail of the array cannot be interpolated if there are no existing values on both sides. As a result, this method does not guarantee that all missing values will be imputed.

Keyword Arguments

limit::Union{UInt, Nothing}: Optionally limit the gap sizes that can be interpolated.

Example

julia> using Impute: Interpolate, impute

julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0       missing   missing   missing  6.0
 1.1   missing  missing  4.4       5.5       6.6

julia> impute(M, Interpolate(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0  3.0  4.0  5.0  6.0
 1.1  2.2  3.3  4.4  5.5  6.6

julia> impute(M, Interpolate(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing   missing   missing  6.0
 1.1  2.2  3.3       4.4       5.5       6.6

K-Nearest Neighbors (KNN)

# Impute.knn — Function

Impute.knn(; k=1, threshold=0.5, dist=Euclidean())

Imputation using k-Nearest Neighbor algorithm.

Keyword Arguments

k::Int: number of nearest neighbors
dist::MinkowskiMetric: distance metric suppports by NearestNeighbors.jl (Euclidean, Chebyshev, Minkowski and Cityblock)
threshold::AbsstractFloat: thershold for missing neighbors

Reference

Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

Example

julia> using Impute, Missings

julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
   missing    missing  -0.536573  -0.961397
   missing   0.989358    missing  -0.750987
 -0.756802    missing   0.990607    missing
 -0.958924  -0.544021   0.650288   0.912945

julia> result = Impute.knn(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
 -0.756802   0.989358  -0.536573  -0.961397
 -0.756802   0.989358  -0.536573  -0.750987
 -0.756802  -0.544021   0.990607   0.912945
 -0.958924  -0.544021   0.650288   0.912945

# Impute.KNN — Type

KNN(; kwargs...)

Imputation using k-Nearest Neighbor algorithm.

Keyword Arguments

k::Int: number of nearest neighbors
dist::MinkowskiMetric: distance metric suppports by NearestNeighbors.jl (Euclidean, Chebyshev, Minkowski and Cityblock)
threshold::AbstractFloat: threshold for missing neighbors

Reference

Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

Last Observation Carried Forward (LOCF)

# Impute.locf — Function

Impute.locf(data; dims=1)

Iterates forwards through the data and fills missing data with the last existing observation. See Impute.LOCF for details.

Example

julia> using DataFrames; using Impute: Impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> Impute.locf(df)
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      2.0       3.3
   4 │      2.0       3.3
   5 │      5.0       5.5

# Impute.LOCF — Type

LOCF(; limit=nothing)

Last observation carried forward (LOCF) iterates forwards through the data and fills missing data with the last existing observation. The current implementation is univariate, so each variable in a table or matrix will be handled independently.

Next Observation Carried Backward (NOCB)

# Impute.nocb — Function

Impute.nocb(data; dims=1)

Iterates backwards through the data and fills missing data with the next existing observation. See Impute.NOCB for details.

Example

julia> using DataFrames; using Impute: Impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> Impute.nocb(df)
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      5.0       3.3
   4 │      5.0       5.5
   5 │      5.0       5.5

# Impute.NOCB — Type

NOCB(; limit=nothing)

Next observation carried backward (NOCB) iterates backwards through the data and fills missing data with the next existing observation.

Replacement

# Impute.replace — Function

Impute.replace(data; values)

Replace missings with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.

Keyword Arguments

values::Tuple: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you’re considering imputing for.

Example

julia> using DataFrames, Impute

julia> df = DataFrame(
           :a => [1.1, 2.2, missing, missing, 5.5],
           :b => [1, 2, 3, missing, 5],
           :c => ["v", "w", "x", "y", missing],
       )
5×3 DataFrame
 Row │ a          b        c
     │ Float64?   Int64?   String?
─────┼─────────────────────────────
   1 │       1.1        1  v
   2 │       2.2        2  w
   3 │ missing          3  x
   4 │ missing    missing  y
   5 │       5.5        5  missing

julia> Impute.replace(df; values=(NaN, -9999, "NULL"))
5×3 DataFrame
 Row │ a         b       c
     │ Float64?  Int64?  String?
─────┼───────────────────────────
   1 │      1.1       1  v
   2 │      2.2       2  w
   3 │    NaN         3  x
   4 │    NaN     -9999  y
   5 │      5.5       5  NULL

# Impute.Replace — Type

Replace(; value)

Keyword Arguments

values::Tuple: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you’re considering imputing for.

Example

julia> using Impute: Replace, impute

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute(M, Replace(; values=0.0); dims=2)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  0.0  0.0  5.0
 1.1  2.2  3.3  0.0  5.5

Simple Random Sample (SRS)

# Impute.srs — Function

Impute.srs(data; rng=Random.GLOBAL_RNG)

Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).

Example

julia> using DataFrames; using Random; using Impute: Impute

julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.3      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.5      │

julia> Impute.srs(df; rng=MersenneTwister(1234))
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ 1.0      │ 3.3      │
│ 4   │ 2.0      │ 3.3      │
│ 5   │ 5.0      │ 5.5      │

# Impute.SRS — Type

SRS(; rng=Random.GLOBAL_RNG)

The basic idea is that for a given variable, x, with missing data, we randomly draw from the observed values of x to impute the missing elements. Since the random draws from x for imputation are done in proportion to the frequency distribution of the values in x, the univariate distributional properties are generally not impacted; this is true for both categorical and continuous data.

Keyword Arguments

rng::AbstractRNG: A random number generator to use for observation selection

Example

julia> using Random; using Impute: SRS, impute

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute(M, SRS(; rng=MersenneTwister(1234)); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  1.0  2.0  5.0
 1.1  2.2  3.3  3.3  5.5

Substitute

# Impute.substitute — Function

Impute.substitute(data; statistic=nothing)
Impute.substitute(data; weights=nothing)

Substitute missing values with a summary statistic over the non-missing values.

Keyword Arguments

statistic: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input data eltype. If this function isn’t passed in then the defaultstats function is used to make a best guess.
weights: A set of statistical weights to apply to the mean or median in defaultstats.

See Substitute for details on substitution rules defined in defaultstats.

Example

julia> using DataFrames, Impute


julia> df = DataFrame(
                  :a => [8.9, 2.2, missing, missing, 1.3, 6.2, 3.7, 4.8],
                  :b => [2, 6, 3, missing, 7, 1, 9, missing],
                  :c => [true, false, true, true, false, missing, false, true],
              )
8×3 DataFrame
 Row │ a          b        c
     │ Float64?   Int64?   Bool?
─────┼─────────────────────────────
   1 │       8.9        2     true
   2 │       2.2        6    false
   3 │ missing          3     true
   4 │ missing    missing     true
   5 │       1.3        7    false
   6 │       6.2        1  missing
   7 │       3.7        9    false
   8 │       4.8  missing     true

julia> Impute.substitute(df)
8×3 DataFrame
 Row │ a         b       c
     │ Float64?  Int64?  Bool?
─────┼─────────────────────────
   1 │     8.9        2   true
   2 │     2.2        6  false
   3 │     4.25       3   true
   4 │     4.25       4   true
   5 │     1.3        7  false
   6 │     6.2        1   true
   7 │     3.7        9  false
   8 │     4.8        4   true

# Impute.Substitute — Type

Substitute(; statistic=Impute.defaultstats)

Substitute missing values with a summary statistic over the non-missing values.

Keyword Arguments

statistic: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input data eltype. If this function isn’t passed in then the Impute.defaultstats function is used to make a best guess.

Example

julia> using Statistics; using Impute: Substitute, impute

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute(M, Substitute(); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  2.0  2.0   5.0
 1.1  2.2  3.3  2.75  5.5

julia> impute(M, Substitute(; statistic=mean); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  2.66667  2.66667  5.0
 1.1  2.2  3.3      3.025    5.5

# Impute.defaultstats — Function

defaultstats(data[, wv])

A set of default substitution rules using either median or mode based on the eltype of the input data. Specific rules are summarized as follows.

Bool elements use mode
Real elements use median
Integer elements where nunique(data) / length(data) < 0.25 use mode (ratings, categorical codings, etc)
Integer elements with mostly unique values use median
!Number (non-numeric) elements use mode as the safest fallback

SVD

# Impute.svd — Function

Impute.svd(; kwargs...)

Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.

Keyword Arguments

init::Imputor: initialization method for missing values (default: Substitute())
rank::Union{Int, Nothing}: rank of the SVD approximation (default: nothing meaning start and 0 and increase)
tol::Float64: convergence tolerance (default: 1e-10)
maxiter::Int: Maximum number of iterations if convergence is not achieved (default: 100)
limits::Unoin{Tuple{Float64, Float64}, Nothing}: Bound the possible approximation values (default: nothing)
verbose::Bool: Whether to display convergence progress (default: true)

References

Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

Example

julia> using Impute, Missings

julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
   missing    missing  -0.536573  -0.961397
   missing   0.989358    missing  -0.750987
 -0.756802    missing   0.990607    missing
 -0.958924  -0.544021   0.650288   0.912945

julia> result = Impute.svd(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
  0.220258   0.555829  -0.536573  -0.961397
 -0.372745   0.989358   0.533193  -0.750987
 -0.756802   0.253309   0.990607   0.32315
 -0.958924  -0.544021   0.650288   0.912945

# Impute.SVD — Type

SVD(; kwargs...)

Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.

Keyword Arguments

init::Imputor: initialization method for missing values (default: Substitute())
rank::Union{Int, Nothing}: rank of the SVD approximation (default: nothing meaning start and 0 and increase)
tol::Float64: convergence tolerance (default: 1e-10)
maxiter::Int: Maximum number of iterations if convergence is not achieved (default: 100)
limits::Union{Tuple{Float64, Float64}, Nothing}: Bound the possible approximation values (default: nothing)
verbose::Bool: Whether to display convergence progress (default: true)

References

Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.