Документация Engee

Imputation

Страница в процессе перевода.

Imputor

An imputor stores information about imputing values in AbstractArrays and Tables.tables. New imputation methods are expected to subtype Imputor and, at minimum, implement the _impute!(data::AbstractArrays, imp::<MyImputor>) method.

While fallback impute and impute! methods are provided to extend your _impute! methods to n-dimensional arrays and tables, you can always override these methods to change the behaviour as necessary.

impute!(data::AbstractArray, imp) -> data

Just returns the data when the array doesn’t contain missings

impute!(data::AbstractArray{Missing}, imp) -> data

Just return the data when the array only contains missings

impute!(data::A, imp; dims=:, kwargs...) -> A

Impute the missing values in the array data using the imputor imp. Optionally, you can specify the dimension to impute along.

Arguments

  • data::AbstractArray{Union{T, Missing}}: the data to be impute along dimensions dims

  • imp::Imputor: the Imputor method to use

Keyword Arguments

  • dims=:: The dimension to impute along. :rows and :cols are also supported for matrices.

Returns

  • AbstractArray{Union{T, Missing}}: the input data with values imputed

NOTES

  1. Matrices have a deprecated dims=2 special case as dims=: is a breaking change

  2. Mutation isn’t guaranteed for all array types, hence we return the result

  3. eachslice is used internally which requires Julia 1.1

Example

julia> using Impute: Interpolate, impute!

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute!(M, Interpolate(); dims=1)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  3.0  4.0  5.0
 1.1  2.2  3.3  4.4  5.5

julia> M
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  3.0  4.0  5.0
 1.1  2.2  3.3  4.4  5.5
impute!(table, imp; cols=nothing) -> table

Imputes the data in a table by imputing the values 1 column at a time; if this is not the desired behaviour custom imputor methods should overload this method.

Arguments

  • imp::Imputor: the Imputor method to use

  • table: the data to impute

Keyword Arguments

  • cols: The columns to impute along (default is to impute all columns)

Returns

  • the input data with values imputed

Example

julia> using DataFrames; using Impute: Interpolate, impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> impute(df, Interpolate())
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      3.0       3.3
   4 │      4.0       4.4
   5 │      5.0       5.5
impute!(data::T, imp; kwargs...) -> T where T <: AbstractVector{<:NamedTuple}

Special case rowtables which are arrays, but we want to fallback to the tables method.

impute(data::T, imp; kwargs...) -> T

Returns a new copy of the data with the missing data imputed by the imputor imp. For matrices and tables, data is imputed one variable/column at a time. If this is not the desired behaviour then you should overload this method or specify a different dims value.

Arguments

  • data: the data to be impute

  • imp::Imputor: the Imputor method to use

Returns

  • the input data with values imputed

Example

julia> using Impute: Interpolate, impute

julia> v = [1.0, 2.0, missing, missing, 5.0]
5-element Vector{Union{Missing, Float64}}:
 1.0
 2.0
  missing
  missing
 5.0

julia> impute(v, Interpolate())
5-element Vector{Union{Missing, Float64}}:
 1.0
 2.0
 3.0
 4.0
 5.0

Interpolation

Impute.interp(data; dims=1)

Performs linear interpolation between the nearest values in an vector. See Impute.Interpolate for details.

Example

julia> using DataFrames; using Impute: Impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> Impute.interp(df)
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      3.0       3.3
   4 │      4.0       4.4
   5 │      5.0       5.5
Interpolate(; limit=nothing)

Performs linear interpolation between the nearest values in an vector. The current implementation is univariate, so each variable in a table or matrix will be handled independently.

!!! Missing values at the head or tail of the array cannot be interpolated if there are no existing values on both sides. As a result, this method does not guarantee that all missing values will be imputed.

Keyword Arguments

  • limit::Union{UInt, Nothing}: Optionally limit the gap sizes that can be interpolated.

Example

julia> using Impute: Interpolate, impute

julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0       missing   missing   missing  6.0
 1.1   missing  missing  4.4       5.5       6.6

julia> impute(M, Interpolate(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0  3.0  4.0  5.0  6.0
 1.1  2.2  3.3  4.4  5.5  6.6

julia> impute(M, Interpolate(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing   missing   missing  6.0
 1.1  2.2  3.3       4.4       5.5       6.6

K-Nearest Neighbors (KNN)

Impute.knn(; k=1, threshold=0.5, dist=Euclidean())

Imputation using k-Nearest Neighbor algorithm.

Keyword Arguments

  • k::Int: number of nearest neighbors

  • dist::MinkowskiMetric: distance metric suppports by NearestNeighbors.jl (Euclidean, Chebyshev, Minkowski and Cityblock)

  • threshold::AbsstractFloat: thershold for missing neighbors

Reference

  • Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

Example

julia> using Impute, Missings

julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
   missing    missing  -0.536573  -0.961397
   missing   0.989358    missing  -0.750987
 -0.756802    missing   0.990607    missing
 -0.958924  -0.544021   0.650288   0.912945

julia> result = Impute.knn(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
 -0.756802   0.989358  -0.536573  -0.961397
 -0.756802   0.989358  -0.536573  -0.750987
 -0.756802  -0.544021   0.990607   0.912945
 -0.958924  -0.544021   0.650288   0.912945
KNN(; kwargs...)

Imputation using k-Nearest Neighbor algorithm.

Keyword Arguments

  • k::Int: number of nearest neighbors

  • dist::MinkowskiMetric: distance metric suppports by NearestNeighbors.jl (Euclidean, Chebyshev, Minkowski and Cityblock)

  • threshold::AbstractFloat: threshold for missing neighbors

Reference

  • Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

Last Observation Carried Forward (LOCF)

Impute.locf(data; dims=1)

Iterates forwards through the data and fills missing data with the last existing observation. See Impute.LOCF for details.

Example

julia> using DataFrames; using Impute: Impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> Impute.locf(df)
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      2.0       3.3
   4 │      2.0       3.3
   5 │      5.0       5.5
LOCF(; limit=nothing)

Last observation carried forward (LOCF) iterates forwards through the data and fills missing data with the last existing observation. The current implementation is univariate, so each variable in a table or matrix will be handled independently.

See also:

!!! Missing elements at the head of the array may not be imputed if there is no existing observation to carry forward. As a result, this method does not guarantee that all missing values will be imputed.

Keyword Arguments

  • limit::Union{UInt, Nothing}: Optionally limits the amount of consecutive missing values to replace.

Example

julia> using Impute: LOCF, impute

julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0       missing   missing   missing  6.0
 1.1   missing  missing  4.4       5.5       6.6

julia> impute(M, LOCF(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0  2.0  2.0  2.0  6.0
 1.1  1.1  1.1  4.4  5.5  6.6

julia> impute(M,  LOCF(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0  2.0  2.0   missing  6.0
 1.1  1.1  1.1  4.4  5.5       6.6

Next Observation Carried Backward (NOCB)

Impute.nocb(data; dims=1)

Iterates backwards through the data and fills missing data with the next existing observation. See Impute.NOCB for details.

Example

julia> using DataFrames; using Impute: Impute


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> Impute.nocb(df)
5×2 DataFrame
 Row │ a         b
     │ Float64?  Float64?
─────┼────────────────────
   1 │      1.0       1.1
   2 │      2.0       2.2
   3 │      5.0       3.3
   4 │      5.0       5.5
   5 │      5.0       5.5
NOCB(; limit=nothing)

Next observation carried backward (NOCB) iterates backwards through the data and fills missing data with the next existing observation.

See also:

!!! Missing elements at the tail of the array may not be imputed if there is no existing observation to carry backward. As a result, this method does not guarantee that all missing values will be imputed.

Keyword Arguments

  • limit::Union{UInt, Nothing}: Optionally limits the amount of consecutive missing values to replace.

Example

julia> using Impute: NOCB, impute

julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0       missing   missing   missing  6.0
 1.1   missing  missing  4.4       5.5       6.6

julia> impute(M, NOCB(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0  6.0  6.0  6.0  6.0
 1.1  4.4  4.4  4.4  5.5  6.6

julia> impute(M,  NOCB(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  6.0  6.0  6.0
 1.1  4.4  4.4       4.4  5.5  6.6

Replacement

Impute.replace(data; values)

Replace missings with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.

Keyword Arguments

  • values::Tuple: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you’re considering imputing for.

Example

julia> using DataFrames, Impute

julia> df = DataFrame(
           :a => [1.1, 2.2, missing, missing, 5.5],
           :b => [1, 2, 3, missing, 5],
           :c => ["v", "w", "x", "y", missing],
       )
5×3 DataFrame
 Row │ a          b        c
     │ Float64?   Int64?   String?
─────┼─────────────────────────────
   1 │       1.1        1  v
   2 │       2.2        2  w
   3 │ missing          3  x
   4 │ missing    missing  y
   5 │       5.5        5  missing

julia> Impute.replace(df; values=(NaN, -9999, "NULL"))
5×3 DataFrame
 Row │ a         b       c
     │ Float64?  Int64?  String?
─────┼───────────────────────────
   1 │      1.1       1  v
   2 │      2.2       2  w
   3 │    NaN         3  x
   4 │    NaN     -9999  y
   5 │      5.5       5  NULL
Replace(; value)

Replace missings with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.

Keyword Arguments

  • values::Tuple: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you’re considering imputing for.

Example

julia> using Impute: Replace, impute

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute(M, Replace(; values=0.0); dims=2)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  0.0  0.0  5.0
 1.1  2.2  3.3  0.0  5.5

Simple Random Sample (SRS)

Impute.srs(data; rng=Random.GLOBAL_RNG)

Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).

Example

julia> using DataFrames; using Random; using Impute: Impute

julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.3      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.5      │

julia> Impute.srs(df; rng=MersenneTwister(1234))
5×2 DataFrame
│ Row │ a        │ b        │
│     │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ 1.0      │ 3.3      │
│ 4   │ 2.0      │ 3.3      │
│ 5   │ 5.0      │ 5.5      │
SRS(; rng=Random.GLOBAL_RNG)

Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).

The basic idea is that for a given variable, x, with missing data, we randomly draw from the observed values of x to impute the missing elements. Since the random draws from x for imputation are done in proportion to the frequency distribution of the values in x, the univariate distributional properties are generally not impacted; this is true for both categorical and continuous data.

Keyword Arguments

  • rng::AbstractRNG: A random number generator to use for observation selection

Example

julia> using Random; using Impute: SRS, impute

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute(M, SRS(; rng=MersenneTwister(1234)); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  1.0  2.0  5.0
 1.1  2.2  3.3  3.3  5.5

Substitute

Impute.substitute(data; statistic=nothing)
Impute.substitute(data; weights=nothing)

Substitute missing values with a summary statistic over the non-missing values.

Keyword Arguments

  • statistic: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input data eltype. If this function isn’t passed in then the defaultstats function is used to make a best guess.

  • weights: A set of statistical weights to apply to the mean or median in defaultstats.

See Substitute for details on substitution rules defined in defaultstats.

Example

julia> using DataFrames, Impute


julia> df = DataFrame(
                  :a => [8.9, 2.2, missing, missing, 1.3, 6.2, 3.7, 4.8],
                  :b => [2, 6, 3, missing, 7, 1, 9, missing],
                  :c => [true, false, true, true, false, missing, false, true],
              )
8×3 DataFrame
 Row │ a          b        c
     │ Float64?   Int64?   Bool?
─────┼─────────────────────────────
   1 │       8.9        2     true
   2 │       2.2        6    false
   3 │ missing          3     true
   4 │ missing    missing     true
   5 │       1.3        7    false
   6 │       6.2        1  missing
   7 │       3.7        9    false
   8 │       4.8  missing     true

julia> Impute.substitute(df)
8×3 DataFrame
 Row │ a         b       c
     │ Float64?  Int64?  Bool?
─────┼─────────────────────────
   1 │     8.9        2   true
   2 │     2.2        6  false
   3 │     4.25       3   true
   4 │     4.25       4   true
   5 │     1.3        7  false
   6 │     6.2        1   true
   7 │     3.7        9  false
   8 │     4.8        4   true
Substitute(; statistic=Impute.defaultstats)

Substitute missing values with a summary statistic over the non-missing values.

Keyword Arguments

  • statistic: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input data eltype. If this function isn’t passed in then the Impute.defaultstats function is used to make a best guess.

Example

julia> using Statistics; using Impute: Substitute, impute

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> impute(M, Substitute(); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  2.0  2.0   5.0
 1.1  2.2  3.3  2.75  5.5

julia> impute(M, Substitute(; statistic=mean); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0  2.66667  2.66667  5.0
 1.1  2.2  3.3      3.025    5.5
defaultstats(data[, wv])

A set of default substitution rules using either median or mode based on the eltype of the input data. Specific rules are summarized as follows.

  • Bool elements use mode

  • Real elements use median

  • Integer elements where nunique(data) / length(data) < 0.25 use mode (ratings, categorical codings, etc)

  • Integer elements with mostly unique values use median

  • !Number (non-numeric) elements use mode as the safest fallback

SVD

Impute.svd(; kwargs...)

Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.

Keyword Arguments

  • init::Imputor: initialization method for missing values (default: Substitute())

  • rank::Union{Int, Nothing}: rank of the SVD approximation (default: nothing meaning start and 0 and increase)

  • tol::Float64: convergence tolerance (default: 1e-10)

  • maxiter::Int: Maximum number of iterations if convergence is not achieved (default: 100)

  • limits::Unoin{Tuple{Float64, Float64}, Nothing}: Bound the possible approximation values (default: nothing)

  • verbose::Bool: Whether to display convergence progress (default: true)

References

  • Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.

Example

julia> using Impute, Missings

julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
   missing    missing  -0.536573  -0.961397
   missing   0.989358    missing  -0.750987
 -0.756802    missing   0.990607    missing
 -0.958924  -0.544021   0.650288   0.912945

julia> result = Impute.svd(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
  0.841471  -0.279415  -0.99999   -0.287903
  0.220258   0.555829  -0.536573  -0.961397
 -0.372745   0.989358   0.533193  -0.750987
 -0.756802   0.253309   0.990607   0.32315
 -0.958924  -0.544021   0.650288   0.912945
SVD(; kwargs...)

Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.

Keyword Arguments

  • init::Imputor: initialization method for missing values (default: Substitute())

  • rank::Union{Int, Nothing}: rank of the SVD approximation (default: nothing meaning start and 0 and increase)

  • tol::Float64: convergence tolerance (default: 1e-10)

  • maxiter::Int: Maximum number of iterations if convergence is not achieved (default: 100)

  • limits::Union{Tuple{Float64, Float64}, Nothing}: Bound the possible approximation values (default: nothing)

  • verbose::Bool: Whether to display convergence progress (default: true)

References

  • Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.