Imputation
Страница в процессе перевода. |
#
Impute.Imputor
— Type
Imputor
An imputor stores information about imputing values in AbstractArray
s and Tables.table
s. New imputation methods are expected to subtype Imputor
and, at minimum, implement the _impute!(data::AbstractArrays, imp::<MyImputor>)
method.
While fallback impute
and impute!
methods are provided to extend your _impute!
methods to n-dimensional arrays and tables, you can always override these methods to change the behaviour as necessary.
#
Impute.impute!
— Method
impute!(data::AbstractArray, imp) -> data
Just returns the data
when the array doesn’t contain missing
s
#
Impute.impute!
— Method
impute!(data::AbstractArray{Missing}, imp) -> data
Just return the data
when the array only contains missing
s
#
Impute.impute!
— Method
impute!(data::A, imp; dims=:, kwargs...) -> A
Impute the missing
values in the array data
using the imputor imp
. Optionally, you can specify the dimension to impute along.
Arguments
-
data::AbstractArray{Union{T, Missing}}
: the data to be impute along dimensionsdims
-
imp::Imputor
: the Imputor method to use
Keyword Arguments
-
dims=:
: The dimension to impute along.:rows
and:cols
are also supported for matrices.
Returns
-
AbstractArray{Union{T, Missing}}
: the inputdata
with values imputed
NOTES
-
Matrices have a deprecated
dims=2
special case asdims=:
is a breaking change -
Mutation isn’t guaranteed for all array types, hence we return the result
-
eachslice
is used internally which requires Julia 1.1
Example
julia> using Impute: Interpolate, impute!
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute!(M, Interpolate(); dims=1)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
julia> M
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0
1.1 2.2 3.3 4.4 5.5
#
Impute.impute!
— Method
impute!(table, imp; cols=nothing) -> table
Imputes the data in a table by imputing the values 1 column at a time; if this is not the desired behaviour custom imputor methods should overload this method.
Arguments
-
imp::Imputor
: the Imputor method to use -
table
: the data to impute
Keyword Arguments
-
cols
: The columns to impute along (default is to impute all columns)
Returns
-
the input
data
with values imputed
Example
julia> using DataFrames; using Impute: Interpolate, impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> impute(df, Interpolate())
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 3.0 3.3
4 │ 4.0 4.4
5 │ 5.0 5.5
#
Impute.impute!
— Method
impute!(data::T, imp; kwargs...) -> T where T <: AbstractVector{<:NamedTuple}
Special case rowtables which are arrays, but we want to fallback to the tables method.
#
Impute.impute
— Method
impute(data::T, imp; kwargs...) -> T
Returns a new copy of the data
with the missing data imputed by the imputor imp
. For matrices and tables, data is imputed one variable/column at a time. If this is not the desired behaviour then you should overload this method or specify a different dims
value.
Arguments
-
data
: the data to be impute -
imp::Imputor
: the Imputor method to use
Returns
-
the input
data
with values imputed
Example
julia> using Impute: Interpolate, impute
julia> v = [1.0, 2.0, missing, missing, 5.0]
5-element Vector{Union{Missing, Float64}}:
1.0
2.0
missing
missing
5.0
julia> impute(v, Interpolate())
5-element Vector{Union{Missing, Float64}}:
1.0
2.0
3.0
4.0
5.0
Interpolation
#
Impute.interp
— Function
Impute.interp(data; dims=1)
Performs linear interpolation between the nearest values in an vector. See Impute.Interpolate
for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.interp(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 3.0 3.3
4 │ 4.0 4.4
5 │ 5.0 5.5
#
Impute.Interpolate
— Type
Interpolate(; limit=nothing)
Performs linear interpolation between the nearest values in an vector. The current implementation is univariate, so each variable in a table or matrix will be handled independently.
!!! Missing values at the head or tail of the array cannot be interpolated if there are no existing values on both sides. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
-
limit::Union{UInt, Nothing}
: Optionally limit the gap sizes that can be interpolated.
Example
julia> using Impute: Interpolate, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, Interpolate(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 3.0 4.0 5.0 6.0
1.1 2.2 3.3 4.4 5.5 6.6
julia> impute(M, Interpolate(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 2.2 3.3 4.4 5.5 6.6
K-Nearest Neighbors (KNN)
#
Impute.knn
— Function
Impute.knn(; k=1, threshold=0.5, dist=Euclidean())
Imputation using k-Nearest Neighbor algorithm.
Keyword Arguments
-
k::Int
: number of nearest neighbors -
dist::MinkowskiMetric
: distance metric suppports byNearestNeighbors.jl
(Euclidean, Chebyshev, Minkowski and Cityblock) -
threshold::AbsstractFloat
: thershold for missing neighbors
Reference
-
Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Example
julia> using Impute, Missings
julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
missing missing -0.536573 -0.961397
missing 0.989358 missing -0.750987
-0.756802 missing 0.990607 missing
-0.958924 -0.544021 0.650288 0.912945
julia> result = Impute.knn(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
-0.756802 0.989358 -0.536573 -0.961397
-0.756802 0.989358 -0.536573 -0.750987
-0.756802 -0.544021 0.990607 0.912945
-0.958924 -0.544021 0.650288 0.912945
#
Impute.KNN
— Type
KNN(; kwargs...)
Imputation using k-Nearest Neighbor algorithm.
Keyword Arguments
-
k::Int
: number of nearest neighbors -
dist::MinkowskiMetric
: distance metric suppports byNearestNeighbors.jl
(Euclidean, Chebyshev, Minkowski and Cityblock) -
threshold::AbstractFloat
: threshold for missing neighbors
Reference
-
Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Last Observation Carried Forward (LOCF)
#
Impute.locf
— Function
Impute.locf(data; dims=1)
Iterates forwards through the data
and fills missing data with the last existing observation. See Impute.LOCF
for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.locf(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 2.0 3.3
4 │ 2.0 3.3
5 │ 5.0 5.5
#
Impute.LOCF
— Type
LOCF(; limit=nothing)
Last observation carried forward (LOCF) iterates forwards through the data
and fills missing data with the last existing observation. The current implementation is univariate, so each variable in a table or matrix will be handled independently.
See also:
-
Impute.NOCB
: Next Observation Carried Backward
!!! Missing elements at the head of the array may not be imputed if there is no existing observation to carry forward. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
-
limit::Union{UInt, Nothing}
: Optionally limits the amount of consecutive missing values to replace.
Example
julia> using Impute: LOCF, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, LOCF(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 2.0 6.0
1.1 1.1 1.1 4.4 5.5 6.6
julia> impute(M, LOCF(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 missing 6.0
1.1 1.1 1.1 4.4 5.5 6.6
Next Observation Carried Backward (NOCB)
#
Impute.nocb
— Function
Impute.nocb(data; dims=1)
Iterates backwards through the data
and fills missing data with the next existing observation. See Impute.NOCB
for details.
Example
julia> using DataFrames; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼──────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ missing 3.3
4 │ missing missing
5 │ 5.0 5.5
julia> Impute.nocb(df)
5×2 DataFrame
Row │ a b
│ Float64? Float64?
─────┼────────────────────
1 │ 1.0 1.1
2 │ 2.0 2.2
3 │ 5.0 3.3
4 │ 5.0 5.5
5 │ 5.0 5.5
#
Impute.NOCB
— Type
NOCB(; limit=nothing)
Next observation carried backward (NOCB) iterates backwards through the data
and fills missing data with the next existing observation.
See also:
-
Impute.LOCF
: Last Observation Carried Forward
!!! Missing elements at the tail of the array may not be imputed if there is no existing observation to carry backward. As a result, this method does not guarantee that all missing values will be imputed.
Keyword Arguments
-
limit::Union{UInt, Nothing}
: Optionally limits the amount of consecutive missing values to replace.
Example
julia> using Impute: NOCB, impute
julia> M = [1.0 2.0 missing missing missing 6.0; 1.1 missing missing 4.4 5.5 6.6]
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing missing 6.0
1.1 missing missing 4.4 5.5 6.6
julia> impute(M, NOCB(); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 6.0 6.0 6.0 6.0
1.1 4.4 4.4 4.4 5.5 6.6
julia> impute(M, NOCB(; limit=2); dims=:rows)
2×6 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing 6.0 6.0 6.0
1.1 4.4 4.4 4.4 5.5 6.6
Replacement
#
Impute.replace
— Function
Impute.replace(data; values)
Replace missing
s with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.
Keyword Arguments
-
values::Tuple
: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you’re considering imputing for.
Example
julia> using DataFrames, Impute
julia> df = DataFrame(
:a => [1.1, 2.2, missing, missing, 5.5],
:b => [1, 2, 3, missing, 5],
:c => ["v", "w", "x", "y", missing],
)
5×3 DataFrame
Row │ a b c
│ Float64? Int64? String?
─────┼─────────────────────────────
1 │ 1.1 1 v
2 │ 2.2 2 w
3 │ missing 3 x
4 │ missing missing y
5 │ 5.5 5 missing
julia> Impute.replace(df; values=(NaN, -9999, "NULL"))
5×3 DataFrame
Row │ a b c
│ Float64? Int64? String?
─────┼───────────────────────────
1 │ 1.1 1 v
2 │ 2.2 2 w
3 │ NaN 3 x
4 │ NaN -9999 y
5 │ 5.5 5 NULL
#
Impute.Replace
— Type
Replace(; value)
Replace missing
s with one of the specified constant values, depending on the input type. If multiple values of the same type are provided then the first one will be used. If the input data is of a different type then the no replacement will be performed.
Keyword Arguments
-
values::Tuple
: A scalar or tuple of different values that should be used to replace missings. Typically, one value per type you’re considering imputing for.
Example
julia> using Impute: Replace, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, Replace(; values=0.0); dims=2)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 0.0 0.0 5.0
1.1 2.2 3.3 0.0 5.5
Simple Random Sample (SRS)
#
Impute.srs
— Function
Impute.srs(data; rng=Random.GLOBAL_RNG)
Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).
Example
julia> using DataFrames; using Random; using Impute: Impute
julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ missing │ 3.3 │
│ 4 │ missing │ missing │
│ 5 │ 5.0 │ 5.5 │
julia> Impute.srs(df; rng=MersenneTwister(1234))
5×2 DataFrame
│ Row │ a │ b │
│ │ Float64? │ Float64? │
├─────┼──────────┼──────────┤
│ 1 │ 1.0 │ 1.1 │
│ 2 │ 2.0 │ 2.2 │
│ 3 │ 1.0 │ 3.3 │
│ 4 │ 2.0 │ 3.3 │
│ 5 │ 5.0 │ 5.5 │
#
Impute.SRS
— Type
SRS(; rng=Random.GLOBAL_RNG)
Simple Random Sampling (SRS) imputation is a method for imputing both continuous and categorical variables. Furthermore, it completes imputation while preserving the distributional properties of the variables (e.g., mean, standard deviation).
The basic idea is that for a given variable, x
, with missing data, we randomly draw from the observed values of x
to impute the missing elements. Since the random draws from x
for imputation are done in proportion to the frequency distribution of the values in x
, the univariate distributional properties are generally not impacted; this is true for both categorical and continuous data.
Keyword Arguments
-
rng::AbstractRNG
: A random number generator to use for observation selection
Example
julia> using Random; using Impute: SRS, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, SRS(; rng=MersenneTwister(1234)); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 1.0 2.0 5.0
1.1 2.2 3.3 3.3 5.5
Substitute
#
Impute.substitute
— Function
Impute.substitute(data; statistic=nothing)
Impute.substitute(data; weights=nothing)
Substitute missing values with a summary statistic over the non-missing values.
Keyword Arguments
-
statistic
: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input dataeltype
. If this function isn’t passed in then thedefaultstats
function is used to make a best guess. -
weights
: A set of statistical weights to apply to themean
ormedian
indefaultstats
.
See Substitute for details on substitution rules defined in defaultstats
.
Example
julia> using DataFrames, Impute
julia> df = DataFrame(
:a => [8.9, 2.2, missing, missing, 1.3, 6.2, 3.7, 4.8],
:b => [2, 6, 3, missing, 7, 1, 9, missing],
:c => [true, false, true, true, false, missing, false, true],
)
8×3 DataFrame
Row │ a b c
│ Float64? Int64? Bool?
─────┼─────────────────────────────
1 │ 8.9 2 true
2 │ 2.2 6 false
3 │ missing 3 true
4 │ missing missing true
5 │ 1.3 7 false
6 │ 6.2 1 missing
7 │ 3.7 9 false
8 │ 4.8 missing true
julia> Impute.substitute(df)
8×3 DataFrame
Row │ a b c
│ Float64? Int64? Bool?
─────┼─────────────────────────
1 │ 8.9 2 true
2 │ 2.2 6 false
3 │ 4.25 3 true
4 │ 4.25 4 true
5 │ 1.3 7 false
6 │ 6.2 1 true
7 │ 3.7 9 false
8 │ 4.8 4 true
#
Impute.Substitute
— Type
Substitute(; statistic=Impute.defaultstats)
Substitute missing values with a summary statistic over the non-missing values.
Keyword Arguments
-
statistic
: A summary statistic function to be applied to the non-missing values. This function should return a value of the same type as the input dataeltype
. If this function isn’t passed in then theImpute.defaultstats
function is used to make a best guess.
Example
julia> using Statistics; using Impute: Substitute, impute
julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 missing missing 5.0
1.1 2.2 3.3 missing 5.5
julia> impute(M, Substitute(); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.0 2.0 5.0
1.1 2.2 3.3 2.75 5.5
julia> impute(M, Substitute(; statistic=mean); dims=:rows)
2×5 Matrix{Union{Missing, Float64}}:
1.0 2.0 2.66667 2.66667 5.0
1.1 2.2 3.3 3.025 5.5
#
Impute.defaultstats
— Function
defaultstats(data[, wv])
A set of default substitution rules using either median
or mode
based on the eltype
of the input data
. Specific rules are summarized as follows.
-
Bool
elements usemode
-
Real
elements usemedian
-
Integer
elements wherenunique(data) / length(data) < 0.25
usemode
(ratings, categorical codings, etc) -
Integer
elements with mostly unique values usemedian
-
!Number
(non-numeric) elements usemode
as the safest fallback
SVD
#
Impute.svd
— Function
Impute.svd(; kwargs...)
Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.
Keyword Arguments
-
init::Imputor
: initialization method for missing values (default: Substitute()) -
rank::Union{Int, Nothing}
: rank of the SVD approximation (default: nothing meaning start and 0 and increase) -
tol::Float64
: convergence tolerance (default: 1e-10) -
maxiter::Int
: Maximum number of iterations if convergence is not achieved (default: 100) -
limits::Unoin{Tuple{Float64, Float64}, Nothing}
: Bound the possible approximation values (default: nothing) -
verbose::Bool
: Whether to display convergence progress (default: true)
References
-
Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.
Example
julia> using Impute, Missings
julia> data = allowmissing(reshape(sin.(1:20), 5, 4)); data[[2, 3, 7, 9, 13, 19]] .= missing; data
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
missing missing -0.536573 -0.961397
missing 0.989358 missing -0.750987
-0.756802 missing 0.990607 missing
-0.958924 -0.544021 0.650288 0.912945
julia> result = Impute.svd(data; dims=:cols)
5×4 Matrix{Union{Missing, Float64}}:
0.841471 -0.279415 -0.99999 -0.287903
0.220258 0.555829 -0.536573 -0.961397
-0.372745 0.989358 0.533193 -0.750987
-0.756802 0.253309 0.990607 0.32315
-0.958924 -0.544021 0.650288 0.912945
#
Impute.SVD
— Type
SVD(; kwargs...)
Imputes the missing values in a matrix using an expectation maximization (EM) algorithm over low-rank SVD approximations.
Keyword Arguments
-
init::Imputor
: initialization method for missing values (default: Substitute()) -
rank::Union{Int, Nothing}
: rank of the SVD approximation (default: nothing meaning start and 0 and increase) -
tol::Float64
: convergence tolerance (default: 1e-10) -
maxiter::Int
: Maximum number of iterations if convergence is not achieved (default: 100) -
limits::Union{Tuple{Float64, Float64}, Nothing}
: Bound the possible approximation values (default: nothing) -
verbose::Bool
: Whether to display convergence progress (default: true)
References
-
Troyanskaya, Olga, et al. "Missing value estimation methods for DNA microarrays." Bioinformatics 17.6 (2001): 520-525.