Документация Engee

Validation

Страница в процессе перевода.

Validator

An Validator stores settings for checking the validity of a AbstractArray or Tables.table containing missing values. New validations are expected to subtype Impute.Validator and, at minimum, implement the _validate(data::AbstractArray{Union{T, Missing}}, ::<MyValidator>) method.

validate(data::AbstractArray, v::Validator; dims=:)

If the validator v fails then an error is thrown, otherwise the data provided is returned without mutation. See Validator for the minimum internal _validate call requirements.

Arguments

  • data::AbstractArray: the data to be impute along dimensions dims

  • v::Validator: the validator to apply

Keywords

  • dims: The dimension to apply the _validate along (default is :)

Returns

  • the input data if no error is thrown.

Throws

  • An error when the test fails

julia> using Test; using Impute: Threshold, ThresholdError, validate

julia> M = [1.0 2.0 missing missing 5.0; 1.1 2.2 3.3 missing 5.5]
2×5 Matrix{Union{Missing, Float64}}:
 1.0  2.0   missing  missing  5.0
 1.1  2.2  3.3       missing  5.5

julia> @test_throws ThresholdError validate(M, Threshold())
Test Passed
      Thrown: ThresholdError
validate(table, v::Validator; cols=nothing)

Applies the validator v to the table 1 column at a time; if this is not the desired behaviour custom validate methods should overload this method. See Validator for the minimum internal _validate call requirements.

Arguments

  • table: the data to impute

  • v: the validator to apply

Keyword Arguments

  • cols: The columns to impute along (default is to impute all columns)

Returns

  • the input data if no error is thrown.

Throws

  • An error when any column doesn’t pass the test

Example

julia> using DataFrames, Test; using Impute: Threshold, ThresholdError, validate


julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrame
 Row │ a          b
     │ Float64?   Float64?
─────┼──────────────────────
   1 │       1.0        1.1
   2 │       2.0        2.2
   3 │ missing          3.3
   4 │ missing    missing
   5 │       5.0        5.5

julia> @test_throws ThresholdError validate(df, Threshold())
Test Passed
      Thrown: ThresholdError

Thresholds

Impute.threshold(data; limit=0.1, kwargs...)

Assert that proportion of missing values in the data do not exceed the limit.

Examples

julia> using DataFrames, Impute

julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrames.DataFrame
│ Row │ a        │ b        │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.3      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.5      │

julia> Impute.threshold(df)
ERROR: ThresholdError: Missing data limit exceeded 0.1 (0.4)
Stacktrace:
...

julia> Impute.threshold(df; limit=0.8)
5×2 DataFrames.DataFrame
│ Row │ a        │ b        │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.3      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.5      │
Threshold(; limit=0.1)

Assert that the ratio of missing values in the provided dataset does not exceed to specified limit.

Keyword Arguments

  • limit::Real: Allowed proportion of missing values (should be between 0.0 and 1.0).

Weighted Thresholds

Impute.wthreshold(data; ratio, weights, kwargs...)

Assert that the weighted proportion of missing values in the data do not exceed the limit.

Examples

julia> using DataFrames, Impute

julia> df = DataFrame(:a => [1.0, 2.0, missing, missing, 5.0], :b => [1.1, 2.2, 3.3, missing, 5.5])
5×2 DataFrames.DataFrame
│ Row │ a        │ b        │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.3      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.5      │

julia> Impute.wthreshold(df; limit=0.4, weights=0.1:0.1:0.5)
ERROR: ThresholdError: Missing data limit exceeded 0.4 (0.4666666666666666)
Stacktrace:
...

julia> Impute.wthreshold(df; limit=0.4, weights=0.5:-0.1:0.1)
5×2 DataFrames.DataFrame
│ Row │ a        │ b        │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 1.0      │ 1.1      │
│ 2   │ 2.0      │ 2.2      │
│ 3   │ missing  │ 3.3      │
│ 4   │ missing  │ missing  │
│ 5   │ 5.0      │ 5.5      │
WeightedThreshold(; limit, weights)

Assert that the weighted proportion missing values in the provided dataset does not exceed to specified limit. The weighed proportion is calculated as sum(weights[ismissing.(data)]) / sum(weights)

Keyword Arguments

  • limit::Real: Allowed proportion of missing values (should be between 0.0 and 1.0).

  • weights::AbstractWeights: A set of statistical weights to use when evaluating the importance of each observation.