Missing values
Julia provides support for representing missing values in a statistical sense. This is intended for situations where no value is available for the observed variable, but a valid value theoretically exists. Missing values are represented through an object missing
, which is a single instance of the type Missing
. missing
is equivalent to https://en.wikipedia.org/wiki/NULL_ (SQL)[NULL
in SQL] and https://cran.r-project.org/doc/manuals/r-release/R-lang.html#NA-handling [NA
in R] and behaves like them in most situations.
Propagation of missing values
Missing values are propagated automatically when passed to standard mathematical operators and functions. For these functions, the uncertainty of the value of one of the operands causes uncertainty of the result. In practice, this means that a mathematical operator that uses a missing value usually returns missing
.:
julia> missing + 1
missing
julia> "a" * missing
missing
julia> abs(missing)
missing
Since missing
is a regular Julia object, this distribution rule only works for functions that explicitly agreed to implement this behavior. This can be achieved in the following ways:
-
by adding a specific method defined for arguments of type
Missing
; -
by taking arguments of this type and passing them to functions that propagate them (for example, standard mathematical operators). Packages should consider whether it makes sense to distribute missing values when defining new functions and, if so, define methods accordingly. Passing a missing (
missing
) value to a function that does not have a method that accepts arguments of theMissing
type outputsMethodError
is exactly the same as for any other type.
Functions that do not distribute missing values can be made to do so by wrapping them in the passmissing
function provided by the package. https://github.com/JuliaData/Missings.jl [Missings.jl]. For example, f(x)
becomes `passing(f)(x)'.
Equality and comparison operators
The standard equality and comparison operators follow the above distribution rule: if any of the operands are missing, the result is missing. Here are some examples:
julia> missing == 1
missing
julia> missing == missing
missing
julia> missing < 1
missing
julia> 2 >= missing
missing
In particular, note that missing == missing
returns missing
, so ==
cannot be used to check if the value is missing. To check if the x
is missing, use ismissing(x)
.
Special comparison operators isequal
and ===
are an exception to the distribution rule. They will always return the value Bool
, even if there are missing values, treating missing
as equal to missing
and as distinct from any other value. Therefore, they can be used to check whether the value is missing (missing
).:
julia> missing === 1
false
julia> isequal(missing, 1)
false
julia> missing === missing
true
julia> isequal(missing, missing)
true
Another exception is the operator isless
: missing' is considered greater than any other value. This operator is used `sort!
, which consequently places the missing values after all other values:
julia> isless(1, missing)
true
julia> isless(missing, Inf)
false
julia> isless(missing, missing)
false
Logical operators
Logical (or Boolean) operators |
, &
and 'xor` is another special case, as they propagate missing values only when it is logically required. For these operators, whether the result is undefined or not depends on the specific operation. This follows from firmly established rules. https://en.wikipedia.org/wiki/Three-valued_logic [three-digit logic], which are implemented, for example, NULL
in SQL and NA
in R. This abstract representation corresponds to relatively natural behavior, which is best explained by concrete examples.
Let’s illustrate this principle with the logical operator "or". |
. Following the rules of Boolean logic, if one of the operands has the value true
, the value of the other operand does not affect the result, which will always be true
.:
julia> true | true
true
julia> true | false
true
julia> false | true
true
Based on this observation, we can conclude that if one of the operands has the value true
and the other has the value missing
, we know that the result is true
, despite the uncertainty about the actual value of one of the operands. If we could observe the actual value of the second operand, its possible values could be true
or false
, and in both cases the result would be true
. Therefore, in this particular case, the absence of the __ value does not apply.:
julia> true | missing
true
julia> missing | true
true
Conversely, if one of the operands has the value false
, the result can be either true
or false', depending on the value of the other operand. Therefore, if the operand has the value `missing
, the result must also have the value missing
.:
julia> false | true
true
julia> true | false
true
julia> false | false
false
julia> false | missing
missing
julia> missing | false
missing
Behavior of the logical operator "and" &
is similar to the behavior of the operator |
, with the difference that the absence of a value does not apply when one of the operands has the value `false'. For example, if this is the case for the first operand:
julia> false & false
false
julia> false & true
false
julia> false & missing
false
On the other hand, the absence of a value is propagated when one of the operands has the value `true', for example, the first:
julia> true & true
true
julia> true & false
false
julia> true & missing
missing
The order of execution and the calculation operator according to the abbreviated scheme
Execution order operators, including if
, while
and ternary operator x ? y : z
, do not allow missing values. This is due to uncertainty about whether the actual value would be true
or false' if we could observe it. This implies that we don’t know how the program should behave. In this case, it is issued `TypeError
, as soon as the value missing
is encountered in this context:
julia> if missing
println("here")
end
ERROR: TypeError: non-boolean (Missing) used in boolean context
For the same reason, unlike the logical operators presented above, logical operators are calculated using an abbreviated scheme. |
) do not allow missing values in situations in which the value of the operands determines whether the next operand is evaluated or not. For example: |
julia> missing || false
ERROR: TypeError: non-boolean (Missing) used in boolean context
julia> missing && false
ERROR: TypeError: non-boolean (Missing) used in boolean context
julia> true && missing && false
ERROR: TypeError: non-boolean (Missing) used in boolean context
On the contrary, an error is not returned when the result can be determined without missing values. This is the case when the code uses calculations according to a shortened scheme before calculating the missing (missing
) operand and when the missing (missing
) operand is the last one.:
julia> true && missing
missing
julia> false && missing
false
Arrays with missing values
You can create arrays containing missing values, just like other arrays.:
julia> [1, missing]
2-element Vector{Union{Missing, Int64}}:
1
missing
As this example shows, the element type of such arrays is Union{Missing, T}
, with type T
missing values. This reflects the fact that the entries in this array can be either of type T
(here Int64
) or of type Missing'. This type of array uses efficient memory storage, similar to `+Array'.{T}+
, which contains the actual values combined with Array{UInt8}
, indicating the type of record (that is, whether it is missing (Missing
) or T
).
Arrays that allow missing values can be constructed using the standard syntax. Use an Array{Union{Missing, T}}(missing, dims)
to create arrays filled with missing values:
julia> Array{Union{Missing, String}}(missing, 2, 3)
2×3 Matrix{Union{Missing, String}}:
missing missing missing
missing missing missing
Using |
An array with an element type that allows for missing entries (for example, Vector{Union{Missing, T}}
), which does not contain missing entries, can be converted to an array type that does not allow missing entries (for example, Vector{T}
) using convert
. If the array contains the values missing
, a MethodError
is issued during conversion:
julia> x = Union{Missing, String}["a", "b"]
2-element Vector{Union{Missing, String}}:
"a"
"b"
julia> convert(Array{String}, x)
2-element Vector{String}:
"a"
"b"
julia> y = Union{Missing, String}[missing, "b"]
2-element Vector{Union{Missing, String}}:
missing
"b"
julia> convert(Array{String}, y)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type String
Skipping missing values
Since missing values are propagated using standard mathematical operators, reduction functions return missing
when called for arrays containing missing values.:
julia> sum([1, missing])
missing
In this situation, use the function skipmissing
to skip missing values:
julia> sum(skipmissing([1, missing]))
1
This auxiliary function returns an iterator that effectively filters the missing values. Therefore, it can be used with any function that supports iterators.:
julia> x = skipmissing([3, missing, 2, 1])
skipmissing(Union{Missing, Int64}[3, missing, 2, 1])
julia> maximum(x)
3
julia> sum(x)
6
julia> mapreduce(sqrt, +, x)
4.146264369941973
Objects created by calling 'skipmissing` on an array can be indexed using indexes from the parent array. Indexes corresponding to missing values are invalid for these values, and an error is returned when trying to use them (they also skip keys
and `eachindex'):
julia> x[1]
3
julia> x[2]
ERROR: MissingException: the value at index (2,) is missing
[...]
This allows functions that work with indexes to work in combination with skipmissing'. This is, to a large extent, a case of searching and finding functions. These functions return the indexes allowed for the object returned by `skipmissing
, as well as the indexes of matching entries in the parent array._:
julia> findall(==(1), x)
1-element Vector{Int64}:
4
julia> findfirst(!iszero, x)
1
julia> argmax(x)
1
Use collect
to extract non-missing values and store them in an array:
julia> collect(x)
3-element Vector{Int64}:
3
2
1
Logical operations with arrays
The three-digit logic for logical operators described above is also used by logical functions applied to arrays. Thus, checking the equality of arrays using the operator '== returns `missing in all cases where the result cannot be determined without knowing the actual value of the missing record. In practice, this means that
missing
is returned if all non-missing values of the compared arrays are equal, but one or both arrays contain missing values (possibly in different positions):
julia> [1, missing] == [2, missing]
false
julia> [1, missing] == [1, missing]
missing
julia> [1, 2, missing] == [1, missing, 2]
missing
As for individual values, use 'isequal` to treat missing (missing
) values as equal to other missing (missing
) values, but different from non-missing values:
julia> isequal([1, missing], [1, missing])
true
julia> isequal([1, 2, missing], [1, missing, 2])
false
Functions any
and all
also follow the rules of three-digit logic. Thus, missing
is returned when the result cannot be determined.:
julia> all([true, missing])
missing
julia> all([false, missing])
false
julia> any([true, missing])
true
julia> any([false, missing])
missing