Engee documentation
Notebook

Missing data in Julia

Introduction

Dealing with missing data is a common task in data preprocessing. While sometimes missing values indicate a meaningful event in the data, they often represent unreliable or unusable data points. In any case, Julia has many options for dealing with missing data.

Creating missing data

Missing data in Julia can be represented in several forms. For example, the value NaN (NaN64) (not a number)

In [ ]:
x_64 = [NaN, 8, 15, 16, 23, 42]
println("Type of $(x_64[1]): ", typeof(x_64[1]))

x_64 = [NaN64, 8, 15, 16, 23, 42]
print("Type of $(x_64[1]): ", typeof(x_64[1]))
Type of NaN: Float64
Type of NaN: Float64

The value NaN can be specified as a smaller floating point number if necessary:

In [ ]:
x_32 = [NaN32, 8, 15, 16, 23, 42]
Out[0]:
6-element Vector{Float32}:
 NaN
   8.0
  15.0
  16.0
  23.0
  42.0
In [ ]:
x_16 = [NaN16, 8, 15, 16, 23, 42]
Out[0]:
6-element Vector{Float16}:
 NaN
   8.0
  15.0
  16.0
  23.0
  42.0

Speaking about the essence of this form of representation of missing data, we can conclude that these are concrete values of the floating-point data type that have the ability to propagate in computations. Often such values appear as a result of undefined operations:

In [ ]:
x_nan = Inf*0
Out[0]:
NaN

Generally, in Julia, the values NaN are used to propagate in computations the uncertainty of calculating numeric values. To work specifically with missing data, it is more correct to use a special object missing, which is also propagated in calculations. This form of representation is the only instance of type Missing:

In [ ]:
typeof(missing)
Out[0]:
Missing

This means, in particular, that arrays containing missing among other values are heterogeneous in type:

In [ ]:
x_missing = [missing, 8, 15, 16, 23, 42]
Out[0]:
6-element Vector{Union{Missing, Int64}}:
   missing
  8
 15
 16
 23
 42

To work with missing, the Missings.jl library may be useful. Thus, using it you can create arrays with such objects:

In [ ]:
import Pkg; Pkg.add("Missings") # загрузка библиотеки
In [ ]:
using Missings # подключение библиотеки

# создание массивов с отсутствующими значениями:
# массивы типа Missing
@show missings(1)
@show missings(3)
@show missings(3,1)

# массив объединения типов
@show missings(Int,3,3); 
missings(1) = [missing]
missings(3) = [missing, missing, missing]
missings(3, 1) = [missing; missing; missing;;]
missings(Int, 3, 3) = Union{Missing, Int64}[missing missing missing; missing missing missing; missing missing missing]

Systematisation of missing data

When sorting arrays with missing data, keep in mind that the object missing is considered larger than any object it is compared to:

In [ ]:
isless(Inf, missing)
Out[0]:
true

Therefore, the missing values will be automatically separated when sorting, and will be at the end of the ascending sort. In case you need to change the order of inclusion of missing values during sorting, it is enough to apply the attribute lt = missingsmallest:

In [ ]:
sort(x_missing, rev=true, lt = missingsmallest)
Out[0]:
6-element Vector{Union{Missing, Int64}}:
 42
 23
 16
 15
  8
   missing

Normalisation of missing data

Let's connect the libraries DataFrames.jl and Statistics.jl to examine examples of working with missing data in tabular data.

In [ ]:
Pkg.add(["DataFrames", "Statistics"])
using DataFrames, Statistics

Let's create a table of test data for processing:

In [ ]:
df_missing = DataFrame(
    имя     = ["NULL", "Коля", "Юра", "Миша"],
    возраст = [16, NaN, missing, 15],
    рост    = [171, 162, 999, 165],
)
Out[0]:
4×3 DataFrame
Rowимявозрастрост
StringFloat64?Int64
1NULL16.0171
2КоляNaN162
3Юраmissing999
4Миша15.0165

When working with data of different formats combined in one table, cases may arise when missing data take different values. To standardise values of missing data it is convenient to use the declaremissings() function from the Impute.jl library:

In [ ]:
Pkg.add("Impute")
In [ ]:
using Impute

df_missing = Impute.declaremissings(df_missing; values=(NaN, 999, "NULL"))
Out[0]:
4×3 DataFrame
Rowимявозрастрост
String?Float64?Int64?
1missing16.0171
2Коляmissing162
3Юраmissingmissing
4Миша15.0165

Now, as we can see, the missing data is reduced to one form - the object missing.

Searching for missing data

To determine if the value NaN, it is convenient to use the function isnan()

In [ ]:
isnan.(x_64)
Out[0]:
6-element BitVector:
 1
 0
 0
 0
 0
 0

There is also a similar function to determine the objects missing.

In [ ]:
ismissing.(x_missing)
Out[0]:
6-element BitVector:
 1
 0
 0
 0
 0
 0

Let's determine the location of missing values in the table:

In [ ]:
df_mask = ismissing.(df_missing)
Out[0]:
4×3 DataFrame
Rowимявозрастрост
BoolBoolBool
1truefalsefalse
2falsetruefalse
3falsetruetrue
4falsefalsefalse

Often objects missing can cause problems in data processing. For this purpose they can be omitted, excluded or replaced.

Skipping missing data

To filter the vector from NaN values, just use the filter() function of the Julia base library.

In [ ]:
filter(!isnan, x_64)
Out[0]:
5-element Vector{Float64}:
  8.0
 15.0
 16.0
 23.0
 42.0

In case of filtering an array with the object missing, the type of the resulting array will not change. You can use the disallowmissing() function of the Missings.jl library to cast the array type to the type of the filtered values.

In [ ]:
@show x = filter(!ismissing, x_missing)
disallowmissing(x)
x = filter(!ismissing, x_missing) = Union{Missing, Int64}[8, 15, 16, 23, 42]
Out[0]:
5-element Vector{Int64}:
  8
 15
 16
 23
 42

The following line of code shows how to filter missing from tabular data using the function filter().

In [ ]:
filter(:имя => !ismissing, df_missing)
Out[0]:
3×3 DataFrame
Rowимявозрастрост
String?Float64?Int64?
1Коляmissing162
2Юраmissingmissing
3Миша15.0165

The same result is obtained by applying the skipmissing() function of the base library:

In [ ]:
collect(skipmissing(x_missing))
Out[0]:
5-element Vector{Int64}:
  8
 15
 16
 23
 42

To skip missing in tabular data, the DataFrames.jl library has a more convenient function - dropmissing():

In [ ]:
dropmissing(df_missing)
Out[0]:
1×3 DataFrame
Rowимявозрастрост
StringFloat64Int64
1Миша15.0165

In order to filter out rows with missing contained in a particular column, the second argument to this function is the name of that column:

In [ ]:
dropmissing(df_missing, :имя)
Out[0]:
3×3 DataFrame
Rowимявозрастрост
StringFloat64?Int64?
1Коляmissing162
2Юраmissingmissing
3Миша15.0165

In this case, the function dropmissing() returns a new table. If it is not necessary to save the original table for the task to be solved, you can delete rows with missing data using dropmissing!().

Replacing missing data

If you need to replace missing data in an array, it is convenient to use the function Missings.replace():

In [ ]:
рост = collect(Missings.replace(df_missing.рост, 170))
Out[0]:
4-element Vector{Int64}:
 171
 162
 170
 165

To replace missing data in tables you can use the function replace!():

In [ ]:
replace!(df_missing.возраст, missing => 15)
Out[0]:
4-element Vector{Union{Missing, Float64}}:
 16.0
 15.0
 15.0
 15.0

Note that the data format of the resulting column does not change. Another way is to use the function coalesce():

In [ ]:
df_missing.имя = coalesce.(df_missing.имя, "Ваня")
df_missing.рост = coalesce.(df_missing.рост, mean(skipmissing(df_missing.рост)))
df_missing
Out[0]:
4×3 DataFrame
Rowимявозрастрост
StringFloat64?Real
1Ваня16.0171
2Коля15.0162
3Юра15.0166.0
4Миша15.0165

Conclusion

This case study explored how to create, organise, normalise, search, find, skip and replace missing data in Julia.