Engee documentation
Notebook

Missing data in Julia

Introduction

Dealing with missing data is a common task in data preprocessing. Although sometimes missing values indicate a significant event in the data, they often represent unreliable or unusable data points. In any case, Julia has many features for working with missing data.

Creating missing data

Missing data in Julia can be presented in several forms. For example, the value NaN (NaN64) ([not a number](https://engee.com/helpcenter/stable/ru/julia/base/numbers.html#Base .NaN))

In [ ]:
x_64 = [NaN, 8, 15, 16, 23, 42]
println("Type of $(x_64[1]): ", typeof(x_64[1]))

x_64 = [NaN64, 8, 15, 16, 23, 42]
print("Type of $(x_64[1]): ", typeof(x_64[1]))
Type of NaN: Float64
Type of NaN: Float64

Meaning NaN If necessary, it can be set as a floating-point number with a lower bit depth.:

In [ ]:
x_32 = [NaN32, 8, 15, 16, 23, 42]
Out[0]:
6-element Vector{Float32}:
 NaN
   8.0
  15.0
  16.0
  23.0
  42.0
In [ ]:
x_16 = [NaN16, 8, 15, 16, 23, 42]
Out[0]:
6-element Vector{Float16}:
 NaN
   8.0
  15.0
  16.0
  23.0
  42.0

Speaking about the essence of this form of representation of missing data, we can conclude that these are specific values of the floating-point data type that can be distributed in calculations. Such values often appear as a result of undefined operations.:

In [ ]:
x_nan = Inf*0
Out[0]:
NaN

As a rule, in Julia the values are NaN They are used to propagate the uncertainty of calculating numerical values in calculations. To work specifically with missing data it would be more correct to use a special object missing, which is also distributed in computing. This form of representation is the only instance of the type Missing:

In [ ]:
typeof(missing)
Out[0]:
Missing

This means, in particular, that arrays containing missing Among other values, they are heterogeneous in type:

In [ ]:
x_missing = [missing, 8, 15, 16, 23, 42]
Out[0]:
6-element Vector{Union{Missing, Int64}}:
   missing
  8
 15
 16
 23
 42

To work with missing the library [Missings.jl] may be useful(https://engee.com/helpcenter/stable/ru/julia/DataFrames/man/missing.html ). So, you can use it to create arrays with such objects:

In [ ]:
import Pkg; Pkg.add("Missings") # загрузка библиотеки
In [ ]:
using Missings # подключение библиотеки

# создание массивов с отсутствующими значениями:
# массивы типа Missing
@show missings(1)
@show missings(3)
@show missings(3,1)

# массив объединения типов
@show missings(Int,3,3); 
missings(1) = [missing]
missings(3) = [missing, missing, missing]
missings(3, 1) = [missing; missing; missing;;]
missings(Int, 3, 3) = Union{Missing, Int64}[missing missing missing; missing missing missing; missing missing missing]

Systematization of missing data

When sorting arrays with missing data, keep in mind that the object missing It is considered to be larger than any object it is compared to.:

In [ ]:
isless(Inf, missing)
Out[0]:
true

Therefore, when sorting, the missing values will be automatically separated, and will be at the end of the ascending sort. If you need to change the order in which missing values are included during sorting, you just need to apply the attribute lt = missingsmallest:

In [ ]:
sort(x_missing, rev=true, lt = missingsmallest)
Out[0]:
6-element Vector{Union{Missing, Int64}}:
 42
 23
 16
 15
  8
   missing

Normalization of missing data

To consider examples of working with gaps in tabular data, we will connect the libraries DataFrames.jl and Statistics.jl.

In [ ]:
Pkg.add(["DataFrames", "Statistics"])
using DataFrames, Statistics

Creating a table of test data for processing:

In [ ]:
df_missing = DataFrame(
    имя     = ["NULL", "Коля", "Юра", "Миша"],
    возраст = [16, NaN, missing, 15],
    рост    = [171, 162, 999, 165],
)
Out[0]:
4×3 DataFrame
Rowимявозрастрост
StringFloat64?Int64
1NULL16.0171
2КоляNaN162
3Юраmissing999
4Миша15.0165

When working with data in different formats combined into a single table, there may be cases when missing data takes on different values. To standardize the values of missing data, it is convenient to use the function declaremissings() from the library Impact.jl:

In [ ]:
Pkg.add("Impute")
In [ ]:
using Impute

df_missing = Impute.declaremissings(df_missing; values=(NaN, 999, "NULL"))
Out[0]:
4×3 DataFrame
Rowимявозрастрост
String?Float64?Int64?
1missing16.0171
2Коляmissing162
3Юраmissingmissing
4Миша15.0165

Now, as we can see, the missing data is reduced to one type - an object. missing.

Search for missing data

To determine whether the value is NaN, it is convenient to use the function isnan()

In [ ]:
isnan.(x_64)
Out[0]:
6-element BitVector:
 1
 0
 0
 0
 0
 0

There is also a similar function to identify objects missing.

In [ ]:
ismissing.(x_missing)
Out[0]:
6-element BitVector:
 1
 0
 0
 0
 0
 0

Let's determine the location of the missing values in the table:

In [ ]:
df_mask = ismissing.(df_missing)
Out[0]:
4×3 DataFrame
Rowимявозрастрост
BoolBoolBool
1truefalsefalse
2falsetruefalse
3falsetruetrue
4falsefalsefalse

Frequently objects missing they may cause problems when processing data. To do this, you can skip, exclude, or replace them.

Skipping missing data

To filter a vector from values NaN it is enough to use the function filter() of the Julia base library.

In [ ]:
filter(!isnan, x_64)
Out[0]:
5-element Vector{Float64}:
  8.0
 15.0
 16.0
 23.0
 42.0

In case of filtering an array with an object missing the type of the resulting array will not change. To convert the array type to the type of filtered values, you can use the function disallowmissing() libraries Missings.jl.

In [ ]:
@show x = filter(!ismissing, x_missing)
disallowmissing(x)
x = filter(!ismissing, x_missing) = Union{Missing, Int64}[8, 15, 16, 23, 42]
Out[0]:
5-element Vector{Int64}:
  8
 15
 16
 23
 42

The following line of code shows how to filter missing from tabular data using the function filter().

In [ ]:
filter(:имя => !ismissing, df_missing)
Out[0]:
3×3 DataFrame
Rowимявозрастрост
String?Float64?Int64?
1Коляmissing162
2Юраmissingmissing
3Миша15.0165

The application of the function gives a similar result skipmissing() the base library:

In [ ]:
collect(skipmissing(x_missing))
Out[0]:
5-element Vector{Int64}:
  8
 15
 16
 23
 42

For skipping missing in the tabular data library DataFrames.jl there is a more convenient function - dropmissing():

In [ ]:
dropmissing(df_missing)
Out[0]:
1×3 DataFrame
Rowимявозрастрост
StringFloat64Int64
1Миша15.0165

In order to use it to filter out rows with missing contained in a specific column, the second argument of this function must be passed the name of this column.:

In [ ]:
dropmissing(df_missing, :имя)
Out[0]:
3×3 DataFrame
Rowимявозрастрост
StringFloat64?Int64?
1Коляmissing162
2Юраmissingmissing
3Миша15.0165

In this case, the function dropmissing() returns a new table. If you do not need to save the original table for the task being solved, you can delete rows with missing data using dropmissing!().

Replacing missing data

If it is necessary to replace the missing data in the array, it is convenient to use the function Missings.replace():

In [ ]:
рост = collect(Missings.replace(df_missing.рост, 170))
Out[0]:
4-element Vector{Int64}:
 171
 162
 170
 165

To replace the missing data in the tables, you can use the function replace!():

In [ ]:
replace!(df_missing.возраст, missing => 15)
Out[0]:
4-element Vector{Union{Missing, Float64}}:
 16.0
 15.0
 15.0
 15.0

It should be noted that the data format of the resulting column does not change. Another way is to use the function coalesce():

In [ ]:
df_missing.имя = coalesce.(df_missing.имя, "Ваня")
df_missing.рост = coalesce.(df_missing.рост, mean(skipmissing(df_missing.рост)))
df_missing
Out[0]:
4×3 DataFrame
Rowимявозрастрост
StringFloat64?Real
1Ваня16.0171
2Коля15.0162
3Юра15.0166.0
4Миша15.0165

Conclusion

In this example, we discussed ways to create, organize, normalize, search, skip, and replace missing data in Julia.