Missing data in Julia
Introduction
Dealing with missing data is a common task in data preprocessing. Although sometimes missing values indicate a significant event in the data, they often represent unreliable or unusable data points. In any case, Julia has many features for working with missing data.
Creating missing data
Missing data in Julia can be presented in several forms. For example, the value NaN (NaN64) ([not a number](https://engee.com/helpcenter/stable/ru/julia/base/numbers.html#Base .NaN))
x_64 = [NaN, 8, 15, 16, 23, 42]
println("Type of $(x_64[1]): ", typeof(x_64[1]))
x_64 = [NaN64, 8, 15, 16, 23, 42]
print("Type of $(x_64[1]): ", typeof(x_64[1]))
Meaning NaN If necessary, it can be set as a floating-point number with a lower bit depth.:
x_32 = [NaN32, 8, 15, 16, 23, 42]
x_16 = [NaN16, 8, 15, 16, 23, 42]
Speaking about the essence of this form of representation of missing data, we can conclude that these are specific values of the floating-point data type that can be distributed in calculations. Such values often appear as a result of undefined operations.:
x_nan = Inf*0
As a rule, in Julia the values are NaN They are used to propagate the uncertainty of calculating numerical values in calculations. To work specifically with missing data it would be more correct to use a special object missing, which is also distributed in computing. This form of representation is the only instance of the type Missing:
typeof(missing)
This means, in particular, that arrays containing missing Among other values, they are heterogeneous in type:
x_missing = [missing, 8, 15, 16, 23, 42]
To work with missing the library [Missings.jl] may be useful(https://engee.com/helpcenter/stable/ru/julia/DataFrames/man/missing.html ). So, you can use it to create arrays with such objects:
import Pkg; Pkg.add("Missings") # загрузка библиотеки
using Missings # подключение библиотеки
# создание массивов с отсутствующими значениями:
# массивы типа Missing
@show missings(1)
@show missings(3)
@show missings(3,1)
# массив объединения типов
@show missings(Int,3,3); 
Systematization of missing data
When sorting arrays with missing data, keep in mind that the object missing It is considered to be larger than any object it is compared to.:
isless(Inf, missing)
Therefore, when sorting, the missing values will be automatically separated, and will be at the end of the ascending sort. If you need to change the order in which missing values are included during sorting, you just need to apply the attribute lt = missingsmallest:
sort(x_missing, rev=true, lt = missingsmallest)
Normalization of missing data
To consider examples of working with gaps in tabular data, we will connect the libraries DataFrames.jl and Statistics.jl.
Pkg.add(["DataFrames", "Statistics"])
using DataFrames, Statistics
Creating a table of test data for processing:
df_missing = DataFrame(
    имя     = ["NULL", "Коля", "Юра", "Миша"],
    возраст = [16, NaN, missing, 15],
    рост    = [171, 162, 999, 165],
)
When working with data in different formats combined into a single table, there may be cases when missing data takes on different values. To standardize the values of missing data, it is convenient to use the function declaremissings() from the library Impact.jl:
Pkg.add("Impute")
using Impute
df_missing = Impute.declaremissings(df_missing; values=(NaN, 999, "NULL"))
Now, as we can see, the missing data is reduced to one type - an object. missing.
Search for missing data
To determine whether the value is NaN, it is convenient to use the function isnan()
isnan.(x_64)
There is also a similar function to identify objects missing.
ismissing.(x_missing)
Let's determine the location of the missing values in the table:
df_mask = ismissing.(df_missing)
Frequently objects missing they may cause problems when processing data. To do this, you can skip, exclude, or replace them.
Skipping missing data
To filter a vector from values NaN it is enough to use the function filter() of the Julia base library.
filter(!isnan, x_64)
In case of filtering an array with an object missing the type of the resulting array will not change. To convert the array type to the type of filtered values, you can use the function disallowmissing() libraries Missings.jl.
@show x = filter(!ismissing, x_missing)
disallowmissing(x)
The following line of code shows how to filter missing from tabular data using the function filter().
filter(:имя => !ismissing, df_missing)
The application of the function gives a similar result skipmissing() the base library:
collect(skipmissing(x_missing))
For skipping missing in the tabular data library DataFrames.jl there is a more convenient function - dropmissing():
dropmissing(df_missing)
In order to use it to filter out rows with missing contained in a specific column, the second argument of this function must be passed the name of this column.:
dropmissing(df_missing, :имя)
In this case, the function dropmissing() returns a new table. If you do not need to save the original table for the task being solved, you can delete rows with missing data using dropmissing!().
Replacing missing data
If it is necessary to replace the missing data in the array, it is convenient to use the function Missings.replace():
рост = collect(Missings.replace(df_missing.рост, 170))
To replace the missing data in the tables, you can use the function replace!():
replace!(df_missing.возраст, missing => 15)
It should be noted that the data format of the resulting column does not change. Another way is to use the function coalesce():
df_missing.имя = coalesce.(df_missing.имя, "Ваня")
df_missing.рост = coalesce.(df_missing.рост, mean(skipmissing(df_missing.рост)))
df_missing
Conclusion
In this example, we discussed ways to create, organize, normalize, search, skip, and replace missing data in Julia.