Missing data in Julia¶
Introduction¶
Dealing with missing data is a common task in data preprocessing. While sometimes missing values indicate a meaningful event in the data, they often represent unreliable or unusable data points. In any case, Julia has many options for dealing with missing data.
Creating missing data¶
Missing data in Julia can be represented in several forms. For example, the value NaN
(NaN64
) (not a number)
x_64 = [NaN, 8, 15, 16, 23, 42]
println("Type of $(x_64[1]): ", typeof(x_64[1]))
x_64 = [NaN64, 8, 15, 16, 23, 42]
print("Type of $(x_64[1]): ", typeof(x_64[1]))
The value NaN
can be specified as a smaller floating point number if necessary:
x_32 = [NaN32, 8, 15, 16, 23, 42]
x_16 = [NaN16, 8, 15, 16, 23, 42]
Speaking about the essence of this form of representation of missing data, we can conclude that these are concrete values of the floating-point data type that have the ability to propagate in computations. Often such values appear as a result of undefined operations:
x_nan = Inf*0
Generally, in Julia, the values NaN
are used to propagate in computations the uncertainty of calculating numeric values. To work specifically with missing data, it is more correct to use a special object missing
, which is also propagated in calculations. This form of representation is the only instance of type Missing
:
typeof(missing)
This means, in particular, that arrays containing missing
among other values are heterogeneous in type:
x_missing = [missing, 8, 15, 16, 23, 42]
To work with missing
, the Missings.jl library may be useful. Thus, using it you can create arrays with such objects:
import Pkg; Pkg.add("Missings") # загрузка библиотеки
using Missings # подключение библиотеки
# создание массивов с отсутствующими значениями:
# массивы типа Missing
@show missings(1)
@show missings(3)
@show missings(3,1)
# массив объединения типов
@show missings(Int,3,3);
Systematisation of missing data¶
When sorting arrays with missing data, keep in mind that the object missing
is considered larger than any object it is compared to:
isless(Inf, missing)
Therefore, the missing values will be automatically separated when sorting, and will be at the end of the ascending sort. In case you need to change the order of inclusion of missing values during sorting, it is enough to apply the attribute lt = missingsmallest
:
sort(x_missing, rev=true, lt = missingsmallest)
Normalisation of missing data¶
Let's connect the libraries DataFrames.jl and Statistics.jl to examine examples of working with missing data in tabular data.
Pkg.add(["DataFrames", "Statistics"])
using DataFrames, Statistics
Let's create a table of test data for processing:
df_missing = DataFrame(
имя = ["NULL", "Коля", "Юра", "Миша"],
возраст = [16, NaN, missing, 15],
рост = [171, 162, 999, 165],
)
When working with data of different formats combined in one table, cases may arise when missing data take different values. To standardise values of missing data it is convenient to use the declaremissings()
function from the Impute.jl library:
Pkg.add("Impute")
using Impute
df_missing = Impute.declaremissings(df_missing; values=(NaN, 999, "NULL"))
Now, as we can see, the missing data is reduced to one form - the object missing
.
Searching for missing data¶
To determine if the value NaN
, it is convenient to use the function isnan()
isnan.(x_64)
There is also a similar function to determine the objects missing
.
ismissing.(x_missing)
Let's determine the location of missing values in the table:
df_mask = ismissing.(df_missing)
Often objects missing
can cause problems in data processing. For this purpose they can be omitted, excluded or replaced.
Skipping missing data¶
To filter the vector from NaN
values, just use the filter()
function of the Julia base library.
filter(!isnan, x_64)
In case of filtering an array with the object missing
, the type of the resulting array will not change. You can use the disallowmissing()
function of the Missings.jl library to cast the array type to the type of the filtered values.
@show x = filter(!ismissing, x_missing)
disallowmissing(x)
The following line of code shows how to filter missing
from tabular data using the function filter()
.
filter(:имя => !ismissing, df_missing)
The same result is obtained by applying the skipmissing()
function of the base library:
collect(skipmissing(x_missing))
To skip missing
in tabular data, the DataFrames.jl library has a more convenient function - dropmissing()
:
dropmissing(df_missing)
In order to filter out rows with missing
contained in a particular column, the second argument to this function is the name of that column:
dropmissing(df_missing, :имя)
In this case, the function dropmissing()
returns a new table. If it is not necessary to save the original table for the task to be solved, you can delete rows with missing data using dropmissing!()
.
Replacing missing data¶
If you need to replace missing data in an array, it is convenient to use the function Missings.replace()
:
рост = collect(Missings.replace(df_missing.рост, 170))
To replace missing data in tables you can use the function replace!()
:
replace!(df_missing.возраст, missing => 15)
Note that the data format of the resulting column does not change. Another way is to use the function coalesce()
:
df_missing.имя = coalesce.(df_missing.имя, "Ваня")
df_missing.рост = coalesce.(df_missing.рост, mean(skipmissing(df_missing.рост)))
df_missing
Conclusion¶
This case study explored how to create, organise, normalise, search, find, skip and replace missing data in Julia.