Missing data in Julia¶

Introduction¶

Dealing with missing data is a common task in data preprocessing. While sometimes missing values indicate a meaningful event in the data, they often represent unreliable or unusable data points. In any case, Julia has many options for dealing with missing data.

Creating missing data¶

Missing data in Julia can be represented in several forms. For example, the value NaN (NaN64) (not a number)

x_64 = [NaN, 8, 15, 16, 23, 42]
println("Type of $(x_64[1]): ", typeof(x_64[1]))

x_64 = [NaN64, 8, 15, 16, 23, 42]
print("Type of $(x_64[1]): ", typeof(x_64[1]))

Type of NaN: Float64
Type of NaN: Float64

The value NaN can be specified as a smaller floating point number if necessary:

x_32 = [NaN32, 8, 15, 16, 23, 42]

6-element Vector{Float32}:
 NaN
   8.0
  15.0
  16.0
  23.0
  42.0

x_16 = [NaN16, 8, 15, 16, 23, 42]

6-element Vector{Float16}:
 NaN
   8.0
  15.0
  16.0
  23.0
  42.0

Speaking about the essence of this form of representation of missing data, we can conclude that these are concrete values of the floating-point data type that have the ability to propagate in computations. Often such values appear as a result of undefined operations:

x_nan = Inf*0

NaN

Generally, in Julia, the values NaN are used to propagate in computations the uncertainty of calculating numeric values. To work specifically with missing data, it is more correct to use a special object missing, which is also propagated in calculations. This form of representation is the only instance of type Missing:

typeof(missing)

Missing

This means, in particular, that arrays containing missing among other values are heterogeneous in type:

x_missing = [missing, 8, 15, 16, 23, 42]

6-element Vector{Union{Missing, Int64}}:
   missing
  8
 15
 16
 23
 42

To work with missing, the Missings.jl library may be useful. Thus, using it you can create arrays with such objects:

import Pkg; Pkg.add("Missings") # загрузка библиотеки

using Missings # подключение библиотеки

# создание массивов с отсутствующими значениями:
# массивы типа Missing
@show missings(1)
@show missings(3)
@show missings(3,1)

# массив объединения типов
@show missings(Int,3,3);

missings(1) = [missing]
missings(3) = [missing, missing, missing]
missings(3, 1) = [missing; missing; missing;;]
missings(Int, 3, 3) = Union{Missing, Int64}[missing missing missing; missing missing missing; missing missing missing]

Systematisation of missing data¶

When sorting arrays with missing data, keep in mind that the object missing is considered larger than any object it is compared to:

isless(Inf, missing)

true

Therefore, the missing values will be automatically separated when sorting, and will be at the end of the ascending sort. In case you want to change the order of inclusion of missing values during sorting, it is enough to apply the attribute lt = missingsmallest:

sort(x_missing, rev=true, lt = missingsmallest)

6-element Vector{Union{Missing, Int64}}:
 42
 23
 16
 15
  8
   missing

Normalisation of missing data¶

Let's connect the libraries DataFrames.jl and Statistics.jl to examine examples of working with missing data in tabular data.

Pkg.add(["DataFrames", "Statistics"])
using DataFrames, Statistics

Let's create a table of test data for processing:

df_missing = DataFrame(
    имя     = ["NULL", "Коля", "Юра", "Миша"],
    возраст = [16, NaN, missing, 15],
    рост    = [171, 162, 999, 165],
)

When working with data of different formats combined into one table, cases may arise when missing data take different values. To standardise values of missing data it is convenient to use the declaremissings() function from the Impute.jl library:

Pkg.add("Impute")

using Impute

df_missing = Impute.declaremissings(df_missing; values=(NaN, 999, "NULL"))

Now, as we can see, the missing data is reduced to one form - the object missing.

Searching for missing data¶

To determine if the value NaN, it is convenient to use the function isnan()

isnan.(x_64)

6-element BitVector:
 1
 0
 0
 0
 0
 0

There is also a similar function to determine the objects missing.

ismissing.(x_missing)

6-element BitVector:
 1
 0
 0
 0
 0
 0

Let's determine the location of missing values in the table:

df_mask = ismissing.(df_missing)

Often objects missing can cause problems in data processing. For this purpose they can be omitted, excluded or replaced.

Skipping missing data¶

To filter the vector from NaN values, just use the filter() function of the Julia base library.

filter(!isnan, x_64)

5-element Vector{Float64}:
  8.0
 15.0
 16.0
 23.0
 42.0

In case of filtering an array with the object missing, the type of the resulting array will not change. You can use the disallowmissing() function of the Missings.jl library to cast the array type to the type of the filtered values.

@show x = filter(!ismissing, x_missing)
disallowmissing(x)

x = filter(!ismissing, x_missing) = Union{Missing, Int64}[8, 15, 16, 23, 42]

5-element Vector{Int64}:
  8
 15
 16
 23
 42

The following line of code shows how to filter missing from tabular data using the function filter().

filter(:имя => !ismissing, df_missing)

The same result is obtained by applying the skipmissing() function of the base library:

collect(skipmissing(x_missing))

5-element Vector{Int64}:
  8
 15
 16
 23
 42

To skip missing in tabular data, the DataFrames.jl library has a more convenient function - dropmissing():

dropmissing(df_missing)

In order to filter out rows with missing contained in a particular column, the second argument to this function is the name of that column:

dropmissing(df_missing, :имя)

In this case, the function dropmissing() returns a new table. If it is not necessary to save the original table for the task to be solved, you can delete rows with missing data using dropmissing!().

Replacing missing data¶

If you need to replace missing data in an array, it is convenient to use the function Missings.replace():

рост = collect(Missings.replace(df_missing.рост, 170))

4-element Vector{Int64}:
 171
 162
 170
 165

To replace missing data in tables you can use the function replace!():

replace!(df_missing.возраст, missing => 15)

4-element Vector{Union{Missing, Float64}}:
 16.0
 15.0
 15.0
 15.0

Note that the data format of the resulting column does not change. Another way is to use the function coalesce():

df_missing.имя = coalesce.(df_missing.имя, "Ваня")
df_missing.рост = coalesce.(df_missing.рост, mean(skipmissing(df_missing.рост)))
df_missing

Conclusion¶

This case study explored how to create, organise, normalise, search, find, skip and replace missing data in Julia.

Row	имя	возраст	рост
	Bool	Bool	Bool
1	true	false	false
2	false	true	false
3	false	true	true
4	false	false	false

Row	имя	возраст	рост
	String	Float64?	Int64
1	NULL	16.0	171
2	Коля	NaN	162
3	Юра	missing	999
4	Миша	15.0	165

Row	имя	возраст	рост
	String?	Float64?	Int64?
1	missing	16.0	171
2	Коля	missing	162
3	Юра	missing	missing
4	Миша	15.0	165

Row	имя	возраст	рост
	String	Float64?	Real
1	Ваня	16.0	171
2	Коля	15.0	162
3	Юра	15.0	166.0
4	Миша	15.0	165