Engee documentation
Notebook

Uploading data and processing passes

This example will demonstrate the process of downloading data from the XLSX format and filling in the gaps using the Impact and DataInterpolations libraries.

The data is an archive of observations of weather events at one weather station over the past 5 years.
In the example, only daily temperature measurements will be used.

Installing libraries necessary for downloading and processing data:

In [ ]:
Pkg.add(["Statistics", "XLSX", "Impute", "CSV", "DataInterpolations"])
In [ ]:
Pkg.add( "Impute" ); #загрузка библиотеки обработки данных
Pkg.add( "DataInterpolations" );
   Resolving package versions...
  No Changes to `/user/.project/Project.toml`
  No Changes to `/user/.project/Manifest.toml`
   Resolving package versions...
  No Changes to `/user/.project/Project.toml`
  No Changes to `/user/.project/Manifest.toml`

Calling the libraries required for loading and processing data:

In [ ]:
using DataFrames, CSV, XLSX, Plots, Impute, DataInterpolations, Statistics
using Impute: Substitute, impute

Reading data from a file to a variable:

In [ ]:
xf_missing = XLSX.readxlsx("$(@__DIR__)/data_for_analysis_missing.xlsx");

Viewing sheet names in uploaded data:

In [ ]:
XLSX.sheetnames(xf_missing)
Out[0]:
1-element Vector{String}:
 "data"

Defining data from a file as a dataframe:

In [ ]:
df_missing = DataFrame(XLSX.readtable("$(@__DIR__)/data_for_analysis_missing.xlsx", "data"));

Enabling a backend graphics display method:

In [ ]:
gr()
Out[0]:
Plots.GRBackend()

Determination of variables characterizing the data - time and temperature:

In [ ]:
x = df_missing.Time;
y = df_missing.T;

Plotting the temperature versus time dependence based on the initial data:

In [ ]:
plot(x, y, labels="Температура", title="График зависимости температуры от времени")
Out[0]:

The graph shows that there are gaps in the data, they can be filled using the libraries Impact and DataInterpolations.

Using the Impact Library:

Defining a vector and a data matrix:

In [ ]:
vectorT = df_missing[:,2]
matrixT = df_missing[:,1:2]
typeof(vectorT)
Out[0]:
Vector{Any} (alias for Array{Any, 1})

Converting vector and matrix to a format acceptable for the functions of the Impact library:

In [ ]:
vectorT = convert(Vector{Union{Missing,Float64}}, vectorT);
matrixT[:,2] = convert(Vector{Union{Missing,Float64}}, matrixT[:,2]);

Filling in gaps using interpolation, filtering, and averages:

In [ ]:
lin_inter_vectorT = Impute.interp(vectorT); #заполнение пропущенных значений интерполированными (для сигналов)
filter_matrixT = Impute.filter(matrixT; dims=:rows); #удаление объектов/наблюдений с пропущенными данными
mean_matrixT = impute(matrixT[:,2], Substitute(; statistic=mean)); #заполнение пропущенных значений средними значениями (подходит для статистических данных)

Plotting graphs with corrected data:

In [ ]:
p2 = plot(df_missing[:,1], lin_inter_vectorT, xlabel="Дата", ylabel="Температура", title="Заполнение пропусков линейной интерполяцией", titlefont=font(10));
p3 = plot(df_missing[:,1], mean_matrixT, xlabel="Дата", ylabel="Температура", title="Заполнение пропусков средним значением", titlefont=font(10), guidefont=font(8));
p1 = scatter(df_missing[:,1], df_missing[:,2], markersize=2, xlabel="Дата", ylabel="Температура", title="Исходные данные", titlefont=font(10), guidefont=font(8))
plot(p1, p2, p3,  layout=(3, 1), legend=false)
Out[0]:

Using the DataInterpolations Library

Data preparation for the use of interpolation methods:

In [ ]:
days = [x for x in 1:length(df_missing[:,2])] #определение вектора от 1 до значения длины массива с данными
t = days
u = reverse(df_missing[:,2]) #сортировка измерений температуры в обратном порядке, от раннего к позднему
u = convert(Vector{Union{Missing,Float64}}, u); #конвертация данных в нужный, для используемых методов, формат

Filling in gaps using linear interpolation and plotting with corrected data:

In [ ]:
A = LinearInterpolation(u,t)
scatter(t, u, markersize=2, label="Исходные данные") #вывод точечного графика
plot!(A, label="Линейная интерполяция", xlabel="Время", ylabel="Температура") #вывод зависимости температуры от времени
Out[0]:

Filling in gaps using quadratic interpolation and plotting with corrected data:

In [ ]:
B = QuadraticInterpolation(u,t)
scatter(t, u, markersize=2, label="Исходные данные")#вывод точечного графика
plot!(B, label="Квадратичная интерполяция", xlabel="Время", ylabel="Температура")#вывод зависимости температуры от времени
Out[0]:

Filling in gaps by interpolating with the latest constant values and plotting with corrected data:

In [ ]:
C = ConstantInterpolation(u,t)
scatter(t, u, markersize=2, label="Исходные данные")#вывод точечного графика
plot!(C, label="Последние значения", xlabel="Время", ylabel="Температура")#вывод зависимости температуры от времени
Out[0]:

Conclusion:

In this example, temperature measurement data was uploaded and preprocessed. Interpolation and filtering methods were applied.

According to the graphs obtained, it can be seen that some methods have limitations in their application to different types of data.

So, replacing omissions with an average value is more suitable for statistical analysis, where the characteristics of the data will not change much.

In the case of quadratic interpolation, there is a strong change in the magnitude of the signal in relatively large missing ranges, so that it is more applicable to small gaps.