Data loading and badge processing¶
This example will demonstrate the process of loading data from XLSX format and filling in data gaps using the Impute and DataInterpolations libraries.
The data is an archive of observations of weather events at one weather station for the last 5 years. Only daily temperature measurements will be used in the example.
Installation of libraries required for data loading and processing:
Pkg.add(["Statistics", "XLSX", "Impute", "CSV", "DataInterpolations"])
Pkg.add( "Impute" ); #загрузка библиотеки обработки данных
Pkg.add( "DataInterpolations" );
Calling the libraries required for data loading and processing:
using DataFrames, CSV, XLSX, Plots, Impute, DataInterpolations, Statistics
using Impute: Substitute, impute
Reading data from a file into a variable:
xf_missing = XLSX.readxlsx("$(@__DIR__)/data_for_analysis_missing.xlsx");
Viewing sheet names in loaded data:
XLSX.sheetnames(xf_missing)
Defining data from a file as a dataframe:
df_missing = DataFrame(XLSX.readtable("$(@__DIR__)/data_for_analysis_missing.xlsx", "data"));
Connecting the backend - the method of displaying the graph:
gr()
Defining variables characterising the data - time and temperature:
x = df_missing.Time;
y = df_missing.T;
Plotting temperature versus time using the original data:
plot(x, y, labels="Температура", title="График зависимости температуры от времени")
The graph shows that there are gaps in the data, they can be filled using the Impute and DataInterpolations libraries.
Using the Impute library:¶
Defining a vector and matrix with data:
vectorT = df_missing[:,2]
matrixT = df_missing[:,1:2]
typeof(vectorT)
Convert the vector and matrix to a format acceptable to Impute library functions:
vectorT = convert(Vector{Union{Missing,Float64}}, vectorT);
matrixT[:,2] = convert(Vector{Union{Missing,Float64}}, matrixT[:,2]);
Filling gaps using interpolation, filtering and mean:
lin_inter_vectorT = Impute.interp(vectorT); #заполнение пропущенных значений интерполированными (для сигналов)
filter_matrixT = Impute.filter(matrixT; dims=:rows); #удаление объектов/наблюдений с пропущенными данными
mean_matrixT = impute(matrixT[:,2], Substitute(; statistic=mean)); #заполнение пропущенных значений средними значениями (подходит для статистических данных)
Plotting graphs with corrected data:
p2 = plot(df_missing[:,1], lin_inter_vectorT, xlabel="Дата", ylabel="Температура", title="Заполнение пропусков линейной интерполяцией", titlefont=font(10));
p3 = plot(df_missing[:,1], mean_matrixT, xlabel="Дата", ylabel="Температура", title="Заполнение пропусков средним значением", titlefont=font(10), guidefont=font(8));
p1 = scatter(df_missing[:,1], df_missing[:,2], markersize=2, xlabel="Дата", ylabel="Температура", title="Исходные данные", titlefont=font(10), guidefont=font(8))
plot(p1, p2, p3, layout=(3, 1), legend=false)
Using the DataInterpolations library¶
Preparing data for interpolation methods:
days = [x for x in 1:length(df_missing[:,2])] #определение вектора от 1 до значения длины массива с данными
t = days
u = reverse(df_missing[:,2]) #сортировка измерений температуры в обратном порядке, от раннего к позднему
u = convert(Vector{Union{Missing,Float64}}, u); #конвертация данных в нужный, для используемых методов, формат
Filling in the gaps using linear interpolation and plotting a graph with the corrected data:
A = LinearInterpolation(u,t)
scatter(t, u, markersize=2, label="Исходные данные") #вывод точечного графика
plot!(A, label="Линейная интерполяция", xlabel="Время", ylabel="Температура") #вывод зависимости температуры от времени
Filling in the gaps using quadratic interpolation and plotting a graph with corrected data:
B = QuadraticInterpolation(u,t)
scatter(t, u, markersize=2, label="Исходные данные")#вывод точечного графика
plot!(B, label="Квадратичная интерполяция", xlabel="Время", ylabel="Температура")#вывод зависимости температуры от времени
Filling gaps by interpolating with the last constant values and plotting the graph with corrected data:
C = ConstantInterpolation(u,t)
scatter(t, u, markersize=2, label="Исходные данные")#вывод точечного графика
plot!(C, label="Последние значения", xlabel="Время", ylabel="Температура")#вывод зависимости температуры от времени
Conclusion:¶
In this example, temperature measurement data was downloaded and preprocessed. Interpolation and filtering techniques have been applied.
The graphs show that some methods have limitations when applied to different types of data.
For example, replacement of skips by the mean value is more suitable for statistical analysis, where the characteristics of the data will not change much.
In the case of quadratic interpolation, there is a strong change in signal magnitude in relatively large missing ranges, so it is more applicable to small gaps.