Reading data from Excel and building models¶

Very often the data that needs to be analysed is not taken from some abstract repository or database, but lies in some tabular format file. In some industries the preferred format is CSV, in other cases the data is created in standard table editors.

In this demonstration, we will learn how to load and graph data from an Excel spreadsheet and build a model from it.

Reading a whole file¶

For most of the commands we will need the following libraries, which are already built into the environment. We only need to specify that we are going to use commands from the namespace of these libraries.

Pkg.add(["Interpolations", "XLSX", "GLM", "Optim"])

using XLSX, DataFrames

An Excel file may contain several spreadsheets. Most often we use only the first of these sheets, which is called Лист1 or Sheet1 (but may be called something else).

Reading the whole file will return us an object with several tables inside, each on a separate sheet.

xf = XLSX.readxlsx( "table_simple_example.xlsx" )

XLSXFile("table_simple_example.xlsx") containing 1 Worksheet
            sheetname size          range        
-------------------------------------------------
                Лист1 4x3           A1:C4

We can see this again by displaying the list of sheets.

XLSX.sheetnames(xf)

1-element Vector{String}:
 "Лист1"

If you have already opened the file with the command readxlsx, you can separate the required sheet with an additional command.

sh = xf["Лист1"]

4×3 XLSX.Worksheet: ["Лист1"](A1:C4)

Or you can use commands that perform this process faster.

Reading a single sheet from a table¶

You can read each sheet individually using the standard library commands XLSX.

XLSX.readtable( "table_simple_example.xlsx", "Лист1" )

XLSX.DataTable(Any[Any[1, 4, 7], Any[2, 5, 8], Any[3, 6, 9]], [:a, :b, :c], Dict(:a => 1, :b => 2, :c => 3))

You can put a table sheet into the format DataFrame and process the data using the means of this library.

DataFrame( XLSX.readtable( "table_simple_example.xlsx", "Лист1" ) )

xdf = Float32.(DataFrame( XLSX.readtable( "table_simple_example.xlsx", "Лист1" ) ))

But it makes sense when the data represents a set of objects and a list of their attributes. Sometimes the table contains just a matrix with numbers, and then it is better to work with it using the standard data type Matrix.

xm = XLSX.readdata( "table_simple_example.xlsx", "Лист1", "A2:C4" )

3×3 Matrix{Any}:
 1  2  3
 4  5  6
 7  8  9

Choosing the right data type simplifies further work with the table or matrix.

gr()
wireframe( xm )

Calculating data on a new calculation grid (interpolation)¶

Now, we have at our disposal a table of data defined in some coordinate system, we can interpolate them using a spline and build a model.

xm = vec( Float32.( XLSX.readdata("table_matrix_example.xlsx", "Лист1", "B1:D1") ))
ym = vec( Float32.( XLSX.readdata("table_matrix_example.xlsx", "Лист1", "A2:A4") ))
zm = Float32.( XLSX.readdata("table_matrix_example.xlsx", "Лист1", "B2:D4") )

wireframe( xm, ym, zm )

using Interpolations

# Переводим описание системы координат из векторов в набор диапазонов
xSrc = range( minimum(xm), maximum(xm), step=xm[2]-xm[1] )
ySrc = range( minimum(ym), maximum(ym), step=ym[2]-ym[1] )

# Строим модель, которая предскажет нам значения на новой сетке
cubInt = cubic_spline_interpolation( (xSrc, ySrc), zm )

# Задаем новую расчетную сетку
xRange = range( minimum(xm), maximum(xm), 50 )
yRange = range( minimum(ym), maximum(ym), 50 )
zIntRes = cubInt(xRange, yRange)

# Визуализируем результат
surface( xRange, yRange, zIntRes, fillalpha=0.2 )
wireframe!( xRange, yRange, zIntRes )

We have interpolated with cubic splines. If the original data points were located in a coordinate system, but with non-regular X and Y notches, we would only be able to build a model based on linear interpolation (or a "nearest neighbours" model).

Functional approximation and minimum search¶

After visually analysing the data, you can make a guess as to which function would be convenient to approximate the data. This requires some experience and interactive work with the data.

Suppose we have verified that the data is well represented by a quadratic function. Let us construct a polynomial that will allow us to more accurately find the expected values of the output parameters in which the target value of the process under study is minimised.

using GLM

# Подготовим данные: представим таблицу как список отдельных точек данных
xx = Float64.(repeat( collect(xSrc), outer = length(ySrc) ))
yy = Float64.(repeat( collect(ySrc), inner = length(xSrc) ))
zz = Float64.(vec(reshape(zm', 1, :)))

# Создадим объект DataFrame
data = DataFrame( X = xx, Y = yy, Z = zz )

# Построим модель с двумя входными параметрами
model = lm( @formula(Z ~ X + Y + X^2 + Y^2 + X*Y), data )

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

Z ~ 1 + X + Y + :(X ^ 2) + :(Y ^ 2) + X & Y

Coefficients:
──────────────────────────────────────────────────────────────────────────────
                  Coef.   Std. Error      t  Pr(>|t|)    Lower 95%   Upper 95%
──────────────────────────────────────────────────────────────────────────────
(Intercept)  26.1111     10.0158       2.61    0.0799  -5.76373     57.986
X            -2.2         0.96622     -2.28    0.1072  -5.27494      0.874942
Y            -0.433333    0.292604    -1.48    0.2352  -1.36453      0.497864
X ^ 2         0.0533333   0.0237268    2.25    0.1102  -0.0221761    0.128843
Y ^ 2         0.0133333   0.00593171   2.25    0.1102  -0.00554402   0.0322107
X & Y         0.01        0.0083887    1.19    0.3189  -0.0166966    0.0366966
──────────────────────────────────────────────────────────────────────────────

# Задаем новую расчетную сетку
dx = 2 * abs(maximum(xm) - minimum(xm))
dy = 2 * abs(maximum(ym) - minimum(ym))
xRange = range( minimum(xm) - dx, maximum(xm) + dx, 50 )
yRange = range( minimum(ym) - dy, maximum(ym) + dy, 20 )

# Визуализируем результат
xx = repeat( collect(xRange), outer = length(yRange) )
yy = repeat( collect(yRange), inner = length(xRange) )
zz = reshape( GLM.predict(model, (;X=xx,Y=yy)), length(xRange), :)

# Построим графики
surface( xRange, yRange, zz', fillalpha=0.2, xlabel="x", ylabel="y", zlabel="z" )
wireframe!( xRange, yRange, zz' )

The models created with GLM accept a point cloud as input (as you have seen, we used data in the form of one-dimensional vectors Vector) and allow us to extrapolate without any additional settings.

Finding the point of minimum¶

Both objects allow us to find the point of minimum of the function, the description of which we got from the table.

using Optim
opt = optimize( x->GLM.predict( model,(;X=[x[1]],Y=[x[2]]) )[1], [0.0,0.0] )
(x_min, y_min) = Optim.minimizer( opt )

surface( xRange, yRange, zz', xlabel="x", ylabel="y", zlabel="z", c=:viridis, zcolor=zz' )
scatter!( [x_min], [y_min], [GLM.predict( model,(;X=[x_min],Y=[y_min]) )[1]],
    camera = (30,50), legend=false, cbar=false )

We have obtained a graph where the minimum is marked.

Writing the data to an Excel file¶

To save the analysis results to another Excel spreadsheet, use the following command:

XLSX.openxlsx( "table_write_example_1.xlsx", mode="w" ) do xf
    
    # Поместим матрицу на первый лист
    XLSX.rename!( xf[1], "Лист1" )

    xw = reshape(xx, 1, :) # Сделаем матрицы из векторов
    for ind in CartesianIndices(xw)
        XLSX.setdata!( xf["Лист1"], XLSX.CellRef( ind[1], ind[2]+1 ), xw[ind])
    end

    yw = reshape(yy, :, 1)
    for ind in CartesianIndices(yw)
        XLSX.setdata!( xf["Лист1"], XLSX.CellRef( ind[1]+1, ind[2] ), yw[ind])
    end

    zw = zz
    for ind in CartesianIndices(zw)
        XLSX.setdata!( xf["Лист1"], XLSX.CellRef( ind[1]+1, ind[2]+1 ), zw[ind])
    end

    
    # На второй лист поместим DataFrame
    XLSX.addsheet!( xf, "Лист2" )
    XLSX.writetable!( xf["Лист2"], eachcol(xdf), names(xdf); anchor_cell=XLSX.CellRef("A1"))
    
end

Or, much shorter - to save DataFrame in a one-sheet table:

XLSX.writetable( "table_write_example_2.xlsx", xdf, overwrite=true )

To save objects Matrix or Vector to a table format, it is much more common to use the format CSV.

Conclusion¶

In this example, we learnt how to read files in Excel spreadsheet format. We converted them to different formats:

the format DataFrame, which is best suited for analysing tabular data
Matrix format, which allows you to work with a group of numbers from the source table as a matrix.

We also demonstrated how to interpolate and approximate data read from an Excel file and how to save the results.

	a	b	c
	Float32	Float32	Float32
1	1.0	2.0	3.0
2	4.0	5.0	6.0
3	7.0	8.0	9.0

	a	b	c
	Any	Any	Any
1	1	2	3
2	4	5	6
3	7	8	9