Engee documentation
Notebook

Determining the distribution parameters for a partial sample

In this example, we will divide the sample into two parts and approximate each half with a separate probability density function using the least squares method to find the parameters of each function.

Creating a selection

Our sample will be a model of the Brownian process in the interval [-20, 20]. The particle transition occurs with equal probability to the left and to the right, the path length is determined by an exponential distribution with a given unit average.

We will get a set of implementations of this process by launching 100,000 trajectories from the center of the area.

In [ ]:
Pkg.add(["Distributions", "Statistics", "Random",
         "StatsBase", "Measures", "HypothesisTests", "LsqFit"])
In [ ]:
a = -20
b = 20
λ = 1

using Random
Random.seed!(2)

Ntraj = 100_000  # Количество рассчитываемых траекторий
Nc = []          # Набор длин траекторий
NtrajList = []   # Список координат траектории частиц

for i in 1:Ntraj
    x = 0
    k = 0
    if length(NtrajList) < 200 push!( NtrajList, [] ); end
    while a < x < b
        r1,r2 = rand(2)
        dx = -log(r2) / λ
        if( r1 < 0.5 ) x -= dx; else x += dx; end;
        k += 1;
        # Сохраним первые 200 траекторий в память
        if length(NtrajList) < 200 push!( NtrajList[end], x ); end
    end
    Nc = push!( Nc, k )
end

Nc = Float64.( Nc );

We had to translate the vector Nc in type Float64, without which we would often receive warnings that the vector may contain data of the type Nothing or Missing even if they are not actually there.

Let's print the first couple hundred trajectories.

In [ ]:
gr()

plot()
for traj in NtrajList
    plot!( 1:length(traj), traj, c=1 )
end

using Measures
plot!( legend=:false, size=(1000,400), xlabel="Шаги симуляции", ylabel="Положение частицы", left_margin=10mm, bottom_margin=10mm )
Out[0]:

We can see that some particles have been in the controlled space for a very long time (up to 800 cycles or more), but mostly they left it during the first hundred steps from the start of the simulation.

Let's find the statistical parameters of this sample:

In [ ]:
using Statistics
Nave = mean( Nc ); # Среднее значение
Nd = std( Nc );    # Дисперсия

histogram( Nc, xlims=(0,1000), normalize=true, label="Данные" )
vline!( [Nave], lw=4, lc=2, label = "Среднее" )
vline!( [Nave+Nd], lw=3, lc=3, label=:none )
vline!( [Nave-Nd], lw=3, lc=3, label=:none )
Out[0]:

To examine this sample, the arithmetic mean is not enough. We need to guess which function is responsible for the distribution of the results of the observed process. To begin with, let's assume that we have a lognormal distribution in front of us.

In [ ]:
using Distributions

dist0 = fit( LogNormal, Nc )
x1 = 10:1000

histogram( Nc, xlims=(0,1000), normalize=true, label="Данные" )
plot!( x1, pdf.( dist0, x1 ), lw=4, lc=2, label="Логнормальное распределение" )
Out[0]:

The right tail of the histogram is noticeably different in appearance from the curve of the found lognormal distribution.

Let's divide the data into two samples

Let's try to study two samples separately from each other. To do this, we divide the sample into two halves (before and after the maximum) and find the parameters of the two distribution functions. First of all, where is the maximum of the distribution?

In [ ]:
xx = range( 10, 1000, step=8 )        # Интервалы гистограммы
xc = xx[1:end-1] .+ (xx[2]-xx[1])/2   # Центры интервалов

#dist_h = fit( Histogram, Nc, xx ).weights; # Максимум гистограммы дает не лучшую оценку из-за выбросов,
dist_h = pdf.( dist0, x1 )                  # поэтому возьмем максимум логнормальной функции

m,i = findmax( dist_h )
Nc_mid = x1[i]
Out[0]:
88

Let's separate the two subsamples:

In [ ]:
Nc1 = Nc[ Nc .<= Nc_mid ];
Nc2 = Nc[ Nc .> Nc_mid ];

We approximate these two samples independently, one after the other. Let's build histograms based on them. Pay attention to the argument bins – if you do not specify it, the system will select a different interval size for each sample, and since we have different sample sizes, the density graph will be abnormal relative to the interval size.

In [ ]:
histogram( Nc1 )
histogram!( Nc2, size=(600,150) )
Out[0]:
In [ ]:
histogram( Nc1, bins=1:10:Nc_mid, label="Левая половина выборки" )
histogram!( Nc2, bins=Nc_mid+1:10:1000, label="Правая половина выборки" )
Out[0]:

Now we have the weights of each interval obtained after plotting the histograms.

In [ ]:
using StatsBase

# Границы интервалов для гистограмм
x1 = 10:Nc_mid
x2 = Nc_mid+1:1000

# Центры интервалов
c1 = x1[1:end-1] .+ (x1[2]-x1[1])/2
c2 = x2[1:end-1] .+ (x2[2]-x2[1])/2

# Значения гистограммы (третий параметр – границы интервалов)
w1 = fit( Histogram, Nc1, x1 ).weights;
w2 = fit( Histogram, Nc2, x2 ).weights;

# Нормализованные гистограммы (важно нормализовать используя всю длину Nc, а не длину отдельных хвостов)
w1n = w1 ./ length(Nc);
w2n = w2 ./ length(Nc);

plot( c1, w1n, label="Левая половина выборки" )
plot!( c2, w2n, label="Правая половина выборки" )
Out[0]:

But each of them represents only a part of the data for constructing a histogram. Based on these data, it is impossible to fully find the distribution parameters, since we have a limited sample.

In [ ]:
plot( c1, w1n, label="Левая половина выборки" )
plot!( c2, w2n, label="Правая половина выборки", size=(600,180) )

dist1 = fit( LogNormal, Nc1 )
dist2 = fit( Exponential, Nc2 )

plot!( c1, pdf.( dist1, c1 ), lw=4, label="Логнормальное распределение" )
plot!( c2, pdf.( dist2, c2 ), lw=4, label="Экспоненциальное распределение" )
Out[0]:

But on the other hand, we can find the distribution parameters if we set it as a function that must pass through the presented points.

Parameters of the left selection

Let's imagine that the left side of the distribution follows a lognormal law. This distribution has the following probability density formula:

This formula has two parameters – and , which we will have to form the elements of the vector p (p[1] and p[2]).

Let's find the parameters of this curve with the condition that it should pass through the available points as closely as possible.

In [ ]:
# Впишем функцию средствами подбора параметров функций (а не распределений)
using LsqFit

# Эти функции должны быть написаны для векторных аргументов, поэтому @. вначале (чтобы все операции были .*, .+ и т.д.)
@. lognormDist( x, p ) = (1 / (x * p[2] * sqrt(2π) )) * exp(-(log(x) - p[1])^2 / (2p[2]^2));

p0 = [1.0, 1.0];
af1 = curve_fit( lognormDist, c1, w1n, p0 );

af1.param
Out[0]:
2-element Vector{Float64}:
 5.0813235614097465
 0.8009541756702706

We have selected distribution parameters that give us a probability density very similar to that given by the built-in function. LogNormal.

In [ ]:
plot( c1, lognormDist(c1, af1.param), label="Наша функция LogNorm", legend=:topleft )
plot!( c1, pdf.(LogNormal(af1.param[1], af1.param[2]), c1), label="Штатная функция для плотности Lognorm" )
plot!( c1, w1n, label="Гистограмма по которой мы строили распределение", legend=:bottomright )
Out[0]:

Parameters of the right selection

Let's try to fit the points of the "right" half of the sample into an exponential probability density function.:

Sometimes this distribution is constructed using a one-parameter function, but in our case, the results are better if .

In [ ]:
@. expDist( x, p ) = p[1] * exp( -p[2]*x );

p0 = [ 0.1, 0.1]; # Все очень сильно зависит от начального параметра. При значениях выше 0.5 модель не сходится
af2 = curve_fit( expDist, c2, w2n, p0 );

af2.param
Out[0]:
2-element Vector{Float64}:
 0.00685261267119607
 0.005432207598447344

Since we use an iterative method to approximate the sample using a given function, we need to specify the initial approximation of the solution. p_0, the choice of which often determines whether our algorithm will converge or not.

In [ ]:
plot( c2, expDist(c2, af2.param), label="Экспоненциальное распределение" )
plot!( c2, w2n, label="Гистограмма, по которой мы искали параметры" )
Out[0]:

Both parts are compatible

If we put both functions on the same graph, we will get a pretty good match with the original histogram, as we can verify with the help of the package HypothesisTests.

In [ ]:
stephist( Nc, xlims=(10,1000), bins = 10:5:1000, normalize=true, label="Исходные данные", fill=true, fillalpha=0.3, color=:black, fillcolor=1 )
plot!( [c1; c2], pdf.( dist0, [c1; c2] ), lw=3, lc=2, ls=:dash, label="Логнормальное распределение" )
plot!( [c1; c2], [lognormDist(c1, af1.param); expDist(c2, af2.param)], lw=3, lc=:black, ls=:dash, label="Найденная плотность распределения" )
Out[0]:

Let's evaluate the quality of this solution.

We have actually written a piecewise linear function into the sample. Now let's evaluate the residual error, both visually and using the Student's criterion.

In [ ]:
using HypothesisTests

yy1 = pdf.( dist0, [c1; c2] )
yy2 = [ lognormDist(c1, af1.param); expDist(c2, af2.param) ]
yy3 = [w1n; w2n]

plot(
    plot( yy1 .- yy3, size=(600,250), lc=2,
        label="Lognorm (pvalue = $(round(pvalue(OneSampleTTest(yy1 .- yy3)),digits=3)))" ),
    plot( yy2 .- yy3, size=(600,250), lc=1,
        label="Lognorm + Exponential (pvalue=$(round(pvalue(OneSampleTTest(yy2 .- yy3)),digits=3)))" ),
    layout=(2,1)
)
Out[0]:

The second error graph looks more uniform.

-the value shows that we have taken into account more systematic effects, and that the error between the sample and the function we have compiled is closer to a normally distributed random variable than the error between the sample and the distribution function. LogNorm.

Conclusion

We solved an application problem for which we needed to fit several functions into the samples. As a result, we have constructed a complex distribution function to describe the results of a physical experiment.