Determination of distribution parameters from partial sampling¶

In this example, we will divide the sample into two parts and approximate each half with a different probability density function, using the least squares method to find the parameters of each function.

Creating a sample¶

Our sample will be a Brownian process model on the segment [-20,20]. The transition of the particle occurs with equal probability to the left and to the right, the length of the run is determined by an exponential distribution with a given unit mean.

Let us obtain a set of realisations of this process by launching 100000 trajectories from the centre of the region.

Pkg.add(["Distributions", "Statistics", "Random",
         "StatsBase", "Measures", "HypothesisTests", "LsqFit"])

a = -20
b = 20
λ = 1

using Random
Random.seed!(2)

Ntraj = 100_000  # Количество рассчитываемых траекторий
Nc = []          # Набор длин траекторий
NtrajList = []   # Список координат траектории частиц

for i in 1:Ntraj
    x = 0
    k = 0
    if length(NtrajList) < 200 push!( NtrajList, [] ); end
    while a < x < b
        r1,r2 = rand(2)
        dx = -log(r2) / λ
        if( r1 < 0.5 ) x -= dx; else x += dx; end;
        k += 1;
        # Сохраним первые 200 траекторий в память
        if length(NtrajList) < 200 push!( NtrajList[end], x ); end
    end
    Nc = push!( Nc, k )
end

Nc = Float64.( Nc );

We had to convert the vector Nc to the type Float64, without which we would often get warnings that the vector may contain data of the type Nothing or Missing, even if it is not actually there.

Let's output the first couple of hundred trajectories.

gr()

plot()
for traj in NtrajList
    plot!( 1:length(traj), traj, c=1 )
end

using Measures
plot!( legend=:false, size=(1000,400), xlabel="Шаги симуляции", ylabel="Положение частицы", left_margin=10mm, bottom_margin=10mm )

We see that some particles were in the controlled space for a very long time (up to 800 cycles and more), but mostly they left it within the first hundred steps from the start of the simulation.

Let us find statistical parameters of this sample:

using Statistics
Nave = mean( Nc ); # Среднее значение
Nd = std( Nc );    # Дисперсия

histogram( Nc, xlims=(0,1000), normalize=true, label="Данные" )
vline!( [Nave], lw=4, lc=2, label = "Среднее" )
vline!( [Nave+Nd], lw=3, lc=3, label=:none )
vline!( [Nave-Nd], lw=3, lc=3, label=:none )

To investigate this sample, the arithmetic mean is not enough. We need to guess what function the distribution of the results of the observed process obeys. First, let us assume that we have a lognormal distribution.

using Distributions

dist0 = fit( LogNormal, Nc )
x1 = 10:1000

histogram( Nc, xlims=(0,1000), normalize=true, label="Данные" )
plot!( x1, pdf.( dist0, x1 ), lw=4, lc=2, label="Логнормальное распределение" )

The right tail of the histogram differs markedly from the curve of the found lognormal distribution.

Let's divide the data into two samples¶

Let's try to study two samples separately from each other. To do this, let's divide the sample into two halves (before and after the maximum) and find the parameters of the two distribution functions. First of all, where is the maximum of the distribution?

xx = range( 10, 1000, step=8 )        # Интервалы гистограммы
xc = xx[1:end-1] .+ (xx[2]-xx[1])/2   # Центры интервалов

#dist_h = fit( Histogram, Nc, xx ).weights; # Максимум гистограммы дает не лучшую оценку из-за выбросов,
dist_h = pdf.( dist0, x1 )                  # поэтому возьмем максимум логнормальной функции

m,i = findmax( dist_h )
Nc_mid = x1[i]

88

Let's separate the two subsamples:

Nc1 = Nc[ Nc .<= Nc_mid ];
Nc2 = Nc[ Nc .> Nc_mid ];

Approximate these two samples independently one from the other. Let's build histograms on them. Pay attention to the argument bins - if it is not specified, the system will choose a different interval size for each sample, and since we have different sample sizes, the density plot will be non-normalised with respect to the interval size.

histogram( Nc1 )
histogram!( Nc2, size=(600,150) )

histogram( Nc1, bins=1:10:Nc_mid, label="Левая половина выборки" )
histogram!( Nc2, bins=Nc_mid+1:10:1000, label="Правая половина выборки" )

Now we have the weights of each interval, obtained after histogram construction.

using StatsBase

# Границы интервалов для гистограмм
x1 = 10:Nc_mid
x2 = Nc_mid+1:1000

# Центры интервалов
c1 = x1[1:end-1] .+ (x1[2]-x1[1])/2
c2 = x2[1:end-1] .+ (x2[2]-x2[1])/2

# Значения гистограммы (третий параметр – границы интервалов)
w1 = fit( Histogram, Nc1, x1 ).weights;
w2 = fit( Histogram, Nc2, x2 ).weights;

# Нормализованные гистограммы (важно нормализовать используя всю длину Nc, а не длину отдельных хвостов)
w1n = w1 ./ length(Nc);
w2n = w2 ./ length(Nc);

plot( c1, w1n, label="Левая половина выборки" )
plot!( c2, w2n, label="Правая половина выборки" )

But each of them represents only a part of the data for histogram construction. We cannot fully find the distribution parameters from this data, because we have a limited sample.

plot( c1, w1n, label="Левая половина выборки" )
plot!( c2, w2n, label="Правая половина выборки", size=(600,180) )

dist1 = fit( LogNormal, Nc1 )
dist2 = fit( Exponential, Nc2 )

plot!( c1, pdf.( dist1, c1 ), lw=4, label="Логнормальное распределение" )
plot!( c2, pdf.( dist2, c2 ), lw=4, label="Экспоненциальное распределение" )

However, we can find the parameters of the distribution if we define it as a function that must pass through the presented points.

Parameters of the left sample¶

Let us imagine that the left side of the distribution follows a lognormal law. This distribution has the following probability density formula:

$$P(x) = \exp \left ( - \left [ \frac{ \ln (x) - \mu } {\sigma}\right ]^2 \bigg/ 2 \right ) \Bigg / (x \sigma \sqrt{2 \pi})$$

This formula has two parameters - $\mu$ and $\sigma$, which will form the elements of the vector p (p[1] and p[2]).

Let's find the parameters of this curve with the condition that it should pass through the available points as close as possible.

# Впишем функцию средствами подбора параметров функций (а не распределений)
using LsqFit

# Эти функции должны быть написаны для векторных аргументов, поэтому @. вначале (чтобы все операции были .*, .+ и т.д.)
@. lognormDist( x, p ) = (1 / (x * p[2] * sqrt(2π) )) * exp(-(log(x) - p[1])^2 / (2p[2]^2));

p0 = [1.0, 1.0];
af1 = curve_fit( lognormDist, c1, w1n, p0 );

af1.param

2-element Vector{Float64}:
 5.0813235614097465
 0.8009541756702706

We have selected the distribution parameters that give us a probability density very similar to that given by the built-in function LogNormal.

plot( c1, lognormDist(c1, af1.param), label="Наша функция LogNorm", legend=:topleft )
plot!( c1, pdf.(LogNormal(af1.param[1], af1.param[2]), c1), label="Штатная функция для плотности Lognorm" )
plot!( c1, w1n, label="Гистограмма по которой мы строили распределение", legend=:bottomright )

Parameters of the right sample¶

Let's try to fit the points of the "right" half of the sample into an exponential probability density function:

$$P(x) = \lambda_1 e^{-\lambda_2 x}$$

Sometimes this distribution is constructed through a one-parameter function, but in our case the results are better if $\lambda_1 \neq \lambda_2$.

@. expDist( x, p ) = p[1] * exp( -p[2]*x );

p0 = [ 0.1, 0.1]; # Все очень сильно зависит от начального параметра. При значениях выше 0.5 модель не сходится
af2 = curve_fit( expDist, c2, w2n, p0 );

af2.param

2-element Vector{Float64}:
 0.00685261267119607
 0.005432207598447344

Since we use an iterative method to approximate the sample using a given function, we need to specify the initial approximation of the solution p_0, the choice of which often determines whether our algorithm will converge or not.

plot( c2, expDist(c2, af2.param), label="Экспоненциальное распределение" )
plot!( c2, w2n, label="Гистограмма, по которой мы искали параметры" )

Let's combine both parts¶

If we plot both functions on the same graph, we get a pretty good match with the original histogram, which we can verify using the package HypothesisTests.

stephist( Nc, xlims=(10,1000), bins = 10:5:1000, normalize=true, label="Исходные данные", fill=true, fillalpha=0.3, color=:black, fillcolor=1 )
plot!( [c1; c2], pdf.( dist0, [c1; c2] ), lw=3, lc=2, ls=:dash, label="Логнормальное распределение" )
plot!( [c1; c2], [lognormDist(c1, af1.param); expDist(c2, af2.param)], lw=3, lc=:black, ls=:dash, label="Найденная плотность распределения" )

Let us evaluate the quality of this solution¶

We have effectively fitted a piecewise linear function to the sample. Now let's estimate the residual error - both visually and using the Student's criterion.

using HypothesisTests

yy1 = pdf.( dist0, [c1; c2] )
yy2 = [ lognormDist(c1, af1.param); expDist(c2, af2.param) ]
yy3 = [w1n; w2n]

plot(
    plot( yy1 .- yy3, size=(600,250), lc=2,
        label="Lognorm (pvalue = $(round(pvalue(OneSampleTTest(yy1 .- yy3)),digits=3)))" ),
    plot( yy2 .- yy3, size=(600,250), lc=1,
        label="Lognorm + Exponential (pvalue=$(round(pvalue(OneSampleTTest(yy2 .- yy3)),digits=3)))" ),
    layout=(2,1)
)

The second error plot looks more homogeneous.

$p$The -value shows that we have accounted for more systematic effects, and that the error between the sample and the function we compiled is closer to a normally distributed random variable than the error between the sample and the distribution function LogNorm.

Conclusion¶

We solved an applied problem for which we needed to fit several functions into samples. As a result, we constructed a complex distribution function to describe the results of a physical experiment.