Parallel calculations based on the example of calculating the number pi

In today's world, parallelization has become not just an optimization, but a necessity. With increasing amounts of data and computational complexity, sequential algorithms often turn out to be insufficiently efficient. The Julia language, designed specifically for high-performance computing, offers powerful tools for parallel work that can significantly speed up the execution of tasks. Using a practical example of calculating the number pi using the Monte Carlo method, this article demonstrates various approaches to parallel computing in Julia. We'll figure it out:

Basic sequential implementations for performance comparison
Several parallelization methods using Julia's built-in tools
Problems that arise when parallel computing is improperly organized
Comparing the effectiveness of different approaches
Practical recommendations for choosing a parallelization strategy

The code presented in the article shows how to correctly organize parallel computing, avoiding common mistakes, and demonstrates the real speedup that can be achieved with proper use of multithreading in Julia.

The Monte Carlo method for calculating pi was chosen as a clear example that is easy to understand and scales well with parallelization, allowing you to clearly demonstrate the advantages of different approaches to organizing parallel computing.

Libraries used in the code:

using Base.Threads - for multithreaded computing (creating and managing threads)
using Random - for generating random numbers (the basis of the Monte Carlo method)
using Statistics - for statistical operations (summation, comparison of results)

using Base.Threads
using Random
using Statistics

Monte Carlo method for calculating pi: naive sequential version

This function returns an approximate value of the number π with an accuracy that improves as the number of random points increases. When using 100 million points, we can expect a result of about 3.14159 with a typical error of about 0.0001. The more points are generated, the closer the result is to the true value of π.

The algorithm is based on geometric probability and uses a simple but powerful idea of area ratio. We are considering a unit square with coordinates from 0 to 1 on both axes. A quarter of the unit circle with the center at the origin fits into this square.

The function works according to the following principle: for each of the n points, two random x and y coordinates are generated in the range from 0 to 1. Then it is checked whether the point falls inside the quarter circle by checking the condition that the sum of the squares of the coordinates does not exceed 1. If the point falls inside the circle, the counter is incremented.

The final value of π is calculated as the quadrupled ratio of the number of points in the circle to the total number of points generated. This follows from the mathematical ratio of the areas: the area of a square is equal to 1, the area of a quarter of a circle is equal to π/4, so the ratio of the number of points in the circle to the total number of points is approaching π/4.

The basis of the method is the ratio between the areas of geometric shapes. The area of the unit square is 1, and the area of the quarter of the unit circle is π/4. With a uniform random distribution of points over a square, the probability of a point falling into a circle is equal to the ratio of the areas, that is, π/4. Therefore, a statistical estimate of this probability multiplied by 4 gives an approximate value of π.

This approach is extremely easy to implement and understand. It does not require complex mathematical calculations or special knowledge. However, the accuracy of the method directly depends on the quality of the random number generator and the number of points. The algorithm has linear time complexity and constant memory complexity, which makes it effective for demonstration purposes.

The convergence rate of the method is approximately 1/√n, which means that to reduce the error by a factor of 10, it is necessary to increase the number of points by a factor of 100. Despite its relatively slow convergence, the method is well suited for demonstrating the principles of statistical modeling and parallel computing, as it is easy to parallelize and modify.

function compute_pi_serial_naive(n)
    count = 0
    for i in 1:n
        x = rand()
        y = rand()
        if x^2 + y^2 <= 1.0
            count += 1
        end
    end
    return 4.0 * count / n
end

Vector implementation

It works like the previous method, but with key differences in implementation.: This function calculates the number pi using the same Monte Carlo method, but uses a vectorized approach instead of an element-by-element cycle.

Instead of processing the points one at a time in a loop, the function first generates two large arrays with all random x and y coordinates. It then calculates the squares of the distances for all points simultaneously using element-wise vector operations. After that, the number of points satisfying the condition of falling into the circle is calculated in one operation.

This approach makes better use of the capabilities of modern processors for parallel data processing and is optimized for working with large arrays, but requires more memory to store intermediate arrays.

function compute_pi_serial_vectorized(n)
    x_arr = rand(n)
    y_arr = rand(n)
    distances_sq = x_arr.^2 .+ y_arr.^2
    count = sum(distances_sq .<= 1.0)
    return 4.0 * count / n
end

It works on the same Monte Carlo principle, but with parallel computing.: This version uses multithreading to speed up calculations by distributing work across multiple processor cores.

The algorithm splits the total number of points into approximately equal parts (chunks) based on the number of available streams. Each thread works with its own portion of data, using its own random number generator with a unique seed to ensure sequence independence. The threads independently count the number of points in the circle in their sections, and then the results are summed up into a common counter. Because each thread works with local variables, and only the final reduction requires combining the results. The use of separate random number generators for each stream guarantees the correctness of statistical modeling when executed in parallel.

function compute_pi_parallel_reduction(n)
    total_count = 0
    @threads for chunk in 1:Threads.nthreads()
        local_count = 0
        chunk_size = n ÷ Threads.nthreads()
        start_idx = (chunk - 1) * chunk_size + 1
        end_idx = chunk == Threads.nthreads() ? n : chunk * chunk_size
        
        rng = MersenneTwister(chunk + 42)
        for i in start_idx:end_idx
            x = rand(rng)
            y = rand(rng)
            if x^2 + y^2 <= 1.0
                local_count += 1
            end
        end
        total_count += local_count
    end
    return 4.0 * total_count / n
end

It works using the same Monte Carlo method, but using batch separation and explicit storage of results.: This version also uses multithreading, but differs in the way results are aggregated between threads.

Instead of directly summing the results into a common variable, each thread writes its partial counter to a pre-created array of results. Each element of the array corresponds to its own stream, which eliminates the possibility of conflicts during recording. After all the threads are completed, the final summation of the array elements takes place. This approach ensures a clear separation of data between threads and completely eliminates the need for synchronization during calculations, since each thread works only with its own array element. The method is especially effective when the number of threads is known in advance and remains constant during calculations.

function compute_pi_parallel_batch(n)
    nthreads = Threads.nthreads()
    chunk_size = n ÷ nthreads
    results = zeros(Int64, nthreads)
    
    @threads for thread in 1:nthreads
        local_count = 0
        start_idx = (thread - 1) * chunk_size + 1
        end_idx = thread == nthreads ? n : thread * chunk_size
        
        rng = MersenneTwister(thread + 100)
        for i in start_idx:end_idx
            x = rand(rng)
            y = rand(rng)
            local_count += (x^2 + y^2 <= 1.0)
        end
        results[thread] = local_count
    end
    return 4.0 * sum(results) / n
end

It uses the same Monte Carlo method, but using asynchronous tasks instead of direct flow control.: This version implements parallel computing through the tasks mechanism, which provides a more flexible approach to parallelization.

Instead of explicitly distributing work across threads using a directive @threads, the function creates independent asynchronous tasks via a macro @spawn. Each task calculates its own portion of points and returns a local counter. All tasks are launched competitively, and their results are collected using an operation fetch, which waits for the completion of tasks and retrieves the results. This approach allows the Julia scheduler to more flexibly distribute tasks across available threads and make better use of system resources, especially when tasks may be performed unevenly. The method also makes it easier to scale and organize more complex parallel workflows.

function compute_pi_parallel_spawn(n)
    nchunks = Threads.nthreads()
    chunk_size = n ÷ nchunks
    tasks = []
    
    for chunk in 1:nchunks
        start_idx = (chunk - 1) * chunk_size + 1
        end_idx = chunk == nchunks ? n : chunk * chunk_size
        
        task = Threads.@spawn begin
            local_count = 0
            rng = MersenneTwister(chunk + 200)
            for i in start_idx:end_idx
                x = rand(rng)
                y = rand(rng)
                local_count += (x^2 + y^2 <= 1.0)
            end
            local_count
        end
        push!(tasks, task)
    end
    
    total_count = sum(fetch.(tasks))
    return 4.0 * total_count / n
end

Next, the function of testing and comparing all versions of the algorithm is implemented.: This script performs a comprehensive comparison of all implementations of the Monte Carlo method for calculating pi, evaluating both the performance and accuracy of each approach.

The script sequentially runs all five versions of the algorithm with the same number of points, measuring the execution time of each function. For each implementation, the approximate value of π and the calculation time are stored. After all the tests are completed, a comparative performance analysis is performed, where the acceleration is calculated relative to the naive sequential version, which is taken as the baseline.

The function provides two key indicators for each version: the acceleration coefficient relative to the base method and the absolute calculation error compared to the exact value of π. This allows us to simultaneously evaluate the effectiveness of parallelization and the statistical reliability of each method, identifying optimal approaches for different use cases.

function benchmark_all_versions_fixed(n = 100_000_000)
    println("Number of threads: ", nthreads())
    println("Number of iterations: ", n)
    println()
    
    versions = [
        ("Naive and consistent", compute_pi_serial_naive),
        ("Vectorized sequential", compute_pi_serial_vectorized),
        ("Parallel (reduction)", compute_pi_parallel_reduction),
        ("Parallel (batch)", compute_pi_parallel_batch),
        ("Parallel (spawn)", compute_pi_parallel_spawn)
    ]
    
    results = []
    times = []
    
    for (name, func) in versions
        print("Testing the $name... ")
        time_taken = @elapsed begin
            pi_approx = func(n)
        end
        push!(results, pi_approx)
        push!(times, time_taken)
        println("π ≈ $pi_approx, time: $(round(time_taken, digits=3)) sec")
    end
    
    println("\n" * "="^50)
    println("PERFORMANCE COMPARISON:")
    base_time = times[1]
    
    for (i, (name, _)) in enumerate(versions)
        speedup = base_time / times[i]
        println("$name: acceleration $(round(speedup, digits=2))x")
    end
    
    println("\n" * "="^50)
    println("CHECKING ACCURACY:")
    actual_pi = π
    for (i, (name, _)) in enumerate(versions)
        error = abs(results[i] - actual_pi)
        println("$name: error $(round(error, digits=6))")
    end
end

benchmark_all_versions_fixed(100_000_000)

Количество потоков: 7
Количество итераций: 100000000

Тестируем Наивная последовательная... π ≈ 3.14135832, время: 0.364 сек
Тестируем Векторизованная последовательная... π ≈ 3.14162076, время: 2.385 сек
Тестируем Параллельная (редукция)... π ≈ 3.141488, время: 0.266 сек
Тестируем Параллельная (батчи)... π ≈ 3.14217288, время: 0.225 сек
Тестируем Параллельная (spawn)... π ≈ 3.14138508, время: 0.2 сек

==================================================
СРАВНЕНИЕ ПРОИЗВОДИТЕЛЬНОСТИ:
Наивная последовательная: ускорение 1.0x
Векторизованная последовательная: ускорение 0.15x
Параллельная (редукция): ускорение 1.37x
Параллельная (батчи): ускорение 1.62x
Параллельная (spawn): ускорение 1.82x

==================================================
ПРОВЕРКА ТОЧНОСТИ:
Наивная последовательная: ошибка 0.000234
Векторизованная последовательная: ошибка 2.8e-5
Параллельная (редукция): ошибка 0.000105
Параллельная (батчи): ошибка 0.00058
Параллельная (spawn): ошибка 0.000208

Output

Performance winner: Parallel version with @spawn it showed the best acceleration of 1.82x, which is close to the ideal acceleration on 7 threads.

Accuracy of the methods: All implementations provided acceptable accuracy with errors of the order of 10⁻⁴, which corresponds to the theoretical error of the Monte Carlo method for 100 million points.

Parallelization efficiency: All three parallel versions showed significant acceleration compared to sequential implementations, demonstrating the practical benefits of multithreading in Julia.

Why is the vectorized version the slowest?

The problem of excessive memory copying:

Temporary arrays are being created x_arr.^2 and y_arr.^2
An intermediate array is being formed distances_sq
Creates a Boolean array for the condition .<= 1.0

Inefficient cache usage:

Large arrays do not fit into the processor cache
Constant access to RAM instead of fast registers
Lack of data localization

Overhead of memory management:

Allocation and deallocation of temporary arrays
Redundant data movement operations

Vectorized operations hide conditional transitions, which are effectively predicted by modern processors in element-by-element processing. Vectorization is effective for complex mathematical operations, but for simple arithmetic operations with conditions, the overhead of memory management exceeds the benefits of parallel data processing.