Implementation of the Nyquist filter for pulse signal processing

In modern digital communication systems, the quality of information transmission directly depends on the efficiency of signal processing, where formative filters play a key role. Among them, filters with root Raised Cosine (RRC) are of particular importance, which provide an optimal compromise between spectral efficiency and resistance to Intersymbol Interference (ISI). These filters have become the de facto standard in most modern communication systems, including 5G cellular networks, satellite communications, and high-speed modems.

However, designing RRC filters is a complex engineering task that requires taking into account many parameters: the roll-off factor, the filter length in characters, the number of samples per character, and the accuracy of data representation. The traditional approach to developing such filters involves a laborious iterative process, including calculating coefficients, choosing an implementation structure, optimizing bit depth, and verifying characteristics. Adding particular complexity is the need to take into account the effects of quantization when moving to a fixed-point hardware implementation.

This article demonstrates the practical implementation of the method using the example of an RRC filter with a smoothing coefficient of 0.2, a length of 10 characters and 8 samples per character, including a comparative analysis of direct and transposed structures, verification of frequency and time characteristics.

Let's move on to development, the code below calculates the coefficients for the FIR filter with a root cosine characteristic (RRC), which is the main shaping filter in modern digital communication systems. Function rrcosfilter calculates the impulse response of the filter based on the specified parameters: the roll-off factor smoothing coefficient, the length of the filter in characters, and the number of samples per character. The filter ensures optimal formation of the spectrum of the transmitted signal, minimizing intersymbol interference and limiting the frequency band. Calculation of coefficients is the first stage of filter design and its basis.

using FFTW, DSP
function rrcosfilter(β::Float64, span::Int, sps::Int)
    T = 1.0
    t = range(-span/2, stop=span/2, length=span*sps+1)
    h = similar(t)
    ϵ = 1e-10
    
    for (i, τ) in enumerate(t)
        if abs(τ) < ϵ
            h[i] = 1.0 - β + 4β/π
        elseif abs(abs(τ) - T/(4β)) < ϵ
            h[i] = (β/√2) * ((1 + 2/π)*sin(π/(4β)) + (1 - 2/π)*cos(π/(4β)))
        else
            num = sin(π*τ/T*(1-β)) + 4β*τ/T * cos(π*τ/T*(1+β))
            den = π*τ/T * (1 - (4β*τ/T)^2)
            h[i] = num / den
        end
    end
    return h / √sum(h.^2)
end

rolloff_factor = 0.2
filter_span = 10
samples_per_symbol = 8
coef = rrcosfilter(rolloff_factor, filter_span, samples_per_symbol)
coef = coef[1:2:end]
println("Rolloff factor: ", rolloff_factor)
println("Filter span in symbols: ", filter_span)
println("Output samples per symbol: ", samples_per_symbol)
println("Number of coefficients: ", length(coef))
plot(coef)

Rolloff factor: 0.2
Filter span in symbols: 10
Output samples per symbol: 8
Number of coefficients: 41

The direct form of the FIR filter is a classical implementation structure of a finite pulse filter, in which the input signal sequentially passes through a chain of delay elements, each output of which is multiplied by the corresponding filter coefficient, and the results are summed to produce an output signal.

Code Description:

This code automatically creates a model of the direct form of the FIR filter based on the previously calculated RRC coefficients. First, a unique model name is generated and a fixed-point format with 16 bits and 14 fractional digits is determined. The coefficients are converted to a fixed point format.

Then, a filter structure is created in the loop: for each coefficient, Gain, Delay, and Addition units are added. A chain of delay elements is formed, where each delayed signal is multiplied by the corresponding coefficient and summed with the previous results. The first coefficient processes the current input signal, and subsequent coefficients process delayed versions of the signal.

The input signal is applied simultaneously to the first gain element and the first delay, creating a cascade structure. All mathematical operations are performed in a fixed-point format to ensure compatibility with the hardware implementation. After building the complete structure, the model is saved to a file and uploaded for further use, providing automated creation of the filter architecture with minimal developer intervention.

name_model = "fir_$(round(Int, rand() * 10000))"
Path = (@__DIR__) * "/" * name_model * ".engee"
println("Path: $Path")
FIXED_POINT_TYPE = "fixdt(1, 16, 14)"
coef = fi(coef, 1, 16, 15)

engee.create(name_model) # Create a model
engee.add_block("/Basic/Ports & Subsystems/In1", name_model*"/")
engee.add_block("/Basic/Ports & Subsystems/Out1", name_model*"/")

# Setting the fixed-point data type for the input port
engee.set_param!(name_model*"/In1",
    "OutDataTypeStr" => "Fixed-point",
    "OutDataTypeStrFixed" => FIXED_POINT_TYPE)

# We determine the number of coefficients
num_coef = length(coef)

@time for n in 1:num_coef-1 
    name_gain = "Gain-" * string(n)
    engee.add_block("/Basic/Math Operations/Gain", name_model*"/"*name_gain)
    engee.set_param!(name_model*"/"*name_gain, 
        "Gain" => coef[n],
        "OutDataTypeStr" => "Fixed-point",
        "OutDataTypeStrFixed" => FIXED_POINT_TYPE)

    name_delay = "Delay-" * string(n)
    engee.add_block("/Basic/Discrete/Unit Delay", name_model*"/"*name_delay)  # Replacing Delay with UnitDelay
   
    name_add = "Add-" * string(n)
    engee.add_block("/Basic/Math Operations/Add", name_model*"/"*name_add)
    engee.set_param!(name_model*"/"*name_add,
        "OutDataTypeStr" => "Fixed-point", 
        "OutDataTypeStrFixed" => FIXED_POINT_TYPE)

    if n == 1
        engee.add_line(name_gain*"/1", name_add*"/1")
    end
    
    if n > 1
        name_delay_prev = "Delay-" * string(n-1)
        engee.add_line(name_delay_prev*"/1", name_delay*"/1")
        engee.add_line(name_delay_prev*"/1", name_gain*"/1")
        
        name_add_prev = "Add-" * string(n-1)
        engee.add_line(name_add_prev*"/1", name_add*"/1")
        engee.add_line(name_gain*"/1", name_add_prev*"/2")
    end
    
    if n == num_coef-1
        name_gain_last = "Gain-" * string(n+1)
        engee.add_block("/Basic/Math Operations/Gain", name_model*"/"*name_gain_last)
        engee.set_param!(name_model*"/"*name_gain_last,
            "Gain" => coef[n+1],
            "OutDataTypeStr" => "Fixed-point",
            "OutDataTypeStrFixed" => FIXED_POINT_TYPE)
            
        engee.add_line(name_delay*"/1", name_gain_last*"/1")
        engee.add_line(name_gain_last*"/1", name_add*"/2")
        engee.add_line(name_add*"/1", "Out1/1")
    end
end

engee.add_line("In1/1", "Gain-1/1")
engee.add_line("In1/1", "Delay-1/1")
engee.save(Path)
model = engee.load(Path, force=true)

Path: /user/my_projects/Demo/CIC/fir_6632.engee
  7.545665 seconds (1.46 M allocations: 90.422 MiB, 1.44% gc time, 41 lock conflicts, 7.68% compilation time)

Model(
    name: fir_6632,
    id: 542c729d-ba18-4c30-a6d4-85d7b6ad6024
)

The transposed form of the FIR filter is an alternative implementation structure in which all multiplication operations by coefficients are performed in parallel with the input signal, and the results are sequentially accumulated through a chain of adders and delay elements.

Code Description:

This code creates a transposed FIR filter structure, which is fundamentally different from the straight form. In this architecture, the input signal is simultaneously applied to all multiplication units (Gain), each of which multiplies it by the corresponding coefficient. The resulting products are then sequentially summed through a chain of adders, between which delay elements are inserted. This creates pipelining, where data flows through the entire structure every clock cycle.

Advantages of the transposed form for Verilog code generation:

Minimum critical delay — the path from input to output contains only one adder and one delay element, which allows you to achieve a higher clock frequency
Natural pipelining — The structure is ideal for conveyor processing, increasing throughput
Parallel multiplication — all multiplication operations are performed simultaneously, which simplifies parallelization of calculations in hardware
Simplified routing — it is easier to implement a parallel structure with a single data source in FPGAs
Better scalability — Adding new coefficients does not increase the critical path

name_model = "firT_$(round(Int, rand() * 10000))"
Path = (@__DIR__) * "/" * name_model * ".engee"
println("Path: $Path")
FIXED_POINT_TYPE = "fixdt(1, 16, 14)"
coef = fi(coef, 1, 16, 15)

# CREATING A MODEL
engee.create(name_model)

# Creating a transposed structure
engee.add_block("/Basic/Ports & Subsystems/In1", name_model*"/")
engee.add_block("/Basic/Ports & Subsystems/Out1", name_model*"/")

num_coef = length(coef)
# The first stage is special - only the adder
engee.add_block("/Basic/Math Operations/Add", name_model*"/Add-1")
engee.add_block("/Basic/Math Operations/Gain", name_model*"/Gain-1")
engee.set_param!(name_model*"/Gain-1", 
    "Gain" => coef[1],
    "OutDataTypeStr" => "Fixed-point",
    "OutDataTypeStrFixed" => FIXED_POINT_TYPE)

# Adding a fixed point for the first adder
engee.set_param!(name_model*"/Add-1",
    "OutDataTypeStr" => "Fixed-point",
    "OutDataTypeStrFixed" => FIXED_POINT_TYPE)

engee.add_line("In1/1", "Add-1/1")
engee.add_line("In1/1", "Gain-1/1")
engee.add_line("Gain-1/1", "Add-1/2")

# Intermediate cascades (from 2 to num_coef)
for n in 2:num_coef
    name_delay = "Delay-" * string(n-1)
    name_add = "Add-" * string(n)
    name_gain = "Gain-" * string(n)
    engee.add_block("/Basic/Discrete/Unit Delay", name_model*"/"*name_delay)
    engee.add_block("/Basic/Math Operations/Add", name_model*"/"*name_add)
    engee.add_block("/Basic/Math Operations/Gain", name_model*"/"*name_gain)
    
    engee.set_param!(name_model*"/"*name_gain, 
        "Gain" => coef[n],
        "OutDataTypeStr" => "Fixed-point",
        "OutDataTypeStrFixed" => FIXED_POINT_TYPE)
    engee.set_param!(name_model*"/"*name_delay, "SampleTime" => st)
    
    # Adding a fixed point for the adder
    engee.set_param!(name_model*"/"*name_add,
        "OutDataTypeStr" => "Fixed-point",
        "OutDataTypeStrFixed" => FIXED_POINT_TYPE)
    
    # Connections
    if n == 2
        engee.add_line("Add-1/1", name_delay*"/1")
    else
        prev_add = "Add-" * string(n-1)
        engee.add_line(prev_add*"/1", name_delay*"/1")
    end
    
    engee.add_line(name_delay*"/1", name_add*"/1")
    engee.add_line("In1/1", name_gain*"/1")  # The input is for ALL coefficients
    engee.add_line(name_gain*"/1", name_add*"/2")
end

# We take the output from the last adder
last_add = "Add-" * string(num_coef)
engee.add_line(last_add*"/1", "Out1/1")

engee.save(Path)
model = engee.load(Path, force=true)

Path: /user/my_projects/Demo/CIC/firT_4077.engee

Model(
    name: firT_4077,
    id: 554b0993-f71f-481b-8624-928c38e1270e
)

For Verilog code generation the transposed form is preferable because it directly corresponds to the high-frequency hardware implementation with pipelining, providing better time performance and more efficient use of FPGA resources while maintaining functional equivalence with the direct form.

In the course of the work, the interpolation and decimation process for a single pulse was implemented. The initial impulse was set by the vector impulse = [1.0; zeros(100)] which was then convoluted with the filter coefficients coef to generate the transmitted signal tx_signal.

On the receiving side, the reverse operation was performed — convolution of the received signal with the same coefficients, which gave the reconstructed signal. rx_signal. Visualization of the results — plotting the initial pulse, interpolated and decimated signals — showed their high similarity, which confirmed the correctness of the algorithm implementation.

impulse = [1.0; zeros(100)]
tx_signal = DSP.conv(impulse, coef)
rx_signal = DSP.conv(tx_signal, coef)

plot(impulse)
plot!(tx_signal)
plot!(rx_signal)

A similar procedure was successfully reproduced in the model using the algorithms implemented by us for validations, which additionally verified the correctness of the approach. The resulting model is a ready-made solution suitable for generating code in the Verilog language.

# Defining the current directory
current_dir = @__DIR__
# Displaying the folder structure
println("Folder structure in the directory: $current_dir")
println()
for (root, dirs, files) in walkdir(current_dir)
    for dir in dirs
        if startswith(dir, "one_impulse_Nyquist_filter")
            println("$dir/")
            dir_path = joinpath(root, dir)
            for file in readdir(dir_path)
                println("  └── $file")
            end
            println()
        end
    end
end

Структура папок в директории: /user/my_projects/Demo/FIR_gen_fi

one_impulse_Nyquist_filter_RX_code/
  └── Delay_40.v
  └── fir_liner_1.v
  └── one_impulse_Nyquist_filter_RX.v

one_impulse_Nyquist_filter_TX_code/
  └── Delay_40.v
  └── fir_liner_1.v
  └── one_impulse_Nyquist_filter_TX.v

The generated Verilog code is a hardware implementation of Nyquist transmitting (TX) and receiving (RX) filters for pulse signal processing. Main components:

Top-level modules (one_impulse_Nyquist_filter_TX/RX.v):
- Manage data synchronization and validation
- Contain a finite state machine for data flow control
- Integrate FIR filter and delay unit
- Both filters use the same processing cores (FIR and Delay), but they differ in control logic and activation conditions.
FIR filter (fir_liner_1.v):
- Implements a linear FIR filter (41 coefficients)
- Uses fixed coefficients to generate an impulse response

Processes signed 16-bit data

Delay unit (Delay_40.v):
- Implements a chain of 40 delay registers
- Synchronizes a valid signal with a filtering delay
- Provides temporary data reconciliation

Conclusion

The simulation confirmed the correctness of the pulse processing implementation. The similarity of the results obtained in the mathematical environment and the hardware model proves the correctness of the algorithm and the accuracy of the selected filter coefficients.

The successfully generated hardware code represents full-fledged transmitting and receiving modules. They are built on a common filter core and differ only in control logic, which ensures energy efficiency of transmission and continuity of reception. The code is modular and ready for synthesis for implementation on the target digital platform.