QAM16 Demodulator - Fixed-Point RTL Implementation

Overview

A fully synchronous, parameterizable 16-QAM demodulator implemented in synthesizable Verilog. The design closes the carrier-recovery and symbol-timing loops entirely in the digital domain using a fixed-point CORDIC engine; no floating-point arithmetic appears anywhere in the datapath. All angular quantities are in Q2.30 format; sample-level IQ data is Q1.30 throughout.

The demodulator simultaneously handles arbitrary carrier frequency offsets, static phase offsets, and additive white Gaussian noise. Lock is achieved in exactly 32 symbols across all tested conditions (50 random LFSR seeds, frequency offsets up to 0.020 rad/sample). The BER curve tracks theoretical 16-QAM performance closely from 9 dB upward and reaches an RTL noise floor of ~2.6×10⁻⁴ at high SNR, set by CORDIC quantisation and Costas loop residual jitter.

Architecture

Signal Flow

IF Input (Q1.30, SPS=2)
        │
        ▼
  [cordic_nco]  ◄── freq_adj, phase_adj  (Costas feedback)
        │  nco_i / nco_q  (Q1.14)
        ▼
  [cordic_mixer]        complex down-conversion
        │  mix_i / mix_q  (Q1.30)
        ▼
  [gardner_timing]      symbol timing recovery  (SPS=2 → SPS=1)
        │  timed_i / timed_q  (Q1.30)
        ▼
  [qam16_slicer]        hard decision + ideal point reconstruction
        │  symbol[3:0], ideal_i, ideal_q
        ▼
  [costas_loop]         decision-directed phase error → PI filter
        │  freq_adj, phase_adj
        └──────────────────────────► [cordic_nco]  (feedback)

Numerically Controlled Oscillator — `cordic_nco`

The NCO maintains a 64-bit signed phase accumulator. A 33-bit accumulator would technically hold ±π in Q2.30, but TWO_PI = 6.75×10⁹ exceeds the signed 33-bit range (4.3×10⁹), making wrap arithmetic unsafe at the extremes. The 64-bit accumulator eliminates all overflow risk at negligible synthesis cost on modern FPGA targets.

Phase wrapping to (−π, π] is performed combinationally before each CORDIC invocation. The NCO then folds the wrapped phase further into (−π/2, π/2] to stay within the CORDIC convergence domain. A quadrant-negation flag is registered and applied to the output to recover correct cosine and sine for all four quadrants.

A design-critical choice: the CORDIC is launched on the next phase accumulator value, computed combinationally, rather than the registered previous value. Without this look-ahead, the NCO introduces a one-sample lag of phase_inc per clock. Over a long symbol sequence under a nonzero frequency offset, this lag accumulates in the Costas integrator as a constant residual error, driving the phase trajectory toward a fold boundary and causing the loop to appear locked while producing intermittent phase slips.

Phase adjustments from the Costas loop are latched separately and consumed atomically on the next NCO enable pulse, preventing a race between the feedback write and the accumulator read.

The NCO outputs cos/sin in Q1.14 (right-shifted by 16 from the full Q1.30 CORDIC output). This truncation keeps all mixer partial products within 64 bits without additional scaling.

Complex Mixer — `cordic_mixer`

Implements conjugate multiplication to down-convert the incoming IF signal to baseband:

out_i = in_i × nco_i + in_q × nco_q
out_q = in_q × nco_i − in_i × nco_q

The four partial products are 64-bit (Q2.60), right-shifted by 14 to produce Q1.30 baseband samples. One clock latency.

At the top level, input samples in_i/in_q are registered into hold registers at valid_in and fed from there to the mixer. Without this, the mixer would see stale data during the ITER+3 clock cycles the CORDIC requires to produce nco_valid, since the external sample pointer advances long before the NCO output is ready.

Symbol Timing Recovery — `gardner_timing`

Implements the Gardner timing error detector, a non-data-aided algorithm suited to QAM. The Gardner TED operates at 2 samples per symbol:

ted = (x[n] − x[n−2]) × x[n−1]

Applied independently to I and Q, then summed.

This cross-correlation measures the zero-crossing position of the eye diagram without requiring knowledge of the transmitted symbols, making it immune to decision errors before carrier lock. The error is filtered through a PI loop with separately tunable proportional and integral coefficients.

The output is a linearly interpolated sample using fractional timing offset mu:

y = buf[1] + mu × (buf[0] − buf[1])

The first two symbol-boundary outputs are suppressed during warmup. buf[2] is zero-initialized; a Gardner error computed against a zero sample would corrupt the initial timing estimate.

16-QAM Slicer — `qam16_slicer`

The constellation is normalized to unit RMS power per symbol, placing amplitude levels at ±1/√10 and ±3/√10 — the standard ETSI/3GPP normalization.

Gray coding per axis ensures adjacent decision regions differ by exactly one bit:

< −2d       →  00  (ideal: −3d)
−2d .. 0    →  01  (ideal: −1d)
  0 .. +2d  →  11  (ideal: +1d)
> +2d       →  10  (ideal: +3d)

d = 1/√10 ≈ 0.31623

The slicer outputs the 4-bit hard decision and the ideal noiseless constellation point in Q1.30. All threshold and level constants are pre-computed as Q1.30 integers — no runtime multiplication in the slicer datapath.

Costas Loop — `costas_loop`

Phase error is computed via a full atan2 using a CORDIC instance in vectoring mode, rather than a small-angle cross-product approximation. The inputs are:

x = Re{ conj(decision) × received }  =  dec_i×rx_i + dec_q×rx_q   (dot product)
y = Im{ conj(decision) × received }  =  dec_i×rx_q − dec_q×rx_i   (cross product)

The vectoring-mode CORDIC drives y → 0 and accumulates the rotation in z_out, which equals atan2(y, x) — the true phase error in radians. This remains well-behaved for large initial errors during acquisition, unlike the linear approximation which saturates and provides incorrect gradient information during pull-in.

The loop filter is a standard PI structure:

integrator  +=  Kd × phase_error     (frequency path)
freq_adj     =  integrator            (drives NCO freq offset)
phase_adj    =  Kp × phase_error      (fast phase correction)

The integrator is clamped to ±π (Q2.30) as anti-windup. Lock detection monitors delta(freq_adj): when it stays below 0.01 rad/sample for 32 consecutive symbols, locked asserts.

CORDIC Core — `cordic_core`

Iterative shift-add CORDIC with configurable width and iteration count. Supports rotation mode (compute cos/sin of a given angle) and vectoring mode (compute atan2 of a vector). Operates with ITER clock cycles per result, controlled by a start/busy/valid handshake.

Vectoring mode pre-conditions the input: if x_init is negative, both x and y are negated before iteration to fold the input into the right half-plane (CORDIC convergence domain). The atan lookup table holds atan(2⁻ⁱ) for i = 0..ITER−1 in Q2.30, auto-generated alongside cordic_params.vh and cordic_consts.vh.

Fixed-Point Format Summary

Signal	Format	Notes
`in_i`, `in_q`	Q1.30	Input samples
`phase_inc`, `freq_adj`, `phase_adj`	Q2.30	Angular quantities, range ±π
`nco_i`, `nco_q`	Q1.14	Truncated from CORDIC Q1.30 output
`mix_i`, `mix_q`, `timed_i`, `timed_q`	Q1.30	Baseband samples throughout
`ideal_i`, `ideal_q`	Q1.30	Slicer reference points for Costas
CORDIC internal	Q1.30	Matches WIDTH=32, ITER=16

Parameters

Parameter	Default	Float equiv.	Description
`WIDTH`	32	—	Datapath width, set via `cordic_params.vh`
`ITER`	16	—	CORDIC iterations (precision vs. latency)
`KP`	53687091	0.05	Costas proportional gain
`KD`	5368709	0.005	Costas integral gain
`KP_T`	10737418	0.01	Gardner proportional gain
`KD_T`	107374	0.0001	Gardner integral gain

Directed Tests

Four test vectors, each transmitting 300 symbols at SPS=2. The Verilog testbench reports PASS/FAIL per test; the Python verification script scores 278 symbols after discarding the acquisition transient (skip=20).

Test	Freq Offset	Phase Offset	Noise σ	Locked	Lock Sym	BER
Ideal	0	0	0	✓	32	0.00e+00
Phase_Offset_0p3	0	+0.3 rad	0	✓	32	1.53e−02
Freq_Offset_0p005	+0.005 rad/samp	0	0	✓	32	2.34e−02
Combined	+0.003 rad/samp	+0.2 rad	0.01	✓	32	1.44e−02

============================================================
Directed Tests
============================================================
  [Ideal                         ]  locked=True  lock_sym=  32  pipe_dly=1  mean_EVM=0.04503  BER=0.00e+00  phase_rms=3.085°
  [Phase_Offset_0p3              ]  locked=True  lock_sym=  32  pipe_dly=1  mean_EVM=0.12672  BER=1.53e-02  phase_rms=7.696°
  [Freq_Offset_0p005             ]  locked=True  lock_sym=  32  pipe_dly=1  mean_EVM=0.11065  BER=2.34e-02  phase_rms=7.191°
  [Combined                      ]  locked=True  lock_sym=  32  pipe_dly=1  mean_EVM=0.10934  BER=1.44e-02  phase_rms=6.613°

The 4.5% EVM on the Ideal test is the CORDIC quantisation noise floor — irreducible at ITER=16, WIDTH=32. All non-ideal tests show higher EVM driven by residual loop phase error, not channel noise.

Constellation Analysis

Detailed per-test constellation breakdown extracted automatically by analyze_qam.py.

Ideal

  EVM Statistics:
    Mean   = 0.04503  (4.50%)       Median = 0.03543  (3.54%)
    Std    = 0.03411                 95th % = 0.12110  (12.11%)
    Max    = 0.19748  (19.75%)       Min    = 0.00160  (0.16%)
  Phase Error (degrees):
    RMS    = 3.085°    Mean = -0.693°  (DC bias)    Std = 3.006°
    Max    = 7.927°    Min  = -8.359°
  BER = 0.000e+00  (0 bit errors / 1112 bits)

  Symbol Occupancy (I-axis=cols, Q-axis=rows):
    Q\I      I=-3D     I=-1D     I=+1D     I=+3D
    Q=+3D       18        17        15        18
    Q=+1D       17        17        21        17
    Q=-1D       19        20        16        16
    Q=-3D       13        18        19        17

  Top-5 Worst EVM:
    #163  I=-0.811  Q=+1.090  EVM=0.197  →  (-0.9487,+0.9487)
    #147  I=+0.813  Q=+1.072  EVM=0.183  →  (+0.9487,+0.9487)

Occupancy is statistically uniform (~17 per point for n=278). All worst-EVM points map to the ±1D ring outer corners — CORDIC phase error has maximum amplitude effect at the outer constellation points, which is expected. Phase DC bias of −0.69° is negligible.

Phase_Offset_0p3

  EVM Statistics:
    Mean   = 0.12672  (12.67%)       95th % = 0.26934  (26.93%)
    Max    = 0.31375  (31.38%)
  Phase Error:
    RMS = 7.696°    Mean = -1.225°  (residual offset)    Max = 15.942°

  Symbol Occupancy:
    Q\I      I=-3D     I=-1D     I=+1D     I=+3D
    Q=+3D       15        20        15        13
    Q=+1D       17        17        21        22
    Q=-1D       23        20        16        16
    Q=-3D        9        18        24        12

Phase DC bias of −1.2° confirms the 2nd-order Costas loop leaves a small systematic residual after acquiring a 0.3 rad (17.2°) initial offset — expected for finite loop bandwidth. The occupancy skew (Q=−3D row: 9, 18, 24, 12) is caused by phase rotation smearing symbols across decision boundaries. All worst-EVM points cluster in the Q2/Q3 (−I) half-plane, consistent with phase rotation pushing the constellation clockwise.

Freq_Offset_0p005

  EVM Statistics:
    Mean   = 0.11065  (11.06%)       Std = 0.09243  (high variance)
    Max    = 0.35427  (35.43%)       95th % = 0.31169  (31.17%)
  Phase Error:
    RMS = 7.191°    Mean = -0.758°    Std = 7.151°
    Max = 19.695°   (larger than Phase_Offset case)

The std ≈ RMS (7.15° ≈ 7.19°) confirms the mean is near zero — the frequency offset produces a spinning phase error rather than a static offset, and the loop is tracking it with residual jitter. The larger max phase error (19.7° vs 15.9° for Phase_Offset) reflects that frequency tracking is harder than static phase correction. Worst-EVM symbols #57, #59, #61 are consecutive — a momentary loop slip where phase error spiked over 3–4 symbols, a characteristic signature of frequency-offset tracking.

Combined

  Phase Error:
    Mean = +1.252°  (positive — frequency and phase offsets partially cancel)
  BER = 1.44e-02  (lower than Freq_Offset alone at 2.34e-02)

The +1.25° DC bias (vs −1.2° for Phase_Offset alone) confirms the two offsets partially cancel in steady state, explaining why Combined BER is better than Freq_Offset alone despite having both stresses active simultaneously. The high EVM Std (0.097) reflects time-varying phase error from the frequency component.

Constellation Plots

Ideal constellation

Phase Offset constellation

Freq Offset constellation

Combined constellation

BER vs Eb/N0

BER vs Eb/N0 sweep...
  Eb/N0= 5 dB  sigma=0.19882  n=2000  HW=3.926e-01  Theory=4.189e-02  ✓
  Eb/N0= 7 dB  sigma=0.15793  n=2000  HW=3.378e-01  Theory=1.697e-02  ✓
  Eb/N0= 9 dB  sigma=0.12545  n=2000  HW=2.682e-02  Theory=4.390e-03  ✓
  Eb/N0=11 dB  sigma=0.09964  n=2000  HW=9.754e-03  Theory=5.647e-04  ✓
  Eb/N0=13 dB  sigma=0.07915  n=1000  HW=5.274e-04  Theory=2.423e-05  ✓
  Eb/N0=15 dB  sigma=0.06287  n=1000  HW=2.637e-04  Theory=1.842e-07  ✓
  Eb/N0=17 dB  sigma=0.04994  n=1000  HW=0.000e+00  Theory=9.072e-11  ✓  [no errors — upper bound]
  Eb/N0=19 dB  sigma=0.04994  n=1000  HW=0.000e+00  Theory=5.874e-16  ✓  [no errors — upper bound]

The per-dimension AWGN sigma for unit-power 16-QAM (4 bits/symbol) is:

σ = sqrt( 1 / (8 × Eb/N0_linear) )

The theoretical BER reference is:

BER_theory = (3/8) × erfc( sqrt( 4 × Eb/N0_linear / 10 ) )

Three operating regions are visible on the plot:

Loop stress (5–7 dB): σ ≈ 0.16–0.20, comparable to the inner eye opening (D = 0.316). The Costas loop acquires but the phase estimate has large variance, pulling the constellation around. HW BER is 0.33–0.39 vs theory 0.017–0.042 — the demodulator is operating at the edge of its phase-tracking capability, not simply receiving noise.

Transition (9–13 dB): The curve converges toward theory as noise drops below the loop’s tracking noise floor. SNR deviation from theory is < 1 dB at 9–13 dB — the fixed-point implementation tracks the theoretical curve closely once loop noise dominates over channel noise.

Noise floor (≥15 dB): BER stops falling at ~2.6×10⁻⁴ regardless of further SNR increase. This is the RTL implementation floor: CORDIC angular quantisation (ITER=16, ~3° RMS phase error) combined with Costas loop residual jitter. The directed-test Ideal EVM of 4.5% (EVM² ≈ 2×10⁻³ effective noise power) corresponds directly to this floor. Increasing ITER or tightening the loop bandwidth would push it lower.

The upper-bound markers at 17 and 19 dB are correct: with 4000 bits total tested, the measurement sensitivity is limited to BER > 2.5×10⁻⁴, which sits above the floor. These points confirm zero observed errors, not zero actual BER.

EVM Summary

EVM vs Eb/N0

EVM vs Eb/N0 sweep...
  Eb/N0= 5 dB   mean EVM=0.23202
  Eb/N0= 7 dB   mean EVM=0.17480
  Eb/N0= 9 dB   mean EVM=0.15409
  Eb/N0=11 dB   mean EVM=0.12729
  Eb/N0=13 dB   mean EVM=0.10737
  Eb/N0=15 dB   mean EVM=0.09351
  Eb/N0=17 dB   mean EVM=0.08293
  Eb/N0=19 dB   mean EVM=0.07320

EVM continues to decrease across the full sweep range without flattening at 19 dB. The noise floor EVM of ~7.3% at 19 dB is the CORDIC quantisation limit — it will not decrease further without increasing ITER. The curve will flatten visibly at approximately 23–25 dB where channel noise drops below the CORDIC floor.

Frequency Acquisition Range

Acquisition Range

Frequency acquisition range sweep...
  freq_offset=0.000 rad/samp  locked=True
  freq_offset=0.002 rad/samp  locked=True
  freq_offset=0.004 rad/samp  locked=True
  freq_offset=0.006 rad/samp  locked=True
  freq_offset=0.008 rad/samp  locked=True
  freq_offset=0.010 rad/samp  locked=True
  freq_offset=0.012 rad/samp  locked=True
  freq_offset=0.014 rad/samp  locked=True
  freq_offset=0.016 rad/samp  locked=True
  freq_offset=0.018 rad/samp  locked=True
  freq_offset=0.020 rad/samp  locked=True
  → Max acquisition range: 0.020 rad/sample

Lock achieved across the full tested range of 0 to 0.020 rad/sample. At SPS=2 and symbol rate fs, this corresponds to a carrier frequency uncertainty of ±1% of the sample rate — typical for FPGA-based IF receivers where the reference oscillator tolerance is in the hundreds of ppm range.

Lock Time

Lock Time Distribution

Lock time distribution (50 random seeds)...
  Lock time  mean=32.0  min=32  max=32 symbols

Lock time is deterministic at exactly 32 symbols across all 50 LFSR seeds. This is consistent with the lock-detection threshold of 32 consecutive symbols with delta(freq_adj) < 0.01 rad/sample. The zero variance confirms the loop converges fully before the 32-symbol window in all tested conditions — lock time is bounded by the detector, not by convergence.

Phase Error

Phase error distributions per directed test. The Ideal test shows symmetric noise centered near 0° — pure CORDIC quantisation jitter. The Phase_Offset test shows the same width of distribution but a slight negative DC bias (−0.69°), confirming the Costas loop compensates but does not fully zero the static offset. The Freq_Offset test shows the widest distribution with near-zero mean — a frequency offset produces time-varying (spinning) phase error that averages out over the observation window.

Verification Summary

============================================================
SUMMARY
============================================================

Directed Tests:
  Test                            Locked  LockSym  PipeDly    MeanEVM         BER   PhaseRMS°
  Ideal                             True       32        1    0.04503    0.00e+00       3.085
  Phase_Offset_0p3                  True       32        1    0.12672    1.53e-02       7.696
  Freq_Offset_0p005                 True       32        1    0.11065    2.34e-02       7.191
  Combined                          True       32        1    0.10934    1.44e-02       6.613

Frequency acquisition range : 0.020 rad/sample
Lock time (50 seeds)        : mean=32.0  min=32  max=32 symbols

BER vs Eb/N0:
   5 dB  HW=3.926e-01  Theory=4.189e-02
   7 dB  HW=3.378e-01  Theory=1.697e-02
   9 dB  HW=2.682e-02  Theory=4.390e-03
  11 dB  HW=9.754e-03  Theory=5.647e-04
  13 dB  HW=5.274e-04  Theory=2.423e-05
  15 dB  HW=2.637e-04  Theory=1.842e-07
  17 dB  HW=0.000e+00  Theory=9.072e-11   [upper bound < 2.5e-04]
  19 dB  HW=0.000e+00  Theory=5.874e-16   [upper bound < 2.5e-04]

Metric	Value
Pipeline delay	1 symbol (constant)
Lock time	32 symbols (deterministic)
Acquisition range	≥ 0.020 rad/sample
Ideal EVM (noise floor)	4.50%
RTL BER floor (≥15 dB)	~2.6×10⁻⁴
BER deviation from theory (9–13 dB)	< 1 dB
BER at 9 dB	2.68×10⁻² (theory: 4.39×10⁻³)
BER at 13 dB	5.27×10⁻⁴ (theory: 2.42×10⁻⁵)