IIT Bhubaneswar
Research Intern, School of Electrical and Computer Sciences
May 2026 - July 2026
Supervisor
Dr. Ayan PalchaudhuriAssistant Professor, SECS
Digital VLSI Architecture Design: High Performance Computer Arithmetic, Testable Architectures, Reconfigurable Computing (FPGA)
Goal
Contribute to an active research problem in VLSI architecture / computer arithmetic and co-author a paper
2026-05-12 → 2026-07-13 · 42% through
Calendar
May 2026
June 2026
July 2026
Progress so far
11 May 2026
Arrived in Bhubaneswar, hostel registration and settling in
12 May 2026
Day 1 at IITBBS: met faculty, other interns, and attended Dr. Palchaudhuri's opening lecture on FPGA fabric and ripple carry adders
13 May 2026
Deep-dive into 2's complement adder/subtractor, carry-lookahead adder, and optimised equality checker (A+B==K) using carry chains; assigned two papers on in-system testing and primitive polynomials
14 May 2026
Worked through assigned papers (Rajski et al., Mrugalski et al.); also sourced and read two additional references independently for background on LFSRs and primitive polynomials
15 May 2026
Met Dr. Palchaudhuri in person for clarification; PCs allocated to interns; received revised research plan covering cellular automata, automation of ring/polynomial structures, and FPGA optimisation as a later stage
16 May 2026
Received three textbooks from Dr. Palchaudhuri: Wang/Wu/Wen VLSI Test Principles, Jha/Gupta Testing of Digital Systems, and Hurst VLSI Testing. Started writing mathematical models for LFSR and RG; read additional theory on HRG. Built and verified standard LFSR RTL, then automated parameterised RTL generation for arbitrary length. After midnight, shifted to the modular LFSR variant by adapting the parameterised standard-form RTL directly rather than starting fresh.
17 May 2026
Worked on RG implementation (faster than expected) and HRG (majority of the day). After getting HRG working, refactored Python code with AI assistance to make the math reusable across files and chained LFSR/RG/HRG generators for comparison. After midnight: fixed regex and matrix bugs in HRG, ran Yosys synthesis with synth_xilinx and collected area stats, hit the limitation that Yosys/ABC re-optimises everything so meaningful timing and fanout differentiation is not recoverable from it alone. nextpnr + prjxray attempt deferred to 18 May.
18 May 2026
Got the full nextpnr-xilinx flow working on Artix-7 (xc7a35tcsg324-1). Generated chipdb from prjxray-db, ran place-and-route on all four netlists, and got post-route Fmax numbers: LFSR standard 368 MHz, LFSR modular 332 MHz, RG 390 MHz, HRG 385 MHz. Numbers align with the structural analysis from lfsrgen compare: standard LFSR is slower due to 3 series XOR gates, modular is limited by fanout-4 on the feedback FF, RG and HRG are fastest because max fanout is 2 and at most 1 XOR gate in the critical path. Later that day: met Dr. Palchaudhuri again, showed him the Fmax results. He directed attention to cellular automata as the next topic and pointed to specific sections in the assigned textbooks covering the relevant CA theory. Explained periodic boundary conditions, the 90/150 rule, universal cell construction, and alternating-rule arrangements for better randomness. Explained how to extend the 3-input neighbourhood (q_{i-1}, q_i, q_{i+1}) with mode signals R, S, L (right, self, left) feeding into XOR pairs to allow inversion of any of the three feedback signals; XOR-ing all three pairs and gating with a mode bit M gives a 7-input, 1-output combinational function per cell. Also covered serial seed initialisation and thinking about the FF array spatially in 2D rather than 1D.
19 May 2026
Self-directed: went through AMD documentation and additional CA resources to understand FPGA-aware CA design. Reviewed 7-series slice structure in the context of mapping the 7-input CA cell function. Began building a mathematical model for CAs in parallel. Also ran synthesis and PnR in Vivado as a cross-check; results came out close to the nextpnr numbers. Identified that nextpnr has limited constraint support relative to Vivado for this device family; both flows will be maintained going forward. Met Dr. Palchaudhuri again in person for a clarification session on design constraints. He raised the question of where hardware reuse makes sense in the CA design and where apparent resource increase is actually justified by measurable improvement. Said he will discuss FPGA-specific design optimisation and direct RTL-to-hardware mapping in more detail in upcoming sessions.
20 May 2026
Continued automation tool development toward primitive instantiation. Got the CA working with alternating 90/150 rules (not polynomial-driven); built a universal CA cell and automated generation in the tool. Output does not produce an M-sequence yet, which is expected. Dr. Palchaudhuri explained FPGA-targeted arithmetic: the y=(a+b|s=0, c+d|s=1) example showing how a MUX-adder maps to a single LUT6_2 using I5 as shared select, O5 and O6 for the two MUX outputs, and the carry chain for the sum bit.
21 May 2026
Dr. Palchaudhuri held a session on primitive instantiation using the Virtex-6 HDL Coding Guide. Covered FDRE, FDSE, CARRY4 (CIN, CYINIT, DI, S ports), LUT primitive INIT parameter. Noted FDSE is preferred for LFSRs to avoid the dead all-zeros state. Attempted to apply primitive instantiation in the automation tool; tested in Vivado due to nextpnr limitations with Xilinx primitives. Later session: Dr. Palchaudhuri raised long untapped FF chains in LFSRs and suggested SRLC32E as a replacement; sent a few pages from a 2018 CA textbook covering hardware implementation of CA structures.
22 May 2026
Went through three AMD documents on SLICEM shift register modes (UG473, UG474, and a third guide) plus the nandland LFSR reference. Designed 16-bit LFSR (x^15+x^14+1) using SRL16E address 13 and a DFF, and a 52-bit LFSR (x^52+x^49+1) using three SRL16Es and four DFFs. Dr. Palchaudhuri explained LUT shift register mode in detail, clarified why SRL64 does not exist, and asked me to investigate at what chain length synthesis switches to SRL primitives. Tested the parameterised shift register module for N=8,16,32,33,34,35,36 in Vivado. All configurations stay at 1 SLICEM + 1 SLICEL. N=8,16 map to SRL16E (O5). N=32 maps to SRLC32E (O6). N=33 still SRLC32E. N=34,35 use SRLC32E + SRL primitives with different O5/O6 usage.
23 May 2026
Added polynomial-driven M-sequence support for CA structures to the tool; this has been a parallel workstream across the past week. Now working on structural comparisons between CA-based and LFSR/RG/HRG-based structures.
24 May 2026
Not a productive day for forward progress. Found and fixed several bugs in the CA implementation. The correctness checker at this point is still brute-force: it tests all possible initial states exhaustively. Complexity is O(2^n) in principle, but halved in practice because every valid M-sequence has a reciprocal pair, so only half the candidate states need to be tested independently.
25 May 2026
Dr. Palchaudhuri clarified the direction for the final paper. He suggested looking at published papers in the area to understand what constitutes novelty -- specifically what the reviewers accepted, why, and which axis of improvement (area, timing, scalability, sequence quality) was the contribution. The goal is to identify a gap that can be filled or extended with something new. Completed the full port to primitive-style instantiation; the tool now generates RTL using behavioural stubs for Vivado simulation (primitives not needed for simulation in Vivado). No hardware synthesis optimisation has been done yet from the polynomial-to-RTL path or at the RTL level itself. Mathematical models are complete for all four structures: standard LFSR, RG, HRG, and CA. By mathematical model the intent is: given a polynomial, the tool determines tap positions, checks whether the polynomial meets required conditions (primitivity, decomposability for HRG), reduces the search space with those checks, and maps the result to the correct hardware topology -- where to place XOR gates, where to place rings, which cells take rule 90 vs 150. The search is still largely brute-force after the reduction.
26 May 2026
Revisited all papers read so far. Primary focus was on two references: Wang/Touba et al. UT-CERC-12-03 (https://www.cerc.utexas.edu/reports/UT-CERC-12-03.pdf), which covers LFSR variants (standard, modular, RG, HRG, minLFSR), gives the minimum XOR gate bound for each, and includes a CA resource comparison; and the CA tutorial by Serra et al. (https://webhome.cs.uvic.ca/~mserra/AttachedFiles/CA_Tutorial.pdf), which covers the CA synthesis process mathematically including the connection between characteristic polynomials and 90/150 rule sequences.
27 May 2026
Studied the CA synthesis process in detail: given a primitive polynomial, how to derive a rule sequence of 90s and 150s where 1 maps to rule 150 and 0 maps to rule 90. Began automating this derivation. For now, each CA cell maps to one LUT. Started work on a more aggressive packing target: fitting two adjacent 150-rule cells into a single LUT6_2. Two consecutive cells share two neighbourhood inputs (q_{i-1} and q_{i+1} are common to both), so the combined function has 4 distinct inputs plus 1 for seed-toggle mode and 1 for seed-in, giving 6 inputs total -- exactly the LUT6_2 width. This means a pair of adjacent CA cells can be packed into a single LUT6_2.
28 May 2026
Automated the revised CA approach using primitive instantiation. Resolved tool constraint issues (dont_touch and keep attributes behaving unexpectedly). Automated the full synthesis process from polynomial to RTL. Wrote down and encoded all conditions a polynomial or rule sequence must satisfy for the hardware to produce a valid M-sequence; these are based on UT-CERC-12-03 and the Serra et al. tutorial, with additional conditions clarified by Dr. Palchaudhuri. Removed unnecessary hardware: the seed non-zero checker is gone, the enable signal is gone, and the active-low global reset is replaced with active-high FDRE reset (the active-low variant was wasting one LUT on an inverter). With these changes, an 8-degree polynomial now fits under 14 LUTs and 16 FFs.
29 May 2026
Optimised the 8-degree CA to 4 LUTs and 8 FFs using tool constraints (dont_touch, keep_hierarchy) and RTL rewrites, while simultaneously automating the optimisation process. Final result fits in one Xilinx 7-series slice. Timing analysis with appropriate constraints gives 467 MHz for this configuration. Explained the initialisation calculations and RTL generation process to Dr. Palchaudhuri, including how the automation derives the rule sequence, applies the condition checks, and emits primitive RTL. He confirmed the approach. Also completed hardware synthesis optimisation for HRG using the decomposability conditions (top-to-bottom and bottom-to-top decomposition). Then began writing down FPGA-specific optimisation strategies for each structure. For standard LFSR: long untapped FF chains use SRLC32E; sections with taps use LUTs to XOR all tapped FF outputs and feed back into the first FF. For modular (Galois) LFSR: each tap position XORs one fixed feedback bit into its FF input, which is a single-fixed-input XOR -- the carry chain XOR can potentially exploit this since one input per stage is the feedback wire and the other varies, which is the condition carry chain XOR is designed for. For LUT-based modular LFSR packing, each LUT6_2 can give two independent XOR outputs (O5 and O6) to handle two tap positions, but the two-output constraint limits how adjacent taps share a LUT given the required fanout from the feedback wire. RG and HRG optimisation strategies are still being worked out. Dr. Palchaudhuri also asked to add ASIC implementation as a parallel workstream, noting that writing behavioural RTL for ASIC is not much additional work given what is already done.
30 May 2026
Started automating LFSR primitive instantiation for both standard and modular forms. The carry chain approach for modular LFSR has not been tested yet; uncertainty is whether driving the carry chain XOR requires an O6 output from a LUT, which would consume the output needed for the XOR result and negate the saving. HRG optimisation using primitive instantiation produced correct output for the 8-degree test polynomial but broke scalability -- the 16-degree polynomial now has 17 mismatches in the output sequence against the reference, indicating a stuck signal in the generated RTL. The unoptimised HRG still produces correct sequences for all tested polynomials. RG optimisation has not been started yet.
Currently working on
FPGA primitive optimisation for all four structures; HRG scalability bug in optimised path
CA optimisation is at 4 LUTs and 8 FFs for an 8-degree polynomial, fitting in one slice at 467 MHz (29 May). LFSR primitive instantiation automation is in progress for both standard and modular forms (30 May). Modular LFSR carry chain approach is under investigation. HRG optimised path produces correct output for degree 8 but has 17 output mismatches for degree 16 -- stuck signal, root cause not yet identified. Unoptimised HRG remains correct for all polynomials. RG optimisation not yet started.
in progressGoals
To do
Complete the NTU VLSI Testing playlist (Prof. James Chien-Mo Li, NTU)
Work through Wang/Wu/Wen and Jha/Gupta textbooks systematically, with focus on CA-related chapters
Go through AMD 7-series datasheet, slice and FF sections (carrying forward)
Build out mathematical model for cellular automata (90/150 rules, periodic boundary, universal cell)
Implement CA RTL with the 7-input universal cell mapped to LUT6; verify sequence properties
Work out serial seed initialisation in RTL
Explore 2D spatial arrangements of the FF array
Extend timing analysis to larger n (n=64, n=128) for LFSR/RG/HRG as secondary task
Document the LFSR/RG/HRG structural comparison as a clean report for Dr. Palchaudhuri
Doing
Building mathematical model for CAs (90/150 rules, mode signals R/S/L, periodic boundary)
Reading assigned textbooks and following NTU course
Maintaining both Vivado and nextpnr flows for cross-validation
Done
Hostel registration and settling in
Reported to department on Day 1
Understood FPGA fabric: CLB, LUT (as SRAM/MUX), LUT6_2 (O5/O6), wide-function MUXes, carry chain, DFFs, routing types
Worked out ripple carry adder FPGA mapping with carry chain
Worked out 2's complement adder/subtractor and understood why full subtractor maps poorly to FPGA carry chains
Completed optimised A+B==K equality checker: 4 LUT6s + 4 MUXes, carry-save approach
Read Rajski et al. (2025) on hybrid ring generators for in-system testing
Read Mrugalski et al. (2026) on primitive polynomials over GF(2), degree 661-1200
Self-sourced and read Arndt (2010) on LFSRs and Brent/Zimmermann on primitive polynomials
Set up paper reading log
PCs allocated; development environment accessible
Built parameterised standard-form LFSR RTL; automated generation for arbitrary polynomial and length; verified sequence
Built modular-form LFSR RTL by adapting standard-form parameterised RTL; both variants in separate directories with independent testbenches
Built RG and HRG RTL; refactored shared GF(2) math into common Python module; chained generators for sequence comparison
Ran Yosys synth_xilinx on all variants; collected area stats; documented limitation of ABC re-optimisation for structural comparison
Generated Artix-7 chipdb from prjxray-db via bbaexport + bbasm; ran nextpnr-xilinx place-and-route on all four netlists
Collected post-route Fmax: LFSR standard 368 MHz, LFSR modular 332 MHz, RG 390 MHz, HRG 385 MHz; results consistent with structural analysis
Cross-validated nextpnr results in Vivado; numbers close; identified nextpnr constraint coverage gap for this device
Got direction on cellular automata from Dr. Palchaudhuri (18 May): periodic boundary, 90/150 rules, universal cell with R/S/L mode signals, 7-input LUT6 mapping, serial seed init, 2D spatial thinking
Built CA RTL with alternating 90/150 rules and universal CA cell; automated cell generation in tool (20 May)
Understood LUT6_2 mapping for MUX-adder function: I5 as shared select, O5/O6 for MUX outputs, carry chain for sum bit (20 May)
Attended primitive instantiation session: FDRE, FDSE, CARRY4 ports, LUT INIT parameter, Virtex-6 HDL Coding Guide (21 May)
Understood FDSE vs FDRE preference for LFSR structures and SRLC32E for long untapped FF chains (21 May)
Designed 16-bit LFSR (x^15+x^14+1) and 52-bit LFSR (x^52+x^49+1) using SRL16E and DFFs (22 May)
Investigated SRL inference thresholds in Vivado: tested N=8,16,32,33,34,35,36; documented primitive mapping and O5/O6 usage per configuration (22 May)
Added polynomial-driven M-sequence support for CA structures to the tool (23 May)
Found and fixed bugs in CA implementation; correctness checker working (O(2^n) brute-force, halved by reciprocal pair symmetry) (24 May)
Ported tool to full primitive-style instantiation with behavioural stubs for Vivado simulation (25 May)
Completed mathematical models for all four structures: LFSR, RG, HRG, CA (25 May)
Revisited UT-CERC-12-03 (Wang/Touba et al.) and Serra et al. CA tutorial; these are the primary references for the synthesis-from-polynomial process (26 May)
Derived and automated CA rule sequence from polynomial: 1 maps to rule 150, 0 maps to rule 90; automated derivation in tool (27 May)
Worked out two-cell-per-LUT6_2 packing for adjacent 150-rule cells: 4 shared inputs + seed-toggle + seed-in = 6 inputs (27 May)
Automated revised CA with primitive instantiation; encoded all polynomial/sequence conditions for valid M-sequence; removed redundant hardware (seed non-zero check, enable signal, active-low reset inverter) (28 May)
Achieved 14 LUTs and 16 FFs for 8-degree polynomial (28 May)
Optimised 8-degree CA to 4 LUTs and 8 FFs (one slice); timing: 467 MHz (29 May)
Explained initialisation and RTL generation automation to Dr. Palchaudhuri (29 May)
Completed HRG hardware synthesis optimisation using decomposability conditions (29 May)
Wrote down FPGA optimisation strategies for standard LFSR (SRLC32E for untapped chains, LUT-based feedback XOR), modular LFSR (carry chain XOR for fixed-input taps, LUT6_2 dual-output packing), and started analysis for RG/HRG (29 May)
Started LFSR primitive instantiation automation for both standard and modular forms (30 May)
Journal
11 May 2026
Reached Bhubaneswar. Hostel registration done. Place is decent.
12 May 2026
First day. Met Dr. Palchaudhuri and the other interns. Sir started straight with a lecture, handed out 2 pages from the AMD 7-series FPGA datasheet and spent the whole session walking us through what's in it. Covered CLBs, LUTs (as SRAM and as MUX), LUT6_2 with O5/O6 outputs, wide-function MUXes, carry chains, DFFs, and routing types. Then went into ripple carry adder: generate/propagate equations, cascading, and how each block maps to the carry chain (MUX for carry, XOR from LUTs). Ended with a take-home: work out the 2's complement adder/subtractor and read the slice/FF section. Campus is much bigger than expected.
13 May 2026
Morning session: discussed 2's complement adder/subtractor. Key insight: a full subtractor is bad for FPGA because its structure is not cascadable and cannot exploit the hardwired carry chain, adding delay. Looked at further optimisations, SLICELs vs SLICEMs, and wide-function MUXes. Afternoon: tackled the A+B==K equality checker. Instead of computing the sum and then comparing, you work with interim sums and carries directly (carry-save idea). Each block takes ai bi ki and produces the required ci-1 and ci. The stage-wise AND products turn out to be functions of only ai bi ki, so the whole thing maps to just 4 LUT6s and 4 MUXes, with the last LUT6 needing only 3 inputs (a0 b0 k0). Clean result. Later, sir assigned two papers: Rajski et al. on hybrid ring generators for in-system testing, and Mrugalski et al. on new primitive polynomials over GF(2), both from Journal of Electronic Testing, relevant to testable architectures.
14 May 2026
Spent the day going through the assigned papers. Rajski et al. 2025 deals with hybrid ring generators (HRGs): ring generators where feedback taps alternate direction (top-to-bottom and bottom-to-top), enabling faster internal data circulation, lower aliasing transient in MISRs, and reduced XOR gate count. Mrugalski et al. 2026 is a tabulation paper giving new primitive polynomials over GF(2) of degree 661 through 1200, extending earlier known tables. To build better background, sourced three additional references: Mukherjee et al. 2011 (IEEE Computer) for the foundational ring generator architecture -- single XOR gate depth, max fanout 2, regular layout, O(n^2) simulation via superposition; Wang, Touba, Brent et al. 2011 (UT-CERC-12-01) for the theoretical basis of hybrid ring generators -- the (k+1)/2 XOR gate reduction, Theorem 1 proving this is a lower bound for k=1,3,5, and the primitive polynomial appendix up to degree 800; and Brent/Zimmermann (ANU rpb243) for algorithms for testing primitivity over GF(2). These were not assigned but are the primary prior work that the assigned papers build on.
15 May 2026
Sent an email to Dr. Palchaudhuri asking for clarification on the direction of work, specifically whether it would make sense to begin automating smaller structures such as ring generation for a given polynomial while continuing to read, and what the expected output form should be (RTL, scripts, or otherwise). He met me in person the same day and gave a clearer plan. He explained cellular automata at a conceptual level: unlike LFSRs or HRGs where feedback connections are global across the register, cellular automata use local neighbourhood rules, making them more area-efficient and structured differently. He said he will send more resources. He directed me to the NTU VLSI Testing course by Prof. James Chien-Mo Li (YouTube playlist) and asked me to go through it. The revised plan is: go through background material and the playlist, build a mathematical model for automating ring generation in parallel, and then move to FPGA optimisation once that is done. PCs were also allocated to the interns today.
16 May 2026
Dr. Palchaudhuri sent three textbooks as supporting material for the VLSI testing background work: Wang, Wu, and Wen (VLSI Test Principles and Architectures, Morgan Kaufmann, 2006), Jha and Gupta (Testing of Digital Systems), and Hurst (VLSI Testing: Digital and Mixed Analogue/Digital Techniques, IEE, 1998). These cover the theoretical and architectural foundations of digital testing, DFT, and BIST, and are meant to be read alongside the NTU course. Starting to go through Wang/Wu/Wen. Also began writing mathematical models for LFSR and RG in parallel, and read more theory on HRG to understand the feedback structure before writing anything. For LFSR, started with the standard (Fibonacci/external XOR) form: wrote a basic RTL first, verified the output sequence with a testbench, then wrote a script to automate RTL file generation for that variant. After confirming it worked, parameterised the RTL so the polynomial coefficients and register length are configurable, then extended the automation script to generate a standard-form LFSR of arbitrary length from a given primitive polynomial. Verified this end-to-end. After midnight (technically 17th), moved to the modular (Galois/internal XOR) variant. Instead of starting over, directly edited the parameterised standard-form RTL to change the feedback logic from external to internal XOR placement. Both variants now live in separate directories with independent testbenches and generation scripts; nothing is shared between them at this stage.
17 May 2026
Started with the RG (ring generator) implementation. The math translated to RTL relatively quickly. HRG took most of the day; the feedback structure is more involved and it took multiple iterations to get the connections right, but it is working now. After that, used AI assistance to refactor the Python side: pulled the shared GF(2) polynomial and matrix math into a common module so it is reusable across the LFSR, RG, and HRG scripts, then chained all three generators together so their output sequences can be compared directly. After midnight (18th): went back to HRG and fixed a couple of bugs, one in a regex pattern used to parse the polynomial input and one in the matrix construction for the state transition. Then attempted Yosys synthesis with synth_xilinx enabled to get area and LUT stats for each variant. Stats came out, but the useful comparison breaks down here: ABC re-optimises the netlist during synthesis, so LUT count differences between standard and modular LFSRs, or between LFSR and HRG, do not reflect the structural differences in the original RTL. Timing and fanout data from Yosys alone is not meaningful for this. Plan is to go through nextpnr and prjxray for a proper place-and-route pass on actual Xilinx primitives, left that for 18th.
18 May 2026
Morning: got the nextpnr-xilinx flow working end-to-end on Artix-7 (xc7a35tcsg324-1). The setup took a while because openxc7 is a snap package and the chipdb generation step is not obvious from the documentation. Had to run bbaexport.py against the prjxray-db bundled with the snap to produce xc7a35t.bba, then convert it to the binary format with bbasm. The resulting xc7a35t.bin is 89 MB. Storage was tight on the laptop the whole time; had to clean up intermediate files between runs to get through all four netlists. One other issue: synth_xilinx does not handle asynchronous reset (always @(posedge clk or negedge rst_n)), so had to write a small wrapper script that regex-substitutes the always block to synchronous form before passing to Yosys. Simulation RTL is untouched; the synthesis copies are separate. XDC generation was also manual: needed IOSTANDARD constraints for every port including all 32 state bits, otherwise nextpnr refuses to proceed. After all that was sorted, PnR ran on all four structures and post-route Fmax numbers came out. LFSR standard: 368 MHz. LFSR modular: 332 MHz. Ring generator: 390 MHz. HRG: 385 MHz. These are consistent with what lfsrgen compare reported for structural metrics: the standard LFSR has 3 XOR gates in series in the feedback path (logic levels = 3), which limits its Fmax relative to RG and HRG where the critical path through any single XOR stage is just 1 level. The modular LFSR has only 1 logic level but the feedback FF drives 4 nodes (fanout = k+1 = 4 for k=3 taps in this polynomial), and the routing overhead for that fanout is what pulls its Fmax below even the standard form. RG hits 390 MHz because fanout stays at 2 across all FFs and there is exactly 1 XOR gate per stage. HRG is 385 MHz, marginally lower than RG, but uses 2 XOR gates in total vs 3 for RG and both LFSRs, making it the most area-efficient of the four. The numbers are from a single placement seed per structure; nextpnr reports two passes (routing iterations) and the second one is what is recorded here. Running with --randomize-seed a few times would give a tighter picture, but the relative ordering is already clear. Later that day: met Dr. Palchaudhuri again and walked through the Fmax results with him. He then redirected the topic to cellular automata and pointed to relevant sections in the textbooks he had sent earlier. The CA explanation covered a lot of ground. Periodic boundary conditions: the leftmost and rightmost cells wrap around so the neighbourhood is always well-defined. Feedback structure: for cell i, the inputs to the flip-flop are q_{i+1}, q_i, q_{i-1} from the neighbours and itself. Rule 90 computes q_{i+1} XOR q_{i-1} (no self-term); rule 150 computes q_{i+1} XOR q_i XOR q_{i-1} (includes self). Alternating 90 and 150 rules across cells is the standard approach for better sequence quality compared to uniform rule assignment. Universal cell: instead of hardwiring a rule, you extend the three neighbourhood inputs with mode signals R (right), S (self), and L (left). Each mode signal gates whether the corresponding neighbour's contribution is direct or inverted: the cell computes (q_{i+1} XOR R) XOR (q_i XOR S) XOR (q_{i-1} XOR L). An additional mode bit M lets you invert the entire output. This gives a 7-input 1-output combinational function per cell, which fits directly in a single LUT6 (6 data inputs used for q_{i+1}, q_i, q_{i-1}, R, S, L; M is handled by the carry chain XOR or MUXF7). In a 7-series slice with 4 LUT6s and a carry chain, all 4 cells in the slice share the same M value, so they are not independently controllable in M within a slice, but you can vary M across slices without restriction. He also covered serial seed initialisation: shift the seed in over several clock cycles before switching to CA mode. And noted that thinking about the FF array in 2D rather than 1D opens up more neighbourhood options and spatial locality arguments.
19 May 2026
Spent the day on two tracks in parallel: FPGA tool flow and building the CA mathematical model. For the tool flow, went through several AMD/Xilinx documents to understand how to design with the 7-series slice structure directly in mind, particularly the LUT6 packing rules, carry chain XOR placement, and how the MUXF7/MUXF8 cascade interacts with the carry chain when M is shared across a slice. Also ran synthesis and place-and-route in Vivado as a cross-check on the nextpnr numbers; results came out close, within a few MHz for the same structures. From this point, Vivado is the primary flow for Xilinx-specific work because nextpnr-xilinx has limited support for placement and routing constraints on this device family, which will matter once the CA design needs controlled slice packing. nextpnr runs will be kept for open-source reproducibility where possible. For the CA model: started writing the mathematical framework for 1D CAs with periodic boundary, formalising the 90/150 rule transition matrices over GF(2), and encoding the mode signal extension. Met Dr. Palchaudhuri in person again later in the day for a clarification on design constraints. He raised the question of where hardware reuse is appropriate in the CA structure and where what looks like resource overhead is actually justified by the result, noting that counter-intuitive resource usage often shows better empirical behaviour. Said he will cover FPGA-optimised design techniques and direct RTL-to-hardware mapping approaches in more detail in upcoming sessions. Resources reviewed today (partial, not fully read through): [AMD 7 Series FPGAs Memory Resources User Guide (UG473)](https://docs.amd.com/api/khub/documents/ryI8c~ZyeXQJ4T1NZ6C57w/content), [AMD 7 Series FPGAs Configurable Logic Block User Guide (UG474)](https://docs.amd.com/api/khub/documents/V0Hleb3wGhMz~zhUvqVT3A/content), and a [CA-on-FPGA technical reference](https://24d5bf36-481e-43c0-a614-99b97d38513c.filesusr.com/ugd/efcde7_86275faf3fa846d5bda7727f4ac7221d.pdf) covering hardware mapping for cellular automata.
20 May 2026
Two threads running in parallel: making the automation tool more targeted toward primitive instantiation, and understanding how to structure designs for FPGA specifically. Went through several resources on FPGA-targeted RTL to get the theory behind mapping logic to primitives rather than letting synthesis decide. On the CA side: got the CA working with alternating 90/150 rules, not polynomial-driven. Built a universal CA cell and integrated it into the tool by automating the cell generation step. The output does not produce an M-sequence, which is expected at this stage since the rule assignment is not yet tied to a primitive polynomial. Separately, asked Dr. Palchaudhuri about targeting the design to FPGA primitives. He walked through a worked example. The circuit computes y, which equals a+b when s=0 and c+d when s=1. Two implementations: (1) compute both sums first, then route both results into a single 2:1 MUX with select line s; (2) use two MUXes, one for (a,c) and one for (b,d), both with the same select line s, followed by one adder. In either case the critical path from any input to y is one MUX delay plus one adder delay. The second option looks more complex because it uses two MUXes rather than one, but the analysis changes on FPGA. With LUT6_2: let s be I5 (the 6th input), which acts as a select line for both MUXes simultaneously. The output of the (b,d) MUX goes to O5 and the XOR of both MUX outputs goes to O6. For the sum output Si, the carry chain is used: O6 drives the carry chain select input, O5 goes to the 0-input of the carry chain MUX, and the final XOR of O6 with the carry chain 1-input produces Si. The full adder-with-MUX function up to the XOR stage maps into a single LUT6_2. This is more resource-efficient than the naive two-MUX-plus-adder count suggests, but the mapping has its own constraints (shared I5 means s cannot vary independently between the two mux functions within one LUT).
21 May 2026
Dr. Palchaudhuri held a session on primitive instantiation, using the [Virtex-6 HDL Coding Guide](https://docs.amd.com/v/u/en-US/virtex6_hdl) as the reference. Covered direct instantiation of LUTs, FFs, and carry chains in RTL. FF types covered: FDRE (D flip-flop with synchronous reset and clock enable), FDSE (D flip-flop with synchronous set and clock enable). He pointed out that FDSE is preferred for LFSR structures because FDRE initialises to 0 and a fully zeroed state is a dead state for an LFSR -- it will never leave all-zeros under normal feedback. FDSE with the set line forces a known non-zero start. Also went through CARRY4 primitive: the CYINIT port and CIN port for carry chain initialisation, and the 4-bit granularity per slice. After the session, took those concepts and tried to apply them in the automation work from the previous day -- attempting to wire the CA cell instantiation to actual slice-level primitives rather than behavioural RTL. Tested directly in Vivado because nextpnr-xilinx has limited support for Xilinx-specific primitives and gives misleading or incomplete results for primitive-level netlists. Later that day, Dr. Palchaudhuri came back with a separate point: in LFSR structures where many consecutive FFs are not tapped (outputs not used for feedback), a long chain of FF registers may be unnecessarily expensive in terms of slice usage. He showed this in Vivado's implementation view and pointed to SRLC32E as an alternative -- a 32-bit shift register primitive that fits in a single LUT6 and can replace a chain of up to 32 untapped FFs. Also sent a few pages from a book on cellular automata (2018 textbook), specifically the hardware implementation section, which covers why CA structures are favourable compared to other MLSG structures from a hardware perspective.
22 May 2026
Started the day going through three AMD documents on shift register modes in SLICEMs: [UG473 (Memory Resources)](https://docs.amd.com/api/khub/documents/m~KNAwVVZRVbV1RhEPIdMg/content), [UG474 (CLB User Guide)](https://docs.amd.com/api/khub/documents/xCnlWl7UTyYLFfT9gii3mw/content), and the [7-series SelectIO guide](https://docs.amd.com/api/khub/documents/zElbBP_Wfzhfnn~F~QLOCg/content). Also went through the [nandland LFSR reference](https://nandland.com/lfsr-linear-feedback-shift-register/). From these, designed a 16-bit LFSR using SRL16 with address 13 and a DFF for the remaining stage (polynomial x^15 + x^14 + 1, two taps). Extended the same approach to a 52-bit LFSR (x^52 + x^49 + 1): three SRL16s with address 15 and four DFFs, XOR for feedback. When shown, Dr. Palchaudhuri pointed out that the 52-bit design should also consider SRLC32E (following up from the 21st). He then gave a more detailed explanation of how LUT-based shift registers work in SLICEM: the LUT acts as a 16-deep or 32-deep shift register where the address input selects the tap, and the SRL output can chain into an adjacent SRL or DFF. Also clarified why SRL64 does not exist as a standalone primitive given the 6-input LUT limit. Put the question: at what chain length does the synthesiser switch from mapping a straight FF chain to shift register primitives instead of discrete FFs? To investigate, he suggested checking what the following always block maps to: always @(posedge clk) q <= {q[30:2], sd}. Then extended the question to a parameterised shift register module with configurable DEPTH and a clock enable, and asked to test N = 8, 16, 32, 33, 34, 35, 36. Results from Vivado implementation: N=8 maps to SRL16E using O5 only, 1 SLICEM + 1 SLICEL. N=16 same as N=8. N=32 maps to SRLC32E using O6 only, same slice count. N=33 still SRLC32E on O6, similar to N=32 -- the carry-out of the SRL chains into a single DFF, no extra LUT needed. N=34 and N=35 both use SRLC32E plus SRL16E (or SRL32E). N=34 uses O6 only throughout. N=35 uses one LUT with O5 only and one with O6 only. Slice count stays at 1 SLICEM + 1 SLICEL across all tested values.
23 May 2026
Back on the tool. Added polynomial-driven support for M-sequence generation using CA structures; this has been running in parallel with the primitive instantiation work across the past week. Now working on comparisons to check for structural differences between CA-based and LFSR/RG/HRG-based structures.
24 May 2026
Not a good day progress-wise. Found a few bugs in the CA tool and spent most of the day tracking them down and fixing them. The correctness checker is still brute-force at this point: it tests all possible initial seeds exhaustively, which is O(2^n). In practice the search is halved because for every valid M-sequence there is a reciprocal pair, so the two sequences from a seed and its reciprocal counterpart can be treated as one case. Still a brute-force approach, just with a constant factor improvement.
25 May 2026
Dr. Palchaudhuri spent time clarifying the direction for the final paper output. His suggestion was to go back through the papers that have been published in this area and figure out specifically what made each one novel -- which axis the contribution was on (area, timing, scalability, sequence quality, something else), why the reviewers accepted it, and where the gaps or open directions are. The idea is to approach it analytically rather than just reading for background, and use that to find where we can bring something genuinely new or extend what exists. Separately, completed the port to primitive-style instantiation. The tool now generates RTL using behavioural stubs for Vivado simulation, since Vivado does not need primitives at simulation time. No synthesis-level optimisation has been applied yet to either the polynomial-to-hardware mapping or the RTL itself. Mathematical models for all four structures are done by this point -- by mathematical model the meaning is: given a polynomial, the tool identifies where to put XOR gates (taps), where to put rings, which cells get rule 90 vs 150 for CA, and which hardware structure to generate. It does some condition checking to reduce the search space (primitivity check, decomposability for HRG, sequence conditions for CA), but the actual search is still brute-force after the reduction.
26 May 2026
Went back through all papers read so far with the paper-analysis framing from Sir. Spent most of the time on two references that are most directly relevant to the synthesis question. First, Wang, Touba et al. UT-CERC-12-03 (https://www.cerc.utexas.edu/reports/UT-CERC-12-03.pdf): this covers standard LFSR, modular LFSR, RG, HRG, and minLFSR, gives minimum XOR gate counts and area comparisons for each, and also has a section on CA resource usage as a comparison point -- this is the closest thing to a unified comparison in the literature. Second, the CA tutorial by Serra et al. (https://webhome.cs.uvic.ca/~mserra/AttachedFiles/CA_Tutorial.pdf): this covers the mathematical synthesis process in detail, including how to go from a characteristic polynomial to a valid 90/150 rule sequence.
27 May 2026
Went through the CA synthesis process from the Serra et al. tutorial carefully: given a primitive polynomial, the rule sequence is derived by treating the polynomial coefficients where 1 maps to rule 150 and 0 maps to rule 90. Began automating this in the tool. At this point each cell still maps to one LUT. But started thinking about a more efficient packing: two adjacent 150-rule cells can potentially share a LUT6_2. The reason is that for two consecutive cells i and i+1, the inputs q_{i-1}, q_i, and q_{i+2} appear in both neighbourhood computations, with q_i and q_{i+1} shared directly. After working through the input set, the combined function for two cells has 4 unique neighbourhood inputs, plus 1 for seed-toggle mode and 1 for seed-in, giving 6 inputs total -- which is exactly what a LUT6_2 provides. So a pair of adjacent CA cells fits in one LUT6_2. Worst case this halves the LUT count for 150-dominated rule sequences.
28 May 2026
Automated the revised two-cell-per-LUT6_2 CA with primitive instantiation. Hit several tool constraint issues with dont_touch and keep attributes not behaving as expected in the context of the automation script -- took a while to resolve. Once that was done, automated the full polynomial-to-RTL synthesis path. Wrote down all conditions a polynomial or rule sequence must satisfy for the hardware to produce a valid M-sequence, drawing from UT-CERC-12-03 and the Serra et al. tutorial, plus conditions discussed with Dr. Palchaudhuri. Took feedback from Sir and removed hardware that was adding cost without being necessary: removed the seed non-zero checker, removed the enable signal, and switched from active-low reset with an inverter to active-high FDRE reset directly. The inverter on the active-low reset path was consuming an extra LUT. After these removals, an 8-degree polynomial fits in 14 LUTs and 16 FFs.
29 May 2026
Continued optimising the 8-degree CA. Using a combination of tool constraints (dont_touch, keep_hierarchy) and RTL rewrites, and automating those rewrites at the same time, brought it down to 4 LUTs and 8 FFs. That is one Xilinx 7-series slice. Ran timing analysis with appropriate constraints and got 467 MHz. Walked Sir through the initialisation calculations and the RTL generation process -- how the tool derives the rule sequence, applies the condition checks, and emits the primitive RTL from the polynomial. He confirmed it looked correct. Also completed the HRG synthesis optimisation using the decomposability conditions: the top-to-bottom and bottom-to-top decomposition of the characteristic polynomial determines which taps go in which direction, and getting that right is the key to achieving the minimum XOR gate count. Dr. Palchaudhuri also asked to add ASIC implementation as a parallel workstream, noting that behavioural RTL for ASIC is not a significant additional effort given what is already in the tool. After that, started writing down the FPGA-specific optimisation strategies structure by structure. Standard LFSR: SRLC32E for long untapped FF chains, LUT-based XOR of all tapped outputs feeding back into the first FF for sections with taps. Modular LFSR: each tap XORs one fixed feedback bit into the corresponding FF input, so each XOR gate has one input fixed (the feedback wire) and one varying -- that is exactly the condition for using the carry chain XOR, since the carry chain XOR has one input from the MUX select and one from the carry-in, one of which can be the fixed feedback. Alternatively, using LUT6_2 with O5 and O6 as two independent XOR outputs handles two tap positions per LUT, but the two-output constraint interacts with the fanout structure of the feedback wire. RG and HRG strategies are still being worked out.
30 May 2026
Started automating LFSR primitive instantiation for both standard and modular forms. The carry chain idea for modular LFSR has not been tested in hardware yet. The uncertainty is whether the carry chain XOR can be driven without the LUT6 consuming its O6 output to reach the carry chain input -- if it does, then the O6 output is no longer available for the actual XOR result, which removes the resource benefit. Need to check the carry chain CARRY4 port connections more carefully. For HRG, the optimised primitive RTL produces the correct sequence for the 8-degree polynomial but breaks for the 16-degree polynomial: there are 17 mismatches in the output sequence against the reference unoptimised version. Something is stuck -- either a signal is not propagating or a constraint is over-constraining a net. The unoptimised HRG still works correctly for all polynomials tested. Did not get to RG optimisation today.
Today
Debug stuck signal in optimised HRG for 16-degree polynomial (17 output mismatches vs reference)
Test carry chain XOR approach for modular LFSR and determine if O6 output is consumed
Work out RG optimisation strategy (carry chain or LUT6_2 packing)
Continue reading Wang/Wu/Wen alongside NTU course
Daily record
12 May 2026
FPGA fabric: CLB contains LUTs, wide-function MUXes, carry chain, and DFFs
LUT functions both as SRAM (stores truth table) and as MUX (selects output by input address)
LUT6_2: dual-output LUT, O5 (5-input function) and O6 (6-input function) from the same cell
Carry chain implements MUX for carry-out; XOR for sum comes from LUTs, ripple carry adder maps very naturally
Ripple carry adder: generate G = AB, propagate P = A⊕B; carry chain cascades these across slices
Types of routing in FPGA: local, direct, long-line (general-purpose interconnect fabric)
13 May 2026
2's complement subtractor: using XOR to invert B and feeding +1 via carry-in works well with the carry chain, maps cleanly, no extra LUTs
Full subtractor structure is not cascadable and cannot exploit the hardwired carry chain, higher delay, worse area
SLICEL vs SLICEM: SLICEM has additional LUT-RAM and shift-register capability; SLICEL is logic-only
Wide-function MUX (F7MUX, F8MUX) allows combining LUT outputs to implement wider functions
A+B==K equality checker: naive approach, compute sum then compare to K, wastes resources
Optimised: work with interim carry/sum bits; each stage (ai, bi, ki) produces the carry conditions; AND products reduce to 3-variable functions
Final mapping: 4 LUT6s + 4 MUXes for a full equality check; last LUT6 uses only 3 inputs (a0, b0, k0)
Using the carry chain in place of a separate AND gate (3rd LUT6) saves resources; first two LUT6s compare si and ki (6 bits each)
14 May 2026
HRGs (Hybrid Ring Generators, Rajski et al. 2025): ring generators where feedback taps alternate direction -- some top-to-bottom (+), some bottom-to-top (-); this accelerates internal data circulation and reduces aliasing transient in MISRs compared to conventional RGs with same-direction taps
Aliasing in BIST: when a faulty circuit's output compresses to the same signature as a fault-free circuit; HRGs reduce both aliasing probability and the transient period before it reaches steady state
Primitive polynomials over GF(2): required for maximum-length sequences; Mrugalski et al. 2026 extends known tables up to degree 1200 for use in large ring generators and HRGs
Ring generator architecture (Mukherjee et al. 2011): feedback taps formed by encompassing k adjacent FFs in a ring; no two feedback lines cross; max fanout 2, single XOR gate delay between any two FFs; O(n^2) fast simulation via lookup table and superposition
HRG lower bound (Wang/Touba et al. 2011, UT-CERC-12-01): for a conventional RG using k XOR gates, an HRG using the same characteristic polynomial needs at minimum (k+1)/2 XOR gates when k=1,3,5; this is achieved when the polynomial is fully decomposable
Brent/Zimmermann (rpb243): O(n^2) primitivity test using prime factors of 2^n-1; verify that LFSR initialized to a non-zero seed returns to it after 2^n-1 steps but not after (2^n-1)/p steps for each prime factor p
15 May 2026
Cellular automata (CA): each cell updates based on a local neighbourhood rule, unlike LFSRs where feedback taps can span the full register
CAs are more area-efficient on FPGA/ASIC because all interconnects are local; no long routing paths for feedback
CA rules can be uniform (same rule for all cells) or hybrid (different rules per cell); hybrid CAs offer more flexibility in sequence generation
CA-based test structures are a distinct approach from LFSRs and HRGs, with different coverage and aliasing characteristics
Research direction: automate ring generation for a given polynomial as a mathematical model first, then target FPGA optimisation
16 May 2026
Three standard textbooks assigned as background: Wang/Wu/Wen covers DFT and BIST architectures; Jha/Gupta covers fault models and test algorithms; Hurst covers both digital and mixed-signal testing
Starting with Wang/Wu/Wen as the primary text alongside the NTU course
Standard-form LFSR (Fibonacci): feedback XOR is external, output of the last stage goes back through XOR gates to selected stages before re-entering the register
Modular-form LFSR (Galois): XOR gates are distributed inside the register, each tap position XORs the output bit directly into that stage; both produce the same maximum-length sequence for the same primitive polynomial
Automating parameterised RTL generation from a polynomial: encode tap positions as a parameter vector, emit XOR assignments conditionally per bit index; verified sequence length matches 2^n - 1
Modular variant can be derived from standard-form RTL by changing feedback logic; polynomial determines tap positions in both cases, only the XOR placement differs
17 May 2026
RG (ring generator): feedback structure based on the characteristic polynomial, similar in concept to LFSR but with ring topology; implementation was straightforward once the math was clear
HRG (hybrid ring generator): a ring generator where feedback taps alternate direction -- some go top-to-bottom (bottom taps, +) and others go bottom-to-top (top taps, -); all taps still span k adjacent flip-flops but in opposite directions; faster internal data circulation than a conventional RG where all taps go the same way
Yosys synth_xilinx maps to Xilinx primitives (LUT6, FDRE, etc.) and runs ABC for logic optimisation; area stats are collectable but the ABC pass flattens and re-optimises logic, so structural differences between LFSR variants are not preserved in the netlist
Yosys alone cannot give meaningful timing or fanout differentiation between structurally different generators of the same polynomial degree; the optimiser equalises them
For a valid structural comparison, place-and-route through nextpnr and back-annotation via prjxray is needed to see actual routing delays and fanout on Xilinx fabric
Refactoring shared GF(2) math (polynomial arithmetic, state transition matrix construction) into a common Python module avoids duplication and makes sequence comparison across generator types easier to script
18 May 2026
nextpnr-xilinx requires a pre-compiled chipdb binary per device; generated from prjxray-db via bbaexport.py (produces .bba) then bbasm (produces .bin); xc7a35t.bin is 89 MB, one-time cost
synth_xilinx does not accept asynchronous reset (always @(posedge clk or negedge rst_n)); must convert to synchronous reset before synthesis, simulation files are independent
nextpnr-xilinx requires IOSTANDARD XDC constraints on every port including all state bits; missing constraints cause it to abort; generated with a Python one-liner over the port list
Post-route Fmax from nextpnr-xilinx (xc7a35tcsg324-1, n=32, poly x^32+x^28+x^27+x+1): LFSR standard 368 MHz, LFSR modular 332 MHz, RG 390 MHz, HRG 385 MHz
LFSR standard is Fmax-limited by 3 series XOR gates in the feedback path (logic levels = k = 3 taps in the polynomial)
LFSR modular has 1 logic level but the feedback FF drives 4 nodes (fanout = k+1 for k=3); routing overhead for fanout-4 costs more Fmax than the logic level saving gains
RG achieves the highest Fmax (390 MHz): feedback FF drives at most 2 nodes, 1 XOR gate per stage, no long routing paths
HRG achieves 385 MHz with only 2 total XOR gates vs 3 for RG and both LFSRs; best area efficiency of the four structures for this polynomial
Yosys area stats are consistent: all four structures map to 32 FDRE + 32-33 LUTs; ABC cannot differentiate them further, which is why Yosys alone is insufficient for this comparison
nextpnr runs two placement iterations per design; the second post-route Max frequency line is the result to record; first is pre-routing estimate
CA neighbourhood for cell i: q_{i+1}, q_i, q_{i-1} feed back as inputs to the ith FF. Periodic boundary wraps these for cells at the array edges
Rule 90: next state = q_{i+1} XOR q_{i-1}. Rule 150: next state = q_{i+1} XOR q_i XOR q_{i-1}. Alternating 90/150 across cells improves sequence quality over uniform rule assignment
Universal CA cell: extend each neighbourhood input with a mode signal (R for right, S for self, L for left); each pair computes (q XOR mode) so the mode bit controls whether the contribution is direct or inverted. XOR all three pairs, then gate with M to optionally invert the whole output. 7-input 1-output function per cell
LUT6 mapping: 6 inputs cover q_{i+1}, q_i, q_{i-1}, R, S, L. M goes to the carry chain XOR or MUXF7. In a 7-series slice (4 LUT6s + carry chain), all 4 cells share the same M, so M is not independently settable within a slice, but can vary between slices
Serial seed initialisation: shift seed in over a fixed number of clock cycles before switching to CA update mode
2D spatial thinking: arranging the FF array in 2D rather than 1D expands neighbourhood options and makes local routing arguments stronger
19 May 2026
Vivado gives results consistent with nextpnr for these structures (within a few MHz); to be used as primary flow going forward because of better constraint support for 7-series
nextpnr-xilinx constraint support for this device is limited relative to Vivado; will remain in use for open-source reproducibility but not as the primary reference for timing
CA 90/150 rule transition matrices over GF(2): rule 90 gives a tridiagonal matrix with 1s on the super- and sub-diagonals; rule 150 adds 1s on the main diagonal as well; periodic boundary wraps the corners
Mode signal encoding R/S/L: effectively selects which cells contribute as non-inverted vs inverted to the XOR chain; controlling this per-cell and per-slice is the key degree of freedom in universal CA design
Resource overhead in CA design: apparent LUT count increase from universal cell (7-input vs 3-input for basic 90/150) may be acceptable or even preferable depending on sequence properties; Prof. Palchaudhuri noted designs that look wasteful often show better empirical test results
Direct RTL-to-hardware mapping on FPGA: structuring RTL to match the target slice architecture (LUT6 width, carry chain topology, MUXF7/F8 placement) rather than letting synthesis decide; this is what the coming sessions will address
20 May 2026
LUT6_2 dual-output mapping for MUX-adder: I5 serves as a common select line for both MUXes within the LUT6_2, so the select signal s is shared; (b,d) MUX output goes to O5, XOR of both MUX outputs goes to O6
Carry chain integration for sum output: O6 acts as the carry chain MUX select, O5 feeds the 0-input of the carry chain MUX, and the final XOR of O6 with the 1-input of the carry chain MUX gives Si; this chains the full adder output through without an extra LUT
The full MUX-adder function (y = (a+b) when s=0, (c+d) when s=1) maps into a single LUT6_2 up to the XOR stage, with the carry chain handling the sum bit; fewer LUTs than a naive two-MUX-plus-adder implementation
The efficiency gain from the shared I5 select line has a constraint: s must be the same signal for both mux functions within that LUT; they cannot be independently controlled within one LUT6_2
Both implementation options (pre-compute both sums then MUX, vs MUX inputs then add) have the same critical path depth: one MUX delay plus one adder delay from any input to y
CA with alternating 90/150 rules and a universal cell is buildable without tying the rule assignment to a primitive polynomial; the output will not be an M-sequence until the polynomial connection is made
Automating universal CA cell generation: parameterise mode signals per cell index and emit LUT6 instantiations for each cell based on its index; alternating rule assignment encodes directly as a function of cell index parity
21 May 2026
FDRE: D flip-flop with synchronous reset and clock enable; resets to 0 on assertion of R; dead state for LFSRs because all-zeros state has no escape under normal XOR feedback
FDSE: D flip-flop with synchronous set and clock enable; sets to 1 on assertion of S; preferred for LFSR structures where the all-zeros state must be avoided
CARRY4 primitive: 4-bit carry chain slice; CYINIT initialises carry from a constant (0 or 1) or from CIN (chained from adjacent CARRY4 below); CIN chains carry across slices; DI inputs are the data-in for the MUX; S inputs are the XOR select
Primitive instantiation syntax: component-level in VHDL, module instantiation in Verilog; all ports named explicitly; INIT parameter on LUT primitives is a hex string encoding the truth table
SRLC32E as replacement for long untapped FF chains: a 32-deep shift register in a single LUT6 in a SLICEM; Q31 output is the end-of-chain tap used to cascade into the next SRL or a DFF; Q is an address-selected tap
In LFSR structures, long consecutive FF chains where no intermediate tap is needed are wasteful; replacing with SRLC32E reduces slice usage and is the intended use of SLICEM shift register mode
Vivado is the correct tool for primitive-level work on Xilinx devices; nextpnr-xilinx does not handle Xilinx-specific primitives (FDRE, FDSE, CARRY4, SRL16E, SRLC32E) correctly in all cases
22 May 2026
LUT in shift register mode (SRL): the LUT6 in a SLICEM can act as a 16-deep (SRL16E) or 32-deep (SRLC32E) shift register; the address input A selects which delay tap is presented at Q; data shifts in on D at each clock edge
SRLC32E: 32-bit SRL with carry-out (Q31) for cascading; Q31 always outputs the last stage regardless of address; Q is address-selectable; CE is clock enable
SRL16E address 13 plus 1 DFF implements a 15-stage delay (14 stages in SRL + 1 in DFF); for x^15 + x^14 + 1, the feedback tap at position 14 is taken from SRL address 13 (0-indexed)
52-bit LFSR with x^52 + x^49 + 1: 3 SRL16Es (address 15 each, 48 stages) plus 4 DFFs; feedback XOR taken from the tap at position 49 (SRL address of the appropriate intermediate stage)
SRL64 does not exist as a primitive because a LUT6 has 6 inputs; in shift register mode the 5 address bits give a maximum of 32 addressable stages (2^5); a 64-deep SRL would need 6 address bits, leaving no input for data
Synthesiser maps straight FF chains to SRL primitives when the chain is long enough and no intermediate taps are required; threshold depends on tool heuristics but consistent SRL mapping seen for N>=8 with CE in the parameterised test
N=8 and N=16: SRL16E, O5 output only, 1 SLICEM + 1 SLICEL
N=32: SRLC32E, O6 output only, same slice count
N=33: SRLC32E, O6 only; the extra stage is handled by Q31 cascading into a DFF, no additional LUT needed
N=34 and N=35: SRLC32E plus SRL16E (or SRL32E); N=34 uses O6 only throughout; N=35 uses one LUT with O5 only and one with O6 only
Slice count stays constant at 1 SLICEM + 1 SLICEL across N=8 through N=36
23 May 2026
Polynomial-driven M-sequence generation using CA structures requires tying the rule assignment (90/150 per cell) to a primitive polynomial; adding this to the tool closes the gap between the universal cell construction and verified M-sequence output
CA vs LFSR/RG/HRG structural comparison: with both classes of structure now automatable, a direct comparison of LUT count, critical path depth, fanout distribution, and routing locality is tractable
24 May 2026
CA correctness checker complexity: O(2^n) brute-force over all initial seeds, reduced by a factor of 2 because every valid M-sequence has a reciprocal pair -- the sequence generated from seed s and its reciprocal counterpart are the same M-sequence run in reverse, so only half the seed space needs to be tested independently
25 May 2026
Novelty analysis approach for paper writing: for each published paper in the area, identify specifically which axis the contribution is on (area, timing, scalability, sequence quality), why that was accepted as novel, and where the gap to the next improvement is
Behavioural stubs for Vivado simulation: Vivado does not require primitive instantiation at simulation time; generating behavioural stubs alongside primitive RTL allows simulation and synthesis from the same tool flow without maintaining two separate RTL trees
Mathematical model scope: tap position derivation, primitivity/decomposability condition checking, rule sequence derivation for CA, and hardware topology emission; search is still brute-force after condition-based reduction
26 May 2026
UT-CERC-12-03 (Wang/Touba et al.): covers standard LFSR, modular LFSR, RG, HRG, and minLFSR; gives minimum XOR gate count for each; includes CA as a comparison point on area; the most unified comparison in the literature reviewed so far (https://www.cerc.utexas.edu/reports/UT-CERC-12-03.pdf)
Serra et al. CA tutorial (https://webhome.cs.uvic.ca/~mserra/AttachedFiles/CA_Tutorial.pdf): covers the mathematical synthesis process for CA from a characteristic polynomial to a valid 90/150 rule sequence; primary reference for the polynomial-to-CA-hardware mapping
27 May 2026
CA rule sequence derivation from polynomial: polynomial coefficients map directly to rules -- coefficient 1 gives rule 150, coefficient 0 gives rule 90; this determines the rule at each cell position
Two-cell LUT6_2 packing for adjacent 150-rule cells: for cells i and i+1, the neighbourhood inputs are q_{i-1}, q_i, q_{i+1}, q_{i+2}; q_i and q_{i+1} appear in both; combined with seed-toggle (1 input) and seed-in (1 input), total unique inputs is 6, which matches LUT6_2 width exactly
This packing halves LUT count for pairs of adjacent 150-rule cells; for rule sequences with many 150s this is a significant reduction
28 May 2026
Tool constraints (dont_touch, keep) do not always behave predictably when applied from an automation script; constraints need to be verified against the elaborated netlist in Vivado to confirm they are being honoured
Redundant hardware removal: seed non-zero checker, enable signal, and active-low reset inverter all add LUTs without benefit for the target use case; removing them brought an 8-degree CA from more than 14 LUTs to 14 LUTs and 16 FFs
Active-low reset with FDRE requires an inverter in the reset path, consuming one additional LUT; switching to active-high reset (FDRE R port directly) eliminates this
All polynomial/sequence conditions for valid M-sequence from a CA: primitivity of the polynomial, rule sequence correctness from the polynomial coefficients, periodic boundary conditions, and seed non-zero requirement (enforced at initialisation, not at runtime)
29 May 2026
8-degree CA: 4 LUTs and 8 FFs achievable in one 7-series slice using dont_touch and keep_hierarchy constraints plus targeted RTL rewrites; 467 MHz with appropriate timing constraints
HRG decomposability conditions: the characteristic polynomial must decompose top-to-bottom and bottom-to-top; the direction of each tap (+ or -) is determined by which decomposition it belongs to; this is the key to achieving (k+1)/2 XOR gates
Standard LFSR FPGA optimisation: long untapped FF chains replace with SRLC32E; tapped sections use LUT-based XOR of all tapped FF outputs feeding into the first FF input
Modular LFSR carry chain XOR opportunity: each tap XOR has one fixed input (the feedback wire) and one varying input (the FF output); CARRY4 XOR stage has one input from DI (varying) and one from S XOR DI (which encodes the select); mapping the fixed feedback wire to the appropriate CARRY4 port could eliminate LUTs for each tap, but requires confirming that O6 is not consumed by the carry chain driver
LUT6_2 dual-output packing for modular LFSR: O5 and O6 can each implement one independent XOR for two different tap positions, but only if the two taps can share the LUT6_2 input ports without violating the 6-input constraint given the fanout of the feedback wire
30 May 2026
Carry chain XOR for modular LFSR: open question is whether the LUT6 O6 output is required to drive the CARRY4 S input; if so, O6 is occupied by the carry chain driver and cannot simultaneously output the XOR result, eliminating the saving; need to check CARRY4 port connections against UG474
HRG optimised path scalability issue: 17 output mismatches for 16-degree polynomial vs reference; unoptimised path correct for all tested polynomials; likely a constraint (dont_touch or keep) over-constraining or incorrectly scoping a net in the generated RTL for the larger polynomial, causing a signal to be stuck
Debugging approach: compare elaborated netlists between 8-degree (working) and 16-degree (failing) optimised HRG; identify which net is stuck and whether the constraint is applied to a net that should not be constrained at that degree
Reading list
Assigned
Xilinx · UG474 (v1.8) September 27, 2016 · assigned 12 May 2026
Janusz Rajski, Maciej Trawka, Jerzy Tyszer, Bartosz Włodarczak · Journal of Electronic Testing (2025) 41:241-253 · assigned 13 May 2026
Grzegorz Mrugalski, Janusz Rajski, Maciej Trawka, Jerzy Tyszer · Journal of Electronic Testing (2026) · assigned 13 May 2026
VLSI Test Principles and Architectures: Design for Testability
Laung-Terng Wang, Cheng-Wen Wu, Xiaoqing Wen · Morgan Kaufmann, 2006 · assigned 16 May 2026
Testing of Digital Systems
N. K. Jha, S. Gupta · Cambridge University Press · assigned 16 May 2026
VLSI Testing: Digital and Mixed Analogue/Digital Techniques
Stanley Leonard Hurst · IEE Circuits and Systems Series 9, 1998 · assigned 16 May 2026
Prof. James Chien-Mo Li, National Taiwan University · YouTube / NTU Lab of Dependable Systems · assigned 15 May 2026
Xilinx / AMD · UG359 (Virtex-6) · assigned 21 May 2026
Cellular Automata (2018): Hardware Implementation Section
Unknown · 2018_Book_CellularAutomata (excerpt) · assigned 21 May 2026
Self-sourced
Jörg Arndt · IEEE Xplore, 2010 · sourced 14 May 2026
Richard Brent, Paul Zimmermann · ANU Technical Report, rpb243 · sourced 14 May 2026
Nilanjan Mukherjee, Janusz Rajski, Grzegorz Mrugalski, Artur Pogiel, Jerzy Tyszer · IEEE Computer, vol. 44, no. 6, pp. 64-71, June 2011 · sourced 14 May 2026
Laung-Terng Wang, Nur A. Touba, Richard P. Brent, Hui Wang, Hui Xu · UT-CERC-12-01, University of Texas at Austin, October 2011 · sourced 14 May 2026
Xilinx / AMD · UG473 · sourced 19 May 2026
Xilinx / AMD · UG474 · sourced 19 May 2026
Unknown · filesusr.com technical reference · sourced 19 May 2026
Xilinx / AMD · UG473 · sourced 22 May 2026
Xilinx / AMD · UG474 · sourced 22 May 2026
Xilinx / AMD · UG471 · sourced 22 May 2026
nandland · nandland.com · sourced 22 May 2026
Laung-Terng Wang, Nur A. Touba, et al. · Computer Engineering Research Center, University of Texas at Austin, UT-CERC-12-03 · sourced 26 May 2026
Marcelo Serra, Tim Slater, Iang Chu Yu, David M. Miller · University of Victoria (webhome.cs.uvic.ca) · sourced 26 May 2026
