Element-Wise Processing Block¶
Overview¶
The Element-Wise (EW) block performs arithmetic or activation operations between two input feature-map streams after convolution. It consists of:
Input reordering modules
FIFO buffering subsystem
EltWise controller that performs scheduling, read enable generation, and valid timing
An array of
element_wise_opcompute units (N parallel lanes)Optional LUT-based activation compute blocks
Output aggregation and validity signalling
This block supports:
Addition
Subtraction
Multiplication
Activation functions (Sigmoid / Tanh)
and automatically manages fp-cast quantization for output precision adjustment.
System Architecture¶
The EW processing pipeline includes the following stages:
Input Reordering
Operand FIFO Buffering
EltWise Controller
Parallel Element-Wise Compute Array
Output Formatting
Input Reordering (conv_output_reorder_EW)¶
The input operands arrive in convolution output order. conv_output_reorder_EW maps
the incoming tensor fragments into a lane-major ordering expected by the EW block.
For each FIFO column and row pair, input indices are remapped according to:
This ensures that each of the \(N\) lanes of the element-wise processing block receives the correct per-pixel elements.
FIFO Subsystem¶
Each operand stream passes through an array of dram_fifo instances.
These FIFOs perform:
Burst-based buffering
Multi-port read-out using demultiplexed
element_rd_encontrolEmpty/full flag generation
Per-FIFO valid signalling
The FIFOs output:
LeftOperand_data_outRightOperand_data_outFIFO status flags used by the controller for scheduling
EltWise Controller¶
The controller orchestrates lane-wise reads from both operand FIFOs, and ensures correct
alignment of values forwarded into the element_wise_op compute lanes.
The main controller responsibilities include:
Maintaining read-cycle index (\(cycle\_idx\))
Handling image-size driven termination (\(EW\_done\))
Applying modulo padding when feature-map dimensions do not match
Managing state transitions through four phases:
State 0: Normal read and operand forwarding State 1: Wait state after full image region read State 2: Flush cycles for modulo padding State 3: Stall until output FIFOs drain
Generating
element_rd_enusingdemux_param1Handling activation-only cases where RightOperand is unused
Producing
data_validthat enables the compute lanes
The controller also detects activation mode (Sigmoid/Tanh) via:
Parallel Element-Wise Compute Array¶
The EW block instantiates \(N\) parallel element-wise compute lanes. Lanes
0–7 use the standard element_wise_op implementation,
while lanes 8 to N-1 use a LUT-based variant element_wise_op_lut.
Each lane receives:
\(DATA\_WIDTH\)-wide LeftOperand
\(DATA\_WIDTH\)-wide RightOperand or zero (activation mode)
data_validgatingEltWise_type
Scaling and zero-point metadata
Each lane produces:
\(DATA\_WIDTH\_OB\)-wide output
A per-lane
EltWise_validpulse
Element-Wise Operation (element_wise_op)¶
Overview¶
The element_wise_op module performs the final arithmetic or activation function
processing on a per-pixel basis. This includes:
Zero-point shifting
Scaling
Operation selection (Add/Sub/Mul/Activation)
Activation functions (Sigmoid / Tanh)
fp-cast quantization
Input Preprocessing¶
Zero-Point Shifting¶
Each operand is first shifted by its respective zero-point:
Scaling¶
The shifted operands are scaled:
This produces extended-width intermediate values that retain numerical precision.
Operation Selection¶
Depending on EltWise_type, one of the following is applied:
Addition
Subtraction
Multiplication
Tanh / Sigmoid Activation (Refer Sigmoid/Tanh Module)
Quantization¶
The final result is quantized using fp_cast in the Tail Block:
fp_cast is mode-dependent:
10 bits for Add/Sub/Mul
16 bits for Sigmoid/Tanh
Element-Wise Block diagram: