Element-Wise Processing Block

Overview

The Element-Wise (EW) block performs arithmetic or activation operations between two input feature-map streams after convolution. It consists of:

  • Input reordering modules

  • FIFO buffering subsystem

  • EltWise controller that performs scheduling, read enable generation, and valid timing

  • An array of element_wise_op compute units (N parallel lanes)

  • Optional LUT-based activation compute blocks

  • Output aggregation and validity signalling

This block supports:

  • Addition

  • Subtraction

  • Multiplication

  • Activation functions (Sigmoid / Tanh)

and automatically manages fp-cast quantization for output precision adjustment.

System Architecture

The EW processing pipeline includes the following stages:

  1. Input Reordering

  2. Operand FIFO Buffering

  3. EltWise Controller

  4. Parallel Element-Wise Compute Array

  5. Output Formatting

Input Reordering (conv_output_reorder_EW)

The input operands arrive in convolution output order. conv_output_reorder_EW maps the incoming tensor fragments into a lane-major ordering expected by the EW block.

For each FIFO column and row pair, input indices are remapped according to:

\[out\_idx = (col \times N) + row\]
\[in\_idx = (row \times FIFO\_NO) + col\]

This ensures that each of the \(N\) lanes of the element-wise processing block receives the correct per-pixel elements.

FIFO Subsystem

Each operand stream passes through an array of dram_fifo instances. These FIFOs perform:

  • Burst-based buffering

  • Multi-port read-out using demultiplexed element_rd_en control

  • Empty/full flag generation

  • Per-FIFO valid signalling

The FIFOs output:

  • LeftOperand_data_out

  • RightOperand_data_out

  • FIFO status flags used by the controller for scheduling

EltWise Controller

The controller orchestrates lane-wise reads from both operand FIFOs, and ensures correct alignment of values forwarded into the element_wise_op compute lanes.

The main controller responsibilities include:

  • Maintaining read-cycle index (\(cycle\_idx\))

  • Handling image-size driven termination (\(EW\_done\))

  • Applying modulo padding when feature-map dimensions do not match

  • Managing state transitions through four phases:

    State 0: Normal read and operand forwarding State 1: Wait state after full image region read State 2: Flush cycles for modulo padding State 3: Stall until output FIFOs drain

  • Generating element_rd_en using demux_param1

  • Handling activation-only cases where RightOperand is unused

  • Producing data_valid that enables the compute lanes

The controller also detects activation mode (Sigmoid/Tanh) via:

\[tanh\_switch = (EltWise\_type == ELTWISE\_SIG) \lor (EltWise\_type == ELTWISE\_TANH)\]

Parallel Element-Wise Compute Array

The EW block instantiates \(N\) parallel element-wise compute lanes. Lanes 07 use the standard element_wise_op implementation, while lanes 8 to N-1 use a LUT-based variant element_wise_op_lut.

Each lane receives:

  • \(DATA\_WIDTH\)-wide LeftOperand

  • \(DATA\_WIDTH\)-wide RightOperand or zero (activation mode)

  • data_valid gating

  • EltWise_type

  • Scaling and zero-point metadata

Each lane produces:

  • \(DATA\_WIDTH\_OB\)-wide output

  • A per-lane EltWise_valid pulse

Element-Wise Operation (element_wise_op)

Overview

The element_wise_op module performs the final arithmetic or activation function processing on a per-pixel basis. This includes:

  • Zero-point shifting

  • Scaling

  • Operation selection (Add/Sub/Mul/Activation)

  • Activation functions (Sigmoid / Tanh)

  • fp-cast quantization

Input Preprocessing

Zero-Point Shifting

Each operand is first shifted by its respective zero-point:

\[LeftOperand\_shifted = LeftOperand - zp_L\]
\[RightOperand\_shifted = RightOperand - zp_R\]

Scaling

The shifted operands are scaled:

\[LeftOperand\_scaled = LeftOperand\_shifted \times LeftOperand\_Scale\]
\[RightOperand\_scaled = RightOperand\_shifted \times RightOperand\_Scale\]

This produces extended-width intermediate values that retain numerical precision.

Operation Selection

Depending on EltWise_type, one of the following is applied:

Quantization

The final result is quantized using fp_cast in the Tail Block:

\[output = \frac{result \times quant\_scale}{2^{fp\_cast}}\]

fp_cast is mode-dependent:

  • 10 bits for Add/Sub/Mul

  • 16 bits for Sigmoid/Tanh

Element-Wise Block diagram:

../_images/eltwise_op.svg