Gati
####

.. toctree::
    :hidden:

    input_blocks
    sa
    quantization
    dram
    configuration-block
    DRAM-controller
    adder_tree
    mega_pool
    reshape_transpose
    Sigmoid
    Resize
    eltwise_op
    DWP
    dispatcher
    nms
    concat
.. contents:: Table of Contents
   :local:
   :depth: 1

Here's a Bird's eye view picture of the entire CNN architecture:

.. image:: _static/Overview.png
   :width: 100%
   :align: center

Following sections describe what each block in the image above does.

ONNX
****

:term:`ONNX` involves reading the model file on the CPU, transforming (eg, from
:term:`Row Major Order (NCHW)` to :term:`Channel First Layout (NHWC)`),
optimizing (eg, operator fusion), reading images from the user and trasmitting
it to the FPGA. This process happens exclusively on the CPU (:term:`RK3399`).

CPU <-> FPGA
************

Vaaman has the arrangement:

.. image:: _static/vaaman-arch.svg
   :width: 80%
   :align: center

Communication b/w the CPU and FPGA are carried out by the `Rah
<https://github.com/bojle/macacetamol>`_ library. Rah abstracts the underlying
:term:`MIPI` interface. 

Input Blocks
************

The input block includes the blocks that read (in most cases) from the DRAM
and bring data to the Systolic array. This includes:

1. Inputs
2. Weights
3. Biases
4. Partial Sums (Accumulants)

Please see :ref:`input_blocks` for more information.

Systolic Array
**************

Gati currently assumes to have 8 units 9x8 weight stationary systolic
array. Each of these units is called a compute engine. A compute engine
is a 2D grid of processing elements arranged in 9 rows and 8 columns.
our choice of 9 rows is because of filter size of VGG16, i.e., 3x3 -
having a compute engine that is coherent in size with filter size
simplifies the dataflow design; however this could be extended to other
filter sizes. each 3x3 filter here can be visualized as a column of 9
elements. Thus all 9 weights of a filter can be exactly fit to compute
engine’s column. in 8 columns of compute engine 8 unique filters can be
pre-loaded. so, in each of 9x8, first 8 filters are loaded, respective
to the engine. After completion of loading weights, each compute engine
is set to accept inputs. 8 engines in-parallel accept first 8 channels.
partial-sums are collected (and added) before passing to the tail
blocks. Tail blocks apply activation functions (e.g. relu), dropout, and
perform operations like downsampling (e.g. maxpooling); in some cases
(transform to row-major format). Finally, the data is staged in FIFOs to
be written back to DRAM.

Systolic Array here is combination of one or many compute engines.
current version of SA assumes a weight stationary Processing element for
convolution layers and output stationary for fully connected layers.
configuration block instructs to switch weight stationary to output
stationary. exploring other dataflows (e.g. row stationary) for
convolution layers is a future work.

Refer to :ref:`sa` for more info.

Adder Tree
==========

Refer to :ref:`adder_tree`

Output Block
************

.. TODO
   a good diagram here would be very nice

TODO

Tail Blocks
***********

.. TODO:
   These sections

BatchNorm
=========

:term:`BatchNorm` is a weighted (4 different weights: mean, var, alpha and beta) block just like
the Bias block except the weights are tensors equal in dimension to the previous
layer. Batchnorm requires a multiplication and a division of input (x) with
constants.  This type of operation can usually be fused into previous
convolution layers thus reducing the need for a hard-implementation. For eg, an
Ofmap of size (96,7,7) would need a batchnorm of dimension (96, 4).


ReLU
====

:term:`Relu` is a simple piecewise activation function.

.. image:: _static/Activation.png
   :width: 50%
   :align: center

ReLU is implemented as a pipelined block within the hardware accelerator. It
processes the output of the convolution operation directly in the pipeline,
eliminating the need to write convolution outputs to DRAM and read them back for
ReLU computation. This approach minimizes time penalties associated with memory
access, ensuring higher efficiency and faster data processing.

Bias
====

Bias is scalar addition operation of a constant with incoming value. 

.. image:: _static/Bias_Addition_1.png
   :width: 40%
   :align: center

The bias addition is implemented as a pipelined operation within the hardware
accelerator. The bias values, fetched from DRAM by a dedicated bias controller,
are added directly to the convolution outputs in the pipeline, ensuring
efficient and seamless data processing without additional memory access
overhead.

Element Wise Operations
***********************

The element_wise_op block includes multiple operations that can be done on individual inputs. 
The current supported element wise operations include:

1. Addition 
2. Subtraction
3. multiplication 
4. Sigmoid/Tanh (Refer to :ref:`Sigmoid` for more info.)

For more info on Element wise operation megablock, refer :ref:`eltwise_op`


Resize Operator
**************

Refer :ref:`Resize`.

Quantization
============

.. image:: _static/Quantization1.png
   :width: 70%
   :align: center

:term:`Quantization` is needed because partial sums from the SA are the result
of MAC of multiple 8-bit elements which results in a number that does not
fit in 8 bits. This block makes a PS of larger bit-width fit in 8 bits. 

Refer to :ref:`quantization` for more info.

Pooling Network
***************

Pool Movement
=============

:term:`Pooling` can be understood as two tasks: movement and action.
The movement has parameters: window size, stride and padding that dictate how
big the kernel is and how it should be moved across the Ifmap. Action is
what has to be done to the values in the kernel. Commonly found actions
are Max and Average which gives the name of two popular pool layers: maxpool and
average pool.

Following image shows the pooling network:

.. image:: _static/Generalized_Pool.png
   :width: 30%
   :align: center

The action block can be replaced by any action while leaving the movement
(everything other than action) untouched. 

Assume a pool of window size (KW, KH), stride (S) and padding (P). Movement works thusly:

1. Input I (a scalar value) arrives out from the output fifo 2 into the action
   block.

2. The action block (discussed later) emits another scale value (after some
   cycles) and stores into F1.

3. Once an entire row has been processed, F1 should be filled with some elements
   and F2 should be empty.

4. For second and all subsequent rows (till KH), values from action are sent
   to F2

5. Once a value enters F2, one value from both fifos F1 and F2 (in the diagram,
   the values a1 and b1) are sent to the second action block which runs the
   action on it.

6. Value from this action block is written back into F1 if the current row is
   not the last row, else it is sent out from the pooling network.

Pool Actions
============

Max
---

The max operation takes max b/w two values at a time and stores it 
in a register to use the same value for next comparison. Initially,
the value of reg would be 0. This operation is carried out KW times,
then the value of reg is emitted out of the Max (action) block.

Average
-------

Average b/w N elements requires division by N (a variable) which is not very
convenient on the FPGA. Average of a N element array can be cheaply calculated
by calculating average of 2 values at a time then averaging these averages. This
results in a tree like structure (as represented in lower right corner of the
image). Moreover, division by 2 is simply a right shift by 1.

.. TODO
   add running_averag script to vaaman-vgg-benchmarks

Consider a window size of 6. We need to take 4 averages to calculate an average
of 6 elements. Average block works thusly:

1. Avg of i1 and i2 is calculated (a1) and push to a fifo. In subsequent
   cycles, average of i3 and i4 is calculated (a2) and also pushed to the fifo. 

2. If the fifo has 2 values, average b/w the two is taken and pushed in the
   fifo. 

3. This is done till there is only one value left in the fifo. This is the
   average.

For a odd-numbered window size, say 5, nothing changes except we only have to
take one less average. The extra element is pushed as is in the fifo.

Right shift by 2 of a integer divides it but gets rid of the decimal part (.5)
which may cause a loss in precision. Empirical evaluation shows that the loss
occured is 0.5 to 1.0% of the original which should be acceptable.

Transpose
*********

See: :ref:`reshape_transpose`

DRAM
****

.. TODO
   add an image depicting the complete layout of memory

In the current setting Vaaman's FPGA (:term:`Trion120`) has a discrete DRAM
attached to it. This is not shared with the CPU (:term:`RK3399`). DRAM is used
to store different types of data in different layouts. These include:

1. Inputs (images)
2. Outputs (what becomes the inputs to next layers)
3. Weights
4. Accumulants (partial sums b/w iterations that are not yet outputs)

The architecture substantially affects the layout of the DRAM. So, one layout
would not work for every model. Weights are read-only i.e. once written in the
DRAM at the beginning of the computation, they are only read by the FPGA, never
written to. Therefore weight data can be transposed in expected order by the
CPU, and sent to the FPGA. Inputs/Outputs are read/write, therefore
transpositions on them happens once, at the start, on CPU and later by the FPGA.

For concrete details on the layout and access pattern, see :ref:`ddr_layout_and_access`.

For implementation of memory controller, see :ref:`DRAM_controller`

Configuration Block/Bus master controller
*****************************************

Configuration block stores required configurations for each layers and
programs input, output, and tail blocks ahead of time so that they can
immediately switch to new settings after completion of the current layer
and start processing next layer. 

Each table above shows a config packet of 256 bits. Understand these
packets as instructions where the instruction width is 256. None of the
above configs currently take all 256 bits, this is not a problem, these
least significant remaining bits can be assumed to be reserved.

The Bus Master Controller facilitates communication between a master 
device and multiple slave devices within a system. It transmits the 
instruction set from the config block to different compute block.

For implementation details of config block/Bus master controller, see
:ref:`configuration_block`
or implementation of memory controller, see :ref:`DRAM_controller`

.. include:: instructions/inst.rst

.. _flattening:

Flattening
**********

In a network, when the inputs to FC are the outputs of a convolution operation,
a "flattening" operation needs to be performed on the outputs. Reason is the
order in which the SA that carries out convolution outputs to the DRAM. If, for 
example, a 9x4x4 SA is used, the outputs a NHW4C4, i.e. first four elements of 
channel one, first four elements of channel two, and so on till channel four after
which next four elements of channel one and this continues. FC expects inputs
in the form of 1xN where N is all the elements in row-major order. To deal with
this, when reading NHW4C4 outputs of a convolution, the flattening controller
is used to flatten it to a 1xN so that it can be input to FC. Note that the
flatten controller need not be always enabled, as any FC layers following an FC
layer will have their inputs already flattened. When and when not to enable
flattening is conveniently provided by the software through the 'Flatten' field
in the FC instruction.

Following image shows the flattening process:

.. image:: _static/Flattening.png
   :width: 60%
   :align: center


FC inputs obtained from DDR are storred in local on-chip memory (BRAM). Here 'M'
represents the number of columns in each systolic arrays and 'N' represents the
number of systolic engine.

The flattening controller works thusly:

Each bank is supposed to house a single channel. The NHW4C4 outputs are read,
and split into sections of 4 each. These 4 values are then put into their
respective banks. Then the banks are read out, one after another in serial
fashion which flattens them. 

FC Engine
*********

The FC engine very similar to the SA except for one small difference is the 
dataflow. It is a grid very much like the SA but it works in 'output stationary'
manner i.e. what is being 'stored' inside the PEs is not weights (like Conv
SA) but outputs. Both 'weights' and 'inputs' are continuosly fed to the PE grid.

PE Grid
=======

The inputs to the FC engine would be of the form:

.. code::

   1xN NxM

The output would be of dim:

.. code::

   1xM

`1xN` is the inputs and `NxM` is the weight. 

Based on this the shape of the PE grid can be figured out. 

Since, input is 1 dimensional, we only need one row in the grid. So, the
size should be `1xP`. 

What should `P` be? The DRAM can return a finite number of bytes (elements) (32
bytes for vaaman) in a cycle, so `P` cannot exceed the DRAM bandwidth. The
minimum can be decided based on resource constraints. A good configuration
(which is being used in Gati as of v0.2.4) is 1x32. 

How FC Engine Functions
=======================

.. figure:: _static/FC_Engine.png
   :width: 80%
   :align: center

   *FC Engine - High-Level Architecture*

The weight matrix (`NxM`) is continuosly sent from the weight fifos (that the FC
engine shares with the SA). The inputs are fully stored on-chip in the input
fifos and also sent continuosly. The FC engine processes `P` columns of the
weight matrix at a time. This means that an FC operations takes ~ `Nx(M/P)`
cycles to complete. The weights are arranged and aligned in the order of `P`.
If `M` is not evenly divisible by `P`, extra columns of only zeros are padded
to the weight matrix by the compiler. The outputs are accumulated in the 
accumulator registers. At the end of each iteration, these outputs are sent to
the tail block to be processed further and ultimately end up in the DRAM via
the output block. The process starting from tail block is the same as that of 
conv outputs after vector addition.

Decoding the FC instruction
===========================

The instruction consists of the usual opcodes, input start/end address, weight 
start/end address, and sizes for inputs/weights. 

Explanation of less-standard fields follows:

#. **Flatten**
    If the preceding layer of this FC is a convolution, its outputs (present in
    NHW4C4 order in DRAM) need to be re-arranged in row-major-FC-engine-friendly
    order. This bit signals the config block to enable the :ref:`flattening`
    controller.

#. **ImageDim**
    If flatten is enabled this field is the product of ROW and COLUMN field of
    the previous convolution operation.

#. **Vec2MatCols**
    The input to FC is a vector of size 1xM, the flattening controllers has
    `P` fifos. Vec2MatCols gives the number of elements belonging in each
    fifo aligned to word size.

    .. code::

       vec2matcols = align(M/P, WORD_SIZE)

    For an input tensor, which is the output of a convolution of size
    12x43x43 (CHW), the alignment would be thusly:

    .. code::

      vec2matcols = align(align(43*43, WORD_SIZE) * align(12, WORD_SIZE), WORD_SIZE)


DRAM write protocol
*******************

Storing data on the DRAM needs two main things: data and address. In Gati,
the instruction blob (i.e. set of all instructions), and all the weights (and
biases)
are first stored in the DRAM. Consider a neural net with 5 layers, each layer
having a weight and a bias. In this case, there are total 10 (weight + bias) + 1
(instructions) distinct pieces of data. The software is responsible for figuring
out where each distinct piece of data should be stored i.e. the addresses. Where
to store these distinct packets is communicated to the FPGA through a protocol
called DWP (DRAM Write Protocol). 

Here's the protocol: 

.. image:: _static/dwp_packet.png
  :width: 110%
  :align: center

It's a simple packet-based protocol with these fields: SOP (Start Of Packet), DS
(Data Size), and DRAM Address, followed by variable length data (payload). SOP
differentiates two packets.  DS is the size (in bytes) of the following payload.
Address is where the payload should be written in the DRAM. The DWP decoder on
the FPGA interprets these packets and write the data into DRAM. DWP is a 32bit
protocol as the DRAM operates on boundries of 32. All addresses are aligned to
this constraint by the software. 

.. TODO: memory segmentation diagram

For implementation of DWP, see :ref:`DWP`

Dispatch Block 
**************

Once compute is complete, the results need to be sent back to the CPU. 
The Dispatcher takes care of that. All megablocks have a output instruction
sent along with it. This is because all outputs are centrally managed by the
output block. The instruction is really meant for the output block. It contains,
among addresses and sizes, a flag to indicate whether computed outputs need to
be sent to the CPU. This is the `dispatch` flag. If an output instruction
has this enabled, the outputs shall be dispatched back to the CPU. The software
provides a way to enabled dispatch on any megablock layers of the model during
compilation. See user manual of software for more details. 

As a result, the dispatcher is flexible in that it can provide the final results 
after computation has ended, or be used for debugging intermidiate layers.

For more Abstract view of Dispatcher, see :ref:`dispatcher`

NMS
***

For more info see :ref:`nms`

CONCAT
******

For info on cancat see :ref:`concat`