This project explores the hardware acceleration of 2D image convolution using Xilinx Vitis HLS, Vivado, and a PYNQ-based Python runtime. The goal is to study how different hardware architectures affect latency, throughput, and resource utilization, then compare the final streaming accelerator against a software baseline across multiple image sizes and kernels.
The final system is a DMA-connected streaming convolution accelerator with AXI-Stream input/output, AXI-Lite control, and a Python benchmarking pipeline. In addition to HLS design-space exploration, the project now includes tiled execution for large images and a lightweight RGB extension built by reusing the grayscale pipeline across channels.
The project includes six convolution implementations:
- conv_baseline – naive nested-loop convolution
- conv_pipeline – pipelined version of the baseline
- conv_linebuffer – line-buffer-based convolution for improved data reuse
- conv_dataflow – sliding-window + dataflow architecture
- conv_dataflow_stream – AXI-streamed hardware accelerator with 32-bit packed pixel transfers and
uint8_toutput formatting - conv_dataflow_stream_int – streamed verification version with
intoutput for debugging and validation
The final streamed design is integrated into a Vivado block design with:
- Zynq UltraScale+ Processing System
- AXI DMA
- custom HLS convolution IP
The software side uses a Jupyter notebook to:
- configure the hardware accelerator
- send image data over DMA
- receive and unpack output
- compare against a software reference
- benchmark software vs hardware performance
- evaluate tiled execution for oversized images
- benchmark RGB execution through three grayscale passes
This project takes a software image-processing operation, 2D convolution, and implements it as a hardware accelerator using HLS design-space exploration.
Vivado block design showing the Zynq processing system, AXI DMA, and streamed convolution IP.
- Implemented multiple HLS convolution architectures
- Used compile-time constants for maximum image size and kernel size to improve HLS optimization
- Built a streaming convolution IP with:
- AXI-Stream input/output
- AXI-Lite control interface
- 32-bit packed input/output words
- Integrated the HLS IP into a Vivado design with AXI DMA
- Generated a
.bitand.hwhfor runtime execution from Python using PYNQ
The deployed accelerator was synthesized with:
- maximum input size:
512 × 512 - kernel size:
3 × 3
Although the code structure can be adapted for other kernel sizes, changing the kernel size requires updating the compile-time constants in the HLS header and regenerating the synthesized design and bitstream.
.
├── README.md
├── notebooks
│ └── stream_convolution_analysis.ipynb
├── results
│ ├── plots
│ │ ├── grayscale
│ │ ├── hls
│ │ └── tiling
│ └── tables
│ ├── grayscale
│ ├── hls
│ ├── rgb
│ └── tiling
├── source
│ ├── conv_baseline.cpp
│ ├── conv_pipeline.cpp
│ ├── conv_linebuffer.cpp
│ ├── conv_dataflow.cpp
│ ├── conv_dataflow_stream.cpp
│ └── conv_kernels.h
├── testbench
│ └── conv_kernels_tb.cpp
└── vivado
├── fpga_2d_convolution.bit
├── fpga_2d_convolution.hwh
├── fpga_2d_convolution.pdf
├── fpga_2d_convolution.png
└── fpga_2d_convolution.tcl
The benchmarking pipeline uses randomly generated grayscale images created with NumPy.
Square image sizes are evaluated in powers of two from:
- 4×4
- 8×8
- 16×16
- 32×32
- 64×64
- 128×128
- 256×256
- 512×512
- 1024×1024
- 2048×2048
- 4096×4096
- 8192×8192
Software execution and untiled hardware execution are performed only up to 512×512, which matches the deployed hardware’s maximum single-pass input size. Larger images are evaluated using tiled execution.
The following 3×3 kernels are used:
Sobel X
[-1, 0, 1]
[-2, 0, 2]
[-1, 0, 1]Sobel Y
[-1, -2, -1]
[ 0, 0, 0]
[ 1, 2, 1]Gaussian Blur
[1, 2, 1]
[2, 4, 2]
[1, 2, 1]The runtime uses an integer normalization parameter, norm_shift, which right-shifts the convolution sum by a fixed number of bits after accumulation. This provides efficient integer-only normalization in both software and hardware.
- Sobel X:
norm_shift = 0 - Sobel Y:
norm_shift = 0 - Gaussian Blur:
norm_shift = 4
Correctness was verified at multiple levels.
The C++ testbench:
- Generates random input images.
- Computes a golden reference convolution.
- Runs each HLS design.
- Compares outputs element-by-element.
This verified correctness for:
- Baseline
- Pipeline
- Linebuffer
- Dataflow
- Streamed
intdesign - Streamed
uint8_tdesign
The Jupyter notebook:
- Computes a software reference result in Python.
- Runs the hardware accelerator through DMA.
- Unpacks hardware output.
- Checks exact equality using
np.array_equal(...).
The software reference matches the hardware behavior by applying:
- Normalization shift
- Clamping to
[0, 255]
For tiled execution, tiled outputs were compared against untiled hardware outputs for image sizes within the untiled hardware limit. Where applicable, the software reference was also used to confirm correctness of tile extraction, cropping, and stitching.
The following table summarizes the HLS synthesis results for the 512×512 case.
| Design | Data Type | II | Latency (Cycles) | Latency (ns) | BRAM | DSP | FF | LUT |
|---|---|---|---|---|---|---|---|---|
| conv_baseline | int | 9 | 2340922 | 2.34E+07 | 4 | 9 | 3561 | 5327 |
| conv_pipeline | int | 9 | 2340921 | 2.34E+07 | 4 | 10 | 3848 | 5119 |
| conv_linebuffer | int | 3 | 786452 | 7.87E+06 | 6 | 20 | 4031 | 5205 |
| conv_dataflow | int | 1 | 262164 | 2.62E+06 | 6 | 30 | 3288 | 4119 |
| conv_dataflow_stream | uint8_t | 1 | 265731 | 2.66E+06 | 2 | 21 | 1459 | 2972 |
| conv_dataflow_stream_int | int | 1 | 265730 | 2.66E+06 | 2 | 18 | 1361 | 2537 |
Area-latency tradeoff across HLS convolution designs.
Area estimate note: The area-versus-latency plot uses the course area model, where
Estimated Area = max(LUT, FF) + 100 × DSP
This approximation treats LUTs and FFs as paired logic resources and models each DSP as equivalent to 100 logic blocks.
- The dataflow-style architectures achieved the best initiation interval, reaching II = 1.
- The linebuffer design significantly reduced latency compared to the baseline and pipelined versions.
- The final streaming architecture maintained near-dataflow latency while integrating AXI streaming and DMA compatibility.
- The deployed
conv_dataflow_streamdesign was selected because it combines strong throughput with deployable system-level interfaces
The notebook benchmarks:
- Software convolution time
- Total hardware time
- Hardware compute/DMA time
- Packing time
- Unpacking time
- Speedup
- Overhead ratio
- Compute ratio
| Kernel | Size | SW Time (s) | HW Time (s) | Speedup |
|---|---|---|---|---|
| Sobel X | 8×8 | 0.00278 | 0.002378 | 1.15× |
| Sobel Y | 8×8 | 0.00268 | 0.001416 | 1.89× |
| Gaussian | 8×8 | 0.00267 | 0.001481 | 1.80× |
| Sobel X | 512×512 | 18.8096 | 0.007803 | 2410× |
| Sobel Y | 512×512 | 18.8151 | 0.007313 | 2573× |
| Gaussian | 512×512 | 18.9801 | 0.007276 | 2609× |
The complete benchmark results are stored in:
results/tables/grayscale/grayscale_base_results.csv
Hardware speedup versus image size for Gaussian grayscale benchmarking.
- For all three kernels, hardware becomes advantageous by 8×8.
- Speedup grows rapidly with image size.
- At 512×512, end-to-end speedup reaches roughly 2400×–2610×
- The three kernels show very similar runtime behavior at larger sizes.
Even in the untiled case, end-to-end hardware runtime still includes substantial software-side overhead from:
- Packing pixels into 32-bit stream words.
- Unpacking output back into
uint8_t. - DMA transfer orchestration
The average untiled hardware compute ratios were approximately:
- Sobel X: 0.278792
- Sobel Y: 0.247149
- Gaussian: 0.280351
This shows that even in the untiled case, a substantial fraction of end-to-end runtime is spent outside raw FPGA computation, confirming that system-level performance depends not only on the convolution engine itself, but also on the surrounding data movement path.
Because the deployed hardware supports a maximum single-pass input size of 512×512, larger images are processed using tiled execution.
The initial tile sizes are:
- 32
- 64
- 128
- 256
- 512
To avoid excessive tile counts for large images, smaller tiles are progressively evicted as image size increases. In the final benchmarking policy, a tile size is evicted once the image side length exceeds 16 times that tile size, while preserving at least the three largest candidate tiles.
Best tile size versus image size for Gaussian tiled execution.
Compute ratio versus overhead ratio across tile sizes for Gaussian tiled execution.
- When an image fits within the 512×512 hardware limit, untiled execution is always fastest.
- For tiled execution, the best tile is always the largest tile tested.
- For image sizes beyond 512×512, the best tile is consistently 512.
- Larger tiles perform better because they reduce the number of tile-level hardware invocations, DMA transfers, and software-side packing/unpacking operations.
Average tile-size runtime must be interpreted carefully, because smaller tile sizes are progressively evicted and therefore are often measured only on smaller workloads. For this reason, the most meaningful tiling trends are:
- best tile versus image size
- compute ratio versus overhead ratio
- scaling at large image sizes
Tiling enables the accelerator to scale successfully beyond the single-pass hardware limit, including image sizes up to 8192×8192. At those large sizes, larger tiles remain preferable because they amortize runtime overhead more effectively.
A lightweight RGB extension was implemented by applying the validated grayscale pipeline independently to the red, green, and blue channels, then recombining the filtered channels.
This approach avoids redesigning the accelerator as a native multi-channel kernel and instead reuses the existing grayscale hardware path three times.
RGB benchmarking was performed at 512×512 without tiling.
Representative results:
| Kernel | SW Time (s) | HW Time (s) | Speedup |
|---|---|---|---|
| Sobel X | 56.5264 | 0.022309 | 2534× |
| Sobel Y | 56.5586 | 0.022377 | 2528× |
| Gaussian | 57.3790 | 0.022288 | 2574× |
As expected, RGB runtime is approximately three times the grayscale cost because the grayscale pipeline is reused once per channel.
The most important result from this project is:
The hardware convolution pipeline is efficient, but end-to-end performance is strongly influenced by software-side data handling overhead.
The streamed accelerator achieves high throughput because of its pipelined II = 1 architecture, but overall performance is still affected by:
- Packing pixels into 32-bit transfer words.
- Unpacking hardware output back into
uint8_t. - DMA transfer overhead.
- Repeated invocation overhead during tiling.
This reflects an important system-design lesson:
Accelerator performance depends not only on computation, but also on the cost of moving and formatting data between hardware and software.
- Open Vitis HLS.
- Create a new project.
- Select target device
xczu3eg-sfvc784-2-e. - Add the following source files:
source/ conv_baseline.cpp conv_pipeline.cpp conv_linebuffer.cpp conv_dataflow.cpp conv_dataflow_stream.cpp conv_kernels.h - Add the testbench:
testbench/ conv_kernels_tb.cpp - Select the top function you want to synthesize:
conv_baselineconv_pipelineconv_linebufferconv_dataflowconv_dataflow_streamconv_dataflow_stream_int
- Run:
- C Simulation for correctness
- C Synthesis for performance/resource results
The provided bitstream was generated for the Zynq UltraScale+ AUP-ZU3 4GB Development Board.
Requirements:
- PYNQ-enabled environment
- bitstream (
.bit) and hardware handoff (.hwh) file in thevivado/directory
-
Open:
notebooks/stream_convolution_analysis.ipynb -
Run the notebook cells in order.
Important: Keep the repository folder structure and relative file paths the same as provided. Otherwise, notebook file paths must be updated manually.
The provided bitstream is configured for a 3×3 convolution kernel.
While the HLS code supports dynamic kernel sizes, changing the kernel size requires:
- Modifying the kernel size in the C++ source by updating the compile-time constants in the HLS header.
- Re-running HLS C-synthesis.
- Regenerating the Vivado design and bitstream.
The current overlay will only work correctly with 3×3 kernels.
The notebook:
- Loads the overlay
- Configures the HLS IP through AXI-Lite
- Sends image data using DMA
- Receives hardware output
- Unpacks the 32-bit stream data
- Compares output against software
- Benchmarks across multiple image sizes and kernels
The project provides:
- HLS synthesis data in CSV form
- Software vs Hardware benchmark data in CSV form
- Tiled benchmark summaries
- RGB benchmark summaries
- Saved plots in
results/plots/ - Saved tables in
results/tables/ - Vivado schematic PDF
- Bitstream (
.bit) and hardware handoff (.hwh) files for hardware execution.
The project now includes:
- HLS architecture exploration
- Vivado DMA integration
- Python runtime execution
- Correctness verification
- Software vs Hardware benchmarking
- Tiled execution for large images
- RGB extension benchmarking
The design is functionally correct, deployable, and demonstrates very large speedup over the software baseline once workloads are large enough to amortize software-side overhead.




