Coarse-grained FPGA overlay architectures paired with general purpose processors offer a number of advantages for general purpose hardware acceleration because of software-like programmability, fast compilation, application portability, and improved design productivity. However, the area overheads of these overlays, and in particular architectures with island-style interconnect, negate many of these advantages, preventing their use in practical FPGA-based systems. Crucially, the interconnect flexibility provided by these overlay architectures is normally over-provisioned for accelerators based on feed-forward pipelined datapaths, which in many cases have the general shape of inverted cones. We propose DeCO, a cone shaped cluster of FUs utilizing a simple linear interconnect between them. This reduces the area overheads for implementing compute kernels extracted from compute-intensive applications represented as directed acyclic dataflow graphs, while still allowing high data throughput. We perform design space exploration by modeling programmability overhead as a function of overlay design parameters, and compare to the programmability overhead of island-style overlays. We observe 87% savings in LUT requirements using the proposed approach compared to DSP block based island-style overlays. Our experimental evaluation shows that the proposed overlay exhibits an achievable frequency of 395 MHz, close to the DSP theoretical limit on the Xilinx Zynq. We also present an automated tool flow that provides a rapid and vendor-independent mapping of the high level compute kernel code to the proposed overlay.
|Original language||English (US)|
|Title of host publication||Proceedings - 24th IEEE International Symposium on Field-Programmable Custom Computing Machines, FCCM 2016|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||8|
|State||Published - Aug 16 2016|