Resumen
In recent years, computational capacity of single Field Programmable Gate Array (FPGA) devices as well as their versatility have increased significantly. Adding to that fact, the High Level Synthesis frameworks allowing to program such processors in a high-level language like C++, makes modern FPGA devices a serious candidate as building blocks of a general-purpose High Performance Computing solution. In this contribution we describe benchmarks which we performed using a kernel from the Lattice QCD code, a highly compute-demanding HPC academic code for elementary particle simulations on the newest device from Xilinx, the U250 accelerator card. We describe the architecture of our solution and benchmark its performance on a single FPGA device running in two modes: using either external or embedded memory. We discuss both approaches in detail and provide assessment for the necessary memory throughput and the minimal amount of resources needed to deliver optimal performance depending on the available hardware. Our considerations can be used as guidelines for estimating the performance of some larger, manynode systems.