HardwareSoftware Integration for

Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths Uday Bondhugula1,Ananth Devulapalli2,James Dinan1,Joseph Fernando2,
Pete Wyckoff3,Eric Stahlberg3,and P.Sadayappan1
1Department of Computer Science&Engineering
The Ohio State University
Columbus,OH43210
{bondhugu,dinan,saday}@cse.ohio-state.edu
2Ohio Supercomputer Center,Springfield3Ohio Supercomputer Center
1South Limestone St.,Suite3101224Kinnear Road
Springfield,OH45502Columbus,OH43212
{ananth,fernando}@osc.edu{pw,eas}@osc.edu
Abstract
Field-Programmable Gate Arrays(FPGAs)are be-ing employed in high performance computing systems owing to their potential to accelerate a wide variety of long-running routines.Parallel FPGA-based de-signs often yield a very high speedup.Applications us-ing these designs on reconfigurable supercomputers in-volve software on the system managing computation on the FPGA.To extract maximum performance from an FPGA design at the application level,it becomes neces-sary to minimize associated data movement costs on the system.We address this hardware/software integration challenge in the context of the All-Pairs Shortest-Paths (APSP)problem in a directed graph.We employ a par-allel FPGA-based design using a blocked algorithm to solve large instances of APSP.With appropriate design choices and optimizations,experimental results on the Cray XD1show that the FPGA-based implementation sustains an application-level speedup of15over an op-timized CPU-based implementation.
1.Introduction
Field-Programmable Gate Arrays(FPGAs)have long been used in embedded image and signal process-ing applications.With rapid advances in modern VLSI This research is supported in part by the Department of Energy’s hnology,FPGAs are becoming increasingly attrac-tive to a much wider audience,in particular,to the High Performance Computing(HPC)community.Modern
FPGAs have abundant resources in the form of tens of thousands of Configurable Logic Blocks(CLBs),a large amount of on-chip memory,and growing num-bers of other special-purpose resources.High band-width to off-chip memory is sustained through paral-lel memory access using the large number of I/O pins available on-chip.All of these factors allow FPGAs to extract a large amount of parallelism apart from effec-tively reusing data.Reconfigurability allows very ef-ficient use of available resources tailored to the needs of an application.This makes it possible for custom-designed parallel FPGA implementations to achieve significant speedup over modern general-purpose pro-cessors for many long-running routines.This increase in performance has led to the use of FPGAs in HPC sys-tems.Cray and SRC Computers already offer FPGA-based high-performance computer systems that couple general-purpose processors with reconfigurable appli-cation accelerators[3,9].
A complete FPGA-based solution on reconfigurable supercomputers like the Cray XD1involves an FPGA design integrated with a user-application running on the system.Limited resources on the FPGA often necessi-tate a blocked algorithm for the problem.This is par-ticularly true for many linear algebra routines[16,17]. Large instances of a problem are solved by employing a parallel FPGA design iteratively with a blocked al-
v槽机
gorithm.Parallel FPGA designs often yield significant speedup.Therefore,in many cases,it becomes impor-tant to orchestrate movement of subsets of data(tiles) so that benefits are reflected at the application level. For some scientific routines involving double-precision floating-point arithmetic,performance of current day FPGAs may not be high enough to make system-side optimizations crucial.However,with current trends in double-precisionfloating-point performance of FP-GAs[11,12],FPGA performance would eventually reach a level to make these optimizations beneficial for all FPGA-based routines.
The all-pairs shortest-paths problem is tofind a shortest path between each pair of vertices in a weighted directed graph.The Floyd-Warshall(FW)algorithm used to solve this problem involves nested loop code that exhibits a regular access pattern with significant data dependences.In the context of FW,we propose approaches that solve the following problem:given an FPGA kernel that achieves a high speedup on a small dataset(tile/block),what optimizations at the system would help obtain a high percentage of this speedup at the application level for large problem sizes?The parallel kernel we use is from our previous work[1]. With appropriate design choices and optimizations,we show through experimental results on the Cray XD1 that the application-level speedup for FPGA-based FW improves from4to15.We build a model that accurately captures performance of the FPGA-based implementa-tion,and provides insights into factors affecting it.
The rest of this paper is organized as follows:In Sec-tion3,we give an overview of the Cray XD1followed by an overview of the FW algorithm,and the FPGA-based FW design developed in[1].In sections4and5, we discuss design choices and propose techniques to re-duce data movement costs on the system.In Section6, we present results of our experiments on the Cray XD1 and analyze performance in detail.
2.Related workCS CN
The Floyd-Warshall algorithm wasfirst proposed by Robert Floyd[5].Floyd based his algorithm on a theo-rem of Warshall[14]that describes how to compute the transitive closure of boolean matrices.Venkataraman et al.proposed a blocked implementation of the algorithm to optimize it for the cache hierarchy of modern proces-sors[13].
Apart from the Floyd-Warshall algorithm,all-pairs shortest-paths(APSP)algorithms with lower time com-plexity exist.Karger[6]solved undirected APSP with non-negative edge-weights in O(Mn+n2log n)time, where n is the number of vertices in the graph and M is the the number of edges participating in shortest-paths. Zwick[8]obtained an O(n2.575)APSP algorithm where the dependence of the running time is polynomial in the maximum magnitude of the edge weights.It is thus only effective when the edge weights are integers of small absolute value.
Researchers have recently demonstrated the compet-itiveness of FPGAs with modern microprocessors for double-precisionfloating-point arithmetic and linear al-gebra routines[11,12,16,17].All of these works pro-pose parallel designs and project performance.Tripp et al.[10]study aspects of hardware/software integration on reconfigurable supercomputers with a traffic simula-tion FPGA kernel.Our work addresses optimizing data movement,and other issues that often arise in mapping nested loops.
In[1],we proposed a parallel FPGA-based design for FW to process a tile efficiently;the kernel is suit-able for accelerated solution of large APSP problem in-stances using a blocked algorithm[13].
3.Overview
In this section,we give an overview of the Cray XD1 system,the Floyd-Warshall(FW)algorithm,the paral-lel FPGA-based FW kernel,and the blocked FW algo-rithm.
3.1Cray XD1
The Cray XD1system is composed of multiple chassis,each containing up to six compute blades. Each compute blade contains two single-or dual-core 64-bit AMD Opteron processors,a RapidArray l6562
pro-cessor which provides two2GB/s RapidArray links to the switch fabric,and an application acceleration module[3].The application acceleration module is an FPGA-based reconfigurable computing module that provides an FPGA complemented with a RapidArray Transport(RT)core providing a programmable clock source,and four banks of Quad Data Rate(QDR)II
SRAM.
Figure1.The Cray XD1system
3.2The Floyd-Warshall algorithm
Given a weighted,directed graph G=(V,E)with a weight function{w:E→R},that maps edges to real-valued weights,we wish tofind,for every pair of ver-tices u,v∈V,a shortest(least-weight)path from u to v, where the weight of a path is the sum of the weights of its constituent edges.Output is typically des
ired in tab-ular form:the entry in u’s row and v’s column should be the weight of a shortest path from u to v.
1:for k←1,N do
2:for i←1,N do
3:for j←1,N do
4:d[i,j]←min(d[i,j],d[i,k]+d[k,j]) 5:end for
6:end for
7:end for
8:Output:d
Figure2.The Floyd-Warshall algorithm
The Floyd-Warshall algorithm uses a dynamic pro-gramming approach to solve the all-pairs shortest-paths problem on a directed graph[2,5,14].It runs in Θ(|V|3)time.
Let w i j be the weight of edge(i,j).w i j is0when i=j,and is∞when(i,j)/∈E.Let d(k)i j be the weight of a shortest path from vertex i to vertex j for which all intermediate vertices are in the set{1,2,...,k}.For k=0,we have d0i j=w i j.A recursive definition from the above formulation is given by:
d k i j=
w i j if k=0 min
d k−1
i j
,d k−1
ik
+d k−1
k j
if k≥1
The matrix{d N i j},1≤i,j≤N gives thefinal result.The above recursive definition can be written as a bottom-up procedure as shown in Fig.2.The code is tight with no elaborate data structures,and so the constant hidden in theΘ-notation is small.Unlike many graph algorithms, the absence of the need to implement any complex ab-stract data types makes FW a good candidate for accel-eration with an FPGA.
3.3Parallel FW design for a blocked algorithm
In this section,we give a brief description of the par-allel FPGA-based FW kernel that we designed in[1].
In the FW computation,we have exactly N2data el-ements,butΘ(N3)computations to perform.Hence, there is high temporal locality that a custom design can exploit.The FW nested loop code has significant data
Figure3.Parallel FW kernel architecture dependences.Extracting parallelism in the presence of these dependences without data access conflicts making maximum use of available FPGA resources is a major challenge.
Let d be the distance matrix of a directed graph of B nodes.In the nested loop code shown in Fig.2,at any iteration k=r of the outer loop,the vectors,d[r,∗] and d[∗,r],update the whole matrix d.We call this row and column,the pivot row and the pivot column, respectively.In order to extract parallelism from the outermost k loop,the computation was reorganized into a sequence of two passes:in thefirst pass,compute the set of pivot rows and columns,and then use the stored pivot rows/columns to compute the updates to matrix elements in a streamed fashion.This approach enabled the creation of a simple and modular design that max-imizes parallelism.The design scales well when–(1) larger FPGAs are employed,or(2)greater I/O band-width to the system is available.
Fig.3shows the architecture.A linear array of B PEs is used to perform FW on a B x B matrix.Each PE has l operators where each operator comprises a comparator and an adder.The r th PE stores the r th pre-computed pivot row and column,and the work it performs corre-sponds to the computation in the iteration k=r of the outer loop of FW.Thefirst and the last PEs read and write data respectively,from/to the I/O engine.The de-sign is a streaming one,with read,compute and write pipel
ined.The latency to process a single tile is given by:
L=
3B2
l
+3B−1
cycles
l is the amount of doAll parallelism in each PE and is governed by I/O bandwidth.B is the amount of pipelined parallelism and is constrained by the amount of FPGA resources.The product of B and l is the total parallelism available in our design,and this was maxi-mized under resources constraints using a model.
The FW kernel described above can handle only ma-trices of size B x B.For the general case of N x N ma-trices,we extend the kernel for the blocked algorithm proposed by Venkataraman et al.[13].We provide a brief description of the same here.
self−dependent partial row/column dependent doubly−dependent
Figure 4.Tiled FW algorithm
Consider the sequential code in Fig.2.At each iter-ation of the outermost loop,each element in the entire matrix is updated (if necessary)to the sum of its pro-jections onto the pivot row and pivot column for that iteration.Now,consider the matrix to be composed of B x B tiles as opposed to individual elements so that there are (N /B )2tiles,and consider a similar operation being performed o
n these tiles.In this case,we have an en-tire tile that needs to be updated by the projection of its elements onto pivot rows and columns that come from a pivot row-block (d [t ...t +B −1][1...N ])and a pivot column block (d [1...N ][t ...t +B −1])for the outer-most loop k =t to t +B −1.The pivot rows and columns used to update a particular tile may –(1)come from the same tile (self-dependent),(2)only the pivot rows come from a different tile (partially row-dependent),(3)only the pivot columns come from a different tile (par-tially column-dependent),or (4)both the pivot rows and the pivot columns come from different tiles (doubly-dependent).Fig.4shows these different types of tiles for a single round.The partially row/column-dependent tiles require the self-dependent tile to be processed first.Similarly,the doubly-dependent tiles depend on the row-dependent and column-dependent tiles (Fig.4).In each of the N /B rounds,we process (N /B )2tiles using the FPGA design.Hence,a total of (N /B )3tiles are pro-cessed for an N x N matrix.The design described in the previous section is used for processing self-dependent tiles;it was extended to process the other three kinds of tiles with minor changes.
Table    1.FPGA-FW:Resource utilization and speedup on the Xilinx XC2VP50.Tile Size FPGA-FW measured CPU-FW Measured
speedup
Resource utilization 8x80.42µs    1.60µs 3.8x 34%16x16  1.29µs 14.1µs 11x 52%32x32
4.84µs
106.5µs 22x
90%
The optimal values for B and l for the XD1FPGA (Xilinx XC2VP50)were determined to be 32and 4re-spectively,for a precision of 16bits.Table 1shows the speedup of the FPGA kernel for processing a single tile.
4.Hardware/software Integration
In this section,we discuss some of the design and implementation issues in integrating the parallel FPGA kernel with software on the system.
4.1Extending for any graph size
With the approach described in the previous section,we can handle a graph with a multiple of kernel
tile size number of nodes.Any arbitrary size can be triv-ially solved by padding the distance matrix with ∞(the largest value within the supported precision)to the next higher multiple of tile size.This is equivalent to adding additional disconnected nodes to the graph.
4.2Data movement and communication
A specially allocated communication buffer that is pinned to the system’s memory is used for transfer of data to/from the FPGA.The I/O engine (Fig.3)on the FPGA transfers data of specified length to/from con-tiguous regions in the communication buffer.We split the buffer into two parts as shown in Fig.5.The source and destination buffer addresses,amount of data to be transferred to/from these buffers,and the type of tile that is to be processed is communicated using a special set of registers.A write to the register meant for the destination buffer address triggers the computation on the FPGA.Completion is indicated by setting a regis-ter bit on the FPGA which the CPU polls.All of this overhead is incurred for each compute request made to the FPGA.Once the computation starts,all of the tiles placed in the source buffer are processed successively.As the computation for successive tiles is overlapped in the FPGA design (Sec.3.3),the overhead associated with making the compute request is hidden when a large enough number of tiles are processed successively.Allowing the FPGA to do the read and write from the communication buffer on the system’s memory frees the CP
U for other tasks.In particular,it provides us the opportunity to use the CPU to copy tiles between the matrix and the communication buffer while the FPGA is computing.We discuss this in Sec.5.2.
The parallel FPGA kernel described in Sec.3.3was used iteratively taking care of dependences.All tiles of a particular kind (self-dependent,partial row/column
FPGA To FPGA (pull)
Communication buffer
From Figure 5.Communication buffer model dependent and doubly-dependent)are processed suc-cessively.Hence,in each of the N /B rounds of the blocked algorithm,the self-dependent tile is processed first followed by the sets of partially row and column-dependent tiles.This is followed by the processing of doubly-dependent tiles.In order to process a tile,the tile along with the pivot row and column elements is copied to the source buffer on the system interleaved in the fashion required by the kernel (Sec.3.3).The compute request is then made to the FPGA.Hence,for a set of k B x B tiles that need to be processed succes-sively,3B 2k matrix elements are copied to the source buffer by the CPU.The FPGA reads from the source buffer and writes back the result tiles comprising B 2k elements to the destination buffer,and sets the comple-tion flag.The result tiles are then copied back to the matrix by the CPU.The number of copy operations for an N x N matrix are therefore 4N 3/B ,where each oper-ation involves a distance matrix element.It is the time consumed in these copies between the matrix and the source/destination buffers that we try to minimize.
5.Optimizing data movement
In this section,we discuss optimizations on the system-side to extract maximum performance from the FPGA kernel.The optimizations are partly model-driven and partly from analysis of measured perfor-mance of the unoptimized version which we refer to as FPGA-FW.In FPGA-FW,the three phas
es of copy,compute and write-back are done sequentially with the distance matrix in the original row-major form.
The entire blocked FW for an N x N matrix comprises (N /B )3tiled computations.Each of the B x B tiles in the matrix gets processed N /B times –once in each round.Hence,temporal locality can be exploited for copying across rounds.
The copy time is proportional to the square of the tile size (B ),while the number of compute operations is
1
2 3 4 5 6 7 8
9奖章制作
F P
G A -F W -32 L a t e n c y  (s )
Figure 6.Conflict misses in FPGA-FW-32Θ(B 3).However,due to the FPGA kernel performing these operations in parallel,the compute time may be reduced to a level where it is comparable to the copy time.This makes several approaches that would reduce or hide the data movement costs beneficial to pursue.
5.1Layout transformation
In FPGA-FW,significant conflict misses occur as successive rows of a tile may get mapped to the same cache line due to size of the matrices being large and a power of two.Fig.6confirms this.This leads to loss of temporal locality across multiple rounds of the blocked algorithm.For column-wise copying,apart from tem-poral locality,spatial locality may be lost too.In ad-dition,if the size of the cache line is greater than the size of a single row of the tile as is the case for 8x8and 16x16kernels,memory bandwidth is not effectively uti-lized.We therefore transform the layout of the input matrix so that B x B tiles of the matrix are contiguous in memory in row-major order (Fig.7).
After the layout transformation,the number of cache misses per B x B tile for row-wise as well as column-wise copying is exactly B 2/L ,where L
is the cache line size.
Figure 7.Transforming the layout of the distance matrix
This is true for large matrices that are at least twice as large as the cache size.The cost of transforming the layout of the matrix is a small fraction of the FW latency and does not affect performance.Padding the distance matrix is another alternative,but is not as effective as layout transformation(Sec.6).
5.2Compute/Copy overlap
Even after performing the layout transformation,the compute time may be comparable to the copy ti
me for matrices that do not completelyfit in the cache.As we make the FPGA responsible for transferring data between itself and the communication buffer,the host is free while the FPGA is computing.We overlap the FPGA computation for a set of k tiles with the write-back time and the copy time for the previous and next sets respectively.Two buffers are used in an alternating fashion(each for a maximum of k tiles).We perform the compute/copy overlap only for doubly-dependent tiles as the processing of these tiles dominates the compute latency for the32x32FPGA kernel(Fig.4).
低压有源滤波
Optimal overlap chunk:Choosing a small chunk size(k)for compute/copy overlap would not hide over-head that is involved in requesting the FPGA to process a set of tiles.Using a large chunk size would increase the trailing non-overlapping copy and write-back time (thefirst copy and the last write-back cannot be hidden). The optimal value for the compute/copy overlap chunk, k,is higher for larger matrices.We determine this in the next section.
6.Measurements and analysis
The measurements for the general-purpose proces-sor case were taken on a2.2GHz64-bit AMD Opteron (as found on the XD1)with a64KB L1data cache and a1MB L2cache with a cache-block size
of64bytes. GCC3.3with“-O3”turned on was used for compila-tion.The FPGA on the XD1is a Xilinx Virtex-II Pro XC2VP50.The FPGA design was clocked at170MHz. The version of Cray User FPGA API used was1.3.The API provides functions to program the FPGA,write val-ues and addresses to registers on the FPGA,taking care of virtual to physical address translation in the latter case.All measurements are for16-bit edge weights. We use the following abbreviations to identify CPU and FPGA-based FW implementations with different opti-mizations throughout this section.
CPU-FW:Simple Floyd-Warshall(FW)implementa-tion on the Opteron(three-way nested loop shown in Fig.2).
CPU-FW-OPT:Optimized blocked implementation of FW on the Opteron(block size empirically op-timized for best performance)[13].We copy tiles to a contiguous buffer to eliminate conflict misses. FPGA-FW:FPGA-based implementation for FW for an N x N matrix on the XD1FPGA without any op-timizations.
FPGA-FW-LT:FPGA-FW with layout transforma-tion as explained in Sec.5.1.
FPGA-FW-LTOV:FPGA-FW-LT with compute and copy overlapped as explained in Sec.5.2.
A suffix of8,16,or32is used to distinguish implemen-tations using kernels that process tiles of that size.In allfigures and tables,copy time refers to the sum to-tal of both,the copy time and the write-back time.All speedups mentioned in this section are over the opti-mized CPU implementation(CPU-FW-OPT).
For graphs with up to256nodes,the distance matrix and the communication buffer completelyfit in the L2 cache(2562x2x5bytes<1MB).Hence,FPGA-FW, FPGA-FW-LT and FPGA-FW-LTOV perform equally well as seen in Table2.However,there is a sudden in-crease in copy time from256nodes to512nodes,and performance drops from there on for FPGA-FW and FPGA-FW-LT(Fig.10).
6.1Effect of Layout Transformation
By operating on the bricked layout of the distance matrix,wefind that the copy time for large graphs is cut down by more than two times.As shown in Table2, the copy time for FPGA-FW-LT increases consistently by eight times as the size of the problem doubles,as opposed to the way it does for FPGA-FW.This is along expected lines(Sec.4.2).
6.2Effect of compute/copy overlap
透水混凝土施工方案Even after the layout transformation,for graphs with 512nodes or more,the copy time is comparable to the compute time.The compute/copy overlap thus leads to a speedup by a factor of two hiding the copy time com-pletely(Fig.8).For the size of the overlap chunk(dis-cussed in Sec.5.2),empirically wefind that a value of 32works well for most problem sizes(Fig.9).Thus,the total communication buffer requirements(including the alternating buffer)are:2∗4∗k∗B2∗2bytes=512KB.
For CPU-FW-OPT,the copy time is a very small fraction of the total latency(about1%)for all problem sizes(Table3).The compute and copy times for CPU-FW-OPT increase along expected lines(proportional to

本文发布于:2024-09-22 22:31:40,感谢您对本站的认可!

本文链接:https://www.17tex.com/tex/4/277996.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:透水   滤波   奖章   混凝土   有源   施工   制作   低压
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议