Analysis of Placement Methods for SCORE Computations

Michael James Wilson

Mentor: André DeHon

Implementation of Computation Laboratory, Computer Science Department, California Institute of Technology, Pasadena, California

Programs written for spatial and reconfigurable computing platforms benefit from being tailored to the available resources in the system.  However, the program may not know the amount of open resources until it begins to run.  This forces the placement (that is, mapping the computation to the hardware) to occur during the runtime of the program.  This step cannot be overlooked, as the speed of the program may be reduced by poor placement.  Yet the speed of the program is also reduced by the time spent on this task.  Fast, quality online placement is necessary to maximize the advantages of scalable, spatial computing.  The research in this report explores several methods of placement applied to SCORE, a compute model for spatial, reconfigurable systems.  Placement is done on simple virtual hardware.  The quality of placement is measured through several metrics.  Comparing the quality of various placements to the runtimes of their algorithms may indicate a method that could be optimized for use in actual placement on SCORE or other reconfigurable hardware.

Computing with reconfigurable hardware can provide significant performance gains over conventional microprocessors.  The Stream Computations Organized for Reconfigurable Execution (SCORE) compute model breaks computations into compute pages, configurable logic blocks, and directed streams which represent the basic logic, data management, and interconnect resources of the computation, respectively[1].  The SCORE scheduler[2] can produce a list of nodes and edges of a computation.  This list then must be mapped to hardware for execution.

No physical SCORE hardware currently exists.  Virtual SCORE hardware used in the analysis in this report is a fat tree, such as in Fig. 1.  Each leaf in the tree represents either a compute page (CP) or a configurable memory block (CMB).  These represent the fundamental units of computation and memory on the hardware, respectively.  In this hardware virtualization, each pair of leaves in a subtree of height one contains one CP and one CMB.

Figure 1 Graphical representation of simple, virtual SCORE hardware.  Dark squares represent compute pages (CPs) and light squares represent configurable memory blocks (CMBs).

If an entire computation cannot fit onto the available hardware, it is broken by the scheduler into timeslices which are then multiplexed onto the hardware.  This results in multiple netlists which must be placed on the same hardware in succession.  Each netlist is a list of connections between CPs and CMBs.  Each CP or CMB is given an identifier, and the same CP or CMB may exist in several timeslices.  Fig. 2 displays a graphical representation of some netlists.  The placement methods were run on sets of netlists generated by the SCORE scheduler on runs of a SCORE jpeg decoder (jpeg_decode_nopar), a jpeg encoder (jpeg_encode_nopar), and a wavelet encoder (wavelet).  Each application was run with varying numbers of CPs and CMBs.

Figure 2 graphical representations of three timeslices from the SCORE wavelet application.  Dark squares represent compute pages, light squares represent configurable memory blocks, and thin black lines represent connections.  Lines touching the right side of the squares represent outputs and lines touching the left side of the squares represent inputs to each node.

Defining good placement

There are several factors to consider when defining good placement.  The length of the connections between connected nodes influences the runtime of the placement, since longer wire lengths mean longer runtimes.  The critical path delay (defined here as the length of the longest path from a node with no inputs to a node with no outputs, or the length of the longest path from a node with no inputs to the completion of a cycle), as well as the total lengths of the connections between all the nodes influence the execution time of the program.  The runtime of the placement methods were also important, since if the runtime of the placement dominates the execution of the computation, it will negate any gains from better placement.  The bandwidth requirements of the hardware imposed by the placement were also considered.  Finally, the number of extra times the hardware has to be reconfigured plays an especially important role.

Placement methods

Three general categories of placement were used:  random placement, constructive placement using spectral ordering, otherwise known as the eigenvalue method[3], and iterative placement using simulated annealing[4].

The random placement simply assigned each node in the computation to a random, available place in the hardware.  This method was primarily used to define lower bounds on placement quality and runtimes; the other methods are expected to do consistently better in terms of quality but execute slower.

The spectral ordering used a one-dimensional ordering by the eigenvalue method to obtain a list of nodes.  The ordering was further broken into independent and merged types.  For independent ordering, the nodes in each timeslice were ordered separately, and then placed.  For merged ordering, the nodes in the entire computation were ordered, and then the nodes that existed in each timeslice were placed according to that order.

The linear ordering of the nodes was then mapped to the hardware.  In the “first-fit” method, resources were placed in order in the first position that they fit:

Algorithm:  Place First-Fit

For each unplaced node

If it is a CP, place it in the next available CP position

Else place it in the next CMB position

In the “semiclever” methods, the nondominant resource was spaced out in order to preserve the initial ordering better but still produce a physically possible placement:

Algorithm:  Place Semiclever

For each unplaced node

If it is a CP, place it in the next available CP position.  If there is available CMB slack, skip spaces in the available CMB positions until the next available CP position and the next available CMB position are as close to each other as possible

Else, place it in the next available CMB position.  Skip CP positions similarly.

In the unconstrained method, the order was maintained by allowing CPs to be placed in CMB locations on the hardware, and vice-versa.  This method cannot actually be used.  It is not physically possible, since it may place some CPs in positions on the hardware reserved for CMBs, and vice-versa:

Algorithm:  Place Unconstrained

For each unplaced node

Place it in the next available position on the hardware, regardless of if it is a CP position or CMB position.

The unconstrained method is interesting because the spectral ordering algorithm is not designed to take two different resources into account.  It shows how spectral ordering would perform if it were used as it was designed to be.

Finally, simulated annealing4 was used in a method that tried to minimize the total wire length and another method which constrained movement to zero and tried to minimize the total wire length subject to this constraint.  The latter resulted in overlapping placements on some occasions; if it were to be actually used a further method would have to take care of the overlaps.  This results in a placement that is not physically possible.

Results

Refer to Fig. 3 for charts of the summed values of all the metrics.  This data is the result of adding together the values generated by each placement method for several netlists generated by the three SCORE applications named previously.

Figure 3 Charts of metrics.

The bandwidth (“area”) requirements were based on Rent’s Rule, the empirical relation which states, for each node in the hardware tree:

                                                                                                                          (1)

Where IO is the number of wires running through any node in the hardware tree (not just the leaves), N is the number of leaves contained in the subtree whose root is the current node, P is a growth factor (here one of .25, .33, .50, .67, or .75), and C is a constant multiplier.  The “area” metric measures values of C for various values of P.  For values of p at .67 or higher, all of the methods tended to reach a lower bound for C, implying that if the growth factor for the hardware is high enough the method of placement has little impact on the constant multiplier.

Overlapping placements should be counted as movements for the purposes of determining total movement.  The standard annealing did about as poorly as the random placement in terms of movement.  However, the low-movement annealing lived up to its name by producing a low number of overlapping placements.  None of the spectral orderings produced low movements, but they were not taking movement into account.

Both annealings took orders of magnitude longer to execute than the other methods, making them impractical for actual use.  Since the execution of the algorithms was not optimized, and the measure of runtimes very rough, the runtime metric should be used mainly to get a general feel for the trends of the methods.

Total wire length and critical path delay tend to follow similar trends, with the independent, unconstrained spectral ordering and the annealer doing the best.  The fact that the unconstrained spectral ordering did very well when resource constraints were lifted implies that it could be a good, fast solution for wire delays if it can be further adapted to account for two different resources.

Principal conclusions

Most importantly, in designing future algorithms movement needs to be considered from the beginning.  It is possible to constrain movement, especially with larger arrays.  Random placement is consistently poor, so placement algorithms do make a significant difference.  Optimizing exclusively for delay and wire length tends to produce poor movement.

Suggestions for further research

There is a broad range of tasks to explore in the field of fast, online placement of high quality for SCORE computations.  Possibilities include making the placer work more closely with the scheduler, or integrating the two, working with hardware virtualizations other than the simple fat tree, producing a method which takes movement into account while using a fast algorithm such as spectral ordering, working with very large netlists or netlists from various other applications, and eventually optimizing placement methods for fast execution in hardware.

Methods

Area

The Rent’s constant for a particular interconnect growth was measured through the equation:

                                                                                                                           (1)

Where IO is the number of wires running through a node of the hardware tree, N is the number of leaves in the subtree of which that node is the root, and p relates to the growth of the interconnect.

CMB Slack, CMBs, CP Slack, CPs

The numbers of used CPs and CMBs may be found from the contents of the netlist files, which become the CP and CMB metrics.  The number of CPs and CMBs the computation was scheduled for may be extracted from the filename of the netlist directory.  The slack for each resource is the difference between the number of that resource the computation was scheduled for and the number that was actually used.

Delay

The critical path delay was measured as the longer of the longest path from a node with no inputs to a node with no outputs or the longest path from a node with no inputs to the end of a cycle.  The distance between any two nodes is calculated by finding the smallest subtree which contains both nodes, and using the equation:

Distance =                                                                                             (2)

where h is the height of the subtree.

Movement

Movement was measured for each timeslice on a node for node basis.  For each node in the timeslice, the first time the node with that ID appears in a previous timeslice, if it appears in a different position it charges one to the movement metric.  The looking back loops back to front, thus if the node doesn’t appear in any other timeslice it will not charge the movement metric.

For example, consider a simple two-timeslice graph.  Timeslice 1 contains nodes 11, 12, 13, and 14.  Timeslice 2 contains nodes 13, 15, 16, and 17.  Assume the hardware has four slots.  If the placement is, in order, 11, 12, 13, 14 for the first timeslice and 13, 15, 16, 17 for the second, the movement would be one (since the only node that exists in both timeslices is 13, and it appears in different positions).  However, if the placement for the first timeslice was 11, 12, 13, 14 and for the second timeslice was 16, 15, 13, 17, movement would be zero.

Overlapping placements

The metric is charged one for each node beyond the first in the same position on the hardware array in a specific timeslice.  Any placement that has more than zero overlapping placements is not physically possible.

Runtime

The rough runtime of calculating the placement is measured in milliseconds, using the system clock of the machine running the program.

Total Wire Length

The total wire length is the total distances between all connected nodes.  If two nodes are connected more than once, the metric is charged the number of times they are connected.

Annealing Parameters

The annealing was tuned through various parameters.  Further tuning could improve the speed or quality of placement by simulated annealing.

The first annealer tried a number of swaps equal to 10 times the number of CPs and CMBs in the graph per epic.  The initial temperature was determined by trying 10000 swaps using the method detailed on page 161 of VLSI Cell Placement Techniques4.  The cost for a swap was the change in wire delay.  The cooling schedule was Tn+1 = 0.95 * Tn.  The acceptance criterion was the standard Metropolis criteria4.  The stop criteria was that the temperature was less than or equal to 0.1.

The low-movement annealer tried 100 swaps per epic.  The initial temperature was determined in the same manner as the first annealer.  The cost for a swap was the change in wire delay plus the change in overlapping moves divided by the temperature.  The cooling schedule was Tn+1 = 0.95 * Tn.  The acceptance criterion was the standard Metropolis criteria4.  The stop criteria was that the temperature was less than or equal to 0.1.


Acknowledgements I thank André DeHon for mentorship.  I also thank Yury Markovskiy for providing the SCORE runtime and scheduler and Rafi Rubin and Michael Wrighton for technical assistance and help with data analysis in the lab.

Competing interests statement The author declares that he has no competing financial interests.

Correspondence and requests for materials should be addressed to M.J.W. (e-mail: mwilson@caltech.edu).



[1]. Caspi, E., Chu, M., Huang, R., Yeh, J., Markovskiy, Y., DeHon, A. & Wawrzynek, J. Stream Computations Organized for Reconfigurable Execution (SCORE): Introduction and Tutorial. (2000).

[2]. Markovskiy, Y., Caspi, E., Huang, R., Yeh, J., Chu, M., Wawrzynek, J. & DeHon, A. Analysis of Quasi-Static Scheduling Techniques in a Virtualized Reconfigurable Machine. FPGA’02. February 24-26. (2002).

[3]. Hall, K.M., AN r-DIMENSIONAL QUADRATIC PLACEMENT ALGORITHM. Management Science 17, 219-229 (1970).

[4]. Shahookar, K. & Mazumder, P. VLSI Cell Placement Techniques.  ACM Computing Surveys 23, 144-164 (1991).