Tiling Rectangles. ...would be encoded as (4 2 1) (1) (1 2) (1) Problem Description

Size: px

Start display at page:

Download "Tiling Rectangles. ...would be encoded as (4 2 1) (1) (1 2) (1) Problem Description"

Solomon Perry
5 years ago
Views:

1 Tiling Rectangles Akshay Singh (akki) June 1, 2011 Given a rectangular area with integral dimensions, that area can be subdivided into square subregions, also with integral dimensions. This process is known as tiling the rectangle. For such square-tiled rectangles, we can encode the tiling with a sequence of grouped integers. Starting from the upper horizontal side of the given rectangle, the squares are "read" from left to right and from top to bottom. The length of the sides of the squares sharing the same horizontal level (at the top of the tiling square) are grouped together by parentheses and listed in order from left to right. For example, the 4x7 rectangle tiled with the following pattern......would be encoded as (4 2 1) (1) (1 2) (1) Problem Description Write a threaded code to input an unknown number of sets of ordered integers and decide if there is some grouping of those integers that would form the correct encoding for some square-tiled rectangle. The sets for testing will be held within a text file listed first on the application's command line. For all valid encodings found, output of the application will be the height and width of the rectangle and the properly formatted coding of the tiling. Output will be stored in the second text file listed on the command line. Input Description The input to the program will be from a text file named on the command line of the application. Consecutive lines will correspond to the same potential tiling of a rectangle, in order. A 0' (zero) will be used to indicate the end of a tiling set. Input file lines will contain 20 integers per text line; the last line of a tiling may have fewer than 20 integers, and will terminate with a zero. End of file marks the

2 end of the tiling sets. Output Description For each potential tiling set within the input file, the output to be generated by the application is the dimensions of rectangles and the encoded tiling of each rectangle to the second file listed on the application command line. If there is no encoding possible from an input set, a message stating that fact should be printed. Order of output sets must be the same as the input set order. Example Command line: tiling.exe setsin.txt rectout.txt Input file: setsin.txt Output file: rectout.txt Set 1 dimensions: 4 x 7 (4 2 1) (1) (1 2) (1) Set 2 dimensions: 2 x 3 (2 1) (1) dimensions: 3 x 2 (2) (1 1) Set 3 Cannot encode a rectangle Set 4 dimensions: 61 x 69 (36 33) (5 28) (25 9 2) (7) (16)

3 Serial Algorithm A serial algorithm to solve the rectangle tiling problem is described below. The following variables shall be used in the algorithm: N W number of tiles array containing the sequence of tiles specified in the problem statement area Area of the rectangle. area = W i2, i [0, N 1] cr, nr minh T R start arrays representing the current row and the next row respectively. Each element of these arrays has height and width properties. smallest height in cr tile index row number array containing the tile index of the first tile in each row of the solution n Number of tiles in the first row (1 n N) 1 Determine feasibility 1.1 Choose the number of tiles n in the first row (1 n N) 1.2 Compute width = W i, i [0, n 1] 1.3 A solution is feasible if area is divisible by width 2 Construct the first row 2.1 minh =, T = 0, R = while T < n cr T.width = cr T.height = W T if W T < minh minh = W T T start R = R++ 3 Iteratively, attempt to plug in the next tiles in sequence 3.1 start R = T, gap = 0, newminh =, k = R for t cr newh = t.height minh

4 3.3.2 if newh == gap += t.width else if gap > plug the gap as described in 4 below gap = if newh < newminh newminh = newh nr k.width = t.width nr k.height = newh k if gap > plug the gap as described in 4 below 3.5 swap cr and nr 3.6 minh = newminh 3.7 if T < N return to make sure all tiles in the last row have the same height for t cr if t.height minh, rectangle of the selected width is not possible return to start R = N 3.10 use start to produce output 3.11 return to 1 4 Plug a gap 4.1 wsum = while wsum < gap and T < N wsum += W T nr k.width = nr k.height = W T if W T < newminh newminh = W T k++, T if wsum gap, the gap cannot be plugged return to return

5 Parallel Algorithm The solution to this problem can be easily parallelized. A suitable strategy may be adopted depending on the nature of the input file. 1. If the input file contains a large number of small problems, it makes sense to use static data decomposition, by dividing the input into sets of problems, and solving each set of problems in a separate thread, using the serial algorithm. 2. On the other hand, the input file may contain relatively large problems. In this case, exploratory decomposition can be used, by statically dividing the number of tiles in the first row (n) among a set of threads, and aggregating the results to produce the output. Input and Output Once an initial serial solution had been implemented, it was evident that the input and output phases were dominating the execution time. It was, therefore, necessary to improve the I/O performance. It was clear that stream based I/O was too slow and an alternative would be required. The obvious choice for improving input performance, was memory mapping the input file, and then using a custom scanner, to construct the problem definitions. ProblemDefinition* ProblemParser::next() { if(eof()) { // End of file return NULL; } ProblemDefinition* problem = new ProblemDefinition(); // W array from the serial algorithm vector<int32_t>& tiles = problem->tiles; int64_t area = 0; int32_t value = 0; bool ws = false; // number of bytes of input remaining int64_t remain = length - (int64_t)(fp - buffer); // Scan input do { if(isws(*fp)) { ws = true; } else { if(ws) { if(value == 0) break; // Stop when a 0 is found tiles.push_back(value); // area += (value * value); Save tile

6 } value = 0; ws = false; } // Construct a number value = value * 10 + *fp - '0'; ++fp; } while(--remain); if(eof() && area == 0) { // End of input return NULL; } problem->area = area; } return problem; For output, a fairly large output buffer was created, and a custom formatter was used to produce formatted output. The output was flushed to disk, whenever the buffer was full. void SolutionFormatter::format(ProblemDefinition& definition, Solution& solution) { appendstring("\n dimensions: ", 15); appendint(solution.height); appendstring(" x ", 3); appendint(solution.width); appendstring("\n ", 2); // solution.row is the start array from the algorithm int32_t** prend = solution.row + solution.nrows; // definition.tiles is the W array from the algorithm int32_t* ptp = &definition.tiles.front(); int32_t* ptpend; // Encode rectangle for(int32_t** pr = solution.row + 1; pr < prend; ++pr) { appendstring(" (", 2); appendint(*ptp); ++ptp; ptpend = *pr; for(; ptp < ptpend; ++ptp) { appendstring(" ", 1); appendint(*ptp); }

7 } appendstring(")", 1); } appendstring("\n", 1); However, even after these optimizations, the time spent on I/O was significant. In particular, formatting and scanning were found to be expensive operations. Naturally, the next step was parallelizing these operations. This requirement gives us one set of threads, that perform the following actions, on different regions of the input file: 1. Scan input from the assigned region of the memory mapped input file and construct problem definitions 2. Solve each problem, depending on it's size, either in the same thread, or by dividing the search space among multiple threads 3. Format the discovered solutions and write them to a buffer, adding new buffers when required 4. Notify the Collector thread (defined below) on completion Let us call these, the Solver threads. Additionally, since sets of problems are being solved in parallel, it makes sense to have a dedicated thread for writing pre-formatted buffers to the output file, as soon as the next expected set of solutions becomes available. The job of this thread would be: 1. Collect formatted output from Solver threads 2. Keep track of the next set of solutions to be written to the output file 3. Dump buffers to the output file Let us call this thread, the Collector thread. Modified Parallel Algorithm After incorporating the I/O considerations, we arrive at the following simplified parallel algorithm: 1. Start the Collector thread 2. Open the input file as a memory mapped file 3. Divide the memory mapped input file into pieces of a predefined size, and create one Solver thread to work on each piece. Use a thread pool to restrict the number of concurrently running threads. 4. Wait for the Collector thread to terminate

8 Further Improvements The following improvements to the basic parallel algorithm, described above, were identified and implemented: 1. Unequal problem set sizes If the sizes of the input file pieces assigned to different Solver threads are the same, it is likely that all concurrently running Solver threads will complete their assigned tasks at the same time, thus, defeating the purpose of the Collector thread. If, however, the sizes of the pieces assigned to the Solver threads are successively increased, we might expect the Solvers to complete execution in the appropriate sequence. Simultaneously, the Collector thread would be kept busy as well, helping us achieve better parallel performance. 2. Reduce the number of memory allocations / deallocations Another factor that severely impacts the execution time, when dealing with a large number of small problems, is the number of allocations and deallocations. In order to reduce this impact, the memory allocated to represent the tile sequence (W), current row (cr), next row (nr) and row start index (start) is reused, as far as possible, across problem instances assigned to one thread. 3. Assign non-consecutive search areas when using multiple threads to solve a large problem One of the problems used to test the solution was x! ones. The logic behind this choice was that the number of solutions would be equal to the number of divisors of the total area (i.e. x!). For example, consider the problem The area of this rectangle is 6 (3!). 6 is divisible by 1, 2, 3, 6 and the corresponding 4 solutions are: 1. (1) (1) (1) (1) (1) (1) 2. (1 1) (1 1) (1 1) 3. (1 1 1) (1 1 1) 4. ( ) Since x! has a large number of small but distinct divisors, x! ones may be the worst case scenario. While solving such problems using multiple threads, by assigning consecutive, number of first row tiles (n in the serial algorithm), to each thread, it was observed that one thread would end up finding many more solutions and, consequently, run much longer than the others. For example, if we solve the 3! ones problem using 2 threads, and the search space division is, {1, 2, 3} assigned to thread 1 and {4, 5, 6} assigned to thread 2, then thread 1 will find 3 solutions while thread 2 discovers only one solution. The imbalance becomes clearer when we consider the 10! ones problem, solved using 4 threads, the number of solutions found per thread are (thread/count) 1/267, 2/2, 3/0, 4/1. An alternative is to initialize n = i for the i th thread and increment n by y when using y threads.

9 For the 3! ones problem, the modified search spaces are {1, 3, 5} and {2, 4, 6}. As a result the solutions per thread would be 1/2, 2/2. For the 10! ones problem solved using 4 threads, we get, 1/210, 2/30, 3/15, 4/15. There is still a significant imbalance, but the result is better than the earlier case. Graphical Overview

10 Performance The input files used to test the solution were: File Description # Problems # Solutions I/P size O/P size o19sisrs o20sisrs lcm16 Order 19 simple imperfect squared rectangles posted in the forum by john_e_lilley Order 20 simple imperfect squared rectangles posted in the forum by john_e_lilley As many ones as the least common multiple (LCM) of all numbers from 1 to 16 (720720) * MB 3.6MB * MB 13MB MB 337MB fact10 10! ( ) ones MB 1.9GB Serial Performance Here's a comparison of the execution times observed when using stream-based I/O vs. custom scanning and formatting on MTL. File Stream I/O Custom I/O o19sisrs o20sisrs lcm fact o o lcm fact10 Note: All times in seconds Stream-based Custom scan/format Clearly, I/O using a custom scanner and formatter is approximately 6 times as fast as stream-based I/O.

11 Scaling The following times were observed on MTL with varying number of threads: File Threads (1 Collector + Solvers) o19sisrs o20sisrs lcm fact Time (sec) Threads o19sisrs o20sisrs Time (sec) Threads lcm16 fact10

12 Conclusion For a file containing small problems, the scaling is variable. o19sisrs scales to 8 threads, producing a 300% speedup while o20sisrs appears to scale to 16 threads, exhibiting a 400% speedup. This is because, the variable partitioning scheme, described earlier, uses fixed partition size increments, which causes o19sisrs to be split into 6 parts while o20sisrs is split into 13 parts. Larger files are expected to produce better parallel performance. The solution scales to 32 cores for large problems. The scaling is sub-linear, but speed does improve, almost constantly, resulting in 600% speedup for both lcm16 and fact10 using 32 threads. It is likely that speedup, in this case, is limited due to the unavailability of processors on the MTL.

Preview. Memory Management

Preview. Memory Management Preview Memory Management With Mono-Process With Multi-Processes Multi-process with Fixed Partitions Modeling Multiprogramming Swapping Memory Management with Bitmaps Memory Management with Free-List Virtual