Using CUDA for Solar Thermal Plant Computation

Size: px

Start display at page:

Download "Using CUDA for Solar Thermal Plant Computation"

Eleanor Green
5 years ago
Views:

1 Using CUDA for Solar Thermal Plant Computation Instructor: Dr. Kwok-Bun Yue Mentors: Dr. Michel Izygon Peter Armstrong Team: Sahithi Chalasani Pranav Mantini Claus Nilsson Arunkumar Subramanian Spring /4/2009

3 1.0 Abstract Solar thermal power plants consist of a central tower surrounded by heliostats (mirrors.) The heliostats focus the sunlight on the tower where the thermal energy is used to generate electricity. Each heliostat may be shaded by its neighbors as well as blocked by them so the reflected light does not reach the tower. The Solar Thermal Plant Computation application is used to calculate the effective area of each heliostat. While the calculations needed to determine the shaded and blocked areas are relatively simple, the sheer number of calculations needed to determine the interaction between the heliostats is immense even for relatively small fields. The area of the representative heliostat that is shaded or blocked does not contribute towards the power generated. This area should be subtracted using a polygon clipper. The original program designed by Tietronix Software, Inc. makes call to a general polygon clipping (GPC) library. This GPC is a huge library designed at The University of Manchester. It has about 2500 lines of code. Most of the processing time for calculating the co-ordinates is taken by the GPC. The GPC library used for the original computation could not be used for our purpose. As, the GPC library are located on the host, a call to the function on the host from the device is lot more time consuming. A polygon clipping algorithm that is more specific to the computation algorithm has to be designed. The polygon clipping algorithm used for our design is a paper, Efficient clipping of arbitrary polygons proposed by Gunther Greiner and Kai Hormann. This algorithm is chosen because it is relatively more efficient than Sutherland Hodgman algorithm which is more commonly used. The data structures used for the polygons are very simple. A doubly linked list is used in the algorithm to represent the polygons. The clipping algorithm involves the 1

4 calculation of all the intersection points and the choosing among these points to create the desired polygon. The current version of the application is single threaded which means that one run can take several hours. Since multiple runs are needed to judge the efficiency of the layout of a field throughout the year, it can be very time consuming to test design changes. The purpose of this project is to demonstrate the feasibility of decreasing the application's run time by using Nvidia's CUDA (Compute Unified Device Architecture) architecture. CUDA allows an application to take advantage of the many cores of an Nvidia graphics processor to parallelize the calculations thereby decreasing the time needed for each run. Several issues with CUDA influenced our design. CUDA does not allow function calls from the GPU to the CPU so using the GPC clipping library in its current form is not possible. 2

5 Table of Contents 1.0 Abstract Introduction & Background Introduction Background Shading and blocking Design and Implementation Technologies Architecture Design Clipping Algorithm CUDA Implementation Implementation Issues Evaluation Conclusion Further Work References Appendices A: Project Management and Team Information B: Major tasks and contributions C: Code comments D: Schedule

6 2.0 Introduction & Background 2.1 Introduction Tietronix, Inc. has a single threaded application used to calculate the efficiency of a solar thermal power plant at a given position, date, and time. In order to find an optimal positioning of the heliostats in the field, the efficiency of the field must be calculated multiple times for different days of the year and times of day. Currently the calculations needed to calculate the efficiency of a layout take so long, that exploring multiple layouts is a very time consuming activity and hinders the usability of the application. Therefore, Tietronix suggested that Nvidia s CUDA technology be used to decrease the time needed to calculate the efficiency of a field at a given time and date, hoping that it could be reduced enough that exploring multiple layouts of a field would become practical rather than just possible. 2.2 Background Solar Thermal Plant: Concentrating solar power plants produce electricity by reflecting sunlight onto a central receiver, where the energy is used to heat a medium which ultimately drives electrical generators. Sunlight is reflected toward the receiver by mirrored devices called heliostats. Heliostats, sometimes numbering in the tens of thousands, are organized into fields around a tower holding the receiver at the appropriate height above the ground. The sunlight is reflected using a device called a heliostat. A heliostat tracks the movement of the sun and orients the mirror, to redirect the sunlight to the central receiver. 4

7 Figure 2 - View of Solucar PS10 near Seville, Spain [3] In order to create the optimum layout for a solar thermal power plant, the individual contributions of each heliostat must be calculated to ensure that as little energy (sunlight) is wasted as possible. To develop an initial layout an application like this one is used to generate a preliminary design. The design is preliminary because the effectiveness of each heliostat is only calculated for a minority of the heliostats. Figure 1 - A cell with the representative heliostat (in red) surrounded by neighboring heliostats. Tietronix [1] The field is expressed as a grid, with the receiver in one cell surrounded by heliostats in all the other cells. Each cell has one representative heliostat which stands in for the actual heliostats which will be located in that section of the field and a number (8 to 84) of neighboring heliostats. The effectiveness of the representative heliostat is limited by the shadows cast on it by its 5

8 neighbors as well as the light it reflects towards the receiver which is blocked by one or more of its neighbors. The cells are treated as being independent of each other and do not affect other cells. Ultimately the total effective area is calculated for the entire grid, giving an effectiveness rating for a given time and day. It is this total effective area that the user wants to maximize in order to get the best value out of power plant. Factors which affect the calculations for a field are (amongst others) the size and shape of the mirrors, the spacing between heliostats, how the heliostats are placed, time of day, day of year, geographical location of the plant, etc. This application lets the user develop an overall layout before proceeding to another application which calculates the effectiveness for every heliostat in the field based on how each heliostat is influenced by its neighbors. 2.3 Shading and blocking A field of heliostats suffers losses caused by shading and blocking by neighboring heliostats. When a heliostat shadows itself to another heliostat which is located behind it, shading occurs at low sun angles. Blocking occurs when a heliostat in front of another heliostat blocks the reflected suns energy on its way to the receiver. The amount of sunlight reflected onto the central receiver depends on the total area of the heliostats that is neither shaded nor blocked. To optimize the energy generated from a solar thermal plant, the total area of the heliostats that is shaded or blocked should be calculated. The following figure illustrates the concept of shading and blocking losses. 6

9 Figure 3 - Explains about the losses of shading and blocking. [5] 3.0 Design and Implementation Tietronix was interested in determining whether or not using CUDA could improve the runtime over their current implementation. Peter Armstrong of Tietronix provided the team with the algorithm and equations needed to calculate the layout (positions of the heliostats) of a thermal solar plant. The resultant data from this algorithm is then used as input for the clipping algorithm which calculates the percentage of each heliostat s mirror which is actively contributing to the working of the power plant. Currently Tietronix uses a code library, called GPC (General Clipping Algorithm), to handle clipping for their application and then calculates the percentages from the clipping results. CUDA does not allow function calls from the device (the GPU) to host (the CPU) side functions (GPC) so the team had to find some way of either 7

10 calling the clipping function from the host side, which would result in an application much like the current one, or implement the clipping functionality on the device. We chose to implement clipping on the device and started looking for a suitable algorithm. Initially we looked at porting the GPC code to run on CUDA, but decided against it for two reasons. One, the library is fairly large with about 2500 lines of code, and secondly it uses a lot of dynamic memory allocations which are not supported by CUDA. 3.1 Technologies The project requirements specified the use of CUDA to decrease the runtime of the current application. Also, Tietronix requested a Windows application which led to the initial selection of Microsoft s Visual Studio 2008 as the IDE of choice. Due to complications integrating the CUDA API with VS 2008, Visual Studio 2005 was chosen instead as it provided the needed functionality and did not have issues with CUDA. Two Nvidia GPUs were used by the team to test the CUDA code; a 8600M running on Windows XP Pro and a 8800 GTS running on 64 bit Windows Vista Ultimate. CUDA is an extension to the C programming language allowing programmers to easily take advantage of the floating point calculating power of a modern Nvidia graphics processing unit (GPU.) CUDA applications divide into two distinct parts; code which runs on the host (the CPU) and code that runs on the device (the GPU). Host side code can use the full range of C functionality with a CUDA specific extension. The device side code is a subset of C extended with some device specific commands. Using CUDA a developer can convert parts of (or all of) an application to execute in parallel on the GPU to achieve a performance gain with just a little work. CUDA manages all creation, 8

11 maintenance, and destruction of threads, leaving the developer to focus on optimizing the application to use the available resources in the most efficient manner. 3.2 Architecture Modern graphics processors can have hundreds of thread processors and are capable of processing thousands of threads concurrently. While individually these processors are slower and less capable than a CPU core, the sheer numbers of them allow the GPU to churn through a large number of floating point calculations in short order. Also, the Nvidia GPUs are created to use very low cost threads making switching between threads very cheap which helps boost the efficiency of the GPU. Nvidia GPUs are divided into multiple Thread Processors which can run multiple threads concurrently. The number of threads and the number of thread processors varies from product to product making it vital to tailor ones program to the exact model of GPU in order to achieve the maximum efficiency for an application. 9

12 3.3 Design The project has two major parts; a single threaded C application used to implement the algorithms in a familiar environment for debugging and testing. This single threaded version cut down on the number of unknowns presented by the project by not introducing CUDA into the code. Secondly, the multithreaded version implemented using Nvidia s CUDA platform. Since CUDA is an extension to C, some of the code from the single threaded application could be copied directly to the CUDA kernel thereby reducing the amount of new unproven code in the new environment Clipping Algorithm As large part of computation time is taken by the clipping algorithm, initially the team started to design a clipping algorithm that is more specific to the requirements. But, as the number of times shading and blocking that can occur in a cell are uncertain, the number of vertices that the polygon clipping code takes as input is unknown. For this purpose a more general algorithm had to be chosen. The clipping algorithm used for our purpose has already been implemented in C. But this code has lot of functions that are unnecessary for our computation so the algorithm had to be implemented again. This algorithm was compared to Vatti s algorithm, a very widely used algorithm, and the comparison shows an improved performance. [4] CUDA Implementation The CUDA implementation consists of one main function which runs on the host (CPU), two kernels, and a number of device side functions. Kernels are functions which run on the device (the graphics card; containing the GPU and the device memory) which are callable from the host. Kernels run asynchronously so once called, they return the execution pointer right back to the 10

13 calling code on the host which can then choose to do something else or wait for the kernel to finish. Device side functions (non-kernels) can only be called from the device. The first kernel does the setup of the field and calculates where all the heliostats are, as well as their orientation in regards to the sun and the receiver. This allows us to determine the three dimensional positions of the mirrors vertices which can then be projected into the representative heliostat s mirror s 2D plane which allows us to determine how much of the representative mirror is shaded and/or blocked by the neighboring heliostats. Kernels equate to device side threads. Therefore, when calling a kernel the caller specifies how many threads need to be started. Kernel one is called once per heliostat in the grid by telling CUDA that we want a grid of size m by n (which matches the grid of the power plant field), as well as how many threads are to be run per cell in the grid (nine in our case; one for each heliostat in the cell.) While this may not be the most efficient use of the resources available, it represents a logical representation of the field in question, which allows for a simpler design. Furthermore, each cell in the grid is run as its own thread group which shares a section of memory. This allows us to share information between the threads in a cell, which in our case means that we can share certain parts of the representative heliostat s calculated values with the neighboring heliostats. In order to share, the threads need to synchronize, which in CUDA s case means a break point is inserted into the code which tells each thread of the group (cell) to wait here until every thread reaches this point. This may be inefficient (it might be more efficient to simply recalculate the values in question for every heliostat) as it introduces a break in the parallel execution of the cell s threads, but this still remains to be tested. 11

14 The second kernel implements the clipping functionality. This kernel processes the cells in parallel, but the heliostats within each cell are processed sequentially because the output from the clipping function for one pairing (representative and neighboring heliostat) may be needed for the next pairing (if any). Currently the application is configured to first calculate the non-shaded area of the representative heliostats and then the non-blocked area. Calculating these two areas in parallel and then taking the intersection between them may be more efficient, but that is left for a later time. The algorithm the team decided to implement uses dynamic memory allocation to build the polygons as the algorithm progresses. Since CUDA does not support dynamic memory allocation from the device side, we had to implement a fixed size array in lieu of the dynamic memory allocations. This obviously leads to inefficient use of memory since we have to size the array to hold the maximum number of vertices we foresee for any resulting polygon. This may not be too bad for our project, but for a production system this would be a major issue which must be addressed further. Also, the algorithm uses double linked lists to hold the polygons. Furthermore, a vertex in one polygon may point to vertex in another polygon (called neighbors). This pointing functionality is replaced by a simple search function in our implementation which is obviously not as efficient as a simple pointer. Currently both kernels as well as the device functions use intermediary steps during calculation using local variables. This leads to a large number of variable declarations and initializations. CUDA is picky about how memory is used, so this is definitely one area that is likely to yield improvements once these temporary variables are removed. However, since we are not done debugging they are still in the code. 12

15 3.4 Implementation Issues At the beginning of the project the team encountered several problems regarding CUDA; only one team member had access to an Nvidia GPU at home. One other team member was able to install the CUDA API and get it to run in emulation mode (in emulation mode the CPU simulates the GPU, allowing for easier debugging). Also, there were issues with making the CUDA compiler work well with Visual Studio 2008, which prompted a switch to using Visual Studio As programming on the CUDA side commenced in became apparent that the single precision offered by older CUDA capable hardware (which includes the devices available to the team) could not handle some of the vectors used in the application without truncation which lead to calculation errors throughout the application. Unfortunately, due to the unfinished nature of the application we have not yet been able to determine the severity of these errors. The clipping algorithms evaluated by the team made heavy use of dynamic memory allocation during runtime which is not supported by CUDA. Therefore, the team had to re-design the chosen algorithm to use a fixed amount of memory based on the estimated maximum needed by the application. 4.0 Evaluation Since the application is not complete we are unable to determine whether or not converting it to use CUDA is worthwhile. It is the guess of the authors that using CUDA is worthwhile but we can offer no data to support this opinion. During our informal testing (for debugging purposes) 13

16 the application has seemed responsive with the runtime being under 20 seconds for every run, but so far we have only run the application to calculate shading. Including blocking in the run as well as effective area calculations and a total area calculation would obvious affect the runtime, but we estimate that a total run can be performed in less than 60 seconds for a 10 by 10 grid with 9 heliostats per cell. Whether or not this would represent a satisfactory and/or worthwhile improvement to Tietronix is unknown. It is now obvious that the group needed far better project management and communication and the group leader takes full responsibility for the shortcomings in this area. Furthermore, the leader should have been far more proactive in ensuring that the project followed the planned timeline, and not accepted the excessive delays in various aspects of the project. 5.0 Conclusion This project has shown that creating a massively multithreaded application to calculate the efficiency of a solar thermal power plant is possible. However, as we are still debugging the clipping part of the project we can draw no formal conclusions. Using CUDA was easier than anticipated even though the lack of dynamic memory allocation affected how we implemented the application. The team never moved into optimizing the code for performance on the device since the code is incomplete, but it is the impression that this grid based application will not achieve as much of a improvement as an application which calculates the effectiveness for every heliostat in a field. 14

17 5.1 Further Work In order to make the CUDA technologies work in a production environment the application developed by team 5 must be altered from a cell oriented approach to an approach focusing on individual heliostats. While the application design would be almost identical to the one created for this project, the number of threads (heliostats) which are grouped together would depend upon the actual hardware on which the application is to run in order to optimize the usage of the available resources. The number of threads must be high enough to fully utilize the individual processors of the GPU, yet low enough that the kernel (CUDA method) can run on one thread processor. The current implementation of the clipping algorithm relies on a simple fixed sized array (with insertion at a specific index) to handle the polygons. This implementation should be replaced with a double linked list designed to work with a fixed size array. This array can either be sized to fit one polygon (and included for every cell), or one array can be designed to function as regular memory and hold all the vertices for all the polygons of the field. The array per cell is the easiest to implement, while the one array as dynamic memory would probably use memory more efficiently (since not all polygons will be of the maximum size) but may not be efficient in CUDA due to how CUDA accesses device memory. 15

18 References 1. Armstrong, P. An Algorithm for Shading and Blocking Computations of a Field of Heliostats Arranged in a Grid Layout. Available from Tietronix Software, Inc.; received February Greiner, G. and Hormann, K Efficient clipping of arbitrary polygons. ACM Trans. Graph. 17, 2 (Apr. 1998), DOI= 3. PS10 solar power tower 2.jpg. Retrieved from Wikipedia.org on April 23 rd, Greiner, G. and Hormann, K. Efficient Clipping of Arbitrary Polygons. Retrieved May 4 th, Thathireddy, K., Garre, S., Khorsand, S., Nandigam, T. Solar Thermal Plant Design and Operation Suite of Tools. UHCL Capstone Project. May 5 th, Retrieved May 4 th,

19 Appendices A: Project Management and Team Information Roles: Application Design: CUDA Programmer: C Programmer: Design an efficient CUDA code for the solar thermal Plant computation. Adapt the C code to use CUDA. Implement the existing solar thermal plant computation algorithm and the chosen polygon clipping in C. Research on clipping algorithms: Find or design a polygon clipping algorithm that is more effective. The polygon clipping algorithm used for our design is a paper, Efficient clipping of arbitrary polygons proposed by Gunther Greiner and Kai Hormann. This algorithm is chosen because it is relatively more efficient than Sutherland Hodgman algorithm which is more commonly used. The data structures used for the polygons are very simple. Website Maintenance: Minutes and agendas: Technical writing: Design and update the capstone website regularly. Write the meeting and agenda for all the team and mentor meetings. Write the technical report. 17

20 B: Major tasks and contributions Application Design: 50% Pranav and 50% Claus CUDA Programmer: 100 % Claus Nilsson C Programmer: 100 % Pranav Mantini Research on clipping algorithms: 40% Arun, 20% Pranav, and 40% Sahithi Website Maintenance: 50% Pranav, 20% Claus, 15% Sahithi, 15% Arun Minutes and agendas: 100 % Sahithi Chalasani Technical writing: 50% Claus, 25% Pranav, 20% Sahithi, 5% Arun. C: Code comments The code included on the accompanying disk comes as two Visual Studio 2005 projects. One is for a 64 bit systems the other for 32 bit. The code is the same for both projects, but the configuration of the project requires either a 64 or 32 bit version of the CUDA API (both are available from Nvidia) be installed prior to compilation. Also, a CUDA enabled driver must be installed for the Nvidia graphics processor on the system. Please see for more information. The project code is currently heavily infested with printf statements used for debugging and therefore the project may not compile in regular debug mode. Please use EmuDebug instead which allows for outputting directly from kernels and device side functions. The code currently does not call the clipping kernel as a bug was revealed during testing which affected made the clipping function go into an infinite loop due to a problem with a 18

21 dysfunctional polygon caused by too many neighboring heliostats shading the representative heliostat. At least that is what we think is wrong right now. More experiments are needed to determine the exact bug and fix it. 19

22 D: Schedule 20

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture Rafia Inam Mälardalen Real-Time Research Centre Mälardalen University, Västerås, Sweden http://www.mrtc.mdh.se rafia.inam@mdh.se CONTENTS