Overview of Project's Achievements

Size: px

Start display at page:

Download "Overview of Project's Achievements"

Samson White
6 years ago
Views:

1 PalDMC Parallelised Data Mining Components Final Presentation ESRIN, 12/01/2012 Overview of Project's Achievements page 1

2 Project Outline Project's objectives design and implement performance optimised, parallelised versions of selected Data Mining algorithms for EO IIM. focus on efficient data handling focus on parallel processing focus on heterogeneous GP-GPU computing Design Phase evaluation parallel processing techniques rapid prototyping (single algorithm, multiple versions) Implementation Phase implementation of selected algorithms page 2

3 Efficient Data Processing Preserving Data Locality Data should be kept close to the CPU in the storage hierarchy caching. Slow data transfers should be reduce to minimum. Data should not travel between parallel processing units data locality. page 3

4 Considered Options of Parallel Processing CPU SIMD (vectorised) instructions (SIMD) exploiting multiple CPUs and CPU-cores (MIMD) general purpose GPU computing (SIMD) distributed computing (MIMD) SIMD Single Instruction Multiple Data harder to program, good performance MIMD Multiple Instructions Multiple Data allows reusing of the existing sequential code Right combination of these options leads to best performance! page 4

5 Rapid Prototyping K-means algorithm Variant of K-means implemented for each considered option of parallel processing. Why K-means? trivial algorithm resource demanding easily parallelisable one of KEO key algorithms KEO performance bottle-neck page 5

6 CPU SIMD Instructions all modern CPUs have SIMD vector registers algorithms need to be re-engineered to fit SIMD processing page 6

7 Multi-Threading Exploiting Multiple CPUs and CPU-cores drawback lack of load balancing page 7

8 Multi-Threading Exploiting Multiple CPUs and CPU-cores optimal computing model pool of thread workers page 8

9 General Purpose GPU Computing massively parallel SIMD architecture large number of simple&slow processors page 9

10 General Purpose GPU Computing not every algorithms well suited for GPU serial code can be executed but the performance drops significantly (generally slower than the regular CPU) best performance when GPU and CPU 'collaborate' fast GPU processing very sensitive to fast data fetching GP-GPU based on different paradigm then CPU calculation existing CPU code not easily portable without significant reengineering. existing libraries reduce the implementation effort page 10

11 General Purpose GPU Computing HW vendors: AMD(ATI) vs. NVidia both HW GP-GPU capable comparable performance the issue is the SW API (drivers) and support NVidia seems to be ahead. SW API: OpenCL vs. NVidia's CUDA OpenCL open standard implementations need to mature CUDA proprietary and limited to Nvidia HW stable and mature page 11

12 Distributed calculation simple RPC client/server model First prototype using the FastRPC library does not hide the networking details Slave Server no tolerance to failures not scalable works for small number of nodes not extensible algorithm specific interfaces Slave Server simple to implement flexible Slave Server Master Client Slave Server page 12

13 HW test bed I The HW Test Bed is a simple cluster build of 'cheap' commodity desktop PC HW. It is used to perform rapid prototyping and testing of the developed SW. GigE service network internet 3 worker nodes: GPU: NVidia GTX 285 2GiB GDDR3 CPU: AMD Ph.II 3.0GHz RAM: 4GiB 1.33GHz HD: 2x 256GB SATAII SW-RAID0 user interface CPU: AMD Ph.II RAM: 4GiB 1.66GHz HD: 2x 500GB SATAII SW-RAID0 2x GigE data network page 13

14 HW test bed II 2 networks: GigE service network internet service network standard switched 1GigE data network 2 bonded switched 1GigE dedicated and optimised for large data transfers SW configuration: Scientific Linux 5.4 (64bit) NVidia CUDA FastRPC 2x GigE data network page 14

15 K-means Prototype Benchmark K-means Execution Time single iteration / s 50 iterations network transfer / s 50 iter. worst case total C single thread min 3 s min 15 s C multi-thread min 19 s min 31 s SSE2 single thread min 18 s min 30 s SSE2 multi-thread min 53 s min 5 s CUDA 1xGPU s min 7 s CUDA 2xGPU s s FastRPC 1xCPU min 1 s min 22 s FastRPC 2xCPU min 32 s min 51 s FastRPC 3xCPU min 5 s min 25 s FastRPC 1xGPU min min 22 s FastRPC 2xGPU s s FastRPC 3xGPU s s 50 iteration k-means algorithms for 1GiB of processed data (cube of 4096x8192x8 32bit FP numbers). page 15

16 K-means Prototype Benchmark local calculation page 16

17 K-means Prototype Benchmark distributed calculation page 17

18 K-means Prototype Benchmark distributed GPU calculation I page 18

19 K-means Prototype Benchmark distributed GPU calculation II page 19

20 PalDMC implementation phase selected algorithms: OTB Pan-Sharpening OTB Road-Extraction MEEO Optical Data Segmentation K-Means Clustering (already implemented) KEO integration (CPU variants only) page 20

21 MEEO Segmentation brief description the source code kindly provided by MEEO Srl. (NDA signed between ISS and MEEO) input: labelled (after clustering) single band image region growing segmentation calculation of segment geometrical processing single pass algorithm performance optimisations applied: minor algorithm re-arrangement applied multi-threading page 21

22 MEEO Segmentation Test Data samples image: 7662x4661, single band, 8 bits/pixel page 22

23 MEEO Segmentation Benchmark cached data non-cached data original 7.13s 8.98s modified 3.82s 6.02s Conclusions: algorithm is reasonably fast relatively to I/O rates not much space left for performance improvements effort spent on further performance optimisations would not bring significant speed-up of the algorithm page 23

24 PalDMC OTB Simple Pan-Sharpening prerequisite Low-Pass Filter produces an image with spectral (in Fourier domain) properties of the XS image. this prerequisite is not addressed by OTB page 24

25 PalDMC Simple Pan-Sharpening Results source panchromatic image source multi-band image OTB demo images QuickBird, Toulouse city page 25

26 PalDMC Simple Pan-Sharpening Results OTB Simple PAN Sharpening Alt. Simple PAN Sharpening (Least-Square fit.) Both images calculated using the ISS implementation. The OTB Simple PANSharp. image is identical to that calculated by OTB. page 26

27 PalDMC OTB Simple Pan-Sharpening ISS implementation sequential variant approx. 2x faster than OTB processed by parts (tiles) easy to parallelise small memory foot-print (scalable) OTB implementation extremely large memory consumption larger images require more that available RAM performance degradation due to memory swapping requested streamed processing takes no visible effect page 27

28 PalDMC Alternative Simple Pan-Sharpening prerequisite bands' linear combination parameters must be known these can be a-priory known (user defined) or calculated, e.g., by Least-Square Fitting page 28

29 PalDMC OTB Road Extraction Test Case 6000x6000x4 Spot free sample, source SpotImage,Fr. page 29

30 PalDMC OTB Road Extraction Test Case 6000x6000x4 Spot free sample, source SpotImage,Fr. page 30

31 PalDMC OTB Road Extraction OTB Road Extraction composite of OTB/ITK filters spectral angle distance from reference pixel value several filters highlighting linear features vectorisation of the linear features to paths paths' refinement (optional) paths' rasterisation Issues: unreasonably high memory footprint extremely long computing times page 31

32 PalDMC OTB Road Extraction profiling Filter time / s - RoadExtractionFilter - SpectralAngleDistanceFilter - GenericRoadExtractionFilter SquareRootImageFilter 0.13 GradientFilter 2.48 NeighborhoodScalarProductFilter 1.26 RemoveIsolatedByDirectionFilter 0.92 RemoveWrongDirectionFilter 0.14 NonMaxRemovalByDirectionFilter 0.79 VectorizationPathListFilter 4.07 FirstSimplifyPathListFilter 0.14 BreakAngularPathListFilter 0.4 FirstRemoveTortuousPathListFilter 0.05 LinkPathListFilter SecondSimplifyPathListFilter 0.13 SecondRemoveTortuousPathListFilter 0.02 LikelihoodPathListFilter 0.06 page 32

33 PalDMC OTB Road Extraction LinkPathFilter OTB LinkPathFiter tries to link short paths' fragments to larger ones calculates distances between all nodes of all paths compares nodes even if the paths are 'far-away' (!) optimisation #1 use of bounding boxes to select close paths (bounding boxes extended by distance threshold) reduces number of node-node comparisons optimisation #2 paths processed by local subsets (tiles) reduces number of path-path comparisons page 33

34 PalDMC OTB Road Extraction benchmark LinkPathFiter reimplementation original OTB time / hour:min:sec 01:41:32 01:25:05 01:58:03 00:30:21 00:27:52 01:15:11 01:02:48 01:12:44 ISS LinkPath time / hour:min:sec 00:01:08 00:01:06 00:01:14 00:00:54 00:00:53 00:01:06 00:01:03 00:01:06 Speed-Up results for selected AVNIR-2 products (provided by ESA) page 34

35 PalDMC OTB Road Extraction benchmark II road extraction pixel processing part of the algorithm single thread / CPU multi-thread / CPU time / sec time / sec CUDA / GPU time / sec CUDA / GPU 2x time / sec results for selected AVNIR-2 products (provided by ESA) page 35

36 The End page 36

Parallelised Data-Mining Components

page: 1 of 11 Parallelised Data-Mining Components Name Signature Date Prepared by: Martin Pačes / ISS 19/01/2012 Approved by: Accepted by: page: 2 of 11 Distribution List Name Company / Agency E-mail Sergio