In the name of Allah Massive Data Algorithmics An Introduction
Overview MADALGO SCALGO Basic Concepts The TerraFlow Project STREAM The TerraStream Project TPIE
MADALGO- Introduction Center for MAssive Data ALGOrithmics A major basic research center funded by The Danish National Research Foundation Covers all areas of the design, analysis and implementation of algorithms and data structures for processing massive data
MADALGO- Four core research areas I/O-efficient algorithms Algorithms designed in a two-level external memory (or I/O-) model The memory hierarchy consists of a main memory of limited size M and an external memory (disk) of unlimited size the goal is to minimize the number of times a block of B consecutive elements is read (or written) from (to) disk (an I/O-operation, or simply I/O)
MADALGO- Four core research areas cache-oblivious algorithms Algorithms designed in the I/O-model but without knowledge of M and B and then analyzed as I/O-model algorithms Holds simultaneously on all levels of any multi-level memory hierarchy.
MADALGO- Four core research areas streaming algorithms Only one (or a small constant number of) sequential pass(es) over the data is (are) allowed Solve a given problem using significantly less space than the input data size Process each data element as fast as possible
MADALGO- Four core research areas algorithm engineering the design and analysis of practical algorithms efficient implementation of these algorithms experimentation that provide insight into their applicability and further improvements
SCALGO SCALGO: SCALable algorithmics Was founded in 2009 in Aarhus, Denmark Mission: to bring cutting-edge massive terrain data-processing technology to market
Terrain Terrain: The vertical and horizontal dimension of land surface
LIDAR LIDAR: Light Detection And Ranging an optical remote sensing technology measures the distance to, or other properties of, a target by illuminating the target with light often uses pulses from a laser
Point cloud A set of vertices in a three-dimensional coordinate system Usually defined by X, Y, and Z coordinates Typically intended to be representative of the external surface of an object
DEM DEM: Digital elevation model A digital model or 3D representation of a terrain's surface Two most used types of DEM are regular grid and triangulated irregular network (TIN)
Regular grid DEM a matrix of equally spaced points with each point having x, y and z coordinate values
Regular grid DEM- Quadtree a tree data structure in which each internal node has exactly four children most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions
Triangulated Irregular Network (TIN) irregularly distributed nodes and lines with three-dimensional coordinates arranged in a network of non-overlapping triangles
TIN- Delaunay triangulation A triangulation for a set of points such that no point is inside the circumcircle of any triangle maximizes the minimum angle of all the angles of the triangles in the triangulation tends to avoid skinny triangles
The TerraFlow Project Has emerged from the experiences with terrain analysis applications which do not scale up to large datasets a software package for computing flow routing and flow accumulation on massive grid-based terrains based on theoretically optimal algorithms designed using external memory paradigms
Flow direction, flow routing and flow accumulation The flow directions of a cell correspond to the directions in which water would flow if poured at that cell onto the terrain water cannot go uphill The flow routing problem: the problem of assigning flow directions to all cells in the DEM such that 1. flow directions do not induce any cycles; 2. every cell has a flow path off the edge of the terrain The flow accumulation of a terrain is an index which estimates the surface runoff for each cell in the terrain
STREAM- Introduction STREAM: Scalable Techniques for hi- Resolution Elevation data Analysis and Modeling Located in the CS department at Duke university funded by the U.S. Army Research Office
STREAM- Projects Constructing DEM developed two methods for efficiently converting LIDAR point sets to more conventional formats: Grid Construction: uses a quad-tree segmentation TIN Construction: uses a Delaunay triangulation algorithm Terrain Flow Modeling improvements to existing work done as part of the TerraFlow project
STREAM- Projects Noise Removal There is some level of noise in DEMs derived from LIDAR computes a persistence score for topological features uses this persistence score to remove small topological features likely the result of noise
STREAM- Projects Hierarchical Watershed Decomposition partitions a terrain into a hierarchy of nested watersheds
STREAM- Projects Topographic Change Detecting topographic change can quickly identify beach dunes damaged by hurricanes, monitor urban development or measure change in forest growth
TerraSTREAM- Introduction A series of libraries and front-ends for these libraries Allows the user to perform a series of computational tasks on very large digital elevation models The data is represented either as a TIN or a GRID A collaboration between Duke University CS researchers and researchers at MADALGO
TerraStream- Features DEM Construction Computes a digital elevation model (DEM) from a point cloud The input data is typically gathered using LIDAR Constructs both TINs and grids
TerraStream- Features DEM Topological Conditioning Simplifies digital elevation models by first identifying and then removing insignificant geographical features Significance is the feature's height, area and volume or any combination of these A feature is insignificant if its significance is smaller than some threshold specified by the user
TerraStream- Features Flow Routing Compute flow directions for each data point in a DEM The routing models supported are steepest-flow-descent multiple-flow-directions flux decomposition Flow Accumulation Accumulate amounts of, e.g., water on a DEM along flow paths as computed by the flow routing module
TerraStream- Features Flood Simulation Flood Mask computes a mask of the cells that are flooded if the water lever were raised 'x' units General Transforms a DEM to a new DEM The height of each cell in the produced DEM is the minimum height that the water level needs to be raised to in order for that particular cell to flood
TerraStream- Features Contour Map Computation Computes the contour map of a terrain
TerraStream- Features Raster Quality Assessment takes a raster and point cloud computes how far the center of each raster cell is from the closest point in the point cloud it is easy to spot areas of the grid where there is no points close If the point cloud used is the same used for generating the input raster this can be used for quality control of the point cloud, the classification algorithm used and the produced raster
TerraStream- Features Watershed Hierarchy Construction Construct a Pfafstetter labeling of the watersheds of a DEM LS-Factor Computation LS-factor: an aggregate of the slope length factor (L) and the slope steepness factor (S) estimate the effects of slope length and steepness on erosion Format Flexibility reading and writing mosaic grids in many common formats
TPIE- Introduction TPIE: The Templated Portable I/O Environment A tool-box providing efficient and convenient tools To ease the implementation of algorithm and data structures on very large sets of data The algorithms and data structures that form the core of TPIE all provide efficient worst-case space, time and disk usage guarantees In Windows, TPIE is known to work with the Microsoft Visual Studio 2008 and 2010 compilers
TPIE- Example Internal sorting
TPIE- Example Reading and writing file streams
TPIE- Example External sorting
TPIE- Example Priority queue
TPIE- I/O parameters M and B get_block_size() implementation
TPIE- I/O parameters Elements block size Pass the block factor to the constructor
The End Thank you for your time