TIE Data-intensive Programming. Dr. Timo Aaltonen Department of Pervasive Computing

TIE-22306 Data-intensive Programming Dr. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming Lecturer: Timo Aaltonen timo.aaltonen@tut.fi Assistants Adnan Mushtaq MSc Antti Luoto MSc Antti Kallonen

Lecturer University Lecturer Doctoral degree in Software Engineering, TUT, Software Engineering, 2005 Work history Various positions, TUT, 1995 2010 Principal Researcher, System Software Engineering, Nokia Research Center, 2010-2012 University lecturer, TUT

Working at the course Lectures on Fridays Weekly exercises beginning from the week #2 Course work announced next Friday Communication http://www.cs.tut.fi/~dip/ Exam

Weekly Exercises Linux class TC217 In the beginning of the course hands-on training In the end of the course reception for problems with the course work Enrolment is open Not compulsory, no credit points Two more instances will be added

Course Work Using Hadoop tools and framework to solve typical Big Data problem (in Java) Groups of three Hardware Your own laptop with self-installed Hadoop Your own laptop with VirtualBox 5.1 and Ubuntu VM A TUT virtual machine

Exam Electronic exam after the course Tests rather understanding than exact syntax Use pseudocode to write a MapReduce program which General questions on Hadoop and related technologies

Today Big data Data Science Hadoop HDFS Apache Flume

1: Big Data World is drowning in data click stream data is collected by web servers NYSE generates 1 TB trade data every day MTC collects 5000 attributes for each call Smart marketers collect purchasing habits More data usually beats better algorithms

Three Vs of Big Data Volume: amount of data Transaction data stored through the years, unstructured data streaming in from social media, increasing amounts of sensor and machine-tomachine data Velocity: speed of data in and out streaming data from RFID, sensors, Variety: range of data types and sources structured, unstructured

Big Data Variability Data flows can be highly inconsistent with periodic peaks Complexity Data comes from multiple sources. linking, matching, cleansing and transforming data across systems is a complex task

Data Science Definition: Data science is an activity to extracts insights from messy data Facebook analyzes location data to identify global migration patterns to find out the fanbases to different sport teams A retailer might track purchases both online and in-store to targeted marketing

Data Science

New Challenges Compute-intensiveness raw computing power Challenges of data intensiveness amount of data complexity of data speed in which data is changing

Data Storage Analysis Hard drive from 1990 store 1,370 MB speed 4.4 MB/s Hard drive 2010s store 1 TB speed 100 MB/s

Scalability Grows without requiring developers to rearchitect their algorithms/application Horizontal scaling Vertical scaling

Parallel Approach Reading from multiple disks in parallel 100 drives having 1/100 of the data => 1/100 reading time Problem: Hardware failures replication Problem: Most analysis tasks need to be able to combine data in some way MapReduce Hadoop

2: Apache Hadoop Hadoop is a frameworks of tools libraries and methodologies Operates on large unstructured datasets Open source (Apache License) Simple programming model Scalable

Hadoop A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license) Core Hadoop has two main systems: Hadoop Distributed File System: self-healing highbandwidth clustered storage MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction

Hadoop Administrators Installation Monitor/Manage Systems Tune Systems End Users Design MapReduce Applications Import and export data Work with various Hadoop Tools

Hadoop Developed by Doug Cutting and Michael J. Cafarella Based on Google MapReduce technology Designed to handle large amounts of data and be robust Donated to Apache Foundation in 2006 by Yahoo

Hadoop Design Principles Moving computation is cheaper than moving data Hardware will fail Hide execution details from the user Use streaming data access Use simple file system coherency model Hadoop is not a replacement for SQL, always fast and efficient quick ad-hoc querying

Hadoop MapReduce MapReduce (MR) is the original programming model for Hadoop Collocate data with compute node data access is fast since its local (data locality) Network bandwidth is the most precious resource in the data center MR implementations explicit model the network topology

Hadoop MapReduce MR operates at a high level of abstraction programmer thinks in terms of functions of key and value pairs MR is a shared-nothing architecture tasks do not depend on each other failed tasks can be rescheduled by the system MR was introduced by Google used for producing search indexes applicable to many other problems too

Hadoop Components Hadoop Common A set of components and interfaces for distributed file systems and general I/O Hadoop Distributed Filesystem (HDFS) Hadoop YARN a resource-management platform, scheduling Hadoop MapReduce Distributed programming model and execution environment

Hadoop Stack Transition

Hadoop Ecosystem HBase a scalable data warehouse with support for large tables Hive a data warehouse infrastructure that provides data summarization and ad hoc querying Pig a high-level data-flow language and execution framework for parallel computation Spark a fast and general compute engine for Hadoop data. Wide range of applications ETL, Machine Learning, stream processing, and graph analytics

Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a metastore mapping files to their schemas and associated SerDes. 6. Oozie: A workflow engine that enables creating a workflow of jobs composed of any of the above.

3: Hadoop Distributed File System Hadoop comes with distributed file system called HDFS (Hadoop Distributed File System) Based on Google s GFS (Google File System) HDFS provides redundant storage for massive amounts of data using commodity hardware Data in HDFS is distributed across all data nodes Efficient for MapReduce processing

HDFS Design File system on commodity hardware Survives even with high failure rates of the components Supports lots of large files File size hundreds GB or several TB Main design principles Write once, read many times Rather streaming reads, than frequent random access High throughput is more important than low latency

HDFS Architecture HDFS operates on top of existing file system Files are stored as blocks (default size 128 MB, different from file system blocks) File reliability is based on block-based replication Each block of a file is typically replicated across several DataNodes (default replication is 3) NameNode stores metadata, manages replication and provides access to files No data caching (because of large datasets), but direct reading/streaming from DataNode to client

HDFS Architecture NameNode stores HDFS metadata filenames, locations of blocks, file attributes Metadata is kept in RAM for fast lookups The number of files in HDFS is limited by the amount of available RAM in the NameNode HDFS NameNode federation can help in RAM issues: several NameNodes, each of which manages a portion of the file system namespace

HDFS Architecture DataNode stores file contents as blocks Different blocks of the same file are stored on different DataNodes Same block is typically replicated across several DataNodes for redundancy Periodically sends report of all existing blocks to the NameNode DataNodes exchange heartbeats with the NameNode

HDFS Architecture Built-in protection against DataNode failure If NameNode does not receive any heartbeat from a DataNode within certain time period, DataNode is assumed to be lost In case of failing DataNode, block replication is actively maintained NameNode determines which blocks were on the lost DataNode The NameNode finds other copies of these lost blocks and replicates them to other nodes

HDFS HDFS Federation Multiple Namenode servers Multiple namespaces High Availability redundant NameNodes Heterogeneous Storage and Archival Storage ARCHIVE, DISK, SSD, RAM_DISK

High-Availability (HA) Issues: NameNode Failure NameNode failure corresponds to losing all files on a file system % sudo rm --dont-do-this / For recovery, Hadoop provides two options Backup files that make up the persistent state of the file system Secondary NameNode Also some more advanced techniques exist

HA Issues: the secondary NameNode The secondary NameNode is not mirrored NameNode Required memory-intensive administrative functions NameNode keeps metadata in memory and writes changes to an edit log The secondary NameNode periodically combines previous namespace image and the edit log into a new namespace image, preventing the log to become too large Keeps a copy of the merged namespace image, which can be used in the event of the NameNode failure

Network Topology HDFS is aware how close two nodes are in the network From closer to further 0: Processes in the same node 2: Different nodes in the same rack 4: Nodes in different racks in the same data center 6: Nodes in different data centers

Network Topology

File Block Placement Clients always read from the closest node Default placement strategy One replica in the same local node as client Second replica in a different rack Third replica in different, randomly selected, node in the same rack as the second replica Additional (3+) replicas are random

Balancing Hadoop works best when blocks are evenly spread out Support for DataNodes of different size In optimal case the disk usage percentage in all DataNodes approximately the same level Hadoop provides balancer daemon Re-distributes blocks Should be run when new DataNodes are added

Running Hadoop Three configurations standalone pseudo-distributed fully-distributed https://hadoop.apache.org/docs/r2.7.2/hadoopproject-dist/hadoop-common/singlecluster.html

Configuring HDFS Variable HADOOP_CONF_DIR defines the directory for the Hadoop configuration files core-site.xml <configuration> <property> <name>fs.defaultfs</name> <value>hdfs://localhost:9001</value> </property> </configuration>

hdfs-site.xml <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///home/nn/hadoop/namenode</value> </property> <property> <name>dfs.datanode.name.dir</name> <value>file:///home/nn/hadoop/datanode</value> </property> </configuration>

Accessing Data Data can be accessed using various methods Java API C API Command line / POSIX (FUSE mount) Command line / HDFS client: Demo HTTP Various tools

HDFS URI All HDFS (CLI) commands take path URIs as arguments URI example hdfs://localhost:9000/user/hduser/log-data/file1.log The scheme and authority are optional /user/hduser/log-data/file1.log Home directory log-data/file1.log

RDBMS vs HDFS Schema-on-Write (RDBMS) Schema must be created before any data can be loaded An explicit load operation which transforms data to DB internal structure New columns must be added explicitly before new data for such columns can be loaded into the DB Schema-on-Read (HDFS) Data is simply copied to the file store, no transformation is needed A SerDe (Serializer /Deserlizer) is applied during read time to extract the required columns (late binding) New data can start flowing anytime and will appear retroactively once the SerDe is updated to parse it

Conclusions Pros Support for very large files Designed for streaming data Commodity hardware Cons Not designed for low-latency data access Architecture does not support lots of small files No support for multiple writers / arbitrary file modifications (Writes always at the end of the file)

Reading data

Flume

4: Data Modeling HDFS is a Schema-on-read system allows storing all of your raw data Still following must be considered Data storage formats Multitenancy Schema design Metadata management

Data Storage Options No standard data storage format Hadoop allows storing of data in any format Major considerations for data storage include File format (e.g. plain text, SequenceFile or more complex but more functionally rich options, such as Avro and Parquet) Compression (splittability) Data storage system (HDFS, HBase, Hive, Impala)

File Formats: Text File Common use case: web logs and server logs comes in many formats Organization of the files in the filesystem Text files consume space -> compression Overhead for conversion ( 123 ->123) Structured text data XML and JSON present challenges to HADOOP hard to split Dedicated libraries exist

File Formats: Binary Data Hadoop can be used to process binary files e.g. images Container format is preferred e.g. SequenceFile If the splittable unit of binary data is larger than 64 MB, you may consider putting the data in its own file, without using a container format

Hadoop File Types Hadoop-specific file formats are specifically created to work well with MapReduce file-based data structures such as sequence files, serialization formats like Avro, and columnar formats such as RCFile and Parquet Splittable compression These formats support common compression formats and are also splittable Agnostic compression codec is stored in the header metadata of the file format - > the file can be compressed with any compression codec, without readers having to know the codec

File-Based Data Structures SequenceFile format is one of the most commonly used file-based formats in Hadoop other formats: MapFiles, SetFiles, ArrayFiles, BloomMapFiles, stores data as binary key-value pairs three formats available for records: uncompressed, record-compressed, blockcompressed

Sequence File Header metadata compression codec, key and value class names, userdefined metadata, randomly generated sync marker Often used a container for smaller files

Compression Also for speeding MapReduce Not only for reducing storage requirements Compression must be splittable MapReduce framework splits data for input to multiple tasks

HDFS Schema Design Hadoop is often a data hub for the entire organization data is shared by many departments and teams Carefully structured and organized repository has several benefits standard directory structure makes it easier to share data between teams allows for enforcing access rights and quota conventions regarding e.g. staging data lead less errors code reuse Hadoop tools make assumptions of the data placement

Recommended Locations of Files /user/<username> data, JARs, and config files of a specific user /etl data in all phases of an ETL workflow /etl/<group>/<application>/<process>/{input, processing, output, bad} /tmp temporary data

Recommended Locations of Files /data datasets shared across organization data is written by automated ETL processes read-only for users subdirectories for each data set /app JARs, Oozie workflow definitions, Hive HQL files, /app/<group>/<application>/<version>/<artifact directory>/<artifact>

Recommended Locations of Files /metadata the metadata required by some tools

Partitioning HDFS has no indexes pro: fast to ingest data con: might lead to full table scan (FTC), even when only a portion of data is needed Solution: break data set into smaller subsets (partitions) a HDFS subdirectory for each partition allows queries to read only the specific partitions

Partitioning: Example Assume data sets for all orders for various pharmacies Without partitioning checking order history for just one physician over the past three months leads to full table scan medication_orders/date=20160824/{order1.csv, order2.csv} only 90 directories must be scanned

5: Data Movement File system client for simple usage Common data sources for Hadoop include traditional data management systems such as relational databases and mainframes logs, machine-generated data, and other forms of event data files being imported from existing enterprise data storage systems

Data Movement: Considerations Timeliness of data ingestion and accessibility What are the requirements around how often data needs to be ingested? How soon does data need to be available to downstream processing? Incremental updates How will new data be added? Does it need to be appended to existing data? Or overwrite existing data?

Data Movement: Considerations Data access and processing Will the data be used in processing? If so, will it be used in batch processing jobs? Or is random access to the data required? Source system and data structure Where is the data coming from? A relational database? Logs? Is it structured, semistructured, or unstructured data?

Data Movement: Considerations Partitioning and splitting of data How should data be partitioned after ingest? Does the data need to be ingested into multiple target systems (e.g., HDFS and HBase)? Storage format What format will the data be stored in? Data transformation Does the data need to be transformed in flight?

Timeliness of Data Ingestion Time lag from when data is available for ingestion to when it s accessible in Hadoop Classifications ingestion requirements: Macro batch anything over 15 minutes to hours, or even a daily job. Micro batch fired off every 2 minutes or so, but no more than 15 minutes in total. Near-Real-Time Decision Support immediately actionable by the recipient of the information delivered in less than 2 minutes but greater than 2 seconds. Near-Real-Time Event Processing under 2 seconds, and can be as fast as a 100-millisecond range. Real Time anything under 100 milliseconds.

Incremental Updates Data is either appended to an existing data set or it is modified HDFS works fine for append only implementations. The downside to HDFS is the inability to do appends or random writes to files after they re created HDFS is optimized for large files If the requirements call for a two-minute append process that ends up producing lots of small files, then a periodic process to combine smaller files will be required to get the benefits from larger files

Original Source System and Data Original file type Structure any format: delimited, XML, JSON, Avro, fixed length, variable length, copybooks, Hadoop can accept any file format not all formats are optimal for particular use cases not all file formats can work with all tools in the Hadoop ecosystem, example: variable-length files

Compression Pro transferring a compressed file over the network requires less I/O and network bandwidth Con most compression codecs applied outside of Hadoop are not splittable (e.g., Gzip)

Misc RDBMS Tool: Sqoop Streaming Data Twitter feeds, a Java Message Service (JMS) queue, events firing from a web application server Tools: Flume or Kafka Logfiles an anti-pattern is to read the logfiles from disk as they are written because this is almost impossible to implement without losing data The correct way of ingesting logfiles is to stream the logs directly to a tool like Flume or Kafka, which will write directly to Hadoop instead

Transformations modifications on incoming data, distributing the data into partitions or buckets, sending the data to more than one store or location Transformation: XML or JSON is converted to delimited data Partitioning: incoming data is stock trade data and partitioning by ticker is required Splitting: The data needs to land in HDFS and HBasefor different access patterns.

Data Ingestion Options file transfers Tools like Flume, Sqoop, and Kafka