CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 12: Conclusion. Aidan Hogan

Similar documents
CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 6: Information Retrieval I. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval I. Aidan Hogan

CSE 444: Database Internals. Lectures 26 NoSQL: Extensible Record Stores

CISC 7610 Lecture 2b The beginnings of NoSQL

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 12 Google Bigtable

Big Data Hadoop Course Content

References. What is Bigtable? Bigtable Data Model. Outline. Key Features. CSE 444: Database Internals

Big Data Development CASSANDRA NoSQL Training - Workshop. November 20 to (5 days) 9 am to 5 pm HOTEL DUBAI GRAND DUBAI

CS November 2017

Challenges for Data Driven Systems

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO 2018

CS November 2018

DIVING IN: INSIDE THE DATA CENTER

MapReduce and Friends

CS 655 Advanced Topics in Distributed Systems

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Introduction to NoSQL by William McKnight

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Databases 2 (VU) ( / )

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

CS5412: OTHER DATA CENTER SERVICES

/ Cloud Computing. Recitation 10 March 22nd, 2016

Cassandra Design Patterns

Ghislain Fourny. Big Data 5. Wide column stores

HBase vs Neo4j. Technical overview. Name: Vladan Jovičić CR09 Advanced Scalable Data (Fall, 2017) Ecolé Normale Superiuere de Lyon

Distributed computing: index building and use

Distributed File Systems II

27/04/2015 CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 1: Introduction THE VALUE OF DATA. Aidan Hogan

BigTable. Chubby. BigTable. Chubby. Why Chubby? How to do consensus as a service

Fundamentals Large-Scale Distributed System Design. (a.k.a. Distributed Systems 1)

Bigtable: A Distributed Storage System for Structured Data by Google SUNNIE CHUNG CIS 612

Distributed computing: index building and use

MASSIVE DATA NEEDS DISTRIBUTED SYSTEMS

Big Data Analytics. Rasoul Karimi

Distributed Systems Intro and Course Overview

CS5412: DIVING IN: INSIDE THE DATA CENTER

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Embedded Technosolutions

Big Data with Hadoop Ecosystem

Database Applications (15-415)

Column-Family Databases Cassandra and HBase

Ghislain Fourny. Big Data 5. Column stores

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Distributed Data Store

CS5412: DIVING IN: INSIDE THE DATA CENTER

Introduction to Hadoop. High Availability Scaling Advantages and Challenges. Introduction to Big Data

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Introduction to BigData, Hadoop:-

CA485 Ray Walshe NoSQL

Google big data techniques (2)

Dynamo: Amazon s Highly Available Key-value Store. ID2210-VT13 Slides by Tallat M. Shafaat

10. Replication. Motivation

BigTable: A Distributed Storage System for Structured Data

Time Series Storage with Apache Kudu (incubating)

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Extreme Computing. NoSQL.

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Scaling Up HBase. Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech. CSE6242 / CX4242: Data & Visual Analytics

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Big Data Architect.

Sources. P. J. Sadalage, M Fowler, NoSQL Distilled, Addison Wesley

Introduction to NoSQL Databases

Big Data Analytics using Apache Hadoop and Spark with Scala

Relevant?!? Algoritmi per IR. Goal of a Search Engine. Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Web Search

Review - Relational Model Concepts

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

Hadoop. copyright 2011 Trainologic LTD

6.830 Lecture Spark 11/15/2017

Exam IST 441 Spring 2014

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

Next-Generation Cloud Platform

Exam 2 Review. October 29, Paul Krzyzanowski 1

Search Engines. Information Retrieval in Practice

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

CS-580K/480K Advanced Topics in Cloud Computing. NoSQL Database

Microsoft Big Data and Hadoop

MapReduce programming model

Big Data Hadoop Stack

Comparing SQL and NOSQL databases

MI-PDB, MIE-PDB: Advanced Database Systems

Improving the MapReduce Big Data Processing Framework

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Distributed Systems. Fall 2017 Exam 3 Review. Paul Krzyzanowski. Rutgers University. Fall 2017

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Presented by Sunnie S Chung CIS 612

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

CS377: Database Systems Text data and information. Li Xiong Department of Mathematics and Computer Science Emory University

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

big picture parallel db (one data center) mix of OLTP and batch analysis lots of data, high r/w rates, 1000s of cheap boxes thus many failures

Part 1: Installing MongoDB

HyperDex. A Distributed, Searchable Key-Value Store. Robert Escriva. Department of Computer Science Cornell University

Lecture 20: WSC, Datacenters. Topics: warehouse-scale computing and datacenters (Sections )

CSE-E5430 Scalable Cloud Computing Lecture 9

ΕΠΛ 602:Foundations of Internet Technologies. Cloud Computing

Migrating massive monitoring to Bigtable without downtime. Martin Parm, Infrastructure Engineer for Monitoring

Transcription:

CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2016 Lecture 12: Conclusion Aidan Hogan aidhog@gmail.com

FULL-CIRCLE

The value of data

Twitter architecture

Google architecture

Generalise concepts to

Working with large datasets

Value/danger of distribution

Frameworks For Distrib. Processing For Distrib. Storage

The Big Data Buzz-word

Distributed Systems

GFS (HDFS) / MapReduce (Hadoop)

Information Retrieval

NoSQL and Querying

EXAM

Course Marking 45% for Weekly Labs (~3% a lab!) [Easy] 20% for Small Class Project [Medium] 35% for Final Exam [Hard]

Final Exam (35%) Goal: test your understanding of concepts Coding covered by labs/project No syntax writing questions! but there will be design and syntax reading questions Max. three hours Not marking you on English You can write in Spanish Four questions (marked on best three) Old exams available on Material Docente (u-cursos) There will be overlap, but not 100% overlap

The following is not a legally abiding agreement. It is just a helpful guide for what s important. -- My laywer, 2016

Q1: Distributed Systems

Q1: Distributed Systems (Slides) Slides: Lecture 2: Distributed Systems I Lecture 3: Distributed Systems II Names per the homepage: http://aidanhogan.com/teaching/cc5212-1-2016/

Q1: Distributed Systems (Topics) Possible Topics: Advantages/disadvantages of a distributed system Five distributed system design goals Distributed architectures (P2P vs. C S, Fat/Thin, n-tier, etc.) Java RMI (high-level) Eight fallacies of distributed computing Consensus basics (fail-stop vs. Byzantine, synchronous vs. asynchronous, goals) Consensus protocols (2PC, 3PC, Paxos) CAP theorem may appear in Q4, but will not appear in Q1

GFS (HDFS) / MapReduce (Hadoop)

Q2: GFS (HDFS) / MapReduce (Hadoop) Slides: Lecture 4: DFS & MapReduce I Lecture 6: MapReduce III: Pig (Lecture 5 was whiteboard only, practicing MapReduce design. This may also be useful when preparing!) Names per the homepage: http://aidanhogan.com/teaching/cc5212-1-2016/

Q2: GFS (HDFS) / MapReduce (Hadoop) Possible Topics: Google File System (reads, writes, fault-tolerance) MapReduce (incl. design question) HDFS/Hadoop (architecture) Pig (high-level, e.g., explain what a given script does)

Information Retrieval

Q3: Information Retrieval (Slides) Slides: Lecture 7: Information Retrieval I Lecture 8: Information Retrieval II (Lecture 9 was whiteboard only, practicing PageRank. This may also be useful when preparing!) Names per the homepage: http://aidanhogan.com/teaching/cc5212-1-2016/

Q3: Information Retrieval (Topics) Possible Topics: Crawling (high-level multi-threading, (D)DoS, robots.txt, sitemap, distribution, bow-tie) Inverted indexes (data structure, normalisation, Heap s law, Zipf s law, Elias encoding, etc.) Ranking (relevance vs. importance, TF-IDF, Vector Space Model, etc.) PageRank (concept, random surfer, calculation)

Bring a Calculator! (that s not a phone!)

NoSQL and Querying

Q4: NoSQL and Querying (Slides) Slides: Lecture 10: NoSQL I Lecture 11: NoSQL II Lecture 3: Distributed Systems II (slides on CAP) Names per the homepage: http://aidanhogan.com/teaching/cc5212-1-2016/

Q4: NoSQL and Querying (Topics) Possible Topics: CAP theorem (<- note out of order) The Database Landscape Key Value stores (data model, operations, distribution, consistent hashing, replication, Dynamo, Merkle trees) Document stores (high-level) Tabular/column-families (data model, Bigtable, sorting, tablets, column families, SSTables, writes, reads, compactions, hierarchy, bloom filters) Graph databases (high-level) Cassandra (high-level)

Final Exam (35%) Goal: test your understanding of concepts Coding covered by labs/project No syntax writing questions! but there will be design and syntax reading questions Max. three hours Not marking you on English You can write in Spanish Four questions (marked on best three) Old exams available on Material Docente (u-cursos) There will be overlap, but not 100% overlap

SOME ADVERTISING

Ayudantes wanted This course next year (I take ayudantes in this course once) Diplomado version of this course End of November / Start of December Will prioritise based on course mark

Next semester La Web de Datos Web of Data / Semantic Web / Linked Data