Cray Graph Engine / Urika-GX. Dr. Andreas Findling
|
|
- August Curtis
- 5 years ago
- Views:
Transcription
1 Cray Graph Engine / Urika-GX Dr. Andreas Findling
2 Uniprot / EU Open Data Portal SPARQL Query counting total number of triples SELECT (COUNT(*) as?count) WHERE {?s?p?o } Copyright 2017 Cray Inc.
3 Cray Analytic Platforms Urika-GD Graph Analytics, XMT2 Seastar Urika-XA Hadoop Spark, Infiniband SSD Urika-GX Hadoop Spark, Cray Graph Engine Aries, SSD Minerva Analytics Software Stack available on XC Platforms Copyright 2017 Cray Inc.
4 Porting the Query Engine Data in an RDF database is unstructured Communication of information across the dataset can be highly irregular, may approach all-to-all for tightly connected graphs Maintaining optimal network performance for short remote references, both PUTs and GETs is essential Urika-GD does this very well ~100 Mrefs/s per node for single word loads and stores But all references are remote! Mapping to XC/Aries architecture Global address space for one-sided communication Leverage the low-level DMAPP library communication layer Non-blocking implicit GETs and PUTs (~50 Mrefs/s per node, single word) Utilize synchronization features and atomic operations available with Aries Copyright 2017 Cray Inc.
5 Selection of Coarray C++ Programming Model C++ template library that runs on top of Cray's Partitioned Global Address Space (PGAS) library Provides the performance advantages of the low-level DMAPP communication Provides easy access to Aries synchronization features and atomic operations Urika-GD codebase is currently C++ Coarray provides an easy model for taking advantage of locality when available Internal intermediate data structures Carrying forward the Basic Graph Function (BGFs) extensions Custom graph algorithms written in Coarray C++ Copyright 2017 Cray Inc.
6 Why is CGE on Urika-GX faster than Urika-GD? Urika-GD got its performance from Multithreading and huge shared memory with fast random access (Seastar) CGE uses a shared memory model called PGAS (Partitioned Global Address Space) Invented and championed by Cray Inc. Depends on the Aries network and its RDMA capability Urika-GX nodes are more powerful Multicore processors provide fewer but more powerful threads 8 channels of DDR4 memory per node provide more memory bandwidth and capacity- critical for Graph Analytics Graph software has been re-factored for these hardware differences, but 90% of it is the same Aries network is faster than Seastar Bandwidth is the most important commodity Copyright 2017 Cray Inc. Cray Inc. Proprietary Not For Public Disclosure 6
7 LUBM25K: Graph Analytics Benefits from Large Memory and Fast Interconnect Lehigh University Bench Mark (LUBM) Basic Graph Patterns and Inference Test Query Time 40,000 35,000 30,000 25,000 20,000 15,000 10,000 Average 300% Improvement on Complex Queries 700,0% 600,0% 500,0% 400,0% 300,0% 200,0% Speed Up Highlights Graph performance on complex queries over larger Urika-GD system 5, ,0% 0,000 0,0% Urika-GD Athena 32 Speed Up Urika-GD system, lubm25k, 64 nodes, 24 images per node Urika-GX system, lubm25k, 32 nodes, 24 images per node Copyright 2016 Cray Inc.
8 Cray Graph Engine Overview
9 Pervasive Speed Supercomputing Experience CGE is an in-memory Semantic Graph Database Implemented using HPC technology PGAS and Aries network Based on W3C industry standards RDF graph data format (a.k.a. Triple Store ) SPARQL 1.1 query language Extended with additional high performance graph algorithms (BGFs) Community detection, S-T connectivity, Betweeness centrality Designed to work with other URIKA-GX applications to create complex workflows Copyright 2017 Cray Inc.
10 Graph analysis workloads Two main workloads Pattern matching Whole graph analysis Typical systems only good at one CGE excels at both Copyright 2017 Cray Inc.
11 A Graph-pattern matching workload Given a pattern of interest find all instances thereof Lehigh University Benchmark
12 A Graph-theoretic Workload What's the shortest route from A to B? What is the ranking of the targeted vertex? PageRank
13 RDF Triple Store LUBM 2017
14 Lehigh University Benchmark Ontology: Univ-Bench Represents the meaning of terms (vocabulary) and their interrelationship using OWL Entities / Classes (42) University Department FullProfessor UndergraduateStudent GraduateStudent Student Relationships / Properties / Rules (32) suborganizationof headof memberof takescourse name telephone Web Ontology Language - OWL OWL, RDF and SPARQL standards are the building blocks of the Semantic Web OWL goes beyond RDF, XML; is intended to be used when information needs to be processed. Best developed Ontologies: Gene Ontology (GO) 14
15 Other Ontologies Gene Ontology (GO) Geneontology.org The GO defines concepts/classes used to describe gene function, and relationships between these concepts. The need of consistent description of gene products across databases. Platform to agree How and Why a specific term is used, and to consistently apply it. Copyright 2015 Cray Inc 15
16 Lehigh University Benchmark The raw data Univ-Bench Artifical data generator UBA UBA generates the requested number of Universities (i.e. LUBM25K has 25,000 Universities) In each University 15~20 Departments are suborganizationof the University In each Department 7~10 FullProfessors worksfor the Department One of the FullProfessors is headof the Department Every Student is memberof the Department 10~20 ResearchGroups are suborganisationof the Department undergraduatedegreefrom, mastersdegreefrom connect Universities Copyright 2015 Cray Inc 16
17 Resource Description Framework N-Triples data format Subject(resource) Predicate (property name) Object (property value) Subject: < Predicate: < Object: < Each of those actually represent resources URI Uniform Resource Identifier Benchmark: LUBM25K 3.3 billion triples 1.2 billion inferred (CGE) 4.5 billion triples in the inferred dataset 626GB in one RDF file: lubm.25k.nt Memory demand: 4 * (Size of *.nt file) => 2504 GB ~ 10 nodes with 256GB (rule of thumb CGE User Guide) Copyright 2015 Cray Inc 17
18 LUBM Queries 14 Queries come with the LUBM benchmark Graph pattern matching queries Queries testing reasoning and inference capabilities SPARQL The query language Designed to query data conforming to the RDF data model. Recursive name: SPARQL protocol and query language Together with the RDF and OWL standards one of the building blocks of the Semantic Web Keywords Typical SPARQL query: I want these pieces of information from the subset of data that meets these conditions WHERE specifies the data to pull Formulated in a triple pattern SELECT picks which data to display Copyright 2015 Cray Inc 18
19 LUBM Queries Graph Pattern: Triangle Query 2: Print out all GraduateStudents which are memberof a Department and do have a undergraduatedegreefrom the same University where the Department is a suborganizationof SELECT?X?Y?Z WHERE {?X rdf:type ub: GraduateStudent.?Y rdf:type ub: University.?Z rdf:type ub: Department.?X ub:memberof?z.?y ub:suborganizationof?y.?x ub:undergraduatedegreefrom?y} Query 9 has the same triangular pattern of relationship. It is the most compute intensive query. Copyright 2015 Cray Inc 19
20 LUBM Queries The basic pattern: nodes Query 14: Print out the names of all undergraduate students SELECT?X WHERE {?X rdf:type ub:undergraduatestudent} Large input, low selectivity No reasoning or inference Query 6: Print out the names of all students (as defined in the Ontology) SELECT?X WHERE {?X rdf:type ub:student} Large input, low selectivity Using the rules of the Ontology (reasoning) is needed to find: UndergraduateStudent and GraduateStudent are Students (subclassof relationship) Copyright 2015 Cray Inc 20
21 Pattern matching scaling 100 LUBM200K Scaling Strong scaling on most queries Strict query time (seconds) x16 256x16 512x Query
22 SPARK GraphX LUBM 2017
23 Spark Framework Apache Spark Fast, general purpose framework for large-scale data processing Potential to keep data in memory Solves the problem of not being able to share data across multiple map and reduce steps Choice of languages: Python, Scala, R, Java Supports variety of workloads with the same runtime Batch Streaming Interactive SQL Machine Learning GraphX
24 GraphX - The Spark Graph Library Data Model: Labeled Property Graph Nodes, Edges The simplest way to think of a graph is to name all the nodes and their connections (edges). Properties and Labels attached to nodes and vertices Data format for LUBM: JSON (nodes.json; edges.json 2 files) Why does Cray CGE do RDF? Open Standard of the W3C: Basis of the Semantic Web. Query Language? Spark/GraphX provides an abstraction for graph analysis No query language but GraphX API currently only available in Scala Writing pattern matching queries requires the understanding of its underlying distributed data processing engine, Spark, and the properties of its data-parallel operations GraphX extends the Spark RDD abstraction by introducing a Graph Class Resilient Distributed Property Graph => DISTRIBUTED PARALLEL Copyright 2015 Cray Inc 24
25 Pattern matching - Spark Comparison LUBM25K CGE vs. Spark GraphX Performance 128 Nodes XC-40 CGE 1-2 orders of magnitude faster Strict query time (ms) CGE GraphX Query CUG 2017 Copyright 2017 Cray Inc. 25
26 Build-in Graph Functions SNAP 2017
27 Built-in Graph Functions (BGFs) SPARQL is limited in its ability to express graph processing CGE augments SPARQL with a capability of calling library graph algorithms You can go from SPARQL to a graph algorithm and back to SPARQL for further refinement Stanford Network Analysis Project (SNAP) US Patent Citations and two online social networks
28 Applications for Available Algorithms Search / neighborhood identification and extraction Pattern-matching / subgraph isomorphism: (Core functionality) Cybersecurity application: Context and search, data exfiltration, beaconing, attack identification Community detection Modularity: Relaxed clique Cybersecurity application: Botnet detection and server hierarchy mapping Path finding Shortest path, S-T connectivity Cybersecurity application: Identify likely paths for information flow between nodes Key node / edge identification Betweenness centrality Cybersecurity application: find the vulnerable points in network configurations Anomaly identification and clustering Bad Rank: finds likely worst actors by association with known bad actors, a la PageRank Cybersecurity application: Unknown-unknown identification Copyright 2017 Cray Inc.
29 SERIOUS AGILITY PERVASIVE SPEED Whole Graph Analysis Scaling Strict Query time (seconds) CGE Performance: Pagerank (SPARQL w/ BGF extension) 32 nodes 64 nodes 128 nodes 256 nodes 512 nodes Strong scaling across SNAP datasets 1 cit-patents soc-livejournal1 com-friendster Dataset Copyright 2017 Cray Inc.
30 SERIOUS AGILITY PERVASIVE SPEED seconds Whole Graph Analysis Scaling Performance Comparison: CGE vs. Spark GraphX PageRank livejournal1 64p livejournal1 128p livejournal1 256p CGE order of magnitude faster Iterative SPARQL approach equivalent to Spark 1 Spark GraphX Python+SPARQL SPARQL+ BGF Programming Model Copyright 2017 Cray Inc.
31 SERIOUS AGILITY PERVASIVE SPEED Whole Graph Analysis Scaling seconds Performance Comparison: CGE vs. Spark GraphX PageRank friendster 64p friendster 128p friendster 256p CGE order of magnitude better than Spark Dataset characteristics affect performance 1 Spark GraphX Python+SPARQL SPARQL+ BGF Programming Model Copyright 2017 Cray Inc.
32 Cray Urika-GX Configuration
33 Urika-GX Configuration Supercomputing Experience Deep memory / storage hierarchy Aries Network Cray Aries fabric with high I/O throughput and low latency 16/48 2-socket Intel Xeon E v4 family processor nodes cores 8-24 TB DRAM TB PCIe SSDs TB HDD local storage Attach to external POSIX-compliant global storage: Cray Sonexion (Lustre ) GPFS NFS HPC Network Optimized PGAS for Cray Graph Engine Large Memory Node-local PCIe SSDs Tiered HDFS, Optimized Shuffle Operations External File Systems (incl. Lustre) Copyright 2017 Cray Inc.
Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationSocial Network Analytics on Cray Urika-XA
Social Network Analytics on Cray Urika-XA Mike Hinchey, mhinchey@cray.com Technical Solutions Architect Cray Inc, Analytics Products Group April, 2015 Agenda 1. Introduce platform Urika-XA 2. Technology
More informationExperiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms
Experiences Running and Optimizing the Berkeley Data Analytics Stack on Cray Platforms Kristyn J. Maschhoff and Michael F. Ringenburg Cray Inc. CUG 2015 Copyright 2015 Cray Inc Legal Disclaimer Information
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationAn Exploration into Object Storage for Exascale Supercomputers. Raghu Chandrasekar
An Exploration into Object Storage for Exascale Supercomputers Raghu Chandrasekar Agenda Introduction Trends and Challenges Design and Implementation of SAROJA Preliminary evaluations Summary and Conclusion
More informationApache Spark Graph Performance with Memory1. February Page 1 of 13
Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationARCHER/RDF Overview. How do they fit together? Andy Turner, EPCC
ARCHER/RDF Overview How do they fit together? Andy Turner, EPCC a.turner@epcc.ed.ac.uk www.epcc.ed.ac.uk www.archer.ac.uk Outline ARCHER/RDF Layout Available file systems Compute resources ARCHER Compute
More informationLatency-Tolerant Software Distributed Shared Memory
Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25
More informationJans Aasman, Ph.D. CEO Franz Inc Optimizing Sparql and Prolog for reasoning on large scale diverse ontologies
Jans Aasman, Ph.D. CEO Franz Inc Ja@Franz.com Optimizing Sparql and Prolog for reasoning on large scale diverse ontologies This presentation Triples and a Graph database (2 minutes, I promise) AllegroGraph
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationData Management. Parallel Filesystems. Dr David Henty HPC Training and Support
Data Management Dr David Henty HPC Training and Support d.henty@epcc.ed.ac.uk +44 131 650 5960 Overview Lecture will cover Why is IO difficult Why is parallel IO even worse Lustre GPFS Performance on ARCHER
More informationSPARQL BGP Optimization For native RDF graph implementations
SPARQL BGP Optimization For native RDF graph implementations Markus Stocker, HP Laboratories Bristol Manchester, 23. October 2007 About me Markus Stocker Born in Switzerland, 1979, Ascona Languages: De,
More informationAn overview of Graph Categories and Graph Primitives
An overview of Graph Categories and Graph Primitives Dino Ienco (dino.ienco@irstea.fr) https://sites.google.com/site/dinoienco/ Topics I m interested in: Graph Database and Graph Data Mining Social Network
More informationShort Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy
Short Talk: System abstractions to facilitate data movement in supercomputers with deep memory and interconnect hierarchy François Tessier, Venkatram Vishwanath Argonne National Laboratory, USA July 19,
More informationAI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management
AI for HPC and HPC for AI Workflows: The Differences, Gaps and Opportunities with Data Management @SC Asia 2018 Rangan Sukumar, PhD Office of the CTO, Cray Inc. Safe Harbor Statement This presentation
More informationLarge Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System
Large Scale Complex Network Analysis using the Hybrid Combination of a MapReduce Cluster and a Highly Multithreaded System Seunghwa Kang David A. Bader 1 A Challenge Problem Extracting a subgraph from
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationUrika: Enabling Real-Time Discovery in Big Data
Urika: Enabling Real-Time Discovery in Big Data Discovery is the process of gaining valuable insights into the world around us by recognizing previously unknown relationships between occurrences, objects
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationOutline Introduction Triple Storages Experimental Evaluation Conclusion. RDF Engines. Stefan Schuh. December 5, 2008
December 5, 2008 Resource Description Framework SPARQL Giant Triple Table Property Tables Vertically Partitioned Table Hexastore Resource Description Framework SPARQL Resource Description Framework RDF
More informationOracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016
Oracle Spatial and Graph: Benchmarking a Trillion Edges RDF Graph ORACLE WHITE PAPER NOVEMBER 2016 Introduction One trillion is a really big number. What could you store with one trillion facts?» 1000
More informationPouya Kousha Fall 2018 CSE 5194 Prof. DK Panda
Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda 1 Motivation And Intro Programming Model Spark Data Transformation Model Construction Model Training Model Inference Execution Model Data Parallel Training
More informationThe Data Exacell (DXC): Data Infrastructure Building Blocks for Integrating Analytics with Data Management
The Data Exacell (DXC): Data Infrastructure Building Blocks for Integrating Analytics with Data Management Nick Nystrom, Michael J. Levine, Ralph Roskies, and J Ray Scott Pittsburgh Supercomputing Center
More informationA Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS
A Plugin-based Approach to Exploit RDMA Benefits for Apache and Enterprise HDFS Adithya Bhat, Nusrat Islam, Xiaoyi Lu, Md. Wasi- ur- Rahman, Dip: Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng
More informationWarehouse- Scale Computing and the BDAS Stack
Warehouse- Scale Computing and the BDAS Stack Ion Stoica UC Berkeley UC BERKELEY Overview Workloads Hardware trends and implications in modern datacenters BDAS stack What is Big Data used For? Reports,
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationSmart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading
Smart Trading with Cray Systems: Making Smarter Models + Better Decisions in Algorithmic Trading Smart Trading with Cray Systems Agenda: Cray Overview Market Trends & Challenges Mitigating Risk with Deeper
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationCSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101
CSC 261/461 Database Systems Lecture 24 Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101 Announcements Term Paper due on April 20 April 23 Project 1 Milestone 4 is out Due on 05/03 But I would
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationScaling Parallel Rule-based Reasoning
University of Applied Sciences and Arts Dortmund Scaling Parallel Rule-based Reasoning Martin Peters 1, Christopher Brink 1, Sabine Sachweh 1 and Albert Zündorf 2 1 University of Applied Sciences and Arts
More informationThis presentation is for informational purposes only and may not be incorporated into a contract or agreement.
This presentation is for informational purposes only and may not be incorporated into a contract or agreement. Oracle10g RDF Data Mgmt: In Life Sciences Xavier Lopez Director, Server Technologies Oracle
More informationAccelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet
WHITE PAPER Accelerating Hadoop Applications with the MapR Distribution Using Flash Storage and High-Speed Ethernet Contents Background... 2 The MapR Distribution... 2 Mellanox Ethernet Solution... 3 Test
More informationData Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures. 13 November 2016
National Aeronautics and Space Administration Data Analytics and Storage System (DASS) Mixing POSIX and Hadoop Architectures 13 November 2016 Carrie Spear (carrie.e.spear@nasa.gov) HPC Architect/Contractor
More informationBetweeness Centraility performance
Betweeness Centraility performance Observations using Cray Graph Engine and Apache GraphX on computer network data Eric Dull, Felix Flath, Brian Sacash, John Zachary Cyber Risk Services Deloitte and Touche,
More informationTable 1 The Elastic Stack use cases Use case Industry or vertical market Operational log analytics: Gain real-time operational insight, reduce Mean Ti
Solution Overview Cisco UCS Integrated Infrastructure for Big Data with the Elastic Stack Cisco and Elastic deliver a powerful, scalable, and programmable IT operations and security analytics platform
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationUNIFY DATA AT MEMORY SPEED. Haoyuan (HY) Li, Alluxio Inc. VAULT Conference 2017
UNIFY DATA AT MEMORY SPEED Haoyuan (HY) Li, CEO @ Alluxio Inc. VAULT Conference 2017 March 2017 HISTORY Started at UC Berkeley AMPLab In Summer 2012 Originally named as Tachyon Rebranded to Alluxio in
More informationCisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr
Solution Overview Cisco UCS Integrated Infrastructure for Big Data and Analytics with Cloudera Enterprise Bring faster performance and scalability for big data analytics. Highlights Proven platform for
More informationEfficient, Scalable, and Provenance-Aware Management of Linked Data
Efficient, Scalable, and Provenance-Aware Management of Linked Data Marcin Wylot 1 Motivation and objectives of the research The proliferation of heterogeneous Linked Data on the Web requires data management
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationSpecialist ICT Learning
Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.
More informationlibhio: Optimizing IO on Cray XC Systems With DataWarp
libhio: Optimizing IO on Cray XC Systems With DataWarp May 9, 2017 Nathan Hjelm Cray Users Group May 9, 2017 Los Alamos National Laboratory LA-UR-17-23841 5/8/2017 1 Outline Background HIO Design Functionality
More informationTechnologies for High Performance Data Analytics
Technologies for High Performance Data Analytics Dr. Jens Krüger Fraunhofer ITWM 1 Fraunhofer ITWM n Institute for Industrial Mathematics n Located in Kaiserslautern, Germany n Staff: ~ 240 employees +
More informationSub-millisecond Stateful Stream Querying over Fast-evolving Linked Data
Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data Yunhao Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems (IPADS) Shanghai Jiao Tong University Stream Query
More informationDistributed Graph Storage. Veronika Molnár, UZH
Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationSCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX
THE MISSING PIECE IN COMPLEX ANALYTICS: SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX Daniel Crankshaw, Peter Bailis, Joseph Gonzalez, Haoyuan Li, Zhao Zhang, Ali Ghodsi, Michael Franklin,
More informationIME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning
IME (Infinite Memory Engine) Extreme Application Acceleration & Highly Efficient I/O Provisioning September 22 nd 2015 Tommaso Cecchi 2 What is IME? This breakthrough, software defined storage application
More informationDeep Learning Frameworks with Spark and GPUs
Deep Learning Frameworks with Spark and GPUs Abstract Spark is a powerful, scalable, real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel,
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationSGI Overview. HPC User Forum Dearborn, Michigan September 17 th, 2012
SGI Overview HPC User Forum Dearborn, Michigan September 17 th, 2012 SGI Market Strategy HPC Commercial Scientific Modeling & Simulation Big Data Hadoop In-memory Analytics Archive Cloud Public Private
More information: A new version of Supercomputing or life after the end of the Moore s Law
: A new version of Supercomputing or life after the end of the Moore s Law Dr.-Ing. Alexey Cheptsov SEMAPRO 2015 :: 21.07.2015 :: Dr. Alexey Cheptsov OUTLINE About us Convergence of Supercomputing into
More informationAbout Codefrux While the current trends around the world are based on the internet, mobile and its applications, we try to make the most out of it. As for us, we are a well established IT professionals
More informationToward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies
Toward portable I/O performance by leveraging system abstractions of deep memory and interconnect hierarchies François Tessier, Venkatram Vishwanath, Paul Gressier Argonne National Laboratory, USA Wednesday
More informationBig Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management
Big Data Meets HPC: Exploiting HPC Technologies for Accelerating Big Data Processing and Management SigHPC BigData BoF (SC 17) by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationOncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries
Oncilla - a Managed GAS Runtime for Accelerating Data Warehousing Queries Jeffrey Young, Alex Merritt, Se Hoon Shon Advisor: Sudhakar Yalamanchili 4/16/13 Sponsors: Intel, NVIDIA, NSF 2 The Problem Big
More informationObject-UOBM. An Ontological Benchmark for Object-oriented Access. Martin Ledvinka
Object-UOBM An Ontological Benchmark for Object-oriented Access Martin Ledvinka martin.ledvinka@fel.cvut.cz Department of Cybernetics Faculty of Electrical Engineering Czech Technical University in Prague
More informationTriple Stores in a Nutshell
Triple Stores in a Nutshell Franjo Bratić Alfred Wertner 1 Overview What are essential characteristics of a Triple Store? short introduction examples and background information The Agony of choice - what
More informationCloud Computing & Visualization
Cloud Computing & Visualization Workflows Distributed Computation with Spark Data Warehousing with Redshift Visualization with Tableau #FIUSCIS School of Computing & Information Sciences, Florida International
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationBacktesting with Spark
Backtesting with Spark Patrick Angeles, Cloudera Sandy Ryza, Cloudera Rick Carlin, Intel Sheetal Parade, Intel 1 Traditional Grid Shared storage Storage and compute scale independently Bottleneck on I/O
More informationGraph Analytics and Machine Learning A Great Combination Mark Hornick
Graph Analytics and Machine Learning A Great Combination Mark Hornick Oracle Advanced Analytics and Machine Learning November 3, 2017 Safe Harbor Statement The following is intended to outline our research
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationExtreme-scale Graph Analysis on Blue Waters
Extreme-scale Graph Analysis on Blue Waters 2016 Blue Waters Symposium George M. Slota 1,2, Siva Rajamanickam 1, Kamesh Madduri 2, Karen Devine 1 1 Sandia National Laboratories a 2 The Pennsylvania State
More informationAn Introduction to Big Data Analysis using Spark
An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationOrchestrating Music Queries via the Semantic Web
Orchestrating Music Queries via the Semantic Web Milos Vukicevic, John Galletly American University in Bulgaria Blagoevgrad 2700 Bulgaria +359 73 888 466 milossmi@gmail.com, jgalletly@aubg.bg Abstract
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #23 04/11/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 MapReduce Example
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationData Platforms and Pattern Mining
Morteza Zihayat Data Platforms and Pattern Mining IBM Corporation About Myself IBM Software Group Big Data Scientist 4Platform Computing, IBM (2014 Now) PhD Candidate (2011 Now) 4Lassonde School of Engineering,
More informationBig Data Analytics using Apache Hadoop and Spark with Scala
Big Data Analytics using Apache Hadoop and Spark with Scala Training Highlights : 80% of the training is with Practical Demo (On Custom Cloudera and Ubuntu Machines) 20% Theory Portion will be important
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationAccelerating Irregular Computations with Hardware Transactional Memory and Active Messages
MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important
More informationBIG DATA TESTING: A UNIFIED VIEW
http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationScalable RDF Stream Reasoning in the Cloud
Semantic Web 0 (0) 1 1 IOS Press Scalable RDF Stream Reasoning in the Cloud Ren Xiangnan a,b,*, Curé Olivier b, Naacke Hubert c and Ke Li a a Innovation Lab Atos, Bezons France E-mails: xiang-nan.ren@atos.net,
More informationMassive Online Analysis - Storm,Spark
Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R
More informationA Tutorial on Apache Spark
A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationRDF Stores Performance Test on Servers with Average Specification
RDF Stores Performance Test on Servers with Average Specification Nikola Nikolić, Goran Savić, Milan Segedinac, Stevan Gostojić, Zora Konjović University of Novi Sad, Faculty of Technical Sciences, Novi
More informationMachine Learning In A Snap. Thomas Parnell Research Staff Member IBM Research - Zurich
Machine Learning In A Snap Thomas Parnell Research Staff Member IBM Research - Zurich What are GLMs? Ridge Regression Support Vector Machines Regression Generalized Linear Models Classification Lasso Regression
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationSEMANTIC WEB DATA MANAGEMENT. from Web 1.0 to Web 3.0
SEMANTIC WEB DATA MANAGEMENT from Web 1.0 to Web 3.0 CBD - 21/05/2009 Roberto De Virgilio MOTIVATIONS Web evolution Self-describing Data XML, DTD, XSD RDF, RDFS, OWL WEB 1.0, WEB 2.0, WEB 3.0 Web 1.0 is
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationEmerging Technologies for HPC Storage
Emerging Technologies for HPC Storage Dr. Wolfgang Mertz CTO EMEA Unstructured Data Solutions June 2018 The very definition of HPC is expanding Blazing Fast Speed Accessibility and flexibility 2 Traditional
More informationDell In-Memory Appliance for Cloudera Enterprise
Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More information