DATA FORMATS FOR DATA SCIENCE Remastered
|
|
- Daniel Sanders
- 5 years ago
- Views:
Transcription
1 Budapest BI FORUM 2016 DATA FORMATS FOR DATA SCIENCE Remastered Valerio Data Scientist and Researcher Fondazione Bruno Kessler (FBK) Trento, Italy
2 WhoAmI Post Doc FBK Interested in Machine Learning, Text and Data Processing with Deep divergences recently Fellow Pythonista since 2006 Complex Data Analytics Unit (MPBA) scientific Python ecosystem PyData Italy Chair kidding, that s me!-)
3 DATA FORMATS FOR DATA SCIENCE Data Processing Q: What s the better way to process (my) data Q+: What s the most Pythonic Way to do that? Data Sharing Q: What s the best way to share (and to present data) A: [Interactive] Charts - Data Visualisation
4 JUPYTER NOTEBOOK FOR DATA SHARING AND DOCUMENTATION
5 #1 DATA THAT YOU CAN READ Human Readable Formats
6 DOES YOUR DATA HAS A STRUCTURE OR NOT? DATA FORMATS THAT YOU CAN READ
7 Unstructu red Data
8 More Pythonic
9 Numpy to the rescue
10
11 CSV Structured Data
12 csv Module (in standard library)
13
14
15
16
17
18 EE SPREADSHITS XSL(X)
19 xlsxwriter.readthedocs.io
20 Structured Data++ Analyse DBs from many angles
21 Normalisation (No Duplicates) & Fixed Structure Relational Databases SQL: Structured Query Language Many different dialects! ORM is the way! 1. INFORMATION ARCHITECTURE
22 SQL ALCHEMY
23
24
25 Your data requires a flexible (not fixed) structure a.k.a. NO-SQL (databases) JSON-based data format e.g. MongoDB pymongo 2. FLEXIBILITY
26 JSON
27 Jupyter Notebook Data Format
28 Your data requires a flexible(ish) structure But you want to validate your data XML-based data format 2.5 FLEXIBILITY AND validation
29
30 Normalisation (No Duplicates) & Fixed Structure Relational Databases (Super effective) in-db Analytics Column-oriented DB 3 STRUCTURE AND speed
31 BIG DATA AND COLUMNAR DBS Big Data World is shifting towards columnar DBs better oriented to OLAP (analytics) rather than OLTP
32
33
34 #2 DATA THAT YOU CANNOT READ Machine Readable Formats
35 unless..
36 BINARY FORMAT * Integers and floats in native and string representations Space is not the only concern (for text). Speed matters! Python conversion to int() and float() are slow costly atoi()/atof() C functions A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O Reilly 2015
37 import pickle Still, it is often desirable to have something more than a binary chunk of data in a file.
38 HIERARCHICAL DATA FORMAT 5 (a.k.a. HDF5) Free and open source file format specification (+) Works great with both big or tiny datasets (+) Storage friendly Allows for Compression (+) Dev. Friendly Query DSL + Multiple-language support Python: PyTables, hdf5, h5py
39
40 NUMPY ARRAYS TIGHT INTEGRATION with PyTables Accessing the table
41 HIERARCHY AND GROUPS
42 DATA CHUNKING A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O Reilly 2015
43 DATA CHUNKING Small chunks are good for accessing only some of the data at a time. Large chunks are good for accessing lots of data at a time. Reading and writing chunks may happen in parallel A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O Reilly 2015
44 PARALLEL HDF5 MPI (mpi4py) integration
45 HDF5 VS MONGODB Total Number of Documents Total Number of Entries Systems HDF5 (blosc filter) MongoDB (flat storage) MongoDB (compact storage) Storage (MB) ,005 Average time per Single Call (sec.) HDF5 (blosc filter) MongoDB (flat storage) MongoDB (compact storage) Storage (MB) 0, , , Query Time Storage Space
46 Data Analysis Framework (and tool) Written in C++; Native extension in Python (aka PyROOT) ROOT6 also ships a Jupyter Kernel Definition of a new Binary Data Format (.root) based on the serialisation of C++ Objects
47
48 C++ style rootpy root_numpy rootpy.github.io/ rootpy.github.io/root_numpy/
49
50 Tight integration with PyROOT objects
51
52 MULTIDIMENSIONAL LABELED ARRAY
53 when Pandas is not enough!
54 #3 DATA IN MULTIPLE FORMATS (Big) Data Lake
55 matthewrocklin.com/blog/work/ 2016/02/22/dask-distributed-part-2 HDFS
56 HDFS HDFS: Hadoop Filesystem Distributed Filesystem on top of Hadoop Data can be organised in shardes and distributed among several machines (cluster config) (de facto) Big Data Data Format Python: hdfs3 Native implementation of HDFS in C++ No Java along the way!
57 HDFS+CSV Opening a Single File on the HDFS
58 Wildcard opening of CSVs on the HDFS
59
60 Out-of-Core Processing
61 Complicated data require complicated formats Complicated formats require good tools OPeNDAP:
62 Thanks a lot for your kind vmaggio@fbk.com +ValerioMaggio it.linkedin.com/in/valeriomaggio
Data Formats. for Data Science. Valerio Maggio Data Scientist and Researcher Fondazione Bruno Kessler (FBK) Trento, Italy.
Data Formats for Data Science Valerio Maggio Data Scientist and Researcher Fondazione Bruno Kessler (FBK) Trento, Italy @leriomaggio About me kidding, that s me!-) Post Doc Researcher @ FBK Complex Data
More informationPyTables. An on- disk binary data container, query engine and computa:onal kernel. Francesc Alted
PyTables An on- disk binary data container, query engine and computa:onal kernel Francesc Alted Tutorial for the PyData Conference, October 2012, New York City 10 th anniversary of PyTables Hi!, PyTables
More informationSurvey of data formats and conversion tools
Survey of data formats and conversion tools Jim Pivarski Princeton University DIANA May 23, 2017 1 / 28 The landscape of generic containers By generic, I mean file formats that define general structures
More informationBig Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara
Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case
More informationDelving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture
Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases
More informationCloudExpo November 2017 Tomer Levi
CloudExpo November 2017 Tomer Levi About me Full Stack Engineer @ Intel s Advanced Analytics group. Artificial Intelligence unit at Intel. Responsible for (1) Radical improvement of critical processes
More informationBig Data Software in Particle Physics
Big Data Software in Particle Physics Jim Pivarski Princeton University DIANA-HEP August 2, 2018 1 / 24 What software should you use in your analysis? Sometimes considered as a vacuous question, like What
More informationBig Data Analytics Tools. Applied to ATLAS Event Data
Big Data Analytics Tools Applied to ATLAS Event Data Ilija Vukotic University of Chicago CHEP 2016, San Francisco Idea Big Data technologies have proven to be very useful for storage, visualization and
More informationSQL Server Machine Learning Marek Chmel & Vladimir Muzny
SQL Server Machine Learning Marek Chmel & Vladimir Muzny @VladimirMuzny & @MarekChmel MCTs, MVPs, MCSEs Data Enthusiasts! vladimir@datascienceteam.cz marek@datascienceteam.cz Session Agenda Machine learning
More informationPyTables. An on- disk binary data container. Francesc Alted. May 9 th 2012, Aus=n Python meetup
PyTables An on- disk binary data container Francesc Alted May 9 th 2012, Aus=n Python meetup Overview What PyTables is? Data structures in PyTables The one million song dataset Advanced capabili=es in
More informationIntel Distribution for Python* и Intel Performance Libraries
Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk
More informationNtuple: Tabular Data in HDF5 with C++ Chris Green and Marc Paterno HDF5 Webinar,
Ntuple: Tabular Data in HDF5 with C++ Chris Green and Marc Paterno HDF5 Webinar, 2019-01-24 Origin and motivation Particle physics analysis often involves the creation of Ntuples, tables of (usually complicated)
More informationAsanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks
Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More information[%]%async_run. an IPython notebook* magic for asynchronous (code) cell execution. Valerio Maggio Researcher
[%]%async_run an IPython notebook* magic for asynchronous (code) cell execution Valerio Maggio Researcher valeriomaggio@gmail.com @leriomaggio Premises Jupyter Notebook Jupyter Notebook Jupyter Notebook
More informationpandas: Rich Data Analysis Tools for Quant Finance
pandas: Rich Data Analysis Tools for Quant Finance Wes McKinney April 24, 2012, QWAFAFEW Boston about me MIT 07 AQR Capital: 2007-2010 Global Macro and Credit Research WES MCKINNEY pandas: 2008 - Present
More informationSciSpark 201. Searching for MCCs
SciSpark 201 Searching for MCCs Agenda for 201: Access your SciSpark & Notebook VM (personal sandbox) Quick recap. of SciSpark Project What is Spark? SciSpark Extensions scitensor: N-dimensional arrays
More informationParallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package MuQun Yang, Christian Chilan, Albert Cheng, Quincey Koziol, Mike Folk, Leon Arber The HDF Group Champaign, IL 61820
More informationMaking the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor
Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,
More informationTHE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES
1 THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES Vincent Garonne, Mario Lassnig, Martin Barisits, Thomas Beermann, Ralph Vigne, Cedric Serfon Vincent.Garonne@cern.ch ph-adp-ddm-lab@cern.ch XLDB
More informationTime Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov
Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov LAND AND WATER & CSIRO IMT SCIENTIFIC COMPUTING Energy Use Data Model (EUDM) endeavours to deliver
More informationOpen Data Standards for Administrative Data Processing
University of Pennsylvania ScholarlyCommons 2018 ADRF Network Research Conference Presentations ADRF Network Research Conference Presentations 11-2018 Open Data Standards for Administrative Data Processing
More informationIndex. Bessel function, 51 Big data, 1. Cloud-based version-control system, 226 Containerization, 30 application, 32 virtualize processes, 30 31
Index A Amazon Web Services (AWS), 2 account creation, 2 EC2 instance creation, 9 Docker, 13 IP address, 12 key pair, 12 launch button, 11 security group, 11 stable Ubuntu server, 9 t2.micro type, 9 10
More informationKNIME for the life sciences Cambridge Meetup
KNIME for the life sciences Cambridge Meetup Greg Landrum, Ph.D. KNIME.com AG 12 July 2016 What is KNIME? A bit of motivation: tool blending, data blending, documentation, automation, reproducibility More
More informationPython ecosystem for scientific computing with ABINIT: challenges and opportunities. M. Giantomassi and the AbiPy group
Python ecosystem for scientific computing with ABINIT: challenges and opportunities M. Giantomassi and the AbiPy group Frejus, May 9, 2017 Python package for: generating input files automatically post-processing
More informationAn Introduction to Big Data Formats
Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION
More informationDATABASE DESIGN II - 1DL400
DATABASE DESIGN II - 1DL400 Fall 2016 A second course in database systems http://www.it.uu.se/research/group/udbl/kurser/dbii_ht16 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
More informationStages of Data Processing
Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,
More informationHadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop
Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce
More informationAnalysing the Panama Papers with Oracle Big Data Spatial and Graph
speakerdeck.com/rmoff/ Analysing the Panama Papers with Oracle Big Data Spatial and Graph BIWA Summit 2017 Robin Moffatt, Rittman Mead 1 Robin Moffatt! Head of R&D, Rittman Mead Previously OBIEE/DW developer
More informationStream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...
More informationData Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP
Data Analysis R&D Jim Pivarski Princeton University DIANA-HEP February 5, 2018 1 / 20 Tools for data analysis Eventual goal Query-based analysis: let physicists do their analysis by querying a central
More informationAnalytics Platform for ATLAS Computing Services
Analytics Platform for ATLAS Computing Services Ilija Vukotic for the ATLAS collaboration ICHEP 2016, Chicago, USA Getting the most from distributed resources What we want To understand the system To understand
More informationNetCDF-4: : Software Implementing an Enhanced Data Model for the Geosciences
NetCDF-4: : Software Implementing an Enhanced Data Model for the Geosciences Russ Rew, Ed Hartnett, and John Caron UCAR Unidata Program, Boulder 2006-01-31 Acknowledgments This work was supported by the
More informationDatabase infrastructure for electronic structure calculations
Database infrastructure for electronic structure calculations Fawzi Mohamed fawzi.mohamed@fhi-berlin.mpg.de 22.7.2015 Why should you be interested in databases? Can you find a calculation that you did
More informationVerteego VDS Documentation
Verteego VDS Documentation Release 1.0 Verteego May 31, 2017 Installation 1 Getting started 3 2 Ansible 5 2.1 1. Install Ansible............................................. 5 2.2 2. Clone installation
More informationAbout Intellipaat. About the Course. Why Take This Course?
About Intellipaat Intellipaat is a fast growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 700,000 in over
More informationCertified Data Science with Python Professional VS-1442
Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become
More informationData Science. Data Analyst. Data Scientist. Data Architect
Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &
More informationFrom Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019
From Single Purpose to Multi Purpose Data Lakes Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019 Agenda Data Lakes Multiple Purpose Data Lakes Customer Example Demo Takeaways
More informationCIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )
Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL
More informationDocument Object Storage with MongoDB
Document Object Storage with MongoDB Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University of Hamburg / German Climate Computing Center (DKRZ) 2017-12-15 Disclaimer: Big Data
More informationAccelerating BI on Hadoop: Full-Scan, Cubes or Indexes?
White Paper Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes? How to Accelerate BI on Hadoop: Cubes or Indexes? Why not both? 1 +1(844)384-3844 INFO@JETHRO.IO Overview Organizations are storing more
More informationAnswer: A Reference:http://www.vertica.com/wpcontent/uploads/2012/05/MicroStrategy_Vertica_12.p df(page 1, first para)
1 HP - HP2-N44 Selling HP Vertical Big Data Solutions QUESTION: 1 When is Vertica a better choice than SAP HANA? A. The customer wants a closed ecosystem for BI and analytics, and is unconcerned with support
More informationThe EHRI GraphQL API IEEE Big Data Workshop on Computational Archival Science
The EHRI GraphQL API IEEE Big Data Workshop on Computational Archival Science 13/12/2017 Mike Bryant CONNECTING COLLECTIONS The EHRI Project The main objective of EHRI is to support the Holocaust research
More informationApproaching the Petabyte Analytic Database: What I learned
Disclaimer This document is for informational purposes only and is subject to change at any time without notice. The information in this document is proprietary to Actian and no part of this document may
More informationData Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens
Data Science and Open Source Software Iraklis Varlamis Assistant Professor Harokopio University of Athens varlamis@hua.gr What is data science? 2 Why data science is important? More data (volume, variety,...)
More informationAlexander Klein. #SQLSatDenmark. ETL meets Azure
Alexander Klein ETL meets Azure BIG Thanks to SQLSat Denmark sponsors Save the date for exiting upcoming events PASS Camp 2017 Main Camp 05.12. 07.12.2017 (04.12. Kick-Off abends) Lufthansa Training &
More informationDESY IT Seminar HDF5, Nexus, and what it is all about
DESY IT Seminar HDF5, Nexus, and what it is all about Eugen Wintersberger HDF5 and Nexus DESY IT, 27.05.2013 Why should we care about Nexus and HDF5? Current state: Data is stored either as ASCII file
More informationData Management Glossary
Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative
More informationData Analytics Job Guarantee Program
Data Analytics Job Guarantee Program 1. INSTALLATION OF VMWARE 2. MYSQL DATABASE 3. CORE JAVA 1.1 Types of Variable 1.2 Types of Datatype 1.3 Types of Modifiers 1.4 Types of constructors 1.5 Introduction
More informationBlurring the Line Between Developer and Data Scientist
Blurring the Line Between Developer and Data Scientist Notebooks with PixieDust va barbosa va@us.ibm.com Developer Advocacy IBM Watson Data Platform WHY ARE YOU HERE? More companies making bet-the-business
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationHadoop Online Training
Hadoop Online Training IQ training facility offers Hadoop Online Training. Our Hadoop trainers come with vast work experience and teaching skills. Our Hadoop training online is regarded as the one of the
More informationA Quick Database Comparison of Db4o and SQL Databases through Cayenne
A Quick Database Comparison of Db4o and SQL Databases through Cayenne Peter Karich August 11, 2007, Bayreuth 1 Contents 1 Design 3 1.1 Pros....................................... 3 1.2 Cons.......................................
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationIBM Big SQL Partner Application Verification Quick Guide
IBM Big SQL Partner Application Verification Quick Guide VERSION: 1.6 DATE: Sept 13, 2017 EDITORS: R. Wozniak D. Rangarao Table of Contents 1 Overview of the Application Verification Process... 3 2 Platform
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 5: Analyzing Relational Data (1/3) February 8, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationPython for Scientists and Engineers
Python for Scientists and Engineers A specialist course in Melbourne Audience: This is a course for scientists and engineers interested in using Python for solving computational problems that arise in
More informationElliotte Rusty Harold August From XML to Flat Buffers: Markup in the Twenty-teens
Elliotte Rusty Harold elharo@ibiblio.org August 2018 From XML to Flat Buffers: Markup in the Twenty-teens Warning! The Contenders XML JSON YAML EXI Protobufs Flat Protobufs XML JSON YAML EXI Protobuf Flat
More informationDesign Patterns for Large- Scale Data Management. Robert Hodges OSCON 2013
Design Patterns for Large- Scale Data Management Robert Hodges OSCON 2013 The Start-Up Dilemma 1. You are releasing Online Storefront V 1.0 2. It could be a complete bust 3. But it could be *really* big
More informationCSE 344 JULY 9 TH NOSQL
CSE 344 JULY 9 TH NOSQL ADMINISTRATIVE MINUTIAE HW3 due Wednesday tests released actual_time should have 0s not NULLs upload new data file or use UPDATE to change 0 ~> NULL Extra OOs on Mondays 5-7pm in
More informationData Architectures in Azure for Analytics & Big Data
Data Architectures in for Analytics & Big Data October 20, 2018 Melissa Coates Solution Architect, BlueGranite Microsoft Data Platform MVP Blog: www.sqlchick.com Twitter: @sqlchick Data Architecture A
More informationCONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM
CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED PLATFORM Executive Summary Financial institutions have implemented and continue to implement many disparate applications
More informationPřehled novinek v SQL Server 2016
Přehled novinek v SQL Server 2016 Martin Rys, BI Competency Leader martin.rys@adastragrp.com https://www.linkedin.com/in/martinrys 20.4.2016 1 BI Competency development 2 Trends, modern data warehousing
More informationCIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench
CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been
More informationBig Data analytics in insurance
Big Data analytics in insurance Who we are Experts At Your Service > Over 50 specialists in IT infrastructure > Certified, experienced, passionate Based In Switzerland > 100% self-financed Swiss company
More informationOREKIT IN PYTHON ACCESS THE PYTHON SCIENTIFIC ECOSYSTEM. Petrus Hyvönen
OREKIT IN PYTHON ACCESS THE PYTHON SCIENTIFIC ECOSYSTEM Petrus Hyvönen 2017-11-27 SSC ACTIVITIES Public Science Services Satellite Management Services Engineering Services 2 INITIAL REASON OF PYTHON WRAPPED
More informationHadoop File Formats and Data Ingestion. Prasanth Kothuri, CERN
Prasanth Kothuri, CERN 2 Files Formats not just CSV - Key factor in Big Data processing and query performance - Schema Evolution - Compression and Splittability - Data Processing Write performance Partial
More informationScientific computing platforms at PGI / JCNS
Member of the Helmholtz Association Scientific computing platforms at PGI / JCNS PGI-1 / IAS-1 Scientific Visualization Workshop Josef Heinen Outline Introduction Python distributions The SciPy stack Julia
More informationHive SQL over Hadoop
Hive SQL over Hadoop Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Introduction Apache Hive is a high-level abstraction on top of MapReduce Uses
More informationOutrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS
Outrun Your Competition With SAS In-Memory Analytics Sascha Schubert Global Technology Practice, SAS Topics AGENDA Challenges with Big Data Analytics How SAS can help you to minimize time to value with
More informationEnable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk. February 13, 2018
Enable Spark SQL on NoSQL Hbase tables with HSpark IBM Code Tech Talk February 13, 2018 https://developer.ibm.com/code/techtalks/enable-spark-sql-onnosql-hbase-tables-with-hspark-2/ >> MARC-ARTHUR PIERRE
More informationOracle Big Data Connectors
Oracle Big Data Connectors Oracle Big Data Connectors is a software suite that integrates processing in Apache Hadoop distributions with operations in Oracle Database. It enables the use of Hadoop to process
More informationIntroduction to NoSQL Databases
Introduction to NoSQL Databases Roman Kern KTI, TU Graz 2017-10-16 Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 1 / 31 Introduction Intro Why NoSQL? Roman Kern (KTI, TU Graz) Dbase2 2017-10-16 2 / 31 Introduction
More informationData Analytics and Machine Learning: From Node to Cluster
Data Analytics and Machine Learning: From Node to Cluster Presented by Viswanath Puttagunta Ganesh Raju Understanding use cases to optimize on ARM Ecosystem Date BKK16-404B March 10th, 2016 Event Linaro
More informationBeyond Relational Databases: MongoDB, Redis & ClickHouse. Marcos Albe - Principal Support Percona
Beyond Relational Databases: MongoDB, Redis & ClickHouse Marcos Albe - Principal Support Engineer @ Percona Introduction MySQL everyone? Introduction Redis? OLAP -vs- OLTP Image credits: 451 Research (https://451research.com/state-of-the-database-landscape)
More informationBig Data with Hadoop Ecosystem
Diógenes Pires Big Data with Hadoop Ecosystem Hands-on (HBase, MySql and Hive + Power BI) Internet Live http://www.internetlivestats.com/ Introduction Business Intelligence Business Intelligence Process
More informationСравнительный анализ инструментов Автоматизации Desktop AUT. Anton Semenchenko
Сравнительный анализ инструментов Автоматизации Desktop AUT Anton Semenchenko Agenda, part 1 (general) 1. Problem 2. Solutions 2016 Agenda, part 2 (tools and criteria's) 1. Tools to be compared (15) 2.
More informationEvolution of Database Systems
Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second
More informationProcessing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer
Processing big data with modern applications: Hadoop as DWH backend at Pro7 Dr. Kathrin Spreyer Big data engineer GridKa School Karlsruhe, 02.09.2014 Outline 1. Relational DWH 2. Data integration with
More informationGPU Accelerated Data Processing Speed of Thought Analytics at Scale
GPU Accelerated Data Processing Speed of Thought Analytics at Scale The benefits of Brytlyt s GPU Accelerated Database Brytlyt is an ultra-high performance database that combines patent pending intellectual
More informationPercona Live September 21-23, 2015 Mövenpick Hotel Amsterdam
Percona Live 2015 September 21-23, 2015 Mövenpick Hotel Amsterdam MongoDB, Elastic, and Hadoop: The What, When, and How Kimberly Wilkins Principal Engineer/Database Denizen ObjectRocket/Rackspace kimberly@objectrocket.com
More informationRead & Download (PDF Kindle) Pro Apache Hadoop
Read & Download (PDF Kindle) Pro Apache Hadoop Pro Apache Hadoop, Second Edition brings you up to speed on Hadoop â the framework of big data. Revised to cover Hadoop 2.0, the book covers the very latest
More informationDatabase Vs. Data Warehouse
Database Vs. Data Warehouse Similarities and differences Databases and data warehouses are used to generate different types of information. Information generated by both are used for different purposes.
More informationACHIEVEMENTS FROM TRAINING
LEARN WELL TECHNOCRAFT DATA SCIENCE/ MACHINE LEARNING SYLLABUS 8TH YEAR OF ACCOMPLISHMENTS AUTHORIZED GLOBAL CERTIFICATION CENTER FOR MICROSOFT, ORACLE, IBM, AWS AND MANY MORE. 8411002339/7709292162 WWW.DW-LEARNWELL.COM
More informationBig Data analytics and Visualization
Big Data analytics and Visualization MTA Cloud symposium A. Agocs, D. Dardanis, R. Forster, J.-M. Le Goff, X. Ouvrard CERN MTA Head quarters, Budapest, 17 February 2017 1 Background information Collaboration
More informationPython for Geospatial Analysis
Python for Geospatial Analysis A specialist course in Brisbane Audience: This is a course for scientists, engineers, and analysts working with geospatial data sets. Outcome: By the end of the course, you
More informationSubmitted to: Dr. Sunnie Chung. Presented by: Sonal Deshmukh Jay Upadhyay
Submitted to: Dr. Sunnie Chung Presented by: Sonal Deshmukh Jay Upadhyay Submitted to: Dr. Sunny Chung Presented by: Sonal Deshmukh Jay Upadhyay What is Apache Survey shows huge popularity spike for Apache
More informationVOLTDB + HP VERTICA. page
VOLTDB + HP VERTICA ARCHITECTURE FOR FAST AND BIG DATA ARCHITECTURE FOR FAST + BIG DATA FAST DATA Fast Serve Analytics BIG DATA BI Reporting Fast Operational Database Streaming Analytics Columnar Analytics
More informationHadoop Development Introduction
Hadoop Development Introduction What is Bigdata? Evolution of Bigdata Types of Data and their Significance Need for Bigdata Analytics Why Bigdata with Hadoop? History of Hadoop Why Hadoop is in demand
More informationIntel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python
Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1:
More informationI am: Rana Faisal Munir
Self-tuning BI Systems Home University (UPC): Alberto Abelló and Oscar Romero Host University (TUD): Maik Thiele and Wolfgang Lehner I am: Rana Faisal Munir Research Progress Report (RPR) [1 / 44] Introduction
More informationMongoDB. copyright 2011 Trainologic LTD
MongoDB MongoDB MongoDB is a document-based open-source DB. Developed and supported by 10gen. MongoDB is written in C++. The name originated from the word: humongous. Is used in production at: Disney,
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationPandas UDF Scalable Analysis with Python and PySpark. Li Jin, Two Sigma Investments
Pandas UDF Scalable Analysis with Python and PySpark Li Jin, Two Sigma Investments About Me Li Jin (icexelloss) Software Engineer @ Two Sigma Investments Analytics Tools Smith Apache Arrow Committer Other
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationEvaluation Guide for ASP.NET Web CMS and Experience Platforms
Evaluation Guide for ASP.NET Web CMS and Experience Platforms CONTENTS Introduction....................... 1 4 Key Differences...2 Architecture:...2 Development Model...3 Content:...4 Database:...4 Bonus:
More informationMapbox Vector Tile Specification 2.0. Blake Thompson - Software Engineer, Mapbox
Mapbox Vector Tile Specification 2.0 Blake Thompson Software Engineer, Mapbox About Me Developer at OKC Mapbox Office Mapnik Node Mapnik Mapnik Vector Tile Author of Mapbox Vector Tile Specification Coffee
More informationNew Developments in Spark
New Developments in Spark And Rethinking APIs for Big Data Matei Zaharia and many others What is Spark? Unified computing engine for big data apps > Batch, streaming and interactive Collection of high-level
More information