Data Science and Open Source Software. Iraklis Varlamis Assistant Professor Harokopio University of Athens

Similar documents
Stages of Data Processing

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Architect.

BIG DATA COURSE CONTENT

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Hadoop. Introduction / Overview

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Big Data with Hadoop Ecosystem

Hadoop, Yarn and Beyond

DATA SCIENCE USING SPARK: AN INTRODUCTION

CSE 444: Database Internals. Lecture 23 Spark

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Syncsort DMX-h. Simplifying Big Data Integration. Goals of the Modern Data Architecture SOLUTION SHEET

Databases and Big Data Today. CS634 Class 22

Introduction into Big Data analytics Lecture 2 Big data platforms. Janusz Szwabiński

Oracle GoldenGate for Big Data

Things Every Oracle DBA Needs to Know about the Hadoop Ecosystem. Zohar Elkayam

HDInsight > Hadoop. October 12, 2017

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

R Language for the SQL Server DBA

An Introduction to Apache Spark

Ian Choy. Technology Solutions Professional

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Specialist ICT Learning

Cloud Computing 3. CSCI 4850/5850 High-Performance Computing Spring 2018

The SMACK Stack: Spark*, Mesos*, Akka, Cassandra*, Kafka* Elizabeth K. Dublin Apache Kafka Meetup, 30 August 2017.

Unifying Big Data Workloads in Apache Spark

Data Architectures in Azure for Analytics & Big Data

Webinar Series TMIP VISION

Flash Storage Complementing a Data Lake for Real-Time Insight

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Accelerate MySQL for Demanding OLAP and OLTP Use Case with Apache Ignite December 7, 2016

Microsoft Exam

microsoft

Big Data and FrameWorks; Perspectives to Applied Machine Learning

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

The Evolution of Big Data Platforms and Data Science

Certified Big Data Hadoop and Spark Scala Course Curriculum

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Přehled novinek v SQL Server 2016

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

NOSQL DATABASE SYSTEMS: DECISION GUIDANCE AND TRENDS. Big Data Technologies: NoSQL DBMS (Decision Guidance) - SoSe

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Processing of big data with Apache Spark

Seoul Elasticsearch Community Meetup

Analysis of Big Data and other sources

Agenda. Spark Platform Spark Core Spark Extensions Using Apache Spark

Understanding the latent value in all content

CIB Session 12th NoSQL Databases Structures

Data 101 Which DB, When. Joe Yong Azure SQL Data Warehouse, Program Management Microsoft Corp.

Hadoop Development Introduction

Exam Questions

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Introduction to NoSQL Databases

SpagoBI and Talend jointly support Big Data scenarios

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

Big Data Applications with Spring XD

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Modern ETL Tools for Cloud and Big Data. Ken Beutler, Principal Product Manager, Progress Michael Rainey, Technical Advisor, Gluent Inc.

Challenges for Data Driven Systems

Distributed Databases: SQL vs NoSQL

Accelerate MySQL for Demanding OLAP and OLTP Use Cases with Apache Ignite. Peter Zaitsev, Denis Magda Santa Clara, California April 25th, 2017

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Cloud, Big Data & Linear Algebra

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Blurring the Line Between Developer and Data Scientist

Deep Learning Frameworks with Spark and GPUs

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Unit 10 Databases. Computer Concepts Unit Contents. 10 Operational and Analytical Databases. 10 Section A: Database Basics

Microsoft Perform Data Engineering on Microsoft Azure HDInsight.

Indiana Oracle Users Group January meeting January 27 th, Big Data

Innovatus Technologies

ANY Data for ANY Application Exploring IBM Data Virtualization Manager for z/os in the era of API Economy

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

Big Data Specialized Studies

Big Data Analytics. Description:

CloudSwyft Learning-as-a-Service Course Catalog 2018 (Individual LaaS Course Catalog List)

Data 101 Which DB, When Joe Yong Sr. Program Manager Microsoft Corp.

Cloud Computing & Visualization

Apache Beam: portable and evolutive data-intensive applications

Logging Reservoir Evaluation Based on Spark. Meng-xin SONG*, Hong-ping MIAO and Yao SUN

Intro to Big Data on AWS Igor Roiter Big Data Cloud Solution Architect

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

SQL Server 2017 Power your entire data estate from on-premises to cloud

Migrate from Netezza Workload Migration

Hadoop course content

Lecture 25 Overview. Last Lecture Query optimisation/query execution strategies

Data contains value and knowledge

Data Ingestion at Scale. Jeffrey Sica

Twitter data Analytics using Distributed Computing

Transcription:

Data Science and Open Source Software Iraklis Varlamis Assistant Professor Harokopio University of Athens varlamis@hua.gr

What is data science? 2

Why data science is important? More data (volume, variety,...) better analysis better decision making New data (trends, events) faster analysis new opportunities 3

The skills of a data scientist * Non-technical skills Education: Computer Science, Mathematics, Statistics Domain Knowledge Intellectual Curiosity Communication Skills Balance between technical, communication and presentation skills Technical skills Programming: R, Python Machine Learning/Data Mining:scikit-learn, KNIME, TensorFlow, Theano, PyTorch Big Data Processing: Hadoop, Spark, Flink or Storm Data management: SQL, nosql Data Visualization: D3.js, Chart.js, Tableau, ggplot or lattice in R, matplotlib,plotly or seaborn in Python *According to Matthew Mayo @ KDNuggets: https://www.kdnuggets.com/2016/05/10-must-have-skills-data-scientist.html 4

There is no one ring to rule them all Statisticians choose R: Programmers choose Python: Many statistical tests and models Better in ad hoc analysis New statistical models can be written in a few lines Comprehensive R Archive Network (12576 packages) https://cran.r-project.org/ Run Python from R https://github.com/rstudio/reticulate Lots of code in stackoverflow Better data manipulation Supports scripting website or other applications 140,784 projects in Python Package Index (PyPI) https://pypi.org/ Run R from Python https://pypi.org/project/rpy2/ 5

Crunching Big Data and Analytics Store Process Document databases (e.g. MongoDB, CouchDB): free-form JSON documents Key-value stores (e.g. Redis, Memcached): from integers or strings to JSON documents Wide column stores (e.g. Cassandra, HBase): data in columns instead of rows Graph databases (e.g. Neo4j): networks or graphs of related entities Search engines (e.g. Elasticsearch, Solr): allow full-text or faceted search, search by synonyms etc. Distributed processing and cluster-computing Hadoop Common, Hadoop Distributed File System, Hadoop YARN (job scheduling and cluster resource management), Hadoop MapReduce (parallel processing) Apache Spark With native bindings for Java, Scala, Python and R Supports SQL, streaming data, machine learning, and graph processing 6

Crunching Big Data and Analytics Analyze and Visualize Process on the fly - Stream analytics ETL Explore: Examine data sets to gain insights Orchestrate (messaging): Kafka Stream data (or in mini-batches): Flink, Storm or Spark streaming Learn and predict: Find trends, predict future activity, Predictive analytics Visualize Process and predict: Spark MLLib Store results: MongoDB, Cassandra Visualize: D3.js, Plotly 7

When it comes to data streams (e.g. IoT) Source: InfoQ.com 8

When it comes to social networks Networks are graphs, and social network analysis requires graph operations at large scale! Distributed storage and processing is the only solution. OLAP OLTP TITAN Gephi ETL as graph traversal Distributed notebook Storage Cassandra, HBase,.. Indexing Elasticsearch, Solr 9

When it comes to document streams (e.g. tweets) The ELK stack Text Mining NLP N. Tsirakis, V. Poulopoulos, P. Tsantilas, I. Varlamis. (2017). Large scale opinion mining for social, news and blog data. Journal of Systems and Software 127: 237-248 10

When it comes to Machine Learning and AI Social Media (re)fetch content Process Content Translate NER Sentiment Image/Video tagging index content Visualization Data lake archive content (ETL) Searchable Content Process Search Deliver Deliver Visualize annotated content annotate content content fetched Collect Selected data Event Streaming Store 11

Open Source Is The New Normal......In Data and Analytics, Scott Gnau, CTO at Hortonworks, @ Forbes, July 2017 And how Open Source makes money? Customization of Code Sell Value-Added Enhancements Provide Support Sell Documentation Sell Binaries Offer OSS Open Source S... 12

Open Source Services Accelerated product development by eliminating product dependencies Agile, Innovative and cost-effective development Cloud services (Amazon, Microsoft, Google, IBM, Salesforce 13

Thank you and questions 14