Big Data with rubygems.org Download Data. Aja Hammerly

Size: px
Start display at page:

Download "Big Data with rubygems.org Download Data. Aja Hammerly"

Transcription

1 Big Data with rubygems.org Download Data Aja Hammerly

2 Aja Hammerly

3

4 Lawyer Cat Says: Any code is copyright Google and licensed Apache

5 @thagomizer_rb Big Data

6 @thagomizer_rb DATA

7 @thagomizer_rb Big Data

8 @thagomizer_rb Storage is Cheap

9 @thagomizer_rb Intimidating

10 @thagomizer_rb OMG Statistics

11 @thagomizer_rb

12 @thagomizer_rb Machine Learning

13 @thagomizer_rb Exploratory

14 Rubygems Download

15 @thagomizer_rb Overview

16 @thagomizer_rb rubygems

17 Column Name id name created_at updated_at slug Type integer varchar datetime datetime

18 Column Name id name created_at updated_at slug Type integer varchar datetime datetime

19 @thagomizer_rb 126,007

20 @thagomizer_rb gem_downloads

21 Column Name Type id integer rubygem_id integer version_id integer count

22 @thagomizer_rb 883,848

23 @thagomizer_rb dependencies

24 Column Name id requirements rubygem_id version_id scope created_at updated_at unresolved_name Type integer varchar integer integer varchar datetime datetime

25 Column Name id requirements rubygem_id version_id scope created_at updated_at unresolved_name Type integer varchar integer integer varchar datetime datetime

26 @thagomizer_rb 3,638,968

27 @thagomizer_rb linksets

28 Column Name id rubygem_id home wiki docs mail code bugs created_at updated_at Type integer integer varchar varchar varchar varchar varchar varchar datetime

29 @thagomizer_rb 125,932

30 @thagomizer_rb versions

31 Column Name Type Column Name Type id integer authors text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256

32 Column Name Type Column Name Type id integer authors text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256

33 Column Name Type Column Name Type id integer authors text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256

34 Column Name Type Column Name Type id integer authors text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256

35 Column Name Type Column Name Type id integer authors text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256

36 Column Name Type Column Name Type id integer authors text rubygem_id integer description text size integer summary text position integer requirements text number varchar platform varchar indexed boolean full_name varchar prerelease boolean licenses varchar latest boolean required_ruby_version varchar yanked_at datetime required_rubygems_version varchar built_at datetime info_checksum varchar updated_at datetime metadata hstore created_at datetime sha256

37 @thagomizer_rb 757,920

38 @thagomizer_rb Asking Questions

39 @thagomizer_rb Domain Knowledge

40 @thagomizer_rb Hypothesis

41 @thagomizer_rb Examples

42 The gem with the most downloads is

43 MiniTest is more popular than

44 Gems released in the last year require ruby >

45 Rails 3 is still more popular than rails

46 Fewer gems are released during

47 @thagomizer_rb Largish Data

48 @thagomizer_rb BigQuery

49 @thagomizer_rb What

50 @thagomizer_rb Why

51 @thagomizer_rb How

52 @thagomizer_rb I BigQuery

53 @thagomizer_rb SQL

54 @thagomizer_rb Fast

55 @thagomizer_rb Scales

56 @thagomizer_rb Complex Enough

57 @thagomizer_rb Demo

58 @thagomizer_rb Vocabulary

59 @thagomizer_rb Dataset

60 @thagomizer_rb Table

61 @thagomizer_rb Import

62 @thagomizer_rb Streaming

63 @thagomizer_rb gcloud

64 @thagomizer_rb pg

65 require 'pg' require 'gcloud' ENV["GOOGLE_CLOUD_PROJECT"] = "rubygems-bigquery" ENV["GOOGLE_CLOUD_KEYFILE"] =

66 gcloud bigquery = Gcloud.new = gcloud.bigquery bq_database = bigquery.dataset

67 @thagomizer_rb postgres = PG.connect dbname: "rubygems"

68 bq_table = bq_database.create_table("gems") do s s.integer s.string "id" "name" end s.timestamp "created_at" s.timestamp

69 @thagomizer_rb columns = %w[id name created_at updated_at]

70 postgres.exec("select * FROM rubygems") do pg_table pg_table.each do row hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end

71 postgres.exec("select * FROM rubygems") do pg_table pg_table.each do row hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end

72 postgres.exec("select * FROM rubygems") do pg_table pg_table.each do row hashed_row = Hash[columns.zip(row.values)] bq_table.insert(data) end

73 postgres.exec("select * FROM rubygems") do pg_table pg_table.each do row hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end

74 @thagomizer_rb Zip & Hash[]

75 [ key1, key2, key3, key4 ] [ val1, val2, val3, val4

76 @thagomizer_rb zip

77 [ key1, key2, key3, key4 ] [ val1, val2, val3, val4 ] [[, ], [, ], [, ], [,

78 [ key1, key2, key3, key4 ] [ val1, val2, val3, val4 ] [[ key1, val1], [ key2, val2], [ key3, val3], [ key4,

79 [[key1, val1], [key2, val2], [key3, val3], [key4,

80 @thagomizer_rb Hash::[]

81 Hash[[key1, val1], [key2, val2], [key3, val3], [key4,

82 { key1 => val1, key2 => val2, key3 => val3, key4 => val4

83 @thagomizer_rb Hash[keys.zip(values)]

84 postgres.exec("select * FROM rubygems") do pg_table pg_table.each do row hashed_row = Hash[columns.zip(row.values)] bq_table.insert(hashed_row) end

85 @thagomizer_rb Batch

86 @thagomizer_rb Formats

87 @thagomizer_rb CSV

88 @thagomizer_rb JSON

89 @thagomizer_rb Avro

90 @thagomizer_rb CSV

91 require 'pg' require 'csv' require

92 postgres = PG.connect dbname: "rubygems" cols = %w[id requirements created_at updated_at rubygem_id version_id

93 query = "SELECT #{cols.join(',')} FROM dependencies" CSV.open(csv_path, "wb") do csv postgres.exec(query) do pg_table pg_table.each do row csv << row.values end end

94 storage = Gcloud.new.storage bucket = storage.bucket "goruco2016-bg-files" bucket.create_file csv_path,

95 @thagomizer_rb Import

96

97

98 @thagomizer_rb What Now?

99 @thagomizer_rb rubygems

100 @thagomizer_rb Simple

101 @thagomizer_rb Rails has the most downloads.

102 Which gem has the most

103 SELECT name, count FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id ORDER BY count DESC LIMIT

104 name count rake 107,076,261 rack 100,955,906 multi_json 100,171,080 json 95,715,131 bundler

105 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name ORDER BY total DESC LIMIT

106 name count rake 214,152,212 rack 201,911,759 multi_json 200,342,260 json 191,430,173 bundler

107 How many downloads does Rails

108 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id WHERE name =

109 name total rails

110 Minitest is more popular than

111 SELECT name, sum(count) as total FROM [rubygems.downloads] JOIN rubygems.gems ON rubygems.gems.id = rubygems.downloads.rubygem_id GROUP BY name HAVING name IN ('minitest',

112 name total minitest rspec

113 Gems released in the last year require ruby >

114 SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY required_ruby_version ORDER BY total

115 name total >= 0 95,857 >= ,069 >= ,624 >= 2.0 1,648 >=

116 @thagomizer_rb Complex

117 Rails 3 has more downloads than the other Rails major

118 SELECT name, REGEXP_EXTRACT(number,r'(\d\.)') AS major, sum(rubygems.downloads.count) AS total FROM [rubygems.versions] JOIN rubygems.gems ON rubygems.gems.id = rubygems.versions.rubygem_id JOIN rubygems.downloads ON rubygems.versions.rubygem_id = rubygems.downloads.rubygem_id WHERE rubygems.gems.name = 'rails' GROUP BY name, major ORDER BY

119 SELECT name, REGEXP_EXTRACT(number,r'(\d\.)') as major, sum(rubygems.downloads.count) as total FROM [rubygems.versions] JOIN rubygems.gems ON rubygems.gems.id = rubygems.versions.rubygem_id JOIN rubygems.downloads ON rubygems.versions.rubygem_id = rubygems.downloads.rubygem_id WHERE rubygems.gems.name = 'rails' GROUP BY name, major order by

120 @thagomizer_rb REGEXP_EXTRACT(number,r'(\d\.)') as major

121 version downloads 0 2,890,350, ,064,535, ,991,436, ,378,651, ,662,487,252 5

122 version downloads 0 2, , , , ,662 5

123 Gems released in the last year require ruby >

124 SELECT required_ruby_version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY required_ruby_version ORDER BY total

125 SELECT REGEXP_EXTRACT(required_ruby_version, r'(.*?\d\.?)') AS version, COUNT(*) AS total FROM rubygems.versions WHERE created_at > DATE_ADD(CURRENT_TIMESTAMP(), -1, "YEAR") GROUP BY version ORDER BY total

126 name total >= 0 95,851 >= 1 13,080 >= 2 12,944 ~> 2 2,040 > 2

127 @thagomizer_rb Thank You

128 @thagomizer_rb

Announcements. Multi-column Keys. Multi-column Keys. Multi-column Keys (3) Multi-column Keys (2) Introduction to Data Management CSE 414

Announcements. Multi-column Keys. Multi-column Keys. Multi-column Keys (3) Multi-column Keys (2) Introduction to Data Management CSE 414 Introduction to Data Management CSE 414 Lecture 3: More SQL (including most of Ch. 6.1-6.2) Announcements WQ2 will be posted tomorrow and due on Oct. 17, 11pm HW2 will be posted tomorrow and due on Oct.

More information

Introduction to Data Management CSE 414

Introduction to Data Management CSE 414 Introduction to Data Management CSE 414 Lecture 3: More SQL (including most of Ch. 6.1-6.2) Overload: https://goo.gl/forms/2pfbteexg5l7wdc12 CSE 414 - Fall 2017 1 Announcements WQ2 will be posted tomorrow

More information

Announcements. Multi-column Keys. Multi-column Keys (3) Multi-column Keys. Multi-column Keys (2) Introduction to Data Management CSE 414

Announcements. Multi-column Keys. Multi-column Keys (3) Multi-column Keys. Multi-column Keys (2) Introduction to Data Management CSE 414 Introduction to Data Management CSE 414 Announcements Reminder: first web quiz due Sunday Lecture 3: More SQL (including most of Ch. 6.1-6.2) CSE 414 - Spring 2017 1 CSE 414 - Spring 2017 2 Multi-column

More information

Stupid Ideas for Many Computers. Aja

Stupid Ideas for Many Computers. Aja Stupid Ideas for Many Computers Aja Hammerly @thagomizer_rb My first Ruby Conf https://www.flickr.com/ @thagomizer_rb photos/jamisonjudd/ 110% More Bad Ideas AT SCALE @thagomizer_rb Aja Hammerly http://github.com/thagomizer/stupidideas

More information

Databases - Have it your way

Databases - Have it your way Databases - Have it your way Frederick Cheung - kgb fred@texperts.com http://www.spacevatican.org 1 kgb Operates a number of Directory Enquiry type products in several countries Runs the 542542 Ask Us

More information

Comp 97: Design Document

Comp 97: Design Document Tufts University School of Engineering Department of Electrical and Computer Engineering Comp 97: Design Document Fall 2013 Name: Jillian Silver Josh Fishbein Jillian.Silver@ tufts.edu Joshua.fishbein@tufts.edu

More information

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran

Apache Drill. Interactive Analysis of Large-Scale Datasets. Tomer Shiran Apache Drill Interactive Analysis of Large-Scale Datasets Tomer Shiran Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection Network intrusions Fraud Failures

More information

Keeping Rails on the Tracks

Keeping Rails on the Tracks Keeping Rails on the Tracks Mikel Lindsaar @raasdnil lindsaar.net Working in Rails & Ruby for 5+ Years http://lindsaar.net/ http://stillalive.com/ http://rubyx.com/ On the Rails? What do I mean by on the

More information

Introduction to Hive Cloudera, Inc.

Introduction to Hive Cloudera, Inc. Introduction to Hive Outline Motivation Overview Data Model Working with Hive Wrap up & Conclusions Background Started at Facebook Data was collected by nightly cron jobs into Oracle DB ETL via hand-coded

More information

Marathon Documentation

Marathon Documentation Marathon Documentation Release 3.0.0 Top Free Games Feb 07, 2018 Contents 1 Overview 3 1.1 Features.................................................. 3 1.2 Architecture...............................................

More information

MySQL Workshop. Scott D. Anderson

MySQL Workshop. Scott D. Anderson MySQL Workshop Scott D. Anderson Workshop Plan Part 1: Simple Queries Part 2: Creating a database: creating a table inserting, updating and deleting data handling NULL values datatypes Part 3: Joining

More information

A Tutorial on Apache Spark

A Tutorial on Apache Spark A Tutorial on Apache Spark A Practical Perspective By Harold Mitchell The Goal Learning Outcomes The Goal Learning Outcomes NOTE: The setup, installation, and examples assume Windows user Learn the following:

More information

레드마인 설치 작성일 : 작성자 : 김종열

레드마인 설치 작성일 : 작성자 : 김종열 레드마인 2.3.3 설치 작성일 : 2013-11-2 작성자 : 김종열 기준문서 : http://www.redmine.or.kr/projects/community/wiki/%eb%a0%88%eb%93%9c%eb%a7%88%ec%9d %B8_%EC%84%A4%EC%B9%98(Windows) 설치홖경 OS: Windows 7 64 DB: Mysql 5.5 이상

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018

Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018 Un'introduzione a Kafka Streams e KSQL and why they matter! ITOUG Tech Day Roma 1 Febbraio 2018 R E T H I N K I N G Stream Processing with Apache Kafka Kafka the Streaming Data Platform 1.0 Enterprise

More information

Apache Hive for Oracle DBAs. Luís Marques

Apache Hive for Oracle DBAs. Luís Marques Apache Hive for Oracle DBAs Luís Marques About me Oracle ACE Alumnus Long time open source supporter Founder of Redglue (www.redglue.eu) works for @redgluept as Lead Data Architect @drune After this talk,

More information

MySQL 101. Designing effective schema for InnoDB. Yves Trudeau April 2015

MySQL 101. Designing effective schema for InnoDB. Yves Trudeau April 2015 MySQL 101 Designing effective schema for InnoDB Yves Trudeau April 2015 About myself : Yves Trudeau Principal architect at Percona since 2009 With MySQL then Sun, 2007 to 2009 Focus on MySQL HA and distributed

More information

SQL. Often times, in order for us to build the most functional website we can, we depend on a database to store information.

SQL. Often times, in order for us to build the most functional website we can, we depend on a database to store information. Often times, in order for us to build the most functional website we can, we depend on a database to store information. If you ve ever used Microsoft Excel or Google Spreadsheets (among others), odds are

More information

Rails: Models. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 25

Rails: Models. Computer Science and Engineering College of Engineering The Ohio State University. Lecture 25 Rails: Models Computer Science and Engineering College of Engineering The Ohio State University Lecture 25 Recall: Rails Architecture Recall: Rails Architecture Mapping Tables to Objects General strategy

More information

Machine Learning & Google Big Query. Data collection and exploration notes from the field

Machine Learning & Google Big Query. Data collection and exploration notes from the field Machine Learning & Google Big Query Data collection and exploration notes from the field Limited to support of Machine Learning (ML) tasks Review tasks common to ML use cases Data Exploration Text Classification

More information

Ingesting Streaming Data for Analysis in Apache Ignite. Pat Patterson

Ingesting Streaming Data for Analysis in Apache Ignite. Pat Patterson Ingesting Streaming Data for Analysis in Apache Ignite Pat Patterson StreamSets pat@streamsets.com @metadaddy Agenda Product Support Use Case Continuous Queries in Apache Ignite Integrating StreamSets

More information

CMSC 330: Organization of Programming Languages. Markup & Query Languages

CMSC 330: Organization of Programming Languages. Markup & Query Languages CMSC 330: Organization of Programming Languages Markup & Query Languages Other Language Types Markup languages Set of annotations to text Query languages Make queries to databases & information systems

More information

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages CMSC 330: Organization of Programming Languages Markup & Query Languages Other Language Types Markup languages Set of annotations to text Query languages Make queries to databases & information systems

More information

SQL (and MySQL) Useful things I have learnt, borrowed and stolen

SQL (and MySQL) Useful things I have learnt, borrowed and stolen SQL (and MySQL) Useful things I have learnt, borrowed and stolen MySQL truncates data MySQL truncates data CREATE TABLE pets ( id INT UNSIGNED NOT NULL AUTO_INCREMENT, type CHAR(3) NOT NULL, PRIMARY KEY

More information

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop Hadoop Open Source Projects Hadoop is supplemented by an ecosystem of open source projects Oozie 25 How to Analyze Large Data Sets in Hadoop Although the Hadoop framework is implemented in Java, MapReduce

More information

PostgreSQL Installation Guide

PostgreSQL Installation Guide PostgreSQL Installation Guide Version 1 Updated March 2018 Copyright 2018 Boston University. All Rights Reserved. Contents Introduction... 3 PostgreSQL Overview... 3 Downloading PostgreSQL... 4 Step 1:

More information

Handling Non-Relational Databases on Cloud using Scheduling Approach with Performance Analysis

Handling Non-Relational Databases on Cloud using Scheduling Approach with Performance Analysis Handling Non-Relational Databases on Cloud using Scheduling Approach with Performance Analysis *1 Bansri Kotecha, 2 Hetal Joshiyara 1 PG Scholar, 2 Assistant Professor 1, 2 Computer science Department,

More information

itexamdump 최고이자최신인 IT 인증시험덤프 일년무료업데이트서비스제공

itexamdump 최고이자최신인 IT 인증시험덤프   일년무료업데이트서비스제공 itexamdump 최고이자최신인 IT 인증시험덤프 http://www.itexamdump.com 일년무료업데이트서비스제공 Exam : Professional-Cloud-Architect Title : Google Certified Professional - Cloud Architect (GCP) Vendor : Google Version : DEMO Get

More information

Get Table Schema In Sql Server 2005 Modify. Column Datatype >>>CLICK HERE<<<

Get Table Schema In Sql Server 2005 Modify. Column Datatype >>>CLICK HERE<<< Get Table Schema In Sql Server 2005 Modify Column Datatype Applies To: SQL Server 2014, SQL Server 2016 Preview Specifies the properties of a column that are added to a table by using ALTER TABLE. Is the

More information

relational Key-value Graph Object Document

relational Key-value Graph Object Document NoSQL Databases Earlier We have spent most of our time with the relational DB model so far. There are other models: Key-value: a hash table Graph: stores graph-like structures efficiently Object: good

More information

CitusDB Documentation

CitusDB Documentation CitusDB Documentation Release 4.0.1 Citus Data June 07, 2016 Contents 1 Installation Guide 3 1.1 Supported Operating Systems...................................... 3 1.2 Single Node Cluster...........................................

More information

CAST(HASHBYTES('SHA2_256',(dbo.MULTI_HASH_FNC( tblname', schemaname'))) AS VARBINARY(32));

CAST(HASHBYTES('SHA2_256',(dbo.MULTI_HASH_FNC( tblname', schemaname'))) AS VARBINARY(32)); >Near Real Time Processing >Raphael Klebanov, Customer Experience at WhereScape USA >Definitions 1. Real-time Business Intelligence is the process of delivering business intelligence (BI) or information

More information

Contents in Detail. Foreword by Xavier Noria

Contents in Detail. Foreword by Xavier Noria Contents in Detail Foreword by Xavier Noria Acknowledgments xv xvii Introduction xix Who This Book Is For................................................ xx Overview...xx Installation.... xxi Ruby, Rails,

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,... Data Ingestion ETL, Distcp, Kafka, OpenRefine, Query & Exploration SQL, Search, Cypher, Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

More information

PostgreSQL Query Optimization. Step by step techniques. Ilya Kosmodemiansky

PostgreSQL Query Optimization. Step by step techniques. Ilya Kosmodemiansky PostgreSQL Query Optimization Step by step techniques Ilya Kosmodemiansky (ik@) Agenda 2 1. What is a slow query? 2. How to chose queries to optimize? 3. What is a query plan? 4. Optimization tools 5.

More information

Column-Family Databases Cassandra and HBase

Column-Family Databases Cassandra and HBase Column-Family Databases Cassandra and HBase Kevin Swingler Google Big Table Google invented BigTableto store the massive amounts of semi-structured data it was generating Basic model stores items indexed

More information

Apache Kudu. Zbigniew Baranowski

Apache Kudu. Zbigniew Baranowski Apache Kudu Zbigniew Baranowski Intro What is KUDU? New storage engine for structured data (tables) does not use HDFS! Columnar store Mutable (insert, update, delete) Written in C++ Apache-licensed open

More information

Data Access 3. Managing Apache Hive. Date of Publish:

Data Access 3. Managing Apache Hive. Date of Publish: 3 Managing Apache Hive Date of Publish: 2018-07-12 http://docs.hortonworks.com Contents ACID operations... 3 Configure partitions for transactions...3 View transactions...3 View transaction locks... 4

More information

Hadoop ecosystem. Nikos Parlavantzas

Hadoop ecosystem. Nikos Parlavantzas 1 Hadoop ecosystem Nikos Parlavantzas Lecture overview 2 Objective Provide an overview of a selection of technologies in the Hadoop ecosystem Hadoop ecosystem 3 Hadoop ecosystem 4 Outline 5 HBase Hive

More information

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture

Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Delving Deep into Hadoop Course Contents Introduction to Hadoop and Architecture Hadoop 1.0 Architecture Introduction to Hadoop & Big Data Hadoop Evolution Hadoop Architecture Networking Concepts Use cases

More information

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016 DATABASE SYSTEMS Introduction to MySQL Database System Course, 2016 AGENDA FOR TODAY Administration Database Architecture on the web Database history in a brief Databases today MySQL What is it How to

More information

Rails: MVC in action

Rails: MVC in action Ruby on Rails Basic Facts 1. Rails is a web application framework built upon, and written in, the Ruby programming language. 2. Open source 3. Easy to learn; difficult to master. 4. Fun (and a time-saver)!

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Ruby on Rails 3. Robert Crida Stuart Corbishley. Clue Technologies

Ruby on Rails 3. Robert Crida Stuart Corbishley. Clue Technologies Ruby on Rails 3 Robert Crida Stuart Corbishley Clue Technologies Topic Overview What is Rails New in Rails 3 New Project Generators MVC Active Record UJS RVM Bundler Migrations Factory Girl RSpec haml

More information

NoSQL + SQL = MySQL. Nicolas De Rico Principal Solutions Architect

NoSQL + SQL = MySQL. Nicolas De Rico Principal Solutions Architect NoSQL + SQL = MySQL Nicolas De Rico Principal Solutions Architect nicolas.de.rico@oracle.com Safe Harbor Statement The following is intended to outline our general product direction. It is intended for

More information

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer

Druid Power Interactive Applications at Scale. Jonathan Wei Software Engineer Druid Power Interactive Applications at Scale Jonathan Wei Software Engineer History & Motivation Demo Overview Storage Internals Druid Architecture Motivation Motivation Visibility and analysis for complex

More information

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours)

Hadoop. Course Duration: 25 days (60 hours duration). Bigdata Fundamentals. Day1: (2hours) Bigdata Fundamentals Day1: (2hours) 1. Understanding BigData. a. What is Big Data? b. Big-Data characteristics. c. Challenges with the traditional Data Base Systems and Distributed Systems. 2. Distributions:

More information

JRuby and Ioke. On Google AppEngine. Ola Bini

JRuby and Ioke. On Google AppEngine. Ola Bini JRuby and Ioke On Google AppEngine Ola Bini ola.bini@gmail.com http://olabini.com/blog Vanity slide ThoughtWorks consultant/developer/programming language geek JRuby Core Developer From Stockholm, Sweden

More information

Db2 Alter Table Alter Column Set Data Type Char

Db2 Alter Table Alter Column Set Data Type Char Db2 Alter Table Alter Column Set Data Type Char I am trying to do 2 alters to a column in DB2 in the same alter command, and it doesn't seem to like my syntax alter table tbl alter column col set data

More information

Databricks, an Introduction

Databricks, an Introduction Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,

More information

Introduction To Postgres. Rodrigo Menezes

Introduction To Postgres. Rodrigo Menezes Introduction To Postgres Rodrigo Menezes I joined in 2013, when we were ~20 people Acquired by Oracle during summer of 2017 Currently, we re about ~250 people I started off as a frontend developer This

More information

Shark: Hive (SQL) on Spark

Shark: Hive (SQL) on Spark Shark: Hive (SQL) on Spark Reynold Xin UC Berkeley AMP Camp Aug 21, 2012 UC BERKELEY SELECT page_name, SUM(page_views) views FROM wikistats GROUP BY page_name ORDER BY views DESC LIMIT 10; Stage 0: Map-Shuffle-Reduce

More information

Microsoft. Perform Data Engineering on Microsoft Azure HDInsight Version: Demo. Web: [ Total Questions: 10]

Microsoft. Perform Data Engineering on Microsoft Azure HDInsight Version: Demo. Web:   [ Total Questions: 10] Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Web: www.marks4sure.com Email: support@marks4sure.com Version: Demo [ Total Questions: 10] IMPORTANT NOTICE Feedback We have developed

More information

Postgres Copy Table From One Schema To Another

Postgres Copy Table From One Schema To Another Postgres Copy Table From One Schema To Another PostgreSQL: how to periodically copy many tables from one database to another but am free to export a copy of both to another server and do whatever I want

More information

SQL, Scaling, and What s Unique About PostgreSQL

SQL, Scaling, and What s Unique About PostgreSQL SQL, Scaling, and What s Unique About PostgreSQL Ozgun Erdogan Citus Data XLDB May 2018 Punch Line 1. What is unique about PostgreSQL? The extension APIs 2. PostgreSQL extensions are a game changer for

More information

Part 1 Configuring Oracle Big Data SQL

Part 1 Configuring Oracle Big Data SQL Oracle Big Data, Data Science, Advance Analytics & Oracle NoSQL Database Securely analyze data across the big data platform whether that data resides in Oracle Database 12c, Hadoop or a combination of

More information

Major Features: Postgres 10

Major Features: Postgres 10 Major Features: Postgres 10 BRUCE MOMJIAN POSTGRESQL is an open-source, full-featured relational database. This presentation gives an overview of the Postgres 10 release. Creative Commons Attribution License

More information

Apache Bahir Writing Applications using Apache Bahir

Apache Bahir Writing Applications using Apache Bahir Apache Big Data Seville 2016 Apache Bahir Writing Applications using Apache Bahir Luciano Resende About Me Luciano Resende (lresende@apache.org) Architect and community liaison at Have been contributing

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Tanium Asset User Guide. Version 1.3.1

Tanium Asset User Guide. Version 1.3.1 Tanium Asset User Guide Version 1.3.1 June 12, 2018 The information in this document is subject to change without notice. Further, the information provided in this document is provided as is and is believed

More information

P!"#r$%

P!#r$% P!"#r$% D$&'%"()$* @+r,(#-$r%"($.% PSA: Macs Postgres.app PSA #2 http://postgresweekly.com PSA #3 CVE 2013-1899 UPGRADE Agenda Brief History Developing w/ Postgres Postgres Performance Querying Postgres

More information

Mastering phpmyadmiri 3.4 for

Mastering phpmyadmiri 3.4 for Mastering phpmyadmiri 3.4 for Effective MySQL Management A complete guide to getting started with phpmyadmin 3.4 and mastering its features Marc Delisle [ t]open so 1 I community experience c PUBLISHING

More information

Matej Kovačič. Jožef Stefan Institute

Matej Kovačič. Jožef Stefan Institute PostgreSQL Analysing Open Data Matej Kovačič matej.kovacic@ijs.si Jožef Stefan Institute Centre for Knowledge Transfer in Information Technologies Artificial Intelligence Laboratory SQL and PostgreSQL

More information

DB Export/Import/Generate data tool

DB Export/Import/Generate data tool DB Export/Import/Generate data tool Main functions: quick connection to any database using defined UDL files show list of available tables and/or queries show data from selected table with possibility

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Research Students Lecture Series 2015

Research Students Lecture Series 2015 Research Students Lecture Series 215 Analyse your big data with this one weird probabilistic approach! Or: applied probabilistic algorithms in 5 easy pieces Advait Sarkar advait.sarkar@cl.cam.ac.uk Research

More information

Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP

Data Analysis R&D. Jim Pivarski. February 5, Princeton University DIANA-HEP Data Analysis R&D Jim Pivarski Princeton University DIANA-HEP February 5, 2018 1 / 20 Tools for data analysis Eventual goal Query-based analysis: let physicists do their analysis by querying a central

More information

An API for Your Data. David Brennan, PhUSE An API for Your Data, David Brennan, AD08, PhUSE

An API for Your Data. David Brennan, PhUSE An API for Your Data, David Brennan, AD08, PhUSE An API for Your Data David Brennan, PhUSE An API for Your Data, David Brennan, AD08, PhUSE 1 Background Tables, tables, tables Open-source tools Web development frameworks Javascript libraries html/css

More information

Evolution of an Apache Spark Architecture for Processing Game Data

Evolution of an Apache Spark Architecture for Processing Game Data Evolution of an Apache Spark Architecture for Processing Game Data Nick Afshartous WB Analytics Platform May 17 th 2017 May 17 th, 2017 About Me nafshartous@wbgames.com WB Analytics Core Platform Lead

More information

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018 NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE DACE https://dace.unige.ch Data and Analysis Center for Exoplanets. Facility to store, exchange and analyse data

More information

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo Microsoft Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo NEW QUESTION 1 You have an Azure HDInsight cluster. You need to store data in a file format that

More information

Package bigqueryr. June 8, 2018

Package bigqueryr. June 8, 2018 Package bigqueryr June 8, 2018 Title Interface with Google BigQuery with Shiny Compatibility Version 0.4.0 Interface with 'Google BigQuery', see for more information.

More information

doc. RNDr. Tomáš Skopal, Ph.D. RNDr. Michal Kopecký, Ph.D.

doc. RNDr. Tomáš Skopal, Ph.D. RNDr. Michal Kopecký, Ph.D. course: Database Systems (NDBI025) SS2017/18 doc. RNDr. Tomáš Skopal, Ph.D. RNDr. Michal Kopecký, Ph.D. Department of Software Engineering, Faculty of Mathematics and Physics, Charles University in Prague

More information

Impala Intro. MingLi xunzhang

Impala Intro. MingLi xunzhang Impala Intro MingLi xunzhang Overview MPP SQL Query Engine for Hadoop Environment Designed for great performance BI Connected(ODBC/JDBC, Kerberos, LDAP, ANSI SQL) Hadoop Components HDFS, HBase, Metastore,

More information

An Introduction to BigQuery

An Introduction to BigQuery An Introduction to BigQuery (in less than 10 minutes) brought to you by The ISB Cancer Genomics Cloud This is what you should see the first time you go to the BigQuery Web UI at bigquery.cloud.google.com

More information

Apache Phoenix We put the SQL back in NoSQL

Apache Phoenix We put the SQL back in NoSQL Apache Phoenix We put the SQL back in NoSQL http://phoenix.incubator.apache.org James Taylor @JamesPlusPlus Maryann Xue @MaryannXue Eli Levine @teleturn About James o o Engineer at Salesforce.com in BigData

More information

exam. Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0

exam.   Microsoft Perform Data Engineering on Microsoft Azure HDInsight. Version 1.0 70-775.exam Number: 70-775 Passing Score: 800 Time Limit: 120 min File Version: 1.0 Microsoft 70-775 Perform Data Engineering on Microsoft Azure HDInsight Version 1.0 Exam A QUESTION 1 You use YARN to

More information

Oracle 1Z0-882 Exam. Volume: 100 Questions. Question No: 1 Consider the table structure shown by this output: Mysql> desc city:

Oracle 1Z0-882 Exam. Volume: 100 Questions. Question No: 1 Consider the table structure shown by this output: Mysql> desc city: Volume: 100 Questions Question No: 1 Consider the table structure shown by this output: Mysql> desc city: 5 rows in set (0.00 sec) You execute this statement: SELECT -,-, city. * FROM city LIMIT 1 What

More information

Typus Documentation. Release beta. Francesc Esplugas

Typus Documentation. Release beta. Francesc Esplugas Typus Documentation Release 4.0.0.beta Francesc Esplugas November 20, 2014 Contents 1 Key Features 3 2 Support 5 3 Installation 7 4 Configuration 9 4.1 Initializers................................................

More information

More MySQL ELEVEN Walkthrough examples Walkthrough 1: Bulk loading SESSION

More MySQL ELEVEN Walkthrough examples Walkthrough 1: Bulk loading SESSION SESSION ELEVEN 11.1 Walkthrough examples More MySQL This session is designed to introduce you to some more advanced features of MySQL, including loading your own database. There are a few files you need

More information

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016 DATABASE SYSTEMS Introduction to MySQL Database System Course, 2016 AGENDA FOR TODAY Administration Database Architecture on the web Database history in a brief Databases today MySQL What is it How to

More information

Processing Big Data. with AZURE DATA LAKE ANALYTICS. Sean Forgatch - Senior Consultant. 6/23/ TALAVANT. All Rights Reserved.

Processing Big Data. with AZURE DATA LAKE ANALYTICS. Sean Forgatch - Senior Consultant. 6/23/ TALAVANT. All Rights Reserved. Processing Big Data with AZURE DATA LAKE ANALYTICS Sean Forgatch - Senior Consultant 6/23/2018 2018 TALAVANT. All Rights Reserved. 1 SQL Saturday Iowa 2018 6/23/2018 2018 TALAVANT. All Rights Reserved.

More information

Exam Questions

Exam Questions Exam Questions 70-775 Perform Data Engineering on Microsoft Azure HDInsight (beta) https://www.2passeasy.com/dumps/70-775/ NEW QUESTION 1 You are implementing a batch processing solution by using Azure

More information

Key Terms. Attribute join Target table Join table Spatial join

Key Terms. Attribute join Target table Join table Spatial join Key Terms Attribute join Target table Join table Spatial join Lect 10A Building Geodatabase Create a new file geodatabase Map x,y data Convert shape files to geodatabase feature classes Spatial Data Formats

More information

Hacking PostgreSQL Internals to Solve Data Access Problems

Hacking PostgreSQL Internals to Solve Data Access Problems Hacking PostgreSQL Internals to Solve Data Access Problems Sadayuki Furuhashi Treasure Data, Inc. Founder & Software Architect A little about me... > Sadayuki Furuhashi > github/twitter: @frsyuki > Treasure

More information

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller

Structured Streaming. Big Data Analysis with Scala and Spark Heather Miller Structured Streaming Big Data Analysis with Scala and Spark Heather Miller Why Structured Streaming? DStreams were nice, but in the last session, aggregation operations like a simple word count quickly

More information

Package rbraries. April 18, 2018

Package rbraries. April 18, 2018 Title Interface to the 'Libraries.io' API Package rbraries April 18, 2018 Interface to the 'Libraries.io' API (). 'Libraries.io' indexes data from 36 different package managers

More information

In this Lecture. More SQL Data Definition. Deleting Tables. Creating Tables. ALTERing Columns. Changing Tables. More SQL

In this Lecture. More SQL Data Definition. Deleting Tables. Creating Tables. ALTERing Columns. Changing Tables. More SQL In this Lecture Database Systems Lecture 6 Natasha Alechina More SQL DROP TABLE ALTER TABLE INSERT, UPDATE, and DELETE Data dictionary Sequences For more information Connolly and Begg chapters 5 and 6

More information

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet Ema Iancuta iorhian@gmail.com Radu Chilom radu.chilom@gmail.com Big data analytics / machine learning 6+ years

More information

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel,

Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, Spring 2017 EXTERNAL SORTING (CH. 13 IN THE COW BOOK) 2/7/17 CS 564: Database Management Systems; (c) Jignesh M. Patel, 2013 1 Motivation for External Sort Often have a large (size greater than the available

More information

ICS4U Project Development Example Discovery Day Project Requirements. System Description

ICS4U Project Development Example Discovery Day Project Requirements. System Description ICS4U Project Development Example Discovery Day Project Requirements System Description The discovery day system is designed to allow students to register themselves for the West Carleton Discovery Day

More information

Package bigqueryr. October 23, 2017

Package bigqueryr. October 23, 2017 Package bigqueryr October 23, 2017 Title Interface with Google BigQuery with Shiny Compatibility Version 0.3.2 Interface with 'Google BigQuery', see for more information.

More information

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA

Python, PySpark and Riak TS. Stephen Etheridge Lead Solution Architect, EMEA Python, PySpark and Riak TS Stephen Etheridge Lead Solution Architect, EMEA Agenda Introduction to Riak TS The Riak Python client The Riak Spark connector and PySpark CONFIDENTIAL Basho Technologies 3

More information

Effective Rails Testing Practices

Effective Rails Testing Practices Effective Rails Testing Practices Mike Swieton atomicobject.com atomicobject.com 2007: 16,000 hours General testing strategies Integration tests View tests Controller tests Migration tests Test at a high

More information

Database Acceleration Solution Using FPGAs and Integrated Flash Storage

Database Acceleration Solution Using FPGAs and Integrated Flash Storage Database Acceleration Solution Using FPGAs and Integrated Flash Storage HK Verma, Xilinx Inc. August 2017 1 FPGA Analytics in Flash Storage System In-memory or Flash storage based DB reduce disk access

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

Learn Relational Database from Scratch. Dan Li, Ph.D. Associate Professor Computer Science Eastern Washington University

Learn Relational Database from Scratch. Dan Li, Ph.D. Associate Professor Computer Science Eastern Washington University Learn Relational Database from Scratch Dan Li, Ph.D. Associate Professor Computer Science Eastern Washington University Self-Introduction Associate professor of Computer Science at EWU Area of expertise

More information

FATWORM IMPLEMENTATION. Chenyang Wu

FATWORM IMPLEMENTATION. Chenyang Wu FATWORM IMPLEMENTATION Chenyang Wu FATWORM IMPLEMENTATION Chenyang Wu OVERVIEW Keywords A Traditional Architecture A Traditional Implementation Resources KEYWORDS KEYWORDS RDBMS, scratch Simplified SQL

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information