Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc.

Size: px
Start display at page:

Download "Bixo - Web Mining Toolkit 23 Sep Ken Krugler TransPac Software, Inc."

Transcription

1 Web Mining Toolkit Ken Krugler TransPac Software, Inc. My background - did a startup called Krugle from Used Nutch to do a vertical crawl of the web, looking for technical software pages. Mined pages for references to open source projects. Used experience to create Bixo, an open source web mining toolkit Built on top of Hadoop, Cascading, Tika. 1

2 Web Mining 101 Extracting & Processing Web Data More Than Just Search Business intelligence, competitive intelligence, events, people, companies, popularity, pricing, social graphs, Twitter feeds, Facebook friends, support forums, shopping carts Quick intro to web mining, so we re on the same page Most people think about the big search companies when they think about web mining. Search is clearly the biggest web mining category, and generates the most revenue. But other types of web mining have value that is high and growing. This is what Bixo focuses on. 2

3 4 Steps in Mining Collect - fetch content from web Parse - extract data from formats Analyze - tokenize, rate, classify, cluster Produce - an index, a report Search Note - does not include serving up the search results Why do I bring this up? To help clarify why web mining is not the same as vertical search (next slide) 3

4 Vertical Search Vertical crawl to get specific content Common use case for Nutch, Heritrix But web mining often has different outcome And specialized processing of data Most people think of vertical search when they think of specialized web mining. Lots of people have been doing this, using OSS like Nutch & Heritrix. End result is typically a Lucene index, plus the content, inverted links, etc. Typical web mining is not the same as vertical search. Often uses a white list, versus crawling to discover links. More specialized processing of the data. And these differences help answer the question of (next slide) 4

5 Why Bixo? Response to needs of commercial projects Plug into Cascading-based workflow Low IT time/skill requirements Run well in AWS EC2 environment Flexible I/O support for AWS - S3, HBase Toolkit for building custom solutions Fetch white list (parse/index, data mine) Scrape white list (social popularity) Does the world really need yet another web crawler? No, but it does need a web mining toolkit Two companies agreed to sponsor work on Bixo as an open source project. On the point of running well in an EC2 environment Even though there are many web mining tasks that can be handled on a single computer, You very quickly run into issues of scale if you can t handle upwards of 100M+ pages. 5

6 Bixo Overview MIT license open source project In use by three companies Pipe model for building workflows Runs on top of Hadoop/Cascading Full disclosure - Bixo makes heavy use of Cascading, which is under GPL. So if you want to sell a product based on Bixo, you need to talk to Chris Wensel. The pipe model comes from our use of Cascading to define the workflows. 6

7 What is Cascading API for Hadoop data processing workflows Operations on tuples with named fields Workflows created from pipes Reduces painful low-level MR details Key for complex/reliable workflows I know Chris Wensel has previously talked about Cascading here, but just to make sure we re all on the same page tuple is like a row in a database. Named fields with values. Example of tuple - result of fetching a page, has URL, time of fetch, content, headers, response rate, etc. Because you can build workflows out of a mix of pre-defined & custom pipes, it s a real toolkit. Chris explains it as MR is assembly, and Cascading is C. Sometimes it feels more like C++ :) Key aspect of reliable workflows is Cascading s ability to check your workflow (the DAG it builds) Finds cases where fields aren t available for operations. Solves a key problem we ran into when customizing Nutch at Krugle 7

8 Architecture This architecture looks nice and squeaky clean - and in general it is. One issue is with the fetch phase of bixo not fitting well into the MR model. External resource constraints mean you can t treat it like a regular job. So lots of threads in a special reduce phase, with corresponding issues -Stack size -Error handling 8

9 HUGMEE Hadoop Users who Generate the Most Effective s Let s use a real example now of using Bixo to do web mining. Imagine that the Apache Foundation decided to honor people who make significant contributions to the Hadoop community. In a typical company, determining the winner would depend on political maneuvering, bribes,and sucking up. But the Apache Foundation could decides to go for a quantitative approach for the HUGMEE award. 9

10 Helpful Hadoopers Use mailing list archives for data (collect) Parse mbox files and s (parse) Score based on key phrases (analyze) End result is score/name pair (produce) How do you figure out the most helpful Hadoopers? As we discussed previously, it s a classic web mining problem Luckily the Hadoop mailing lists are all nicely archived as monthly mbox files. How do we score based on key phrases (next slide)? 10

11 Scoring Algorithm Very sophisticated point system thanks == 5 owe you a beer == 50 worship the ground you walk on ==

12 High Level Steps Collect s Fetch mod_mbox generated page Parse it to extract links to mbox files Fetch mbox files Split into separate s Parse s Extract key headers (messageid, , etc) Parse body to identify quoted text Parsing the mod_mbox page is simple with Tika s HtmlParser Cheated a bit when parsing s - some users like Owen have many aliases So hand-generated alias resolution table. 12

13 High Level Steps Analyze s Find key phrases in replies (ignore signoff) Score s by phrases Group & sum by message ID Group & sum by address Produce ranked list Toss addresses with no love Sort by summed score Need to ignore thanks in thanks in advance for doing my job for me signoff. Generate two tuples for each -one with messageid/name/address -One with reply-to messageid/score Group/sum aspect is classic reduce operation. 13

14 Workflow I think this slide is pretty self-explanatory - two Bixo fetch cycles, 6 custom Cascading operations, 6 MR jobs. OK, actually not so clear, but Key point is that only purple is stuff that I had to actually create Some lines are purple as well, since that workflow (DAG) is also something I defined - see next page. But only two custom operations actually needed - parsing mbox_page and calculating score Running took about 30 minutes - mostly politely waiting until it was Ok to politely do another fetch. Downloaded 150MB of mbox files 409 unique addresses with at least one positive reply. 14

15 Building the Flow Most of the code needed to create the workflow for this data mining app. Lots of oatmeal code - which is good. Don t want to be writing tricky code here. Could optimize, but that would be a mistake most web mining is programmer-constrained. So just use more servers in EC2 - cheaper & faster. 15

16 mod_mbox Page Example of the top-level pages that were fetched in first phase. Then needed to be parsed to extract links to mbox files. 16

17 Custom Operation Example of one of two custom operation Parsing mod_mbox page Uses Tika to extract Ids Emits tuple with URL for each mbox ID 17

18 Validate Curve looks right - exponential decay. 409 unique addresses that got some love from somebody. 18

19 This Hug s for Ted! And the winner is Ted Dunning I know - I should have colored the elephant yellow. 19

20 Produce A list of the usual suspects Coincidentally, Ted helped me derive the scoring algorithm I used hmm. 20

21 Use Bixo to Find +/- product comments on forums Compare web site quality Track social network popularity Derive optimized SEO terms Scape and analyze pricing data Previous example could be easily changed to find opinion makers on forums Many other use cases All involve web mining workflow - fetch, parse, analyze, produce 21

22 Summary Bixo is a web mining toolkit Built on Hadoop, Cascading, Tika Young project but used commercially Future - Mahout, monitoring, HBase, URL DB, cleanup, bug fixes, rinse, repeat Lots to be done, of course, but moving fast 22

23 Resources Web: List: Source: Bugs: URLs to find out more about the Bixo project. Stefan Groschupf from 101tec helped with initial Bixo coding. His company provides infrastructure for project, thus 101tec.com in URLs above 23

24 Any Questions? 24

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf

Tambako the Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf Tambako the Jaguar@flickr.com Bixo - a webcrawler toolkit Ken Krugler, Stefan Groschupf Jule_Berlin@flickr.com Agenda Overview Background Motivation Goals Status Differences Architecture Data life cycle

More information

Web Mining Strata 2012

Web Mining Strata 2012 1 Scale Unlimited Web Mining Strata 2012 photo by: i_pinz, flickr Copyright (c) 2012 Scale Unlimited. All Rights Reserved. Reproduction or distribution of this document in any form without prior written

More information

CIO 24/7 Podcast: Tapping into Accenture s rich content with a new search capability

CIO 24/7 Podcast: Tapping into Accenture s rich content with a new search capability CIO 24/7 Podcast: Tapping into Accenture s rich content with a new search capability CIO 24/7 Podcast: Tapping into Accenture s rich content with a new search capability Featuring Accenture managing directors

More information

Web Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter

Web Analysis in 4 Easy Steps. Rosaria Silipo, Bernd Wiswedel and Tobias Kötter Web Analysis in 4 Easy Steps Rosaria Silipo, Bernd Wiswedel and Tobias Kötter KNIME Forum Analysis KNIME Forum Analysis Steps: 1. Get data into KNIME 2. Extract simple statistics (how many posts, response

More information

Erlang and Thrift for Web Development

Erlang and Thrift for Web Development Erlang and Thrift for Web Development Todd Lipcon (@tlipcon) Cloudera June 25, 2009 Introduction Erlang vs PHP Thrift A Case Study About Me Who s this dude who looks like he s 14? Built web sites in Perl,

More information

Image Credit: Photo by Lukas from Pexels

Image Credit: Photo by Lukas from Pexels Are you underestimating the importance of Keywords Research In SEO? If yes, then really you are making huge mistakes and missing valuable search engine traffic. Today s SEO world talks about unique content

More information

Online Video Playbook. Written by: Johnny Beirne

Online Video Playbook. Written by: Johnny Beirne Online Video Playbook Written by: Johnny Beirne Table of Contents Introduction... 1 On-camera...... 2 Animation...... 3 Animated GIFs........ 4 Screen Capture Tutorials... 5 Smart Phone Videos...... 6

More information

program self-assessment tool

program self-assessment tool 10-Point Email Assessment (Based on FulcrumTech Proprietary Email Maturity) Your Website Email program self-assessment tool This brief self-assessment tool will help you honestly assess your email program

More information

Nutch as a Web mining platform the present and the future Andrzej Białecki

Nutch as a Web mining platform the present and the future Andrzej Białecki Apache Nutch as a Web mining platform the present and the future Andrzej Białecki ab@sigram.com Intro Started using Lucene in 2003 (1.2-dev?) Created Luke the Lucene Index Toolbox Nutch, Lucene committer,

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

LizardThemes.com Free & Premium WordPress Themes. LizardThemes. User Guide. First Edition

LizardThemes.com Free & Premium WordPress Themes. LizardThemes. User Guide. First Edition LizardThemes.com Free & Premium WordPress Themes LizardThemes User Guide First Edition Online version: http://lizardthemes.com/documentation/ 2013 Contents Chapter 1 How to start... 3 Chapter 2 Theme Settings...

More information

Oleksandr Kuzomin, Bohdan Tkachenko

Oleksandr Kuzomin, Bohdan Tkachenko International Journal "Information Technologies Knowledge" Volume 9, Number 2, 2015 131 INTELLECTUAL SEARCH ENGINE OF ADEQUATE INFORMATION IN INTERNET FOR CREATING DATABASES AND KNOWLEDGE BASES Oleksandr

More information

Endless Monetization

Endless Monetization Hey Guys, So, today we want to bring you a few topics that we feel compliment's the recent traffic, niches and keyword discussions. Today, we want to talk about a few different things actually, ranging

More information

Introduction! 2. Why You NEED This Guide 2. Step One: Research! 3. What Are Your Customers Searching For? 3. Step Two: Title Tag!

Introduction! 2. Why You NEED This Guide 2. Step One: Research! 3. What Are Your Customers Searching For? 3. Step Two: Title Tag! Table of Contents Introduction! 2 Why You NEED This Guide 2 Step One: Research! 3 What Are Your Customers Searching For? 3 Step Two: Title Tag! 4 The First Thing Google Sees 4 How Do I Change It 4 Step

More information

Furl Furled Furling. Social on-line book marking for the masses. Jim Wenzloff Blog:

Furl Furled Furling. Social on-line book marking for the masses. Jim Wenzloff Blog: Furl Furled Furling Social on-line book marking for the masses. Jim Wenzloff jwenzloff@misd.net Blog: http://www.visitmyclass.com/blog/wenzloff February 7, 2005 This work is licensed under a Creative Commons

More information

Web Hosting. Important features to consider

Web Hosting. Important features to consider Web Hosting Important features to consider Amount of Storage When choosing your web hosting, one of your primary concerns will obviously be How much data can I store? For most small and medium web sites,

More information

THE 18 POINT CHECKLIST TO BUILDING THE PERFECT LANDING PAGE

THE 18 POINT CHECKLIST TO BUILDING THE PERFECT LANDING PAGE THE 18 POINT CHECKLIST TO BUILDING THE PERFECT LANDING PAGE The 18 point checklist to building the Perfect landing page Landing pages come in all shapes and sizes. They re your metaphorical shop front

More information

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna

BUbiNG. Massive Crawling for the Masses. Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna BUbiNG Massive Crawling for the Masses Paolo Boldi, Andrea Marino, Massimo Santini, Sebastiano Vigna Dipartimento di Informatica Università degli Studi di Milano Italy Once upon a time UbiCrawler UbiCrawler

More information

SEO: SEARCH ENGINE OPTIMISATION

SEO: SEARCH ENGINE OPTIMISATION SEO: SEARCH ENGINE OPTIMISATION SEO IN 11 BASIC STEPS EXPLAINED What is all the commotion about this SEO, why is it important? I have had a professional content writer produce my content to make sure that

More information

Getting started with social media and comping

Getting started with social media and comping Getting started with social media and comping Promotors are taking a leap further into the digital age, and we are finding that more and more competitions are migrating to Facebook and Twitter. If you

More information

CS193X: Web Programming Fundamentals

CS193X: Web Programming Fundamentals CS193X: Web Programming Fundamentals Spring 2017 Victoria Kirst (vrk@stanford.edu) CS193X schedule Today - Middleware and Routes - Single-page web app - More MongoDB examples - Authentication - Victoria

More information

Large Crawls of the Web for Linguistic Purposes

Large Crawls of the Web for Linguistic Purposes Large Crawls of the Web for Linguistic Purposes SSLMIT, University of Bologna Birmingham, July 2005 Outline Introduction 1 Introduction 2 3 Basics Heritrix My ongoing crawl 4 Filtering and cleaning 5 Annotation

More information

MovieRec - CS 410 Project Report

MovieRec - CS 410 Project Report MovieRec - CS 410 Project Report Team : Pattanee Chutipongpattanakul - chutipo2 Swapnil Shah - sshah219 Abstract MovieRec is a unique movie search engine that allows users to search for any type of the

More information

How to Get Your Web Maps to the Top of Google Search

How to Get Your Web Maps to the Top of Google Search How to Get Your Web Maps to the Top of Google Search HOW TO GET YOUR WEB MAPS TO THE TOP OF GOOGLE SEARCH Chris Brown CEO & Co-founder of Mango SEO for web maps is particularly challenging because search

More information

EPISODE 23: HOW TO GET STARTED WITH MAILCHIMP

EPISODE 23: HOW TO GET STARTED WITH MAILCHIMP EPISODE 23: HOW TO GET STARTED WITH MAILCHIMP! 1 of! 26 HOW TO GET STARTED WITH MAILCHIMP Want to play a fun game? Every time you hear the phrase email list take a drink. You ll be passed out in no time.

More information

Storm Crawler. Low latency scalable web crawling on Apache Storm. Julien Nioche digitalpebble. Berlin Buzzwords 01/06/2015

Storm Crawler. Low latency scalable web crawling on Apache Storm. Julien Nioche digitalpebble. Berlin Buzzwords 01/06/2015 Storm Crawler Low latency scalable web crawling on Apache Storm Julien Nioche julien@digitalpebble.com digitalpebble Berlin Buzzwords 01/06/2015 About myself DigitalPebble Ltd, Bristol (UK) Specialised

More information

Analysis, Dekalb Roofing Company Web Site

Analysis, Dekalb Roofing Company Web Site Analysis, Dekalb Roofing Company Web Site Client: Dekalb Roofing Company Site: dekalbroofingcompanyinc.com Overall Look & Design This is a very good-looking site. It s clean, tasteful, has well-coordinated

More information

The Fat-Free Guide to Conversation Tracking

The Fat-Free Guide to Conversation Tracking The Fat-Free Guide to Conversation Tracking Using Google Reader as a (Basic) Monitoring Tool. By Ian Lurie President, Portent Interactive Portent.com Legal, Notes and Other Stuff 2009, The Written Word,

More information

COMPREHENSIVE GUIDE ON HOW TO NAIL COLD

COMPREHENSIVE GUIDE ON HOW TO NAIL COLD Reply #1 THE FIRST REPLY BOOK ON SALES Kick off your outbound sales and setup new predictable revenue stream. COMPREHENSIVE GUIDE ON HOW TO NAIL COLD EMAIL 2016 LIST OF CONTENTS Intro Part 1: Building

More information

Faster Workflows, Faster. Ken Krugler President, Scale Unlimited

Faster Workflows, Faster. Ken Krugler President, Scale Unlimited Faster Workflows, Faster Ken Krugler President, Scale Unlimited The Twitter Pitch Cascading is a solid, established workflow API Good for complex custom ETL workflows Flink is a new streaming dataflow

More information

Collective Intelligence in Action

Collective Intelligence in Action Collective Intelligence in Action SATNAM ALAG II MANNING Greenwich (74 w. long.) contents foreword xv preface xvii acknowledgments xix about this book xxi PART 1 GATHERING DATA FOR INTELLIGENCE 1 "1 Understanding

More information

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012

Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012 Python & Web Mining Lecture 6 10-10-12 Old Dominion University Department of Computer Science CS 495 Fall 2012 Hany SalahEldeen Khalil hany@cs.odu.edu Scenario So what did Professor X do when he wanted

More information

Almost 80 percent of new site visits begin at search engines. A couple of years back Nielsen published a list of popular search engines.

Almost 80 percent of new site visits begin at search engines. A couple of years back Nielsen published a list of popular search engines. SEO OverView We have a problem, we want people to visit our Web site, that's the purpose after all to bring people to our website and increase traffic inorder to buy soundspirit products and learn more

More information

Lifehack #1 - Automating Twitter Growth without Being Blocked by Twitter

Lifehack #1 - Automating Twitter Growth without Being Blocked by Twitter Lifehack #1 - Automating Twitter Growth without Being Blocked by Twitter Intro 2 Disclaimer 2 Important Caveats for Twitter Automation 2 Enter Azuqua 3 Getting Ready 3 Setup and Test your Connection! 4

More information

MITOCW watch?v=r6-lqbquci0

MITOCW watch?v=r6-lqbquci0 MITOCW watch?v=r6-lqbquci0 The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To

More information

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword.

Page Title is one of the most important ranking factor. Every page on our site should have unique title preferably relevant to keyword. SEO can split into two categories as On-page SEO and Off-page SEO. On-Page SEO refers to all the things that we can do ON our website to rank higher, such as page titles, meta description, keyword, content,

More information

Tika in Action JUKKA MANNING CHRIS A. MATTMANN L. ZITTING. Shelter Island

Tika in Action JUKKA MANNING CHRIS A. MATTMANN L. ZITTING. Shelter Island Tika in Action CHRIS A. MATTMANN JUKKA L. ZITTING 11 MANNING Shelter Island contents foretuord xv preface xvii acknowledgments xix about this book xxi about the authors xxv about the cover illustration

More information

Welcome to the New Era of Cloud Computing

Welcome to the New Era of Cloud Computing Welcome to the New Era of Cloud Computing Aaron Kimball The web is replacing the desktop 1 SDKs & toolkits are there What about the backend? Image: Wikipedia user Calyponte 2 Two key concepts Processing

More information

Hadoop. Introduction / Overview

Hadoop. Introduction / Overview Hadoop Introduction / Overview Preface We will use these PowerPoint slides to guide us through our topic. Expect 15 minute segments of lecture Expect 1-4 hour lab segments Expect minimal pretty pictures

More information

Session #1024: Building a Social Advocacy Campaign on a Convio Open Framework

Session #1024: Building a Social Advocacy Campaign on a Convio Open Framework Session #1024: Building a Social Advocacy Campaign on a Convio Open Framework Presented by: Steve Mook Doug Fierro Session Objective Demonstrate via working example a subset of the rich set of open platform

More information

RadiantBlue Technologies, Inc. Page 1

RadiantBlue Technologies, Inc. Page 1 vpiazza RadiantBlue Technologies, Inc. Page 1 vpiazza Enabling Government Teams to Share and Access Data in the Cloud in 2016 Michael P. Gerlek mgerlek@radiantblue.com 4 May 2016 RadiantBlue Technologies,

More information

Spam. Time: five years from now Place: England

Spam. Time: five years from now Place: England Spam Time: five years from now Place: England Oh no! said Joe Turner. When I go on the computer, all I get is spam email that nobody wants. It s all from people who are trying to sell you things. Email

More information

Paul's Online Math Notes. Online Notes / Algebra (Notes) / Systems of Equations / Augmented Matricies

Paul's Online Math Notes. Online Notes / Algebra (Notes) / Systems of Equations / Augmented Matricies 1 of 8 5/17/2011 5:58 PM Paul's Online Math Notes Home Class Notes Extras/Reviews Cheat Sheets & Tables Downloads Algebra Home Preliminaries Chapters Solving Equations and Inequalities Graphing and Functions

More information

Kindle Books InfoPath With SharePoint 2010 How-To

Kindle Books InfoPath With SharePoint 2010 How-To Kindle Books InfoPath With SharePoint 2010 How-To Real, step-by-step solutions for creating and managing data forms in SharePoint 2010 with InfoPath: fast, accurate, proven, and easy to use  A concise,

More information

XML Sitemap Splitter for Magento 2. User Guide

XML Sitemap Splitter for Magento 2. User Guide XML Sitemap Splitter for Magento 2 User Guide Table of Contents 1. XML Sitemap Splitter Configuration 1.1. Accessing the Extension Main Setting 1.2. General 1.3. Categories 1.4. Products In Stock 1.5.

More information

Links For SEO in 2018

Links For SEO in 2018 Links For SEO in 2018 Hello London! Servus in Wien About Christoph C. Cemper Links & SEO since 2003 Founder & CEO of @cemper Author of Spaghetti Code Orange Jackets and SEO 1 ARE LINKS IMPORTANT? The

More information

Corner The Local Search Engine Market Four Steps to Ensure your Business will Capitalize from Local Google Search Exposure by Eric Rosen

Corner The Local Search Engine Market Four Steps to Ensure your Business will Capitalize from Local Google Search Exposure by Eric Rosen Corner The Local Search Engine Market Four Steps to Ensure your Business will Capitalize from Local Google Search Exposure by Eric Rosen 2011 www.marketingoutthebox.com Table of Contents Introduction

More information

Search Engine Optimization Lesson 2

Search Engine Optimization Lesson 2 Search Engine Optimization Lesson 2 Getting targeted traffic The only thing you care about as a website owner is getting targeted traffic. In other words, the only people you want visiting your website

More information

Getting Started with. Lite.

Getting Started with. Lite. Getting Started with Lite www.boltiq.io Getting Started with Lite Download Download the app as either a container or Library. http://www.boltiq.io/bolt-lite/ See Examples Open the example test projects

More information

How Does a Search Engine Work? Part 1

How Does a Search Engine Work? Part 1 How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0 What we ll examine Web crawling

More information

David Werth IDEAS Design & Grayout Aerosports Albuquerque, q NM & Indianapolis, IN

David Werth IDEAS Design & Grayout Aerosports Albuquerque, q NM & Indianapolis, IN 1 David Werth IDEAS Design & Grayout Aerosports Albuquerque, q NM & Indianapolis, IN Dave@IDEASDesigninc.com Dave@GrayOut.com Moderator: (Jacquie Warda) (Jacquie B Airshows) 2 Founder and CEO of IDEAS

More information

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout

Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout Un-moderated real-time news trends extraction from World Wide Web using Apache Mahout A Project Report Presented to Professor Rakesh Ranjan San Jose State University Spring 2011 By Kalaivanan Durairaj

More information

The Challenges for Software Developers with Modern App Delivery

The Challenges for Software Developers with Modern App Delivery The Challenges for Software Developers with Modern App Delivery This blog post is by Tim Mangan, owner of TMurgent Technologies, LLP. Awarded a Microsoft MVP for Application Virtualization, and CTP by

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

5 R1 The one green in the same place so either of these could be green.

5 R1 The one green in the same place so either of these could be green. Page: 1 of 20 1 R1 Now. Maybe what we should do is write out the cases that work. We wrote out one of them really very clearly here. [R1 takes out some papers.] Right? You did the one here um where you

More information

Optimizing Apache Nutch For Domain Specific Crawling at Large Scale

Optimizing Apache Nutch For Domain Specific Crawling at Large Scale Optimizing Apache Nutch For Domain Specific Crawling at Large Scale Luis A. Lopez, Ruth Duerr, Siri Jodha Singh Khalsa luis.lopez@nsidc.org http://github.com/b-cube IEEE Big Data 2015, Santa Clara CA.

More information

Connect with Remedy: SmartIT: Social Event Manager Webinar Q&A

Connect with Remedy: SmartIT: Social Event Manager Webinar Q&A Connect with Remedy: SmartIT: Social Event Manager Webinar Q&A Q: Will Desktop/browser alerts be added to notification capabilities on SmartIT? A: In general we don't provide guidance on future capabilities.

More information

MapReduce, Hadoop and Spark. Bompotas Agorakis

MapReduce, Hadoop and Spark. Bompotas Agorakis MapReduce, Hadoop and Spark Bompotas Agorakis Big Data Processing Most of the computations are conceptually straightforward on a single machine but the volume of data is HUGE Need to use many (1.000s)

More information

How to Become an IoT Developer (and Have Fun!) Justin Mclean Class Software.

How to Become an IoT Developer (and Have Fun!) Justin Mclean Class Software. How to Become an IoT Developer (and Have Fun!) Justin Mclean Class Software Email: justin@classsoftware.com Twitter: @justinmclean Who am I? Freelance Developer - programming for 25 years Incubator PMC

More information

Oracle Mix. A Case Study. Ola Bini JRuby Core Developer ThoughtWorks Studios.

Oracle Mix. A Case Study. Ola Bini JRuby Core Developer ThoughtWorks Studios. Oracle Mix A Case Study Ola Bini JRuby Core Developer ThoughtWorks Studios ola.bini@gmail.com http://olabini.com/blog Vanity slide Vanity slide Ola Bini Vanity slide Ola Bini From Stockholm, Sweden Vanity

More information

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera, How Apache Hadoop Complements Existing BI Systems Dr. Amr Awadallah Founder, CTO Cloudera, Inc. Twitter: @awadallah, @cloudera 2 The Problems with Current Data Systems BI Reports + Interactive Apps RDBMS

More information

An architect s website:!

An architect s website:! An architect s website:! Designing and building your own website - discussion notes / BANG. 1 First ask yourself 2 questions! * Is the website to get new business enquiries via online search? * Is the

More information

MapReduce Design Patterns

MapReduce Design Patterns MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together

More information

MSI Sakib - Blogger, SEO Researcher and Internet Marketer

MSI Sakib - Blogger, SEO Researcher and Internet Marketer About Author: MSI Sakib - Blogger, SEO Researcher and Internet Marketer Hi there, I am the Founder of Techmasi.com blog and CEO of Droid Digger (droiddigger.com) android app development team. I love to

More information

A TALE OF TWO APPS WHY DEVELOPMENT PRACTICES MATTER

A TALE OF TWO APPS WHY DEVELOPMENT PRACTICES MATTER A TALE OF TWO APPS WHY DEVELOPMENT PRACTICES MATTER WHO AM I? PHP Developer for about 9 years Worked in insurance for 4.5 years I know RPG! (Not that good at it though) WHAT DID WE NEED TO DO? Build an

More information

MARKETING FOR PROPERTY INVESTORS THE QUICK GUIDE

MARKETING FOR PROPERTY INVESTORS THE QUICK GUIDE EMAIL MARKETING FOR PROPERTY INVESTORS THE QUICK GUIDE Email marketing is still one of the best and most effective methods of real estate marketing for investors. How do you do it well? Email Marketing

More information

Business Hacks to grow your list with Social Media Marketing

Business Hacks to grow your list with Social Media Marketing Business Hacks to grow your list with Social Media Marketing Social media marketing enables you to attract more attention when you create and share content. Social media platforms are great places to engage

More information

Magnetize Your. Website. A step-by-step action guide to attracting your perfect clients. Crystal Pina. StreamlineYourMarketing.com

Magnetize Your. Website. A step-by-step action guide to attracting your perfect clients. Crystal Pina. StreamlineYourMarketing.com Magnetize Your Website A step-by-step action guide to attracting your perfect clients Crystal Pina StreamlineYourMarketing.com 2016 StreamlineYourMarketing.com All Rights Reserved. Published by Streamline

More information

Codify: Code Search Engine

Codify: Code Search Engine Codify: Code Search Engine Dimitriy Zavelevich (zavelev2) Kirill Varhavskiy (varshav2) Abstract: Codify is a vertical search engine focusing on searching code and coding problems due to it s ability to

More information

What is Standard APEX? TOOLBOX FLAT DESIGN CARTOON PEOPLE

What is Standard APEX? TOOLBOX FLAT DESIGN CARTOON PEOPLE What is Standard APEX? TOOLBOX FLAT DESIGN CARTOON PEOPLE About me Freelancer since 2010 Consulting and development Oracle databases APEX BI Blog: APEX-AT-WORK Twitter: @tobias_arnhold - Oracle ACE Associate

More information

Launch Store. University

Launch Store. University Launch Store University Store Settings In this lesson, you will learn about: Completing your Store Profile Down for maintenance, physical dimensions and SEO settings Display and image settings Time zone,

More information

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet. Mr G s Java Jive #2: Yo! Our First Program With this handout you ll write your first program, which we ll call Yo. Programs, Classes, and Objects, Oh My! People regularly refer to Java as a language that

More information

[ SEO LINK ROBOT QUICK USAGE GUIDE]

[ SEO LINK ROBOT QUICK USAGE GUIDE] This document is based on a set of emails I sent out to trial users to give tips and ideas on running seo link robot. Initial Setups Hope you have now downloaded and installed Seo Link Robot and are ready

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL

Building High Performance Apps using NoSQL. Swami Sivasubramanian General Manager, AWS NoSQL Building High Performance Apps using NoSQL Swami Sivasubramanian General Manager, AWS NoSQL Building high performance apps There is a lot to building high performance apps Scalability Performance at high

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

EXCELLING WITH ANALYSIS AND VISUALIZATION

EXCELLING WITH ANALYSIS AND VISUALIZATION EXCELLING WITH ANALYSIS AND VISUALIZATION A PRACTICAL GUIDE FOR DEALING WITH DATA Prepared by Ann K. Emery July 2016 Ann K. Emery 1 Welcome Hello there! In July 2016, I led two workshops Excel Basics for

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Social Bookmarks. Blasting their site with them during the first month of creation Only sending them directly to their site

Social Bookmarks. Blasting their site with them during the first month of creation Only sending them directly to their site Hey guys, what's up? We have another, jammed packed and exciting bonus coming at you today. This one is all about the "Everyone knows Everybody" generation; where everyone is socially connected via the

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO?

SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO? TABLE OF CONTENTS INTRODUCTION CHAPTER 1: WHAT IS SEO? CHAPTER 2: SEO WITH SHOPIFY: DOES SHOPIFY HAVE GOOD SEO? CHAPTER 3: PRACTICAL USES OF SHOPIFY SEO CHAPTER 4: SEO PLUGINS FOR SHOPIFY CONCLUSION INTRODUCTION

More information

Microsoft Exam

Microsoft Exam Volume: 42 Questions Case Study: 1 Relecloud General Overview Relecloud is a social media company that processes hundreds of millions of social media posts per day and sells advertisements to several hundred

More information

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016

DATABASE SYSTEMS. Introduction to MySQL. Database System Course, 2016 DATABASE SYSTEMS Introduction to MySQL Database System Course, 2016 AGENDA FOR TODAY Administration Database Architecture on the web Database history in a brief Databases today MySQL What is it How to

More information

DG Theory into practice Delegate A/S HQ

DG Theory into practice Delegate A/S HQ DG Theory into practice 2016-04-08 @ Delegate A/S HQ Agenda Matching of expectations Short introduction: Speaker and Delegate A/S Theory into practice Delegate A/S palette of technologies (what we do)

More information

Digital Marketing Proposal

Digital Marketing Proposal Digital Marketing Proposal ---------------------------------------------------------------------------------------------------------------------------------------------- 1 P a g e We at Tronic Solutions

More information

Prototyping Data Intensive Apps: TrendingTopics.org

Prototyping Data Intensive Apps: TrendingTopics.org Prototyping Data Intensive Apps: TrendingTopics.org Pete Skomoroch Research Scientist at LinkedIn Consultant at Data Wrangling @peteskomoroch 09/29/09 1 Talk Outline TrendingTopics Overview Wikipedia Page

More information

Oh.. You got this? Attack the modern web

Oh.. You got this? Attack the modern web Oh.. You got this? Attack the modern web HELLO DENVER!...Known for more than recreational stuff 2 WARNING IDK 2018 Moses Frost. @mosesrenegade This talk may contain comments or opinions that at times may

More information

The Right Read Optimization is Actually Write Optimization. Leif Walsh

The Right Read Optimization is Actually Write Optimization. Leif Walsh The Right Read Optimization is Actually Write Optimization Leif Walsh leif@tokutek.com The Right Read Optimization is Write Optimization Situation: I have some data. I want to learn things about the world,

More information

Detecting ads in a machine learning approach

Detecting ads in a machine learning approach Detecting ads in a machine learning approach Di Zhang (zhangdi@stanford.edu) 1. Background There are lots of advertisements over the Internet, who have become one of the major approaches for companies

More information

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW

6 TOOLS FOR A COMPLETE MARKETING WORKFLOW 6 S FOR A COMPLETE MARKETING WORKFLOW 01 6 S FOR A COMPLETE MARKETING WORKFLOW FROM ALEXA DIFFICULTY DIFFICULTY MATRIX OVERLAP 6 S FOR A COMPLETE MARKETING WORKFLOW 02 INTRODUCTION Marketers use countless

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Become strong in Excel (2.0) - 5 Tips To Rock A Spreadsheet!

Become strong in Excel (2.0) - 5 Tips To Rock A Spreadsheet! Become strong in Excel (2.0) - 5 Tips To Rock A Spreadsheet! Hi folks! Before beginning the article, I just wanted to thank Brian Allan for starting an interesting discussion on what Strong at Excel means

More information

Map-Reduce With Hadoop!

Map-Reduce With Hadoop! Map-Reduce With Hadoop! Announcement 1/2! Assignments, in general:! Autolab is not secure and assignments aren t designed for adversarial interactions! Our policy: deliberately gaming an autograded assignment

More information

GSAK (Geocaching Swiss Army Knife) GEOCACHING SOFTWARE ADVANCED KLASS GSAK by C3GPS & Major134

GSAK (Geocaching Swiss Army Knife) GEOCACHING SOFTWARE ADVANCED KLASS GSAK by C3GPS & Major134 GSAK (Geocaching Swiss Army Knife) GEOCACHING SOFTWARE ADVANCED KLASS GSAK - 102 by C3GPS & Major134 Table of Contents About this Document... iii Class Materials... iv 1.0 Locations...1 1.1 Adding Locations...

More information

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà

FROM LEGACY, TO BATCH, TO NEAR REAL-TIME. Marc Sturlese, Dani Solà FROM LEGACY, TO BATCH, TO NEAR REAL-TIME Marc Sturlese, Dani Solà WHO ARE WE? Marc Sturlese - @sturlese Backend engineer, focused on R&D Interests: search, scalability Dani Solà - @dani_sola Backend engineer

More information

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench

CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench CIS 612 Advanced Topics in Database Big Data Project Lawrence Ni, Priya Patil, James Tench Abstract Implementing a Hadoop-based system for processing big data and doing analytics is a topic which has been

More information

WELCOME! - Brisbane City. Kurt Sanders Director of Strategy The Content Division. Terri Cooper Small Business Liaison.

WELCOME! - Brisbane City. Kurt Sanders Director of Strategy The Content Division. Terri Cooper Small Business Liaison. WELCOME! Terri Cooper Small Business Liaison - Brisbane City Kurt Sanders Director of Strategy The Content Division @sanderlands How to build a website for your business without spending a fortune, making

More information

Intro History Version 2 Problems Software Future. Dr. StrangeBook. or: How I Learned to Stop Worrying and Love XML. Nigel Stanger

Intro History Version 2 Problems Software Future. Dr. StrangeBook. or: How I Learned to Stop Worrying and Love XML. Nigel Stanger Dr. StrangeBook or: How I Learned to Stop Worrying and Love XML Nigel Stanger Department of Information Science May 7, 2004 Dr. StrangeBook CIS Seminar 2004 1 What am I going to talk about? Document publication

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information