The Web: Concepts and Technology January 15: Course Overview 1 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Today s Plan Who am I? What is this course about? Logistics Who are you? 2 Eugene Agichtein CS 190: The Web: Concepts and Technology, Emory University Spring 2009
Who am I: Background Sept 2006-: Assistant Professor in the Math & CS department Affiliate Faculty, Linguistics Affiliate Faculty, Web Science @ Georgia Tech Summer 2007: Visiting Researcher at Yahoo! Research 2004 to 2006: Postdoctoral Researcher at Microsoft Research Text Mining, Search, and Navigation group, and MSN Search/Live 1998-2004: Ph.D. in Computer Science from Columbia University: dissertation on extracting structured relations from web-scale document repositories 1994-1998: 1998: B.S. in Engineering from The Cooper Union. 3 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Research: Developing Intelligent Systems to Help People e Find Information o Online Search, browsing behavior User-generated content, social networks Human cognitive processes
Intelligent Information Access Lab http://ir.mathcs.emory.edu/ Information retrieval & extraction, text & data mining Web search user behavior, social networks, social media Ryan Kelly, Emory 10 Walt Askew, Emory 09 Abulimiti Aji, 1 st Year Ph.D Qi Guo, Yandong Liu, Alvin Grissom, 2 nd year Ph.D 2 nd year Ph.D 2 nd year MS External collaborations: Emory Libraries: Selden Deemer, Arthur Murphy Psychology: Phil Wolff Neuroscience: Beth Buffalo School of Medicine: Ernie Garcia And colleagues atyahoo! Research, Microsoft Research, Motorola, and GeorgiaTech
Course Outline Web history and infrastructure Web Search and Browsing Applications: E-commerce, advertising Abuse: spam, hacking and the gray areas Web services Recommender systems Online social networks Online collaboration Other topics: will depend on your interest! 6 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
What is the Internet? t? The largest network of networks in the world. Uses TCP/IP protocols and packet switching. Runs on any communications substrate. From Dr Vinton Cerf From Dr. Vinton Cerf, Co-Creator of TCP/IP
Structure of the Internet 8 Eugene Agichtein CS 190: The Web: Concepts and Technology, Emory University Spring 2009
Bi Brief fhistory of fthe Internet t 1968 - DARPA (Defense Advanced Research Projects Agency) contracts with BBN (Bolt, Beranek & Newman) to create ARPAnet 1970 - First five nodes: UCLA Stanford UC Santa Barbara U of Utah, and BBN 1974 -TCP specification by Vint Cerf 1984 On January 1, the Internet with its 1000 hosts converts en y masse to using TCP/IP for its messaging
Graph mining
Web Link Structure and Web Search Browsing can t find these pages Need a search engine Bow Tie Structure Broder et al 2000
Web Search: Google 1997 2000 12 Eugene Agichtein CS 190: The Web: Concepts and Technology, Emory University Spring 2009
Google Architecture URL Server - sends lists of URLs to crawlers Crawler - downloads web pages Store Server - compresses & stores web pages into the repository Indexer - reads the repository & uncompresses the documents - parses the documents - creates forward index - parses out the links URL Resolver - converts relative URLs to absolute URLs and then to docids - generates a database of links - puts the anchor text into the barrels Sorter - generates the inverted index g Searcher - answers queries
Web Search: Google (continued) 2001 2007 14 Eugene Agichtein CS 190: The Web: Concepts and Technology, Emory University Spring 2009
15 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
16 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
17 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
18 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Users learn to ignore ads! Heat map: Detect gaze position and duration using eye tracking Box Blindness 19 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
20 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
21 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
This was surface web
The Invisible, Deep, or Hidden Web Web sites or information that Google or other popular search engines are not fully indexing Websites specifically excluded by the search engine
24 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
25 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
26 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
27 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Web 2.0: It s Hard to Define, But I Know it When I See it Web Services / API s Emerging Tech Folksonomies / Content tagging g AJAX RSS Some Apps You may Know Flickr Google Maps Blogging & Content Syndication Craigslist Know Facebook, Linkedin, Tribes, Ryze, Friendster Del.icio.us Upcoming.org 43Things.com "[This is] not my mom's Internet It's changing, and it's changing because we're looking at the share-shifting the the time people are looking at TV, reading a magazine, listening to the radio they're not replacing each other; they're coming together." - AOL Exec / May 2005 Major Retailers Amazon API s Google Adsense API Yahoo API Ebay API
Web 2.0: Evolution Towards a Read/Write Platform Web 1.0 (1993-2003) Web 2.0 (2003- beyond) Pretty much HTML pages viewed through a browser Web pages, pg plus a lot of other content shared over the web, with more interactivity; more like an application than a page Read Mode Write & Contribute Page Primary Unit of Post / record content static State dynamic Web browser Viewed through Browsers, RSS Readers, anything Client Server Architecture Web Services Web Coders Content Created by Everyone geeks Domain of mass amaturization
Recommendations Search Recommendations Items Products, web sites, blogs, news items, 30 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Well-known recommender systems: Amazon and Netflix 31 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
Recommendation Types Editorial Simple aggregates Top 10, Most Popular, Recent Uploads Tailored to individual users Amazon, Netflix, 32 CS 584: Information Retrieval. Math & Computer Science Department, Emory University
The Long Tail CS 584: Information Retrieval. Math & Computer 33 Source: Chris Anderson (2004) Science Department, Emory University
Netflix Challenge 34 CS 584: Information Retrieval. Math & Computer Science Department, Emory University