Python & Web Mining. Lecture Old Dominion University. Department of Computer Science CS 495 Fall 2012
|
|
- Gwendoline Carter
- 6 years ago
- Views:
Transcription
1 Python & Web Mining Lecture Old Dominion University Department of Computer Science CS 495 Fall 2012 Hany SalahEldeen Khalil hany@cs.odu.edu
2 Scenario So what did Professor X do when he wanted to find other mutants to join his evil-fighting team?
3 Solution He used Cerebro
4 Scenario 2 Now think of a similar scenario You want to search for a document on the web
5 Solution A search engine is equivalent to Cerebro
6 Searching Either Cerebro or Search Engine it is the same concept, you are searching a big thing for some things that have certain characterizing things!
7 Chapter 4: Searching and Ranking
8 Full Text Search Engines It allows people to search a large set of documents for a set of words. It ranks the results according to how relevant those documents are to those words.
9 The Contract By the end of this chapter you will be able to: Build a search engine that will index a set of documents. It should withstand about 100,000 documents smoothly. Crawl Web pages Index them And Search them.
10 Document Gathering The set of documents we are going to search and rank could be from one of those two sources: Fixed collection (ex. Company intranet) The web. Then we will need to crawl it!
11 What is a Search Engine? You collect the documents You index them in a big table and the locations of the different words. Return a ranked list of documents corresponding to a query. Neural network that learns to associate results to queries based on what people clicked from the list of results.
12 Metrics in ranking Metrics used in ranking in regards to: Contents of the page (word frequency) Information external to the page (pagerank)
13 Building our Search Engine Phase 1: Crawling Phase 2: Indexing Phase 3: Querying
14 Phase 1: Crawling When you don t have a fixed controlled set of documents. Algorithm: 1. Get seed URLs in the queue. 2. For each URL in the queue download page and extract all URLs in the page. 3. Add those URLs to the queue 4. Go to step 2
15 Breadth-First-Search Phase 1: Crawling When you don t have a fixed controlled set of documents. Algorithm: 1. Get seed URLs in the queue. 2. For each URL in the queue download page and extract all URLs in the page. 3. Add those URLs to the queue 4. Go to step 2
16 URLLib2 Essential library which downloads HTML in a string after providing a URL contents has now the entire HTML page
17 Beautiful Soup Python library for parsing HTML Very tolerant of webpages with broken HTML.
18
19 Crawling Why breadth-first search? Easier modification to the code later. Go in depth layer by layer. Avoids risk of overflowing the stack We are going to crawl:
20 Crawling
21 So we have a lot of web pages now that are linked.so, what s next?
22 Phase 2: Indexing What is indexing here? It s a full text index. List of all different words and the documents they appeared in along with their locations in the document.
23 What will we index? Words that appear in the webpage: Ignoring the non-textual elements (tags, ids etc) Ignoring punctuation Ignoring case. Breaking words alphanumerically.
24 We will use SQLite SQLite: Easy to use and setup. Embedded relational database. Stores the entire dataset in one file ( databasename.db) Uses SQL for queries. We wil use python implementation Pysqlite
25 Schema: Database Preparation
26 Schema Python Code
27 Schema Python Code
28 Finding Words on a Page For the downloaded pages by URLib2 we need to: Extract the textual components Remove the other things like tags and properties. Preserve the order of the sections
29 We use Beautiful Soup
30 Then break and clean the words Break on alpha-numeric Remove case.
31 We add each word and its location to the index Maintaining its order according to appearance in the document
32 We also need to store linkage between pages
33 and avoid re-indexing URLs
34 Re-running the Crawler It will take a long time, so download the ready database file from:
35 Congratulations! You ve built your first fulltext search and crawler!
36 To get all the documents having a certain word But it only works for one word at a time
37 Phase 3: Querying We need to design a function that is going to take a query string then: Parse it to separate words. Construct SQL queries that returns the documents that have all the different words.
38 Querying So query for the two words with ID: 10, 17 becomes:
39 Querying Now we have multiple-word search! Time to rank those results
40 Ranking Content Based Ranking: Based on the contents of the documents, and the query itself. Inbound-Link Ranking: Uses Link Structure of sites to decide what s important
41 Content Based Ranking So far documents resulting have been presented in the order they were crawled. With large number of files you will end up with a lot of irrelevant content.
42 Within those documents resulted, which is the one I want? To answer this we need to give each document some sort of score based only on: The query provided The content of the page
43 Content Based Scoring Metrics Word Frequency Number of times the query words appeared in the document Document Location Did the query words appeared early in the document. Word Distance Words in query should appear closer than each other
44 Scoring
45 Scoring Now for the query submitted we get a list of scores.
46 Scoring So far there are no scores
47 Problem: The need to Normalize Some scores are large but bad, some are small but good..we need to normalize: Get all scores within the same range and same direction Done by getting the max and then assigning 1 to it s element. Other numbers should be divided by this maximum.
48 The need to Normalize
49 Recap: Content Based Scoring Metrics Word Frequency Number of times the query words appeared in the document Document Location Did the query words appeared early in the document. Word Distance Words in query should appear closer than each other
50 Word Frequency Documents talking about python all over Vs. saying one sentence that pythons are cooler than snakes. Add weight to our code to tune the effect of using Word Frequency metric.
51 Word Frequency
52 Applying Word Frequency
53 Word Location in Document Usually if the page is relevant to the query submitted those terms appear in the beginning of the document or title. We already stored the location of the words so we will just utilize it.
54 Word Location in Document
55 Word Location in Document Each ID can appear multiple of times, the method sums the locations of all the words. The sum is compared to the best result for that URI then we apply the normalizing function.
56 Word Distance Often more useful when the words in the query are closer to each other in the document itself. We will take the difference in location between each two. We try to find the smallest total distance.
57 Word Distance
58 Inbound Link Ranking After all, It s all about reputation If a page has been linked to by many other pages then probably it is a good page. The more awesome the page linking to my page the better I get Just like job interviews & recommendations
59 Simple Counting Same as citations for publications the more citation the higher the paper Problem: The more generalized win rather than the best
60 Solution: Mix it up! Mix inbound count metric with other previously discussed metrics Remember: All inbound links are equal, but some are more equal than others like inbound from cnn.com ~ Hany Orwell
61 Page Rank In a nutshell: The algorithm assigns a score that indicates how important a page is according to how important the pages linking to it. Same as in real life, pretty girls always hang with pretty girls
62 Page Rank Calculating the probability P that someone randomly clicking on links will arrive at a certain page. Page Rank = Minimal Rank + Damping Factor x Z Page Rank of Page Y Number of Outbound Links from Y For every page Y linked to Z
63 Page Rank
64 Page Rank Start with initial values = 1 Iterate till no change no more or 20 iterations (enough) You only have to compute it offline once and it stays. Update only after adding new pages.
65 Page Rank Proof: No wonder! Since every page link to it!
66 Now get the PageRank score And the bigger the collection the wonder of PageRank appears, as it nearly eliminates useless pages
67 Finally, Using Link Text In pages normally the anchor text of the link contains a short precise description. Scenario: If a link to my slides in cnn.com saying awesome source to learn python will get me awesome score and my page will be first if you query python language CNN.com has high PR Anchor text in cnn has python which is in the query.
68 Finally, Using Link Text Page rank of source is added to final score of destination if destination has anchor that is present in query words
69 So Now what? We milked the page for properties either in its content or reputation how can we make it smarter? adaptable?
70 User Input They say User Input is Evil Well, not always Online applications give continuous feedback in form of user input, behavior or preference
71 User Feedback Let s say you made a query on Google and clicked on the third result:
72 User Feedback We need an algorithm that learn from the user s feedback and choices Thus we will utilize Artificial Neural Networks.
73 Artificial Neural Networks So how to train your dragon?
74 Artificial Neural Networks Err I mean how to train your neural network? We will build a click-tracking neural network which has neurons (nodes) and connections between them
75 Click-Tracking Neural Networks This is called Multi-Layer Perceptron
76 Multi-Layer Perceptron Yes Perceptron.and no not a form of a Decepticon?
77 Inputs turning on their Outputs Strong enough inputs will drive their outputs If someone queries world Bank and click on world bank URI it strengthens it
78 Question: Why bother? Why not just collect all the click logs and count choices, store em all? Answer: The power of Neural Networks is that it can make reasonable guess for queries it has never seen before based on similarity to other queries Remember: Queries are unlimited
79 Presenting and Storing a N.N
80 Connection Strengths weights
81 Connection Strengths weights
82 Neural Network Setup Neural Network Setup usually happens in full in advance. In this case since we don t know all queries we will create it on-demand faster & simpler
83 What happens when we come across a combination of words query we never encountered before? We create a new hidden node for it. We set the default strengths and activations to it and its connections
84 Hidden Node Creation
85 Example wworld 1/2 0.1 uworldbank wworld 0.1 uriver wbank 1/2 0.1 uearth Query Words Corresponding URLs
86 Example
87 Feeding Forward We need to set how much each node should respond to change in its input. If the total input is approaching 0 the response should be faster, but if it goes more towards the +ve or ve it should respond slower till it stabilizes Thus most N.N use Sigmoid Functions
88 Hyperbolic Tangent Function It s a form of Sigmoid Function and we will utilize it in our N.N
89 Ok let s start pre Feed Forwarding We have query words and corresponding URIs We need all the hidden nodes
90 Next step setting up the network All inputs are active, thus activation is 1 Get strengths stored in the database
91 Feed Forward It means take inputs and push them through the network minding the strengths Calculate New Activation: Activation = tanh activation x strength Node All Nodes leading to this node
92 Feed Forward Need Output Activations
93 Results We provide the Query and URIs and obtain the activations (how much each URI is active)
94 But these results are dumb! Since it hasn t trained yet the results are practically useless The Network needs to LEARN!
95 Teaching the N.N Click on a certain URL giving feedback that it is the most appropriate in your opinion How the feedback is given is totally up to the application
96 Backward Learning URLs you clicked are given a 1 while the others are given a 0 Teaching the network now goes from the URIs back to the Queries So how much we need to change the inputs to the output nodes to get the desired output activations?
97 Backward Propagation The output activations need to be the new target [1,0,0 ] First we calculate the errors that need to be fixed or deltas between past activation and expected targets. Delta = dtanh ( past activation) x error
98 Backward Propagation
99 Backward Propagation After calculating the deltas we update the strengths throughout the network Change = Delta x Activation New Strength = Old Strength + N x Change Constant
100 Backward Propagation
101 Full Training Function
102 Trained Neural Network
103 The Power of a trained Neural Network Never queried before but awesome result!
104 Now add it to our search engine
105 Assignment 5 Pick one question from the end of the chapter. Implement the function and state briefly the differences. Utilize the python files associated with the class if needed. Deadline: Next week
Searching and Ranking
Searching and Ranking Michal Cap May 14, 2008 Introduction Outline Outline Search Engines 1 Crawling Crawler Creating the Index 2 Searching Querying 3 Ranking Content-based Ranking Inbound Links PageRank
More informationCS6200 Information Retreival. Crawling. June 10, 2015
CS6200 Information Retreival Crawling Crawling June 10, 2015 Crawling is one of the most important tasks of a search engine. The breadth, depth, and freshness of the search results depend crucially on
More informationA project report submitted to Indiana University
Sequential Page Rank Algorithm Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1
More informationKnow what you must do to become an author at Theory of Programming. Become an Author at. Theory of Programming. Vamsi Sangam
Know what you must do to become an author at Theory of Programming Become an Author at Theory of Programming Vamsi Sangam Contents What should I do?... 2 Structure of a Post... 3 Ideas List for Topics...
More informationWeb Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document
More informationThe Anatomy of a Large-Scale Hypertextual Web Search Engine
The Anatomy of a Large-Scale Hypertextual Web Search Engine Article by: Larry Page and Sergey Brin Computer Networks 30(1-7):107-117, 1998 1 1. Introduction The authors: Lawrence Page, Sergey Brin started
More informationA Framework for adaptive focused web crawling and information retrieval using genetic algorithms
A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably
More informationSOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES
SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES Introduction to Information Retrieval CS 150 Donald J. Patterson This content based on the paper located here: http://dx.doi.org/10.1007/s10618-008-0118-x
More informationNeural Networks (pp )
Notation: Means pencil-and-paper QUIZ Means coding QUIZ Neural Networks (pp. 106-121) The first artificial neural network (ANN) was the (single-layer) perceptron, a simplified model of a biological neuron.
More informationLecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa
Instructors: Parth Shah, Riju Pahwa Lecture 2 Notes Outline 1. Neural Networks The Big Idea Architecture SGD and Backpropagation 2. Convolutional Neural Networks Intuition Architecture 3. Recurrent Neural
More informationSupervised Learning in Neural Networks (Part 2)
Supervised Learning in Neural Networks (Part 2) Multilayer neural networks (back-propagation training algorithm) The input signals are propagated in a forward direction on a layer-bylayer basis. Learning
More informationThe Idiot s Guide to Quashing MicroServices. Hani Suleiman
The Idiot s Guide to Quashing MicroServices Hani Suleiman The Promised Land Welcome to Reality Logging HA/DR Monitoring Provisioning Security Debugging Enterprise frameworks Don t Panic WHOAMI I wrote
More informationWHAT TYPE OF NEURAL NETWORK IS IDEAL FOR PREDICTIONS OF SOLAR FLARES?
WHAT TYPE OF NEURAL NETWORK IS IDEAL FOR PREDICTIONS OF SOLAR FLARES? Initially considered for this model was a feed forward neural network. Essentially, this means connections between units do not form
More informationInformation Retrieval
Multimedia Computing: Algorithms, Systems, and Applications: Information Retrieval and Search Engine By Dr. Yu Cao Department of Computer Science The University of Massachusetts Lowell Lowell, MA 01854,
More informationTop-To-Bottom (And Beyond) On-Page Optimization Guidebook
SEOPressor Connect Presents: Top-To-Bottom (And Beyond) On-Page Optimization Guidebook Copyright 2017 SEOPressor Connect All Rights Reserved 2 If you re looking for a guideline how to optimize your SEO
More informationIntroduction To Graphs and Networks. Fall 2013 Carola Wenk
Introduction To Graphs and Networks Fall 203 Carola Wenk On the Internet, links are essentially weighted by factors such as transit time, or cost. The goal is to find the shortest path from one node to
More informationMining Web Data. Lijun Zhang
Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems
More informationInformation Retrieval Spring Web retrieval
Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic
More informationSearch Engines. Dr. Johan Hagelbäck.
Search Engines Dr. Johan Hagelbäck johan.hagelback@lnu.se http://aiguy.org Search Engines This lecture is about full-text search engines, like Google and Microsoft Bing They allow people to search a large
More informationEndless Monetization
Hey Guys, So, today we want to bring you a few topics that we feel compliment's the recent traffic, niches and keyword discussions. Today, we want to talk about a few different things actually, ranging
More informationTitle: Artificial Intelligence: an illustration of one approach.
Name : Salleh Ahshim Student ID: Title: Artificial Intelligence: an illustration of one approach. Introduction This essay will examine how different Web Crawling algorithms and heuristics that are being
More informationHow To Gain a Competitive Advantage
How To Gain a Competitive Advantage VIDEO See this video in High Definition Download this video How To Gain a Competitive Advantage - 1 Video Transcript The number one reason why people fail online is
More informationDeep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur
Deep Learning for Visual Computing Prof. Debdoot Sheet Department of Electrical Engineering Indian Institute of Technology, Kharagpur Lecture - 05 Classification with Perceptron Model So, welcome to today
More informationA PRACTICE BUILDERS white paper. 8 Ways to Improve SEO Ranking of Your Healthcare Website
A PRACTICE BUILDERS white paper 8 Ways to Improve SEO Ranking of Your Healthcare Website More than 70 percent of patients find their healthcare providers through a search engine. This means appearing high
More informationSearch Engines. Charles Severance
Search Engines Charles Severance Google Architecture Web Crawling Index Building Searching http://infolab.stanford.edu/~backrub/google.html Google Search Google I/O '08 Keynote by Marissa Mayer Usablity
More informationCMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro
CMU 15-781 Lecture 18: Deep learning and Vision: Convolutional neural networks Teacher: Gianni A. Di Caro DEEP, SHALLOW, CONNECTED, SPARSE? Fully connected multi-layer feed-forward perceptrons: More powerful
More informationNotes on Multilayer, Feedforward Neural Networks
Notes on Multilayer, Feedforward Neural Networks CS425/528: Machine Learning Fall 2012 Prepared by: Lynne E. Parker [Material in these notes was gleaned from various sources, including E. Alpaydin s book
More informationLab 3 - Development Phase 2
Lab 3 - Development Phase 2 In this lab, you will continue the development of your frontend by integrating the data generated by the backend. For the backend, you will compute and store the PageRank scores
More informationAN SEO GUIDE FOR SALONS
AN SEO GUIDE FOR SALONS AN SEO GUIDE FOR SALONS Set Up Time 2/5 The basics of SEO are quick and easy to implement. Management Time 3/5 You ll need a continued commitment to make SEO work for you. WHAT
More informationInformation Retrieval
Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability
More informationFall 09, Homework 5
5-38 Fall 09, Homework 5 Due: Wednesday, November 8th, beginning of the class You can work in a group of up to two people. This group does not need to be the same group as for the other homeworks. You
More informationNatural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu
Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Web Crawler Finds and downloads web pages automatically provides the collection for searching Web is huge and constantly
More informationCS 6501: Deep Learning for Computer Graphics. Training Neural Networks II. Connelly Barnes
CS 6501: Deep Learning for Computer Graphics Training Neural Networks II Connelly Barnes Overview Preprocessing Initialization Vanishing/exploding gradients problem Batch normalization Dropout Additional
More informationLECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS
LECTURE NOTES Professor Anita Wasilewska NEURAL NETWORKS Neural Networks Classifier Introduction INPUT: classification data, i.e. it contains an classification (class) attribute. WE also say that the class
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationCTI-TC Weekly Working Sessions
CTI-TC Weekly Working Sessions Meeting Date: October 18, 2016 Time: 15:00:00 UTC Purpose: Weekly CTI-TC Joint Working Session Attendees: Agenda: Jordan - Moderator Darley Christian Hunt Rich Piazza TAXII
More informationMachine Learning 13. week
Machine Learning 13. week Deep Learning Convolutional Neural Network Recurrent Neural Network 1 Why Deep Learning is so Popular? 1. Increase in the amount of data Thanks to the Internet, huge amount of
More informationCS 3640: Introduction to Networks and Their Applications
CS 3640: Introduction to Networks and Their Applications Fall 2018, Lecture 7: The Link Layer II Medium Access Control Protocols Instructor: Rishab Nithyanand Teaching Assistant: Md. Kowsar Hossain 1 You
More informationBelow is another example, taken from a REAL profile on one of the sites in my packet of someone abusing the sites.
Before I show you this month's sites, I need to go over a couple of things, so that we are all on the same page. You will be shown how to leave your link on each of the sites, but abusing the sites can
More informationSearching the Deep Web
Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index
More informationLecture 12. Lecture 12: The IO Model & External Sorting
Lecture 12 Lecture 12: The IO Model & External Sorting Announcements Announcements 1. Thank you for the great feedback (post coming soon)! 2. Educational goals: 1. Tech changes, principles change more
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou Sun yzsun@ccs.neu.edu November 19, 2015 Methods to Learn Classification Clustering Frequent Pattern Mining
More informationCollection Building on the Web. Basic Algorithm
Collection Building on the Web CS 510 Spring 2010 1 Basic Algorithm Initialize URL queue While more If URL is not a duplicate Get document with URL [Add to database] Extract, add to queue CS 510 Spring
More informationCS47300 Web Information Search and Management
CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page
More informationA project report submitted to Indiana University
Page Rank Algorithm Using MPI Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1
More informationEXTRACTION OF RELEVANT WEB PAGES USING DATA MINING
Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,
More informationCS 4349 Lecture October 18th, 2017
CS 4349 Lecture October 18th, 2017 Main topics for #lecture include #minimum_spanning_trees. Prelude Homework 6 due today. Homework 7 due Wednesday, October 25th. Homework 7 has one normal homework problem.
More informationPillar Content & Topic Clusters
Pillar Content & Topic Clusters Hi, I m Liz Murphy! A Little About Me Content strategist at IMPACT. My obsession is content that closes deals. I ve been in the inbound world for 5 years. I have aggressive
More informationWhat Are The SEO Benefits from Online Reviews and UGC?
Online Reviews: The Benefits, Best Practices and More. By: Joe Vernon on www.gravitatedesign.com Growing up I was told by my parents not to care what others thought of me but to continue being who I was.
More informationInformation Retrieval
Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have
More informationLink Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld
Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on
More informationExcel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller
Excel Basics Rice Digital Media Commons Guide Written for Microsoft Excel 2010 Windows Edition by Eric Miller Table of Contents Introduction!... 1 Part 1: Entering Data!... 2 1.a: Typing!... 2 1.b: Editing
More informationCS 137 Part 4. Structures and Page Rank Algorithm
CS 137 Part 4 Structures and Page Rank Algorithm Structures Structures are a compound data type. They give us a way to group variables. They consist of named member variables and are stored together in
More informationHow To Construct A Keyword Strategy?
Introduction The moment you think about marketing these days the first thing that pops up in your mind is to go online. Why is there a heck about marketing your business online? Why is it so drastically
More informationWeb Crawling. Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India
Web Crawling Jitali Patel 1, Hardik Jethva 2 Dept. of Computer Science and Engineering, Nirma University, Ahmedabad, Gujarat, India - 382 481. Abstract- A web crawler is a relatively simple automated program
More informationCIS 192: Artificial Intelligence. Search and Constraint Satisfaction Alex Frias Nov. 30 th
CIS 192: Artificial Intelligence Search and Constraint Satisfaction Alex Frias Nov. 30 th What is AI? Designing computer programs to complete tasks that are thought to require intelligence 4 categories
More informationAuthority Scoring. What It Is, How It s Changing, and How to Use It
Authority Scoring What It Is, How It s Changing, and How to Use It For years, Domain Authority (DA) has been viewed by the SEO industry as a leading metric to predict a site s organic ranking ability.
More informationArtificial Intelligence Introduction Handwriting Recognition Kadir Eren Unal ( ), Jakob Heyder ( )
Structure: 1. Introduction 2. Problem 3. Neural network approach a. Architecture b. Phases of CNN c. Results 4. HTM approach a. Architecture b. Setup c. Results 5. Conclusion 1.) Introduction Artificial
More informationWeb Applications: Internet Search and Digital Preservation
CS 312 Internet Concepts Web Applications: Internet Search and Digital Preservation Dr. Michele Weigle Department of Computer Science Old Dominion University mweigle@cs.odu.edu http://www.cs.odu.edu/~mweigle/cs312-f11/
More informationDefinition. Quantifying Anonymity. Anonymous Communication. How can we calculate how anonymous we are? Who you are from the communicating party
Definition Anonymous Communication Hiding identities of parties involved in communications from each other, or from third-parties Who you are from the communicating party Who you are talking to from everyone
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm
More informationInformation Retrieval May 15. Web retrieval
Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically
More informationAssignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationCourse Syllabus. Course Information
Course Syllabus Course Information Course: MIS 6V99 Special Topics Programming for Data Science Section: 5U1 Term: Summer 2017 Meets: Friday, 6:00 pm to 10:00 pm, JSOM 2.106 Note: Beginning Fall 2017,
More informationFighting Phishing I: Get phish or die tryin.
Fighting Phishing I: Get phish or die tryin. Micah Nelson and Max Hyppolite bit.ly/nercomp_sap918 Please, don t forget to submit your feedback for today s session at the above URL. If you use social media
More informationUS Patent 6,658,423. William Pugh
US Patent 6,658,423 William Pugh Detecting duplicate and near - duplicate files Worked on this problem at Google in summer of 2000 I have no information whether this is currently being used I know that
More informationGraph Data Processing with MapReduce
Distributed data processing on the Cloud Lecture 5 Graph Data Processing with MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, 2015 (licensed under Creation Commons Attribution
More informationINTRODUCTION TO ADVANCED SEO
INTRODUCTION TO ADVANCED SEO TABLE OF CONTENTS WHAT YOU ALREADY GET WITH YOUR PRONTO SITE WHY LINKS ARE IMPORTANT FOR SEO THE RIGHT STRATEGY FOR YOUR BUSINESS LINK BUILDING PROGRAMS WHAT YOU ALREADY GET
More informationNews Article Matcher. Team: Rohan Sehgal, Arnold Kao, Nithin Kunala
News Article Matcher Team: Rohan Sehgal, Arnold Kao, Nithin Kunala Abstract: The news article matcher is a search engine that allows you to input an entire news article and it returns articles that are
More information10 SEO MISTAKES TO AVOID
10 SEO S TO AVOID DURING YOUR NEXT SITE RE Redesigning your website isn t just an exercise in aesthetics. Sure, the purely visual elements of your newly designed website will likely get the most attention,
More informationCURZON PR BUYER S GUIDE WEBSITE DEVELOPMENT
CURZON PR BUYER S GUIDE WEBSITE DEVELOPMENT Website Development WHAT IS WEBSITE DEVELOPMENT? This is the development of a website for the Internet (World Wide Web) Website development can range from developing
More informationSEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER
SEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER ADVICE FOR LAW FIRM MARKETERS CONSULTWEBS.COM GETMORE@CONSULTWEBS.COM (800) 872-6590 1 SEARCH ENGINE OPTIMIZATION ALWAYS, SOMETIMES, NEVER ADVICE FOR
More informationThe Mathematics Behind Neural Networks
The Mathematics Behind Neural Networks Pattern Recognition and Machine Learning by Christopher M. Bishop Student: Shivam Agrawal Mentor: Nathaniel Monson Courtesy of xkcd.com The Black Box Training the
More informationSite Audit SpaceX
Site Audit 217 SpaceX Site Audit: Issues Total Score Crawled Pages 48 % -13 3868 Healthy (649) Broken (39) Have issues (276) Redirected (474) Blocked () Errors Warnings Notices 4164 +3311 1918 +7312 5k
More informationCOMP251: Algorithms and Data Structures. Jérôme Waldispühl School of Computer Science McGill University
COMP251: Algorithms and Data Structures Jérôme Waldispühl School of Computer Science McGill University About Me Jérôme Waldispühl Associate Professor of Computer Science I am conducting research in Bioinformatics
More informationGraph Algorithms: Part 2. Dr. Baldassano Yu s Elite Education
Graph Algorithms: Part 2 Dr. Baldassano chrisb@princeton.edu Yu s Elite Education Graphs In Computer Science we describe pairwise relationships as a graph Graphs are made up of two types of things: Nodes
More informationSEO According to Google
SEO According to Google An On-Page Optimization Presentation By Rachel Halfhill Lead Copywriter at CDI Agenda Overview Keywords Page Titles URLs Descriptions Heading Tags Anchor Text Alt Text Resources
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 CS 347 Notes 12 5 Web Search Engine Crawling
More information1 Probabilistic analysis and randomized algorithms
1 Probabilistic analysis and randomized algorithms Consider the problem of hiring an office assistant. We interview candidates on a rolling basis, and at any given point we want to hire the best candidate
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 12: Distributed Information Retrieval CS 347 Notes 12 2 CS 347 Notes 12 3 CS 347 Notes 12 4 Web Search Engine Crawling Indexing Computing
More informationPost Experiment Interview Questions
Post Experiment Interview Questions Questions about the Maximum Problem 1. What is this problem statement asking? 2. What is meant by positive integers? 3. What does it mean by the user entering valid
More informationLecture 20: Neural Networks for NLP. Zubin Pahuja
Lecture 20: Neural Networks for NLP Zubin Pahuja zpahuja2@illinois.edu courses.engr.illinois.edu/cs447 CS447: Natural Language Processing 1 Today s Lecture Feed-forward neural networks as classifiers simple
More informationYou must include this cover sheet. Either type up the assignment using theory3.tex, or print out this PDF.
15-122 Assignment 3 Page 1 of 12 15-122 : Principles of Imperative Computation Fall 2012 Assignment 3 (Theory Part) Due: Thursday, October 4 at the beginning of lecture. Name: Andrew ID: Recitation: The
More informationPageRank Explained or Everything you ve always wanted to know about PageRank 2001 All Rights Reserved
PageRank Explained or Everything you ve always wanted to know about PageRank 2001 All Rights Reserved Written and theorised by Chris Ridings, owner of http://www.searchenginesystems.net/ Edited by Jill
More informationThe COS 333 Project. Robert M. Dondero, Ph.D. Princeton University
The COS 333 Project Robert M. Dondero, Ph.D. Princeton University 1 Overview A simulation of reality In groups of 3-5 people... Build a substantial three tier software system 2 Three-Tier Systems "Three
More informationQUALITY SEO LINK BUILDING
QUALITY SEO LINK BUILDING Developing Your Online Profile through Quality Links TABLE OF CONTENTS Introduction The Impact Links Have on Your Search Profile 02 Chapter II Evaluating Your Link Profile 03
More informationChapter 5: Algorithms and Heuristics. CS105: Great Insights in Computer Science
Chapter 5: Algorithms and Heuristics CS105: Great Insights in Computer Science Last Time... Selection Sort - Mentioned Bubble Sort Binary Search Sort - Based on lg(n) QuickSort Guess Who? Each player picks
More informationCrawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server
Authors: Sergey Brin, Lawrence Page Google, word play on googol or 10 100 Centralized system, entire HTML text saved Focused on high precision, even at expense of high recall Relies heavily on document
More informationIntroduction. But what about some of the lesser known SEO techniques?
Introduction When it comes to determine out what the best SEO techniques are for your inbound marketing campaign, the most basic strategies aren t that tough to figure out. If you ve been blogging or marketing
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationAnatomy of a search engine. Design criteria of a search engine Architecture Data structures
Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection
More informationKnow your data - many types of networks
Architectures Know your data - many types of networks Fixed length representation Variable length representation Online video sequences, or samples of different sizes Images Specific architectures for
More informationFeature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.
CS 188: Artificial Intelligence Fall 2008 Lecture 24: Perceptrons II 11/24/2008 Dan Klein UC Berkeley Feature Extractors A feature extractor maps inputs to feature vectors Dear Sir. First, I must solicit
More informationAnalytics Building Blocks
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based
More informationMicroservice Splitting the Monolith. Software Engineering II Sharif University of Technology MohammadAmin Fazli
Microservice Software Engineering II Sharif University of Technology MohammadAmin Fazli Topics Seams Why to split the monolith Tangled Dependencies Splitting and Refactoring Databases Transactional Boundaries
More informationInformation Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.
Introduction to Information Retrieval Ethan Phelps-Goodman Some slides taken from http://www.cs.utexas.edu/users/mooney/ir-course/ Information Retrieval (IR) The indexing and retrieval of textual documents.
More informationINTRODUCTION. 2
1 INTRODUCTION It is of no secret that Android is loved by millions of people around the world. Created and developed by Google, it would be most developers dream job. That being said, there are a lot
More informationCSC148H Week 3. Sadia Sharmin. May 24, /20
CSC148H Week 3 Sadia Sharmin May 24, 2017 1/20 Client vs. Developer I For the first couple of weeks, we have played the role of class designer I However, you are also often in the opposite role: when a
More information