Popularity of Twitter Accounts: PageRank on a Social Network
|
|
- Megan Perkins
- 5 years ago
- Views:
Transcription
1 Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages, called tweets. This service has gained worldwide popularity since its creation in 2006, with more than 100 million users posting 340 million tweets a day in The popularity of a Twitter user s account is given by the number of Followers it has; that is, the number of other accounts that follow or subscribe to the user s tweets. However, the number of followers alone may not be the best measure of popularity. What is to prevent a user s popularity from being inflated though accounts that have been created for the sole purpose of following that user (aside from Twitter deleting said accounts if they are found to be violating the Terms of Service)? The aim of this project is to investigate the application of PageRank to determine the popularity of a Twitter account, and to determine whether it is a more appropriate measure of an account s popularity. 2 Background Work Google s PageRank algorithm was developed to rank websites for their search-engine results. It assigns a numerical rank to each indexed web page, evaluating pages importance by their link structure. The idea is that if a page A has a link to page B, theowner of A is giving some measure of importance to page B. We can think of the web as a directed graph, where pages are nodes and there is an edge from page A to page B if A contains one or more links to B. The mathematics behind the algorithm are generic, and PageRank can be applied to the social network graph of Twitter s follower relations. In this graph, instead of webpages and links, nodes are users and edges indicate follower relations. There is an edge from A to B if B is a follower of A. That is, edges follow the direction of tweet transmission from a user to their followers. This directionality is due to our dataset and has some ramifications on the analysis of the results, which will be discussed later. 1
2 3 Theorectical Analysis To gather testing data I initially used a Python script and the Twitter API to collect account information, storing it in a relational database. However, this quickly proved to be impractical; the Twitter API limits request rates to a maximum of 15 requests per 15 minutes. As a result, I would have been able to construct only a very small dataset; processing only 30,000 accounts would take nearly 21 days! So, instead of attempting to gather modern data, I have used a snapshot of the Twitter dataset as it appeared in The social network graph of Twitter at this time consisted of 41,652,230 users, with 1,468,365,182 follower relations. The dataset initially consisted of a 25GB text file, where each line was a pair of ids, i and j, indicating an edge from i to j (that is, j is a follower of i). This data was unordered, and di cult to access and use due to its size. To convert it into a usable form, I first attempted to create a relational database, with the idea that it would make queries for data analysis much easier once the PageRank of each node was calculated. However, this approach proved to be impractical; it was far too slow. Even using bulk insertion, only 300,000 nodes were added after 4 hours. Furthermore, it was very slow to access during runtime - iterating through this partial dataset took much longer than iterating the entire text file. Next, I tried simply streaming the text file to construct the adjacency matrix. While this was faster it still took too long, taking nearly 1.5 hours just to iterate through the edges. More importantly, the resulting matrix would be far too large to fit in memory; with 41,652,239 nodes the adjacency matrix would have 1,734,908,263,972,900 entries! Obviously a better data structure is required. I was able to use the WebGraph framework to generate a compressed graph file from the set of edges. The links structure now looks something like Figure 1. Figure 1: Links structure of graph We represent the transition matrix by the outdegree of each node and the list of its successors. By listing nonzero entries by column, we know the value of each nonzero entry: 1/outdegree. 2
3 This structure is then compressed by the WebGraph framework. Using this method, the Twitter dataset was compressed into a 2.5GB file. However, the graph structure was still too large for the framework to load into memory, even when 8GB of heap space was allocated. Fortunately, WebGraph allows loading of the graph as a memory-mapped file, provided a list of o sets has been generated, though access is slower. While the graph is now usable, the PageRank algorithm needs to be adjusted. The approach discussed in class will not work, as the matrix would still be too large. A more memory e cient way of calculating PageRank is required. Taher Haveliwala describes amethodine cient Computation of PageRank (1999). He describes a multi-pass (per single PageRank iteration) technique that allows computation of PageRank in memory for very large graphs, assuming stream access to the graph and scores from the previous iteration: Figure 2: Memory-e cient PageRank algorithm For this algorithm we make successive passes over the links structure, using the previous rank values (held in Source) to compute the current iteration rank values (held in Dest). The algorithm can stop when the di erence between Source and Dest reaches some threshold (PageRank converges), or after a set amount of iterations. This implementation requires that at least the current scores are stored in memory (scores from the previous round may be stored on disk if memory cannot hold both). For the Twitter dataset, with 42,000,000 nodes, and assuming 32-bit floats, both the current iteration and previous iteration scores can easily be held in memory. However, this implementation also assumes that there are no dead ends in our graph. Dead ends leak PageRank, and we cannot easily remove them from our graph. We need to make some minor adjustments to compensate: 3
4 Figure 3: Modified PageRank algorithm to account for dead ends If the graph has no dead ends, the amount of PageRank leaked is 1 c. Sincewe have dead ends, the amount of leaked PageRank may be larger, and we must account for it by calculating S. This ensures that the rank vector is normalized to 1. It is important to note that the PageRank algorithm assigns higher rank to webpages that have large indegrees. This is essentially the opposite of what we want for the Twitter follower relation, where accounts with large outdegrees (many followers) should be more popular. When we apply the algorithm as-is on the Twitter dataset, we are in fact calculating a Reverse PageRank. This means that Nodes that have a lower score are the nodes that are more popular. There are a few ways of solving this: 1. We can reverse the direction of all of the edges before calculating PageRank. Due to issues with the dataset used for this project, this was not easily possible. The size and format lead to the dataset becoming corrupted when I tried to transpose the graph. 2. We can ensure all of the ranks are normalized to 1, invert them, and then normalize again. However, since we used single precision floats, accuracy is lost in this step and many ranks end up as 1.0 or 0.0. Using double precision would likely help, but this also doubles the memory usage and slows the running time significantly. Alternatively, we may simply use this Reverse PageRank as an account s popularity by treating it in the same manner as golf scores: the lower the score, the more popular the account, and we order by ascending score rather than descending. 4 Experimental Design and Analysis I chose to implement the PageRank algorithm in Java to take advantage of the Web- Graph framework. First I calculated the FollowerRank of each node to use as a basis 4
5 for comparison. This calculation is simple; a nodes FollowerRank is its outdegree divided by the total number of edges in the graph (the percentage of follower relations the node has out of all follower relations). I then ran the PageRank algorithm described previously, using the standard teleportation factor of Rather than running to convergence, I chose to run for a set number of rounds, in this case 10, due to time constraints and attempting to recover data after an unfortunate crash. Each round took approximately 10 minutes, for a total of just under 2 hours. Figures 4 and 5 show the top ten accounts by followers, and then these same accounts sorted by their respective PageRank: Figure 4: Top Twitter accounts by FollowerRank Figure 5: Same accounts, ranked by Reverse PageRank The order is significantly di erent. Notably, Obama s account has a rather poor ranking, which is contradictory to how many followers he has. The fact that this is an outdated dataset makes analysis somewhat di cult, but based on a check of current Twitter statistics for these accounts, the bottom three in the above table all follow many users. For example, Obama currently follows 626,000 users, while Oprah only follows 288. Many of the other users in our dataset with the best rankings were random people who had only a few followers, but didn t follow anyone else. These accounts basically acted as sink nodes. 5
6 A possible solution to these random accounts appearing to be popular would be to recursively remove dead end nodes, since many of these relation chains ended at accounts without any followers. However, the compressed graph is immutable, so we cannot modify it at runtime. Furthermore, our id s are sequential with a separate file mapping these sequential ids to Twitter ids and account handles. To save memory, the sequential ids are omitted from text output and from our rank arrays (the array index corresponds to the node id). Removing any nodes would make our data unreadable. My attempts to preprocess the graph before compression were unsuccessful, and the dataset became corrupted several times. So we cannot actually remove the nodes, though we can try to exclude them from our calculations. I attempted to implement a variant of PageRank that performed one iteration of dead end removal, but it had several drawbacks. First, it increased our running time, since we must now iterate through each node s successors twice: first to determine how many hanging nodes and adjust the PageRank contribution factor, and then to distribute the ranking. Additionally, this simply left many new dead ends, and nodes that were completely disconnected from the rest of the graph. 5 Conclusions Ultimately, PageRank does not seem particularly suited to evaluating a Twitter account s popularity, at least not without some modifications. It is likely that recursively removing all nodes with zero followers would improve the rankings, but we still run into the issues of popular accounts serving as sink nodes. For example, in the current 2017 Twitter follower network, Taylor Swift has millions of followers, but doesn t follow anyone. Her account would be a massive sink node, but we wouldn t want to remove it. There is a great deal of potential for future work. Firstly, managing to remove all of the dead end nodes, and transposing the graph would make the PageRank calculation both more intuitive and more meaningful. Furthermore, implementing a factor akin to Google s trust rank for ignoring spam and link farm websites may help. Accounts that have existed for longer, have more followers, or possibly verified accounts could be considered more trustworthy. By doing this, users with few followers would not contribute strongly or at all to the popularity of the accounts that they follow. I had intended to implement some form of trust rank, but this project became a major learning lesson on the di culties of working with big data, and most of my time was spent on issues related to memory usage. I would also like to try this with a more modern dataset; Twitter in 2010 had 42,000,000 users, today it has over 330,000,000 users. Not only would the results be more meaningful, they would be easier to analyze, instead of having to make as many assumptions. However, collecting this data has proved di cult and time-consuming. 6
7 As it is, the number of followers does seem to be a simple and e ective estimate of a Twitter account s popularity, provided untrustworthy accounts could be factored out in some way. References [1] Boldi, P. and Vigna, S., 2004, May. The webgraph framework I: compression techniques. In Proceedings of the 13th international conference on World Wide Web (pp ). ACM. [2] Haveliwala, T., E cient computation of PageRank. Stanford. [3] Kwak, H., Lee, C., Park, H. and Moon, S., 2010, April. What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp ). ACM. [4] Leskovec, J., Rajaraman, A. and Ullman, J.D., Mining of massive datasets. Cambridge university press. 7
Part 1: Link Analysis & Page Rank
Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationCOMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION
International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationCOMP Page Rank
COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper
More informationLink Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld
Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationInformation Networks: PageRank
Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the
More informationUnit VIII. Chapter 9. Link Analysis
Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2
More informationHow to organize the Web?
How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper
More informationFinding humourous comments on Reddit using Page- Rank and clustering
Finding humourous comments on Reddit using Page- Rank and clustering Michael Berthelot Carleton University A b s t r a c t This paper was motivated by the desire to decrease the time required to search
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationA project report submitted to Indiana University
Sequential Page Rank Algorithm Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1
More informationCS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul
1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given
More informationSlides based on those in:
Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More informationProximity Prestige using Incremental Iteration in Page Rank Algorithm
Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration
More informationFigure 1: A directed graph.
1 Graphs A graph is a data structure that expresses relationships between objects. The objects are called nodes and the relationships are called edges. For example, social networks can be represented as
More informationLink Farming in Twitter
Link Farming in Twitter Pawan Goyal CSE, IITKGP Nov 11, 2016 Pawan Goyal (IIT Kharagpur) Link Farming in Twitter Nov 11, 2016 1 / 1 Reference Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Kumar
More informationLink Analysis and Web Search
Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur
More informationCS224W: Social and Information Network Analysis Jure Leskovec, Stanford University
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second
More informationWeb Spam Detection with Anti-Trust Rank
Web Spam Detection with Anti-Trust Rank Viay Krishnan Computer Science Department Stanford University Stanford, CA 4305 viayk@cs.stanford.edu Rashmi Ra Computer Science Department Stanford University Stanford,
More informationA project report submitted to Indiana University
Page Rank Algorithm Using MPI Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1
More informationWorld Wide Web has specific challenges and opportunities
6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has
More informationLecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule
Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question
More informationAn Improved Computation of the PageRank Algorithm 1
An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.
More information3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today
3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu
More informationCS 224W Final Report Group 37
1 Introduction CS 224W Final Report Group 37 Aaron B. Adcock Milinda Lakkam Justin Meyer Much of the current research is being done on social networks, where the cost of an edge is almost nothing; the
More informationINTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)
INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationCentralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge
Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum
More informationLecture 8: Linkage algorithms and web search
Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data
More informationThe PageRank Citation Ranking
October 17, 2012 Main Idea - Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR - course web page, Brian home page, Emily home page, Steven
More informationBig Data - Some Words BIG DATA 8/31/2017. Introduction
BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1 What is Big Data Big Data means
More informationSearching the Web What is this Page Known for? Luis De Alba
Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse
More informationAgenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page
Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm
More informationAnalysis of Large Graphs: TrustRank and WebSpam
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCSE 373: Data Structures and Algorithms. Memory and Locality. Autumn Shrirang (Shri) Mare
CSE 373: Data Structures and Algorithms Memory and Locality Autumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey Champion, Ben Jones, Adam Blank, Michael Lee, Evan McCarty, Robbie Weber,
More informationWeb search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)
' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search
More informationCHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science
CHENNAI MATHEMATICAL INSTITUTE M.Sc. / Ph.D. Programme in Computer Science Entrance Examination, 5 May 23 This question paper has 4 printed sides. Part A has questions of 3 marks each. Part B has 7 questions
More informationAssignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis
Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis Due by 11:59:59pm on Tuesday, March 16, 2010 This assignment is based on a similar assignment developed at the University of Washington. Running
More informationPolyratings Website Update
Polyratings Website Update Senior Project Spring 2016 Cody Sears Connor Krier Anil Thattayathu Outline Overview 2 Project Beginnings 2 Key Maintenance Issues 2 Project Decision 2 Research 4 Customer Survey
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationImplementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b
International Conference on Artificial Intelligence and Engineering Applications (AIEA 2016) Implementation of Parallel CASINO Algorithm Based on MapReduce Li Zhang a, Yijie Shi b State key laboratory
More information15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018
15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs. Authorities turn out to be places where information can
More informationEinführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme
Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants
More informationAllstate Insurance Claims Severity: A Machine Learning Approach
Allstate Insurance Claims Severity: A Machine Learning Approach Rajeeva Gaur SUNet ID: rajeevag Jeff Pickelman SUNet ID: pattern Hongyi Wang SUNet ID: hongyiw I. INTRODUCTION The insurance industry has
More information. social? better than. 7 reasons why you should focus on . to GROW YOUR BUSINESS...
Is EMAIL better than social? 7 reasons why you should focus on email to GROW YOUR BUSINESS... 1 EMAIL UPDATES ARE A BETTER USE OF YOUR TIME If you had to choose between sending an email and updating your
More informationBig Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition
Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition What s the BIG deal?! 2011 2011 2008 2010 2012 What s the BIG deal?! (Gartner Hype Cycle) What s the
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing
More informationCPSC 340: Machine Learning and Data Mining. Ranking Fall 2016
CPSC 340: Machine Learning and Data Mining Ranking Fall 2016 Assignment 5: Admin 2 late days to hand in Wednesday, 3 for Friday. Assignment 6: Due Friday, 1 late day to hand in next Monday, etc. Final:
More informationIntro to Algorithms. Professor Kevin Gold
Intro to Algorithms Professor Kevin Gold What is an Algorithm? An algorithm is a procedure for producing outputs from inputs. A chocolate chip cookie recipe technically qualifies. An algorithm taught in
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationTODAY S LECTURE HYPERTEXT AND
LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority
More informationA STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE
A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular
More informationLab 4. 1 Comments. 2 Design. 2.1 Recursion vs Iteration. 2.2 Enhancements. Justin Ely
Lab 4 Justin Ely 615.202.81.FA15 Data Structures 06 December, 2015 1 Comments Sorting algorithms are a key component to computer science, not simply because sorting is a commonlyperformed task, but because
More informationLossy Compression of Scientific Data with Wavelet Transforms
Chris Fleizach Progress Report Lossy Compression of Scientific Data with Wavelet Transforms Introduction Scientific data gathered from simulation or real measurement usually requires 64 bit floating point
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationBusiness and Scientific Applications of the Java Programming Language
Business and Scientific Applications of the Java Programming Language Angelo Bertolli April 24, 2005 Abstract While Java is arguably a good language with that to write both scientific and business applications,
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams
More informationCS224W Final Report Emergence of Global Status Hierarchy in Social Networks
CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range
More information5 Choosing keywords Initially choosing keywords Frequent and rare keywords Evaluating the competition rates of search
Seo tutorial Seo tutorial Introduction to seo... 4 1. General seo information... 5 1.1 History of search engines... 5 1.2 Common search engine principles... 6 2. Internal ranking factors... 8 2.1 Web page
More informationPageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. III (Jan.-Feb. 2017), PP 01-07 www.iosrjournals.org PageRank Algorithm Albi Dode 1, Silvester
More informationA FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET
A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/6/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 High dim. data Graph data Infinite data Machine
More informationCS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp
CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp Chris Guthrie Abstract In this paper I present my investigation of machine learning as
More informationDigital Marketing & Sales Training. Part 1: SEO, Local, & AdWords Express Leadgenix & AG 431
Digital Marketing & Sales Training Part 1: SEO, Local, & AdWords Express Leadgenix & AG 431 Introductions Andy Selcho AG Location Owner Dan Posner Partner Relationships Jamie Bates Director of Operations
More informationReddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011
Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011 1. Introduction Reddit is one of the most popular online social news websites with millions
More informationEfficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 5, SEPTEMBER 2002 1225 Efficient Tuning of SVM Hyperparameters Using Radius/Margin Bound and Iterative Algorithms S. Sathiya Keerthi Abstract This paper
More informationCS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When
More informationCHAPTER. The Role of PL/SQL in Contemporary Development
CHAPTER 1 The Role of PL/SQL in Contemporary Development 4 Oracle PL/SQL Performance Tuning Tips & Techniques When building systems, it is critical to ensure that the systems will perform well. For example,
More informationAdaptive methods for the computation of PageRank
Linear Algebra and its Applications 386 (24) 51 65 www.elsevier.com/locate/laa Adaptive methods for the computation of PageRank Sepandar Kamvar a,, Taher Haveliwala b,genegolub a a Scientific omputing
More informationSome Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing
Some Interesting Applications of Theory PageRank Minhashing Locality-Sensitive Hashing 1 PageRank The thing that makes Google work. Intuition: solve the recursive equation: a page is important if important
More informationTopology-Based Spam Avoidance in Large-Scale Web Crawls
Topology-Based Spam Avoidance in Large-Scale Web Crawls Clint Sparkman Joint work with Hsin-Tsang Lee and Dmitri Loguinov Internet Research Lab Department of Computer Science and Engineering Texas A&M
More informationIntroduction to Data Mining
Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs
More informationCSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena
CSE 190 Lecture 16 Data Mining and Predictive Analytics Small-world phenomena Another famous study Stanley Milgram wanted to test the (already popular) hypothesis that people in social networks are separated
More informationF. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google
Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META
More informationParallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem
I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **
More informationStructure of Social Networks
Structure of Social Networks Outline Structure of social networks Applications of structural analysis Social *networks* Twitter Facebook Linked-in IMs Email Real life Address books... Who Twitter #numbers
More informationText Mining on Mailing Lists: Tagging
Text Mining on Mailing Lists: Tagging Florian Haimerl Advisor: Daniel Raumer, Heiko Niedermayer Seminar Innovative Internet Technologies and Mobile Communications WS2017/2018 Chair of Network Architectures
More informationB490 Mining the Big Data. 5. Models for Big Data
B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive
More informationWeb Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Web Search Instructor: Walid Magdy 14-Nov-2017 Lecture Objectives Learn about: Working with Massive data Link analysis (PageRank) Anchor text 2 1 The Web Document
More informationMining Data that Changes. 17 July 2015
Mining Data that Changes 17 July 2015 Data is Not Static Data is not static New transactions, new friends, stop following somebody in Twitter, But most data mining algorithms assume static data Even a
More informationUniversity of Waterloo. Storing Directed Acyclic Graphs in Relational Databases
University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: 20523416 User ID: skoushan 4A
More informationCS281 Section 3: Practical Optimization
CS281 Section 3: Practical Optimization David Duvenaud and Dougal Maclaurin Most parameter estimation problems in machine learning cannot be solved in closed form, so we often have to resort to numerical
More informationLecture Notes on Garbage Collection
Lecture Notes on Garbage Collection 15-411: Compiler Design André Platzer Lecture 20 1 Introduction In the previous lectures we have considered a programming language C0 with pointers and memory and array
More informationA Modified Algorithm to Handle Dangling Pages using Hypothetical Node
A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal
More informationSocial Network Analysis
Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page
More informationP2P Applications. Reti di Elaboratori Corso di Laurea in Informatica Università degli Studi di Roma La Sapienza Canale A-L Prof.ssa Chiara Petrioli
P2P Applications Reti di Elaboratori Corso di Laurea in Informatica Università degli Studi di Roma La Sapienza Canale A-L Prof.ssa Chiara Petrioli Server-based Network Peer-to-peer networks A type of network
More information