Popularity of Twitter Accounts: PageRank on a Social Network

Size: px

Start display at page:

Download "Popularity of Twitter Accounts: PageRank on a Social Network"

Megan Perkins
5 years ago
Views:

1 Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages, called tweets. This service has gained worldwide popularity since its creation in 2006, with more than 100 million users posting 340 million tweets a day in The popularity of a Twitter user s account is given by the number of Followers it has; that is, the number of other accounts that follow or subscribe to the user s tweets. However, the number of followers alone may not be the best measure of popularity. What is to prevent a user s popularity from being inflated though accounts that have been created for the sole purpose of following that user (aside from Twitter deleting said accounts if they are found to be violating the Terms of Service)? The aim of this project is to investigate the application of PageRank to determine the popularity of a Twitter account, and to determine whether it is a more appropriate measure of an account s popularity. 2 Background Work Google s PageRank algorithm was developed to rank websites for their search-engine results. It assigns a numerical rank to each indexed web page, evaluating pages importance by their link structure. The idea is that if a page A has a link to page B, theowner of A is giving some measure of importance to page B. We can think of the web as a directed graph, where pages are nodes and there is an edge from page A to page B if A contains one or more links to B. The mathematics behind the algorithm are generic, and PageRank can be applied to the social network graph of Twitter s follower relations. In this graph, instead of webpages and links, nodes are users and edges indicate follower relations. There is an edge from A to B if B is a follower of A. That is, edges follow the direction of tweet transmission from a user to their followers. This directionality is due to our dataset and has some ramifications on the analysis of the results, which will be discussed later. 1

2 3 Theorectical Analysis To gather testing data I initially used a Python script and the Twitter API to collect account information, storing it in a relational database. However, this quickly proved to be impractical; the Twitter API limits request rates to a maximum of 15 requests per 15 minutes. As a result, I would have been able to construct only a very small dataset; processing only 30,000 accounts would take nearly 21 days! So, instead of attempting to gather modern data, I have used a snapshot of the Twitter dataset as it appeared in The social network graph of Twitter at this time consisted of 41,652,230 users, with 1,468,365,182 follower relations. The dataset initially consisted of a 25GB text file, where each line was a pair of ids, i and j, indicating an edge from i to j (that is, j is a follower of i). This data was unordered, and di cult to access and use due to its size. To convert it into a usable form, I first attempted to create a relational database, with the idea that it would make queries for data analysis much easier once the PageRank of each node was calculated. However, this approach proved to be impractical; it was far too slow. Even using bulk insertion, only 300,000 nodes were added after 4 hours. Furthermore, it was very slow to access during runtime - iterating through this partial dataset took much longer than iterating the entire text file. Next, I tried simply streaming the text file to construct the adjacency matrix. While this was faster it still took too long, taking nearly 1.5 hours just to iterate through the edges. More importantly, the resulting matrix would be far too large to fit in memory; with 41,652,239 nodes the adjacency matrix would have 1,734,908,263,972,900 entries! Obviously a better data structure is required. I was able to use the WebGraph framework to generate a compressed graph file from the set of edges. The links structure now looks something like Figure 1. Figure 1: Links structure of graph We represent the transition matrix by the outdegree of each node and the list of its successors. By listing nonzero entries by column, we know the value of each nonzero entry: 1/outdegree. 2

3 This structure is then compressed by the WebGraph framework. Using this method, the Twitter dataset was compressed into a 2.5GB file. However, the graph structure was still too large for the framework to load into memory, even when 8GB of heap space was allocated. Fortunately, WebGraph allows loading of the graph as a memory-mapped file, provided a list of o sets has been generated, though access is slower. While the graph is now usable, the PageRank algorithm needs to be adjusted. The approach discussed in class will not work, as the matrix would still be too large. A more memory e cient way of calculating PageRank is required. Taher Haveliwala describes amethodine cient Computation of PageRank (1999). He describes a multi-pass (per single PageRank iteration) technique that allows computation of PageRank in memory for very large graphs, assuming stream access to the graph and scores from the previous iteration: Figure 2: Memory-e cient PageRank algorithm For this algorithm we make successive passes over the links structure, using the previous rank values (held in Source) to compute the current iteration rank values (held in Dest). The algorithm can stop when the di erence between Source and Dest reaches some threshold (PageRank converges), or after a set amount of iterations. This implementation requires that at least the current scores are stored in memory (scores from the previous round may be stored on disk if memory cannot hold both). For the Twitter dataset, with 42,000,000 nodes, and assuming 32-bit floats, both the current iteration and previous iteration scores can easily be held in memory. However, this implementation also assumes that there are no dead ends in our graph. Dead ends leak PageRank, and we cannot easily remove them from our graph. We need to make some minor adjustments to compensate: 3

4 Figure 3: Modified PageRank algorithm to account for dead ends If the graph has no dead ends, the amount of PageRank leaked is 1 c. Sincewe have dead ends, the amount of leaked PageRank may be larger, and we must account for it by calculating S. This ensures that the rank vector is normalized to 1. It is important to note that the PageRank algorithm assigns higher rank to webpages that have large indegrees. This is essentially the opposite of what we want for the Twitter follower relation, where accounts with large outdegrees (many followers) should be more popular. When we apply the algorithm as-is on the Twitter dataset, we are in fact calculating a Reverse PageRank. This means that Nodes that have a lower score are the nodes that are more popular. There are a few ways of solving this: 1. We can reverse the direction of all of the edges before calculating PageRank. Due to issues with the dataset used for this project, this was not easily possible. The size and format lead to the dataset becoming corrupted when I tried to transpose the graph. 2. We can ensure all of the ranks are normalized to 1, invert them, and then normalize again. However, since we used single precision floats, accuracy is lost in this step and many ranks end up as 1.0 or 0.0. Using double precision would likely help, but this also doubles the memory usage and slows the running time significantly. Alternatively, we may simply use this Reverse PageRank as an account s popularity by treating it in the same manner as golf scores: the lower the score, the more popular the account, and we order by ascending score rather than descending. 4 Experimental Design and Analysis I chose to implement the PageRank algorithm in Java to take advantage of the Web- Graph framework. First I calculated the FollowerRank of each node to use as a basis 4

I then ran the PageRank algorithm described previously, using the standard teleportation factor of 0.85.

5 for comparison. This calculation is simple; a nodes FollowerRank is its outdegree divided by the total number of edges in the graph (the percentage of follower relations the node has out of all follower relations). I then ran the PageRank algorithm described previously, using the standard teleportation factor of Rather than running to convergence, I chose to run for a set number of rounds, in this case 10, due to time constraints and attempting to recover data after an unfortunate crash. Each round took approximately 10 minutes, for a total of just under 2 hours. Figures 4 and 5 show the top ten accounts by followers, and then these same accounts sorted by their respective PageRank: Figure 4: Top Twitter accounts by FollowerRank Figure 5: Same accounts, ranked by Reverse PageRank The order is significantly di erent. Notably, Obama s account has a rather poor ranking, which is contradictory to how many followers he has. The fact that this is an outdated dataset makes analysis somewhat di cult, but based on a check of current Twitter statistics for these accounts, the bottom three in the above table all follow many users. For example, Obama currently follows 626,000 users, while Oprah only follows 288. Many of the other users in our dataset with the best rankings were random people who had only a few followers, but didn t follow anyone else. These accounts basically acted as sink nodes. 5

6 A possible solution to these random accounts appearing to be popular would be to recursively remove dead end nodes, since many of these relation chains ended at accounts without any followers. However, the compressed graph is immutable, so we cannot modify it at runtime. Furthermore, our id s are sequential with a separate file mapping these sequential ids to Twitter ids and account handles. To save memory, the sequential ids are omitted from text output and from our rank arrays (the array index corresponds to the node id). Removing any nodes would make our data unreadable. My attempts to preprocess the graph before compression were unsuccessful, and the dataset became corrupted several times. So we cannot actually remove the nodes, though we can try to exclude them from our calculations. I attempted to implement a variant of PageRank that performed one iteration of dead end removal, but it had several drawbacks. First, it increased our running time, since we must now iterate through each node s successors twice: first to determine how many hanging nodes and adjust the PageRank contribution factor, and then to distribute the ranking. Additionally, this simply left many new dead ends, and nodes that were completely disconnected from the rest of the graph. 5 Conclusions Ultimately, PageRank does not seem particularly suited to evaluating a Twitter account s popularity, at least not without some modifications. It is likely that recursively removing all nodes with zero followers would improve the rankings, but we still run into the issues of popular accounts serving as sink nodes. For example, in the current 2017 Twitter follower network, Taylor Swift has millions of followers, but doesn t follow anyone. Her account would be a massive sink node, but we wouldn t want to remove it. There is a great deal of potential for future work. Firstly, managing to remove all of the dead end nodes, and transposing the graph would make the PageRank calculation both more intuitive and more meaningful. Furthermore, implementing a factor akin to Google s trust rank for ignoring spam and link farm websites may help. Accounts that have existed for longer, have more followers, or possibly verified accounts could be considered more trustworthy. By doing this, users with few followers would not contribute strongly or at all to the popularity of the accounts that they follow. I had intended to implement some form of trust rank, but this project became a major learning lesson on the di culties of working with big data, and most of my time was spent on issues related to memory usage. I would also like to try this with a more modern dataset; Twitter in 2010 had 42,000,000 users, today it has over 330,000,000 users. Not only would the results be more meaningful, they would be easier to analyze, instead of having to make as many assumptions. However, collecting this data has proved di cult and time-consuming. 6

7 As it is, the number of followers does seem to be a simple and e ective estimate of a Twitter account s popularity, provided untrustworthy accounts could be factored out in some way. References [1] Boldi, P. and Vigna, S., 2004, May. The webgraph framework I: compression techniques. In Proceedings of the 13th international conference on World Wide Web (pp ). ACM. [2] Haveliwala, T., E cient computation of PageRank. Stanford. [3] Kwak, H., Lee, C., Park, H. and Moon, S., 2010, April. What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web (pp ). ACM. [4] Leskovec, J., Rajaraman, A. and Ullman, J.D., Mining of massive datasets. Cambridge university press. 7

Part 1: Link Analysis & Page Rank

Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,