Homework 1 Yang Zhang

Size: px

Start display at page:

Download "Homework 1 Yang Zhang"

Norma Austin
5 years ago
Views:

1 Homework 1 Yang Zhang Part 1: Using test-sm.nt as dataset for this part: (1) python dd.py to get Degree: Degree: 3, Frequency: 2 Degree: 4, Frequency: 3 Degree: 2, Frequency: 3 (2) python pr-d.py to get PageRank;: Vertex: node:5, PageRank: Vertex: node:8, PageRank: Vertex: node:6, PageRank: Vertex: node:4, PageRank: Vertex: node:7, PageRank: Vertex: node:3, PageRank: Vertex: node:2, PageRank: Vertex: node:1, PageRank: (3) python tr.py to get number of triangles: Number of Triangles: 4. (4) The eccentricity of <node:1> is 2. The eccentricity of <node:2> is 3. The eccentricity of <node:3> is 3. The eccentricity of <node:4> is 3. The eccentricity of <node:5> is 2. The eccentricity of <node:6> is 2. The eccentricity of <node:7> is 3. The eccentricity of <node:8> is 3. Part 4: Parallel way is much faster than serial: each step in serial implementation takes about 0.6 to 1.6 seconds, which means it will takes thousands or even millions of seconds to wait till it reaches convergences. Explanation: (1) When python codes call sparql query, there exists some delay due to the communication. Serial implementation calls sparql query much more frequent than parallel so the a lot of time are wasted in sparql-python communication; (2) Time sparql requires to handle large dataset is not linearly proportional to the size of the data, time spent on K data are much smaller than K times of the time spent on one data.

2 For (2)(3)(5): A Random-Walk Way of Graph Path Ranking Using SPARQL 1. General Idea: To implement random walk algorithm on a directed graph, each path starts from a random vertex, and randomly picks up a connected node to extend the path, repeatedly, till an expected length. To make it parallel in SPARQL, K paths proceeds at the same time. Running this progress in a While loop keeps updating the counts of the visits to each vertex, and hence the percentages of each visited vertex. We assume, similar to Page Rank concept, the more important a vertex is, the more visits it gets. Therefore the percentage represents the importance level of the vertex within the graph. After a certain number of iterations, the percentages will converge to equilibrium values. Hereby, we define the convergence rate as the maximum change of the percentages of the visits to the vertices in two adjacent iterations. The algorithm is completed after the convergence rate drops below a threshold. A path is scored by adding up the percentages of the nodes the path includes. To avoid mistakenly scoring paths involving circles, a node is counted only once even if it was reached multiple times. A path with higher score is more likely to connect the important nodes. 2. Development: 2.1 Framework: Three working graphs are used besides the default graph storing the data. Graph Name Functions Format workinggraph0 To update the statistics(count, percentage, convergence rate)?nodeid <temp:count>?count?nodeid <temp: percentage>?percentage?nodeid <temp:difference>?difference workinggraph1 workinggraph2 To save all previous visits, in both accomplished paths and the earlier finished steps in the ongoing paths A buffer graph temporally saving the generated next steps in the ongoing paths. Data in workinggraph2 will then be moved to workinggraph1 before executing next step?pathid <step: N>?nodeId N varies from 1 to LENGTH_OF_PATH?pathId <step: M>?nodeId M is the ongoing step

3 2.2 Generating random starting vertices: SELECT DISTINCT?startNode WHERE{ {?startnode?p?o} UNION{?o?p?startNode} BIND(RAND() AS?sortKey) }GROUP BY?sortKey LIMIT 1 #for serial processing LIMIT 2 #for parallel processing 2.3 Routing Randomly Selecting a random connection to extend a path is similar to generating random starting vertices, except an inner projection needs to be generated for each path 2.4 Updating the statistics After each iteration, the newly generated visits will be moved from workinggraph1 to workinggraph2. Then the query counts both the total number of all previous visits, and the number of visits to each vertex. Dividing the latter by former gets the percentages of the visits to each vertex. To update the convergence rate, the differences between the new and previous percentage of each node are calculated and the maximum of them is the convergence rate of current iteration. 2.5 Checking if it is convergent A buffer list of size 10 is defined in Python to store the convergence rates of the most recent ten iterations. When a new iteration is finished, the first element of the buffer list will be popped out and the new convergence rate is pushed to the end. Only when all the 10 convergence rates are smaller than the threshold, random walking is identified as being convergent and the iterations stops. This prevents improperly cease due to two coincidentally similar iterations. The convergence threshold is set up as 2.6 Path Ranking: SELECT?pathId (SAMPLE(?_nextNode) AS?nextNode) { SELECT?pathId?currentNode?_nextNode{ GRAPH< <step:n>?currentnode}?currentnode?p?_nextnode BIND(RAND() AS?orderKey) } ORDER BY?orderKey }GROUP BY?pathId 1 number of vertices 10 Retrieve the nodes on a path and add the percentages up as the score of this path.

4 SELECT?pathId (SUM(?per) AS?score) (GROUP_CONCAT(?nodeId;SEPARATOR="->") AS?nodes) WHERE{ SELECT DISTINCT?pathId?nodeId?per WHERE{ GRAPH < GRAPH< <temp:percentage>?per.} } } GROUP BY?pathId ORDER BY DESC(?score) 3. Implementation 3.1 Serial While True: getstartnode(1); For step = 1 to LENGTH_OF_PATH getnextnode(); Endfor If isconvergent(): Break; //if convergent, jump out of while loop Endif PathRanking(); 3.2 Implementation: K-Parallel Parallel way is much similar to serial except getting K start nodes and extending K paths at the same time. K or 1 is a parameter transferred to SPARQL queries. While True: getstartnode(k); For step = 1 to LENGTH_OF_PATH getnextnodeforkpaths(); Endfor If isconvergent(): Break; //if convergent, jump out of while loop Endif PathRanking();

5 4. Case Study: We tested the implementation on a VMware Ubuntu virtual machine, which is assigned 2GB memory. The test graph has 10,000 vertices and 104,250 edges. When testing the serial way, we find each step takes 0.6 to 1.7 seconds, which means it takes 6,000 to 17,000 seconds to generate 10,000 random paths, and this is far away from reaching convergence. Serial way is way too time consuming. We executed the algorithm parallelly with 100, 500, 1000, and 5000 threads respectively. The following figure shows the association between the convergence rate and the execution time and 1000-Parallels reach convergence faster than 100- and 5000-Parallels. This tells us a proper number of parallels could be around 5% to 10% of the number of vertices of the graph. Too many and too little parallels deteriorates its performance. Convergence Rate / Parallel 500-Parallel 1000-Parallel 5000-Parallel Convergence Threshold Time /s

Link Analysis in the Cloud

Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)