SEMINAR: GRAPH-BASED METHODS FOR NLP

Size: px

Start display at page:

Download "SEMINAR: GRAPH-BASED METHODS FOR NLP"

Piers Houston
6 years ago
Views:

1 SEMINAR: GRAPH-BASED METHODS FOR NLP Organisatorisches: Seminar findet komplett im Mai statt Seminarausarbeitungen bis 15. Juli (?) Hilfen Seminarvortrag / Ausarbeitung auf der Webseite Tucan number for registra1on: se 1

2 Fahrplan 2

3 3

4 Mo#va#on for graph representa#on Graphs are an intui1ve and natural way to encode en##es (e.g. language units) as nodes and their rela#ons (e.g. similari1es) as edges (directed / undirected) feature- based representa1on can be transformed into a graph via a similarity measure graphs may not necessarily be transformed back into a feature representa1on (at least not a unique one). Think of e.g. points in n- dimensional space. Graph isomorphism 4

5 Graph representa#ons Adjacency Matrix ì î Adjacency List Additional information such as weights might be saved easily. 5

6 Mo#va#on for graph representa#on There exist efficient algorithms that directly operate on graphs 6

7 ? P = NP Eﬃcient Algorithms? 7

8 Efficient Algorithms! There are efficient (polynomial) algorithms for the exact solu1on of many problems on graphs, e.g. Graph Traversal (DFS, Shortest Paths, Max- Capacity Paths, ) Op1mal Trees and Branchings (MST, MAX- FOREST, MAX- BRANCHING, ) Graph Clustering (Min- Cut, Markow Clustering, Chinese Whispers, ) Graph Ranking (PageRank, Random Walks, Markow Chain Theory) Graph Distances (local: Paths, global: Graph Edit Distance, ) Flows on Graphs (MAX- FLOW, MIN- COST FLOW, ) Matching and Assignment (Hungarian Method, Edmond s Algorithm) many more 8

9 Efficient Algorithms! There are efficient approxima#on algorithms and heuris#cs for the approximate solu1on of many graphs problems, e.g. Subgraph Problems (Dense Subgraphs, Minors, ) Op1mal Tour Problems (TSP, PCTSP, VRP, ) Steiner Trees many more There are simple heuris#cs that o^en yield quite good results, such as for example k- OPT for the Euclidean TSP. 9

10 Why efficiency is crucial Graphs are usually large- scale In 2008, English Wikipedia used to have ar1cles* with links in between Graphs are usually dense and strongly connected The largest "strongly- connected- component" of Wikipedia has ar1cles. Remember from the last lecture Graphs in NLP are usually scale- free and have the small world property (high clustering coefficient) à Problem solu1ons o^en consider only small subgraphs (local neighborhoods), but an a priori par11oning is usually not possible (this yields small 1me complexity but full space complexity) * by today there are almost 4 million ar1cles 10

11 PageRank First- genera1on Google global ranking algorithm (1998) Measure the (query- independent) importance of Web page based solely on the link structure. Assign each node a numerical score between 0 and 1, its PageRank. Rank Web pages based on PageRank values. General Idea: every page has a number of in- links (back links) and out- links (forward links) pages with more in- links are more important in- links from important pages are more important 11

12 PageRank 12

b u page B (1$ d) N The equa1on is recursive, but it may be computed by star1ng with any set of ranks and itera1ng the computa1on un1l it converges.

13 Defini#on of PageRank u: a web page, R(u) its page rank F u : set of pages u points to (forward links) B u : set of pages that point to u (backw. Links) F u : the number of links from u N: total number of pages d: damping factor, default d=0.85 R(u) = R(v) " F # d + v v!b u page B (1$ d) N The equa1on is recursive, but it may be computed by star1ng with any set of ranks and itera1ng the computa1on un1l it converges. Rank sink problem: cycle of pages that accumulates rank within the cycle, but never distributes rank outside Need damping: uniform rank distribu1on for all pages page X page A page D page C 13

14 Random Surfer Model When normalizing PageRank over all pages to 1, R(u) can be thought of as the probability that a random surfer looks at a page u. Damping corresponds to teleporta1on : With some probability d, the random surfer is teleported to some other page page B page X page A page C page D 14

15 Computation of PageRank Numeric: Simulate a lot of random surfers: The Power method of Eigenvector computation initialize all pages with the same rank repeat until convergence: for all pages u: compute R t+1 (u) on the basis of R t (v) t:=t+1 input : matrix size N, error tolerance ϵ output: eigenvector p p 0 = 1/N 1 t=0; repeat until δ < ϵ: t=t+1; p t = M T p t 1 ; δ = p t p t 1 ; return p t ; 15

16 LexRank: Applica#on to Mul#- Document Summariza#on Mul2- document summariza2on task: 1. iden1fy important topics of the documents to be summarized 2. iden1fy sentences belonging to a certain topic 3. from these sentences belonging to the same topic, select the ones that best describe the topic 4. concatenate sentences from different topics and make sure they fit together Consider sub- problem 3: Input: Sentences that talk about more or less the same thing Output: Scores for those sentences that reflect how well a single sentence represents that topic Solu#on idea: use measures on sentence similarity graph 16

From Sentences to TF*IDF vectors TF: count w 1.

n This is a sentence that talks about some topic.

27 0 And here is another sentence that talks abot

And here is yet another one of these notorious

17 From Sentences to TF*IDF vectors TF: count w 1..w n TF*IDF Sentence w 1 w 2 w 3 w n w 1 w 2 w 3 w n This is a sentence that talks about some topic And here is another sentence that talks abot something slightly different. And here is yet another one of these notorious sentences DF ! total number of sentences$ IDF(w) = log# & " DF(w) % 0 feature vector of the second sentence This is the same as the vector space model for Informa1on Retrieval

18 From TF*IDF vectors to sentence similarity graph Sentence similarity graph: nodes: sentences edges: cosine similarity between sentence feature vectors Can apply threshold on similarity or use similarity as edge weight 18

19 Measures: Centroid, Degree and Centrality Centroid Idea: select an average sentence. Compute average point of sentence vectors (centroid) select sentence that is most similar to the centroid for summariza1on Degree Centrality Idea: sentences that cover most of the content have a high node degree (number of edges): since word overlap is responsible for edges, node degree measures word overlap with the overall set of sentences for summariza1on, choose the sentence with the highest degree LexRank Centrality Idea: it does not suffice to be similar to many sentences: similarity to important sentences counts more. normalize the adjacency sentence similarity to make it a stochas1c matrix run PageRank to obtain scores that are used for ranking the sentences for summariza1on, choose sentence with highest score 19

20 Evalua#on of graph- based mul#- document summariza#on Scores: ROUGE metric: similar to BLEU, between manual summaries and system summaries random baseline: select any sentence from set by chance lead- based: select based on posi1on of sentence within document è LexRank is a simple method for genng high scores. It uses the whole structure of the graph, as opposed to Centroid or Degree. This technique also works well for single- document summariza1on. 20

TextRank for Keyword Extrac#on Keyword extrac#on: find the most salient keywords for a document Keyword extrac#on with PageRank: preprocess document: iden1fy adjec1ves and nouns as targets target co-

21 TextRank for Keyword Extrac#on Keyword extrac#on: find the most salient keywords for a document Keyword extrac#on with PageRank: preprocess document: iden1fy adjec1ves and nouns as targets target co- occurrence graph: targets co- occurring within a window of 2-10 words apply PageRank to get ranking scores on nodes select highest scoring keywords, possibly concatenate ADJ- NOUN- NOUN sequences if present in the text 21

22 Keyword Extrac#on Evalua#on Comparison: Supervised system that is trained on manually assigned keywords, using frequency and contextual features Note that TextRank is unsupervised: no training necessary 22

23 Graph Clustering Task: Find meaningful groups of nodes in graph by cunng edges Intui1on: Connectedness within a cluster is higher than between clusters Many graph clustering algorithms find the number of clusters automa1cally

Clustering by Min- Cut / Max- Flow MinCut algorithm: hierarchical top- down clustering compute the minimum cut: leaving out a set of edges, which results in disconnec1ng a set of nodes from another,

24 Clustering by Min- Cut / Max- Flow MinCut algorithm: hierarchical top- down clustering compute the minimum cut: leaving out a set of edges, which results in disconnec1ng a set of nodes from another, with the smallest edge weight sum recursively apply to the components that got disconnected Finding the minimum cut is equivalent to finding the maximum flow in a network Advantage: Efficient. Fastest known algorithm of per- cut complexity O( E +log 3 ( V ) Disadvantage: Unbalanced cuts when to stop? 24

Markov Chain Clustering Clustering based on random walks: MCL is the parallel simula1on of all possible random walks up to a finite length on a graph G Idea: a random walker on the graph is more

25 Markov Chain Clustering Clustering based on random walks: MCL is the parallel simula1on of all possible random walks up to a finite length on a graph G Idea: a random walker on the graph is more likely to stay within the same cluster than to end up in a different cluster a[er a small number of steps Algorithm: can show convergence to a limit T Add loops: transition matrix T= column-normalize (A G + I) MCL process: alternate between T=T t // expansion: raise T to its power of t T=inflate(T) // inflation: increase contrast within columns by raising values to their power of s (s>0) and normalize column-wise Interpret T as a clustering: use strongest connection as label Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May

26 Expansion step: simulate the random walk (stochas1c) adjacency matrix T: probabili1es to walk from node in column to node in row in a single step. T 2 : probabili1es to walk from A to B in 2 steps. A G loops added T T 2 26

27 Infla#on Step: only keep a]ractors x 2 x 2 norm alize x 2 Inflate the differences within a column by taking the k- th power of the value, then normalize to ensure stochas1c property. k regulates the cluster sizes Clustering: Highest entry in column vector is cluster label variants: Could add small random noise to break 1es Op1miza1on: Only keep K largest values, only keep values over threshold 27

28 Chinese Whispers Graph Clustering MCL: keep only a few strong neighbors Chinese Whispers: only propagate strongest label in neighborhood initialize: "forall v i in V: class(v i )=i;" Nodes have a class and while changes:" communicate it to their forall v in V, randomized order:" adjacent nodes "class(v)=highest ranked class in neighborhood of v;" B L4 deg=2 C L3 deg= A L1 3 deg=4 D L2 deg=1 E L3 deg=3 28 A node adopts one of the the majority class in its neighbourhood Nodes are processed in random order for some itera1ons Node weigh1ng schemes

29 Disambigua#on using Resource Graphs 29

30 Disambigua#on of Named En##es using Resource Graphs Wikipedia Link Graph (Shortest) paths are one possibility 30

31 Disambigua#on of Named En##es using Resource Graphs (Shortest) paths are one possibility. What else? maximum capacity paths (capaci1es needed, e.g. coherence, probabili1es,...) maximum flows (Aten1on: Small world graph! Path length must be bounded!) apply PageRank to weight nodes Semantic enrichment: Use the nodes on the paths / flows for enriching to overcome the knowledge acquisition bottleneck 31

32 Summary on Graph Methods in NLP Graph representa1on is a natural representa1on of en11es and their rela1ons We might use well- known (efficient) graph algorithms for the solu1on of specific NLP problems Taking the overall structure into account some NLP tasks might be improved (enriching seman1cs) Graph clustering algorithms solve unsupervised NLP tasks without the need to specify the number of clusters We can enrich informa1on by walks on graphs 32

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum