Extractive Text Summarization Techniques

Size: px

Start display at page:

Download "Extractive Text Summarization Techniques"

Phebe Townsend
5 years ago
Views:

1 Extractive Text Summarization Techniques Tobias Elßner Hauptseminar NLP Tools Tobias Elßner Extractive Text Summarization

2 Overview Rough classification (Gupta and Lehal (2010)): Supervised vs. unsupervised Extractive vs. abstractive Focus on unsupervised extractive approaches: Keyword extraction Collection of the most important sentences Tobias Elßner Extractive Text Summarization 1 / 10

3 Luhn s Method Described by Luhn (1958) Heuristic method to summarize technical documentations Sentences are ranked after the number of coocurrences of significant words Usually in a window of 4-5 words Highest scoring sentences of each paragraph are then selected Tobias Elßner Extractive Text Summarization 2 / 10

4 Luhn s Method: Significant Words Tobias Elßner Extractive Text Summarization 3 / 10

5 Keyword Extraction: tf.idf As described in Erkan and Radev (2004): tf t = count(t) Σ t count(t ) The probability of term t in a document idf t = log( N n t ) N: Number of all documents n t : Number of all documents containing term t The self-information of t with respect to documents tf.idf t = tf t idf t The term frequency weighted by self-information Tobias Elßner Extractive Text Summarization 4 / 10

6 tf.idf: Example the linguistics : High document frequency But: Probably occurs in every document Therefore: log( N n t ) = log(1) = 0 Low document frequency But: Probably occurs in few documents Therefore: log( N n t ) > log(1) > 0 Tobias Elßner Extractive Text Summarization 5 / 10

7 Keyword Extraction: Text Rank Described by Mihalcea and Tarau (2004) Based on PageRank Uses an undirected unweighted graph G = (V, E) of co-occurring terms in a document Each term is initialized with 1 Iterate the ranking algorithm until convergence Usually iterations Use the terms with the highest scores as key words Tobias Elßner Extractive Text Summarization 6 / 10

8 Ranking Algorithm S(V i ) = (1 d) + d 1 ( j IN(V i ) OUT (V j ) S(V j)) d [0, 1]: damping factor d is the probability of accessing a vertex randomly Usually set to 0.85 IN(V i ): All vertices pointing to V i OUT (V j ): All vertices V j points to. Tobias Elßner Extractive Text Summarization 7 / 10

9 Latent Semantic Analysis Steinberger and Jezek (2004): Builds on a term-sentence matrix Performs an SVD on the term-sentence matrix Method used to reduce dimensions of a matrix Related to Principal Component Analysis (PCA) Idea: Take the sentence(s) that cover(s) most of the variance in the data These sentences are associated with the highest singular values Tobias Elßner Extractive Text Summarization 8 / 10

10 Singular Value Decomposition Tobias Elßner Extractive Text Summarization 9 / 10

11 Other Approaches Kullback-Leibler-Divergence Find a set of sentences that opimally encodes the given document Vertex-Cover Find a set of sentences that covers optimally the terms of a given document Tobias Elßner Extractive Text Summarization 10 / 10

12 References Erkan, G. and Radev, D. R. (2004). Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, 22: Gupta, V. and Lehal, G. S. (2010). A survey of text summarization extractive techniques. Journal of emerging technologies in web intelligence, 2(3): Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2): Mihalcea, R. and Tarau, P. (2004). Textrank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing. Steinberger, J. and Jezek, K. (2004). Using latent semantic analysis in text summarization and summary evaluation. Proc. ISIM, 4: Tobias Elßner Extractive Text Summarization

The Algorithm of Automatic Text Summarization Based on Network Representation Learning

The Algorithm of Automatic Text Summarization Based on Network Representation Learning Xinghao Song 1, Chunming Yang 1,3(&), Hui Zhang 2, and Xujian Zhao 1 1 School of Computer Science and Technology,