Automated Tagging for Online Q&A Forums

Size: px

Start display at page:

Download "Automated Tagging for Online Q&A Forums"

Aldous Leonard
5 years ago
Views:

1 1 Automated Tagging for Online Q&A Forums Rajat Sharma, Nitin Kalra, Gautam Nagpal University of California, San Diego, La Jolla, CA 92093, USA {ras043, nikalra, Abstract Hashtags created by users of online content websites such as Facebook, Twitter, StackExchange, Quora, etc. provide us a great way to explore trending content, discover new content as well as explore content of our interests. With tremendous increase in the size of content on these online websites, these tags are becoming increasingly important. Datasets from these websites can be explored to understand how hashtags for a particular post are chosen and how they relate to the interests of a user. In this paper, we explore the StackOverflow dataset, and develop a system for content tag prediction by modelling it as a multi-label classification problem. Keywords Prediction, Tags, Multi-label, classification, Data Mining I. INTRODUCTION We are living in the era of Hashtags, where tagging of online content is widespread. These tags allow for easy discovery of content. Some examples of such human based content tagging include Hashtags on Facebook and Twitter, tags on Quora, StackExchange and blog posts, etc. This exponential increase in content tagging in recent years calls for the need of understanding the underlying process by which certain post gets a set of associated tags. Understanding such process could lead to building an efficient predictive system which could suggest tags for posts based on the content. Such a predictive system could be a very useful feature for users of these online content websites. Another potential benefit could be to have a tag correction system based on such a tag prediction system, which could clear incorrectly labeled tags on a post. These benefits motivate us to understand what goes into choosing a set of tags for an online post and do some predictive analysis for the same. We have built a tag prediction system for the Stack Overflow Questions. We chose the Stack Overflow dataset primarily because it has large number of posts and corresponding tags and we found the features present in the data to be good for building a tag prediction system. In addition, Stack Overflow imposes a limit of five tags per question, which results in the tags being highly related to the post and hence, leading to better predictive analysis. Moreover, StackExchange - the parent company of Stack Overflow - regularly makes this data available online, which will facilitate further analysis. II. RELATED WORK The task of predicting content tags has been studied by a few researchers previously. Kuo, 2011 proposed a model where content tags are predicted based on the words in the post and their relation to the tags. [3] The model, which was originally built for next word prediction in documents, was able to perform well when adapted for Stack Overflow dataset by limiting the nextword prediction to tags only. This co-occurrence based model exploits the very important feature responsible for the tags, that is, the post s content itself. Stanley & Byrne, 2013 have also explored this problem and presented a model based on a variation of declarative memory retrieval theory of ACT-R cognitive architecture (Anderson et al., 2004) to predict associated tags for posts on StackOverflow.[5][6] Similar work has been done by Xia & Lo, 2013 who have proposed a multi-component tag prediction model based on multi-label learning problem, similarity based ranking as well as tag-term based ranking to predict the tags of posts of software information sites.[4] These models provide us with a good starting point for building a predictive system for Stack Overflow Questions. III. DATA SET We use the dataset from 10% of Stack Overflow Q&A provided on Kaggle.[1] The dataset contains 1,264,217 Stack Overflow questions, each consisting of the following fields: question-id owner-id creation date score title text HTML markup of body The corresponding Tags dataset contains tags for each of the questions which can be fetched using the question-id.

2 A. Exploratory Analysis The dataset contains 37,036 unique tags. Figure 1 shows the distribution of the tag occurrences for 1000 most frequent tags.

Since working with such a huge dataset can be computationally intensive we filtered our dataset to only contain questions tagged with the 500 most frequent tags.

From the figure we can infer that after filtering most of the questions contain either two or three tags. TABLE I: STATISTICS OF FILTERED DATA Fig.

2 2 A. Exploratory Analysis The dataset contains 37,036 unique tags. Figure 1 shows the distribution of the tag occurrences for 1000 most frequent tags. From the plot we inferred that the 20 most frequent tags account for approximately 67% of the question data. Figure 2 shows those top 20 tags plotted with their corresponding frequency. Since working with such a huge dataset can be computationally intensive we filtered our dataset to only contain questions tagged with the 500 most frequent tags. This led to a reduced dataset of size 486,209. Table 1 shows some of the statistics of the filtered data and Fig 3 shows the distribution of number of tags per question. From the figure we can infer that after filtering most of the questions contain either two or three tags. TABLE I: STATISTICS OF FILTERED DATA Fig. 2: Number of occurrences for 20 most frequent tags Number of questions 486,209 Number of unique tags 500 Total occurrences of tags 2,510,762 Average number of tags per question 2.4 Finally, we shuffled and partitioned the filtered dataset into training (60% - 291,726), validation (15% - 72,930) and test sets(25% - 121,553). Fig. 3: Distribution of Number of Tags per question in the filtered data Fig. 1: Number of occurrences for 1000 most frequent tags We then analyzed the data by pivoting on unique users and their history. Figure 4 shows the distribution of number of questions posted by top unique users. Analysis showed that more than 85% of the users posted less than 3 questions. IV. PREDICTIVE TASK The primary predictive task for this dataset is to predict the tags based on the content of a question. Earlier we inferred that most of the questions have either two or three tags and a maximum of five tags are possible per question. Thus, the final predictive task can be defined as: given a set of questions Q, and a set of tags T = {t 1, t 2..., t n }, n = 500, we want to assign a subset of tags T q T q Q, where T q 5. A. Features used 1) Title of the question: The title of the question is one of the most important feature as users tend to add direct reference to the tag in title to make it clear what the query is specifically about. We dont filter out punctuations marks as they represent crucial information like the language name. For instance, asp.net

3 3 Fig. 5: Significance of code as a feature Fig. 4: Distribution of most number of questions posted by unique users in training dataset N idf(t, Q) = log q Q : t q (2) tfidf(t, q, Q) = tf(t, q) idf(t, Q) (3) Before calculating the TF-IDF values the text of the review is filtered using following methods: Remove the stop words. Convert the textual content to lower case. 2) Body of the question: The body of the question is given to us in the data as an HTML blob that includes textual content for the question as well as the code element associated with it. The code element is embedded between <code></code>tags. Following techniques were used to filter body-text: Remove the stop-words Convert the content to lower case. Extract and remove the code element Remove tags from the data. After this filtering we are left with the body of the question without the code element. Using this body text we now calculate the TF-IDF values for the dataset. 3) Code element of the question: We obtained the code element for each question from the filtering done for the body of the question. The code usually consists of keywords that are highly related to the programming language/ framework that the question is concerned about. For instance Figure 5 shows a query about scikit-learn library and has no mention about it in the body. If we analyze and train our model, it can learn to relate sklearn import statement with scikit-learn tag. We utilize this property of the code content and implement TF-IDF score for code element of each of the questions. The formulation that we use for TF-IDF calculation: tf(t, q) = number of times term t appears in question q (1) B. Methods We are predicting at most 5 tags for each question from a pool of 500 unique tags, implying that each question can be assigned multiple tags based on its content. So we train independent classifiers for each tag to achieve this. scikit-learn provides us with a multi-label classification wrapper - OneVsRestClassifier which can work on top of any underlying classification algorithm. For the underlying algorithm we try with the following: Multinomial Naive Bayes Linear SVC Stochastic Gradient Descent A. Evaluation V. MODEL Our prediction system is evaluated on the basis of the mean recall, mean precision and mean F1 score over all questions, i.e. the recall, precision and F1 are computed for every question, and then, the mean of all these values are computed. If Q is the validation data set, consisting of questions, Y i is the set of actual labels and Z i is the set predicted labels for i th data point. The precision, recall and F1 scores for a predictor h are computed as follows: P recision(h, Q) = 1 Recall(h, Q) = 1 F 1(h, Q) = 1 i=1 i=1 i=1 Y i Z i Z i Y i Z i Y i 2 Y i Z i Z i + Y i (4) (5) (6)

4 B. Baseline Model For comparing various models defined above we create a model with the following feature: Bag-of-words Title - We create a frequency based feature vector for a question by counting

Bag-of-words Body - Similar to the bag-of-words approach we followed for title text we create a bag-of-words representation of the body data.

4 4 B. Baseline Model For comparing various models defined above we create a model with the following feature: Bag-of-words Title - We create a frequency based feature vector for a question by counting the number of occurrences of a word across all unique words in the title corpus. Bag-of-words Body - Similar to the bag-of-words approach we followed for title text we create a bag-of-words representation of the body data. We then use the multiple classifiers approach to perform the prediction using the Naive Bayes classifier for individual tags. Using this model we get mean precision and mean recall values as 0.46 and 0.63 respectively. of tags associated with each question. We construct our label vectors by creating a list of tags assigned for each question. We construct our training set of features by segregating the title, body and code as separate string entries. To perform classification we create a Feature union pipeline that consists of a union of features and a classifier. The flow is depicted as follows: C. Optimal number of Tags Since a question can have any number of tags between one and five, we assign k tags having the highest decision function value where k is a hyper-parameter. In order to determine the optimal value of this hyper-parameter we plot the distribution of mean F1 score, precision and recall with varying k on the validation set (Fig. 6). It is observed that we get the best performance by setting the value of k to 3. Hence, we predict three tags for all models from here onwards. Fig. 7: Classification Training Fig. 6: Model performance with maximum number of predicted tags on validation set D. Classification Task For performing classification, we filter our dataset as specified in the Data Exploration. Once we are done filtering the data we have a set of questions and a set To compute the TF-IDF scores of each of features we first pass them through a CountVectorizer which computes the counts of each unigram in the textual content across the corpus and then the Tfidf transformer creates tfidf values for these unigrams. These features are then used to train a OneVsRestClassifier with an underlying linear SVM classifier. Once the training is done, we predict the labels for test data using the classifier. To get the confidence scores for each prediction we use the decision function of the classifier. The predicted labels, actual labels and the sorted confidence scores are then used to predict the performance measures - F1 Score, Precision and Recall. We experiment with different features and underlying classifiers for the OneVsRest classifier. The results are depicted in the next section. VI. A. Classifiers Performance RESULTS We evaluate our model performance parameters with different classification methods. Figure 8 shows the per-

5 formance of our selected models on the test dataset. Compared to baseline, Linear SVC and SGD techniques perform much better and on the other hand Multinomial NB falls behind.

We observe a large improvement in performance when using the SGD method and an even better performance with the Linear SVC based model. Fig.

5 5 formance of our selected models on the test dataset. Compared to baseline, Linear SVC and SGD techniques perform much better and on the other hand Multinomial NB falls behind. This seems to be because of the fact that the features are not independent and the dependencies among them are quite dissimilar. We observe a large improvement in performance when using the SGD method and an even better performance with the Linear SVC based model. Fig. 9: Results with different features using Linear SVC on test dataset Fig. 8: Results of different models on test dataset B. Feature Experiments We performed ablation experiments on our feature set to evaluate the impact of each feature. Figure 9 shows the impact of each feature on our model. The best performing feature set is the combination of the defined features. Title only model that is very close to the baseline model has the least performance as expected because of the absence of the code and body context, hence it provides limited information. Removing code from the feature set leads to a significant drop in performace across all three measures (F1 Score, Precision and Recall). VII. FUTURE WORK The main limitation of the approach is the inability to run the model on entire dataset i.e. for all the tags (37,036) because of computational limitations of the machine. During analysis, we discovered that a great proportion of the tags were programming languages, which can prove to be a very useful feature. As an extension to this model, we can try and discover the language of the code element and use it as a feature by creating one-hot representation of the programming language used in the question. The raw data-dump from stack-exchange has almost 20 dataset for Stackoverflow questions which can be helpful if we were to include the user-preferences/history as a feature to predict the tags. Another complex feature can be to harvest the correlation between tags to perform classification. For instance, questions about Javascript/CSS/HTML are highly like to have front-end as a secondary tag. The proposed model has a unique capability of training multiple tags and combining different elements of the content as independent features. For instance, we discovered that embedded code though not following the same content background proved a powerful feature for tagging. We can exploit this capability to tag Piazza posts which can have embedded code/equations as elements. Another application can be tagging Quora data, although Quora being a more social platform, user history and correlation plays an important role in classification. VIII. CONCLUSION We conclude that predicting the tags of a question based on its title, body and code section worked reasonably well. Based on our experimentation, including user based feature into tag prediction didn t work well, primarily because very small proportion of the users posted more than two questions and hence, the model was not able to learn significantly by knowing who posted the question. Also, excluding the code part from the model resulted in less accurate prediction because the code part leads to important learning about the language and syntax which in turn leads to better predictions. Removing the

6 6 punctuations from the content was also not a good idea because many of the terms which help in predictive analysis have some punctuation mark in them because of the nature of this content being related to code and technical terms. We also conclude that considering title, body and code as separate features instead of combining them leads to better prediction, primarily because the significance of a term in the questions content would generally depend on whether the term appears in title, body or the code section because of different tf-idf distributions in different sections. REFERENCES [1] StackSample: 10% of Stack Overflow Q&A, [2] I. Katakis, G. Tsoumakas, I. Vlahavas, Multilabel text classification for automated tag suggestion, Proceedings of the ECML/PKDD Discovery Challenge, [3] Kuo, D. On Word Prediction Methods (Tech. Rep. No. UCB/EECS ). EECS Department, University of California, Berkeley [4] Xin Xia, Davud Lo, Xinyu Wang and Bo Zhou Tag Recommendation in Software Information Sites, IEEE Explore, [5] Claton Stanley, Michael D. Byrne Predicting Tags for Stack- Overflow Posts, 2013 [6] PFu, W. ; Pirolli, P. L.SNIF-ACT: a cognitive model of user navigation on the World Wide Web. Human Computer Interaction. 2007

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar Youngstown State University Bonita Sharif, Youngstown State University

Improving Stack Overflow Tag Prediction Using Eye Tracking Alina Lazar, Youngstown State University Bonita Sharif, Youngstown State University Jenna Wise, Youngstown State University Alyssa Pawluk, Youngstown