Bi-monthly report Tianyi Luo 1
Work done in this week Write a crawler plus based on keywords (Support Chinese and English) Modify a Sina weibo crawler (340M/day) Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) Online learning to rank module is completed and will be integrated into the next version of QA system. (We can utilize click data to enhance system)
Write a crawler based on keywords What problem we need to solve? Given a list of key words which contains Chinese or English e.g. 新浪阿里巴巴 Tencent. We want to crawl the webpages returned by search engine e.g. Bing using these keywords as query.
Write a crawler based on keywords The github of this crawler (300 lines java code) https://github.com/pkuluotianyi/getcorpusbaseonkeyword Welcome to use and give me some advice. Thx~
Modify a Sina weibo crawler The link of this crawler (implement by python ) Link: http://pan.baidu.com/s/1mgou5yg password: digt It will download 340M sina weibo data every day
Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) What problem we need to solve? Given candidate sets(may be 50/query) of 1596 queries and their all similarities score e.g. Utilize learning to rank technology to learn the optimization combination ways of these two scores and conduct re-ranking Candidate file format: 2 申请机动车驾驶证的收费标准 422 [ 申请 ] { 机动车驾驶证, 驾驶证, 驾照 } [ 的 ] 收费 [ 标准 ] [ 是 ] [ 什么 ] 申请机动车驾驶证的收费标准? 二 三轮摩托车驾驶证收费标准 :280 元 / 人, 工本费 10 元 2.595365 3.3772912 422 16.69876 13.262825 14.4957075 11.881077 17.792013 14.471099 0.7856765 0.497769 13.612657 10.55533 So this problem is an Information Retrieval problem
Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) Implement the offline l2r module and integrate into QA system (batch learning) Implement with Java (900 lines) Utilize learning to rank technology to learn the optimization ranking results The data format of learning to rank
Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) What features we define? Experiment result: http://cslt.riit.tsinghua.edu.cn/mediawiki/index.php/huilan-learning-to-rank significant coefficients: +0.0398 1. 排序文档集为问题模板, 利用 tf*idf ranking model 得到的分数 +0.0369 2. 排序文档集为标准问题, 利用 tf*idf ranking model 得到的分数 +0.0542 3. 问题模板的长度 -0.0313 4. 标准问题的长度 +0.0733 5. 对 query 进行分词, 分词出现在问题模板中的次数与问题模板长度的比值 +0.1661 6. 对 query 进行分词, 分词出现在标准问题中的次数与标准问题长度的比值 例子 : query 为 保障性住房 ; 分词结果 : 保障性 住房 ; 标注问题 : 什么是保障性住房? query 的分词结果出现在标准问题中的次数为 2 次, 标准问题长度为 18, 该 feature 值为 1/9 +0.1751 7. 对 query 进行分词, 分词出现在问题模板中的次数与 query 分词总数的比值 +0.0766 8. 对 query 进行分词, 分词出现在标准问题中的次数与 query 分词总数的比值 例子 :query 为 保障性住房 ; 分词结果 : 保障性 住房 ; 标注问题 : 什么是保障性住房? query 的分词结果出现在标准问题中的次数为 2 次, query 分词总数为 3, 该 feature 值为 2/3
Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) What features we define? +0.0342 9. 排序文档集为问题模板, 利用 BM25 ranking model 得到的分数 +0.0240 10. 排序文档集为标准问题, 利用 BM25 ranking model 得到的分数 +0.0400 11. 排序文档集为问题模板, 利用 DFR ranking model 得到的分数 -0.0952 12. 排序文档集为标准问题, 利用 DFR ranking model 得到的分数 0.0000 13. 排序文档集为问题模板, 利用 IB ranking model 得到的分数 0.0000 14. 排序文档集为标准问题, 利用 IB ranking model 得到的分数 -0.0057 15. 排序文档集为问题模板, 利用 LMDirichlet ranking model 得到的分数 +0.0152 16. 排序文档集为标准问题, 利用 LMDirichlet ranking model 得到的分数 0.0000 17. 排序文档集为问题模板, 利用 LMJelinekMercer ranking model 得到的分数 +0.0222 18. 排序文档集为标准问题, 利用 LMJelinekMercer ranking model 得到的分数 +0.1052 19. (1)query 中有 NER, 问题模板如果也有 NER, 则 feature 值为 1; (2)query 中没有 NER, 问题模板如果也没有 NER, 则 feature 值为 1; (3)query 中有 NER, 问题模板如果没有 NER, 则 feature 值为 0; (4)query 中没有 NER, 问题模板如果有 NER, 则 feature 值为 0; +0.1605 20. (1)query 中有 NER, 标准问题如果也有 NER, 则 feature 值为 1; (2)query 中没有 NER, 标准问题如果也没有 NER, 则 feature 值为 1; (3)query 中有 NER, 标准问题如果没有 NER, 则 feature 值为 0; (4)query 中没有 NER, 标准问题如果有 NER, 则 feature 值为 0; -0.0181 21. 通过 Sentence Embedding 计算 query 和标准答案的 cos 相似度, 为 (0,1) 的实数 (Sentence Embedding 的具体做法是, 将所有词的每一维进行比较, 取绝对值最大的数值作为 Sentence 的 vector )
Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) About word embedding feature Word embedding feature is not important in our experiment. Interact with Haifeng Wang, it's an important feature. We don t adopt the best way to utilize word embedding feature to generate sentence vector. There is a good generating sentence vector method is proposed in << Deep Learning for Answer Sentence Selection>>.
Offline learning to rank module is completed and integrated into QA system (P@1:69%->77%) Another experiment about the number of candidates Before conduct learning to rank, P@1 is 54.0%. Different number P@1 P@5 Train time(including generating candidates) Nbest = 30 63.2% 81.1% 226s Nbest = 40 62.9% 83.3% 283s Nbest = 50 62.8% 83.0% 339s Nbest = 60 61.8% 82.3% 386s Nbest = 100 57.3% 73.2% 603s The number of candidates could be no more or no less.
Online learning to rank module is completed and will be integrated into next version of QA system What problem we need to solve? Dynamic index.(user enter one question and answer to teach the QA system) User click the result which match his need most and QA system will collect user click data to enhance the system Learning to rank deployment strategies Prepare two systems One system conduct offline batch learning while another service online users. After offline learning, we switched systems. Implement online learning module Implement with Java (400 lines).
<<An Online Learning to Rank Framework>> Lerot --- it is already solr s component in 2014. I run the code(python) and it works well. This online learning method is called dueling bandit gradient descent (DBGD). Y. Yue and T. Joachims. Interactively optimizing information retrieval systems as a dueling bandits problem. In ICML 09,2009. The logical of this kind of online learning to rank 1.Random weights perturbation. 2.Through clicking data we will determine whether this perturbation is good or not. The logical of online learning to rank we want 1.we have click data 2.through click data we will update weights
Want to do next Conduct experiments about a new online learning to rank method and ready to write ACL 2015 paper. Conduct experiments about online deep learning to rank method inspired by <<Text Understanding from Scratch>>.
Thank You! 15