Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

Size: px

Start display at page:

Download "Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics"

Gregory Sharp
5 years ago
Views:

1 Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

3 Media Intelligence Business intelligence (BI) Uses data mining techniques and tools for the transformation of raw data into meaningful information for business analysis. Media intelligence (MI) serves the same purpose but uses text mining techniques on user-generated unstructured textual data such as online newspapers, social media sites, blogs, comment fields, and wikis.

4 Media Monitoring The activity of monitoring the visibility of some issues and topics in print, online and broadcast media. Can be conducted for business, political, and scientific purposes. The services that media monitoring companies provide typically include the systematic recording of radio and television broadcasts, the collection of press clippings from print media publications, the collection of data from online information sources.

6 Web crawler Systematically browses the Internet for the purpose of Web indexing. Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them much more quickly. A Web crawler starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit Some common crawlers: Heritrix; Nutch ; PHP- Crawler

7 Issues in crawling Selection: which pages to download, re-visit: when to check for changes to the pages, politeness: avoid overloading Web sites, parallelization: coordinate distributed web crawlers.

8 Scraping Web scraping focuses more on the transformation of unstructured HTML data on the WARC, into structured data that can be stored and analyzed in a central local database or spreadsheet.

9 Scraping Techniques Human copy-and-paste Regular expression matching: (tagging by detecting regular patterns) HTML parsers: scraps according to the HTML structure. Needs constant updating because of changes in the HTML structures. Apache Nutch provides web crawling and HTML parsing Web-scraping software: _term=m260&utm_content=v1&utm_campaign=homega Semantic annotation recognizing: The pages being scraped may embrace metadata or semantic markups and annotations, which can be used to locate specific data Xpath cleaning

10 Full text database (Digital archive) Contains the complete text of blogs, magazines, newspapers or other kinds of textual documents. k/search/flap.do?flapid=home&random= Yahoo news MongoDB

11 Information retrieval Full text indexing and searching capability, Terminology extraction: finding the relevant terms for a given corpus Thesaurus: Classifies the articles by means of a method based on keywords related to S&T and associated with score values

12 Basic lexicon

13 Relevance: Scientific activities Cutting-edge technologies related to research, aerospace technology, astronomy Discussions on policies and impact of ST&I Life sciences, medicine and health policy Scientific content explained / disclosed Environment, environmental policies, international treaties, alternative energy etc. Humanities and social sciences, including those that give voice to researchers from these areas

14 Relevance: Keyword approach Italy Step 1. gold standard set of manually selected 1000 articles according to a set of six dichotomy variables capturing a relevant dimension: Scientist; Scientific institution; Scientific journal; Scientific discipline social and statistical science excluded; General reference to scientific research activity; General reference to a scientific discovery or artefact minimum relevance: an article gets at least two points (two YES ), 6 (maximum) points. Four different human coders, double checking two times; coherent at least two coders retained in the gold standard. Step 2. weighting each article by applying the thesaurus w minimum score =20 to obtain the measure of salience = % of relevant articles on the total sample.

15 Relevance: Vector approach Spain Support Vector Machine (SVM). Training set: Self Training a small sample of 999 articles classified manually (as science, technology and their intrinsic and extrinsic features). Articles with highest scores added to the training set. This new set was used to re-classify all the articles. repeated until the results were deemed reliable. Active learning manual controls (classifying random samples). Our final training sets had between 800 and 1000 articles for each category. We carried out a k-fold crossvalidation (k=5). The mean of correct classifications for all the categories was 89.21%.

16 Indicator generation Mass: the absolute number of S&T articles published in the examined vehicle, in a given period M = N_selected Frequency: relative quantity of S&T articles on the total of published articles in the vehicle (%) f = M / N_Tot Density: relative space of S&T articles (% words in S&T articles / total of words in the vehicle) d = W_selected / W_Tot Deepening: relative weight of S&T articles comparing with the vehicle average article A = d / f

17 Search and queries Works like a Web query, searches and retrieves relevant documents and exports them (for example in an excel file) for further analysis in qualitative and quantitative text analysis software

KW strategy Strength-weakness Weakest: UK; totally bottom-up (only for one week) ; Italy totally top-down; ad hoc selection. Stronger: Germany combines both.

18 KW strategy Strength-weakness Weakest: UK; totally bottom-up (only for one week) ; Italy totally top-down; ad hoc selection. Stronger: Germany combines both. Strongest: Spain more systematic; separates KW selection for disciplines and themes, issues. Suggestion: a multidimensional coding frame to select a gold standard for human annotation and then use them as a training set for machine learning Italy s 6 dimensions (Scientist; institution; journal; discipline; scientific research activity ; scientific discovery or artefact) for relevance testing can be an example.

19 Countries involved in the Science in the Media Monitoring research Automated analysis: Brazil Italy Spain Turkey Not Automated UK Germany India

20 Example

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery

An Integrated Framework to Enhance the Web Content Mining and Knowledge Discovery Simon Pelletier Université de Moncton, Campus of Shippagan, BGI New Brunswick, Canada and Sid-Ahmed Selouani Université