Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)

Size: px

Start display at page:

Download "Web scraping. Donato Summa. 3 WP1 face to face meeting September 2017 Thessaloniki (EL)"

Marjory Stevens
5 years ago
Views:

1 Web scraping Donato Summa

2 Summary Web scraping : Specific vs Generic Web scraping phases Web scraping tools Istat Web scraping chain

3 Summary Web scraping : Specific vs Generic Web scraping phases Web scraping tools Istat Web scraping chain

4 Web scraping: Specific vs Generic We can distinguish two different kinds of web scraping: specific web scraping, when both structure and content of websites to be scraped are perfectly known, and crawlers just have to replicate the behaviour of a human being visiting the website and collecting the information of interest. Typical areas of application: data collection for price consumer indices generic web scraping, when no a priori knowledge on the content is available, and the whole website is scraped and subsequently processed in order to infer information of interest

5 Specific Web scraping We are interested in collecting very specific pieces of information in specific HTML structures (eg. tables) in specific webpages of a specific website with a known structure

Generic Web scraping We are interested in collecting the whole content of a website http://www.sitename.

6 Generic Web scraping We are interested in collecting the whole content of a website The address is the only available information Then you have to deal with the scraped unstructured content

7 Summary Web scraping : Specific vs Generic Web scraping phases Web scraping tools Istat Web scraping chain

8 Web scraping phases Excluding the analysis part of the job we can distinguish 4 phases : The same approach was taken by CBS (of course by using their own SW tools)

). Scraping: a scraper takes Web resources (documents, images, etc.

9 Web scraping phases Crawling: a Web crawler (also called Web spider or ant or robot) is a software program that systematically browses the Web starting from an Internet address (or a set of Internet addresses) and some pre-defined conditions (e.g., how many links navigate, the depth, types of files to ignore, etc.). Scraping: a scraper takes Web resources (documents, images, etc.), and engages a process for extracting data from those resources, finalized to data storage for subsequent elaboration purposes.

Web scraping phases Indexing / Searching: searching operations on a huge

Analysers tokenize text by performing any number of operations on it,

accents from characters, lowercasing (also called normalizing), removing

10 Web scraping phases Indexing / Searching: searching operations on a huge amount of data can be very slow, so it is necessary to index contents. Analysers tokenize text by performing any number of operations on it, which could include: extracting words, discarding punctuation, removing accents from characters, lowercasing (also called normalizing), removing common words, reducing words to a root form (stemming), or changing words into the basic form (lemmatization). The whole process is also called tokenization, and the chunks of text pulled from a stream of text are called tokens.

11 Summary Web scraping : Specific vs Generic Web scraping phases Web scraping tools Istat Web scraping chain

12 Web scraping tools There are lots of Web scraping tools available on the Web (both free and commercial). We tested 3 of them in order to select the best solution for our needs, in particular : Apache Nutch + Apache Solr HTTrack + OS filesystem A solution based on JSOUP + custom storage None of them fully satisfied our expectations, the main issue was the difficulty or the lack of customization. We decided to build our own scraping platform by: using what we considered valuable (Apache Solr) wrapping and customize already available SW libraries (RootJuice) developing from scratch some programs (see next slide)

13 Istat Web scraping tools UrlSearcher * UrlScorer ** * already used by BG - SI - PL ** already used by BG URL Retrieval use case UrlMatchTableGenerator ** RootJuice ** SolrTSVImporter ** Apache Solr ** Firm websites scraping use case FirmsDocTermMatrixGenerator These tools are freely available at:

14 Istat Web scraping tools All the programs are : free open source (you can adapt them to your needs) easy to understand (simple structure) easy to use (you can test all of them within a day) fully portable (written in Java) It can be a good starting point to give them a try before : testing others solutions write your own programs

15 Summary Web scraping : Specific vs Generic Web scraping phases Web scraping tools Istat Web scraping chain

16 Istat Web scraping chain List of URLs RootJuice Scraped content T/D Matrix generator Final results Learners

17 Istat Web scraping chain Step 1 List of URLs RootJuice Scraped content T/D Matrix generator Final results Learners

18 Step 1 RootJuice (crawling/scraping) It takes as input 3 files: - a seed file containing the list of the URLs to be scraped - a list of web domains to avoid (directories domains) - a configuration file seed.txt domainstofilterout.txt rootjuiceconf.properties yellowpages.com domaintoavoid2.... domaintoavoidn # proxy configuration PROXY_HOST = proxy.istat.it PROXY_PORT = 3128 # technical parameters of the scraper RESUMABLE_CRAWLING = false NUM_OF_CRAWLERS = 10 MAX_DEPTH_OF_CRAWLING = 2 MAX_PAGES_TO_FETCH = -1 MAX_PAGES_PER_SEED = # paths CRAWL_STORAGE_FOLDER = specific path CSV_FILE_PATH = specific path LOG_FILE_PATH = specific path

19 Step 1 RootJuice (crawling/scraping) for each row of the seed file (if the URL is not in the list of the domains to avoid) the program tries to acquire the related HTML pages from each acquired HTML page the program extracts just the textual content of the fields we are interested in and writes a line in a CSV file

20 Step 1 RootJuice (crawling/scraping) The structure of each row of the produced CSV is this: id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagdescription + TAB + metatagkeywords + TAB + firmid + TAB + sitoazienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody

21 Istat Web scraping chain Step 2 List of URLs RootJuice Scraped content T/D Matrix generator Final results Learners

22 Step 2 Load scraped data into Solr Now that we have the scraped textual content of the html pages, we need to index and persist it for further processing and searching. For the purpose we use Apache Solr that is an open source enterprise search platform (and a NoSQL DB) built on top of Apache Lucene. It can be used for storing and searching any type of data Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document handling. Providing distributed search and index replication, Solr is highly scalable and, for this reason, suitable to be used in Big Data context.

23 Step 2 Load scraped data into Solr It is possible to load documents into Solr in different ways, we wrote an ad hoc program that uses an API for Java called SolrJ. SolrTSVImporter takes as input 2 files: - a configuration file - the CSV file containing the scraped content (produced by RootJuice) solrinput.csv id + TAB + url + TAB + imgsrc + TAB + imgalt + TAB + links + TAB + ahref + TAB + aalt + TAB + inputvalue + TAB + inputname + TAB + metatagdescription + TAB + metatagkeywords + TAB + firmid + TAB + sitoazienda + TAB + link_position + TAB + title + TAB + text_of_the_pagebody row 1 with data row 2 with data row 3 with data row N with data solrtsvimporterconf.properties # proxy configuration PROXY_HOST = proxy.istat.it PROXY_PORT = 3128 # Solr server configuration SOLR_SERVER_URL = specify the url SOLR_SERVER_QUEUE_SIZE = 100 SOLR_SERVER_THREAD_COUNT = 5 # paths LOG_FILE_PATH = specific path

24 Step 2 Load scraped data into Solr

25 Istat Web scraping chain Step 3 List of URLs RootJuice Scraped content T/D Matrix generator Final results Learners

26 Step 3 FirmsDocTermMatrixGenerator It takes as input a configuration file : # ============================================ # technical parameters of the program # ============================================ # MAX_RESULTS = max num of documents per firm retrievable from storage platform MAX_RESULTS = FIRST_LANG = ITA SECOND_LANG = ENG # ============================================ # paths # ============================================ SOLR_INDEX_DIRECTORY_PATH = specific/path/on/my/computer MATRIX_FILE_FOLDER = specific/path/on/my/computer GO_WORDS_FILE_PATH = specific/path/on/my/computer STOP_WORDS_FILE_PATH = specific/path/on/my/computer LOG_FILE_PATH = specific/path/on/my/computer TREE_TAGGER_EXE_FILEPATH = specific/path/on/my/computer FIRST_LANG_PAR_FILE_PATH = specific/path/on/my/computer SECOND_LANG_PAR_FILE_PATH = specific/path/on/my/computer

27 Step 3 FirmsDocTermMatrixGenerator The output will be a matrix having : on the first column all the relevant stemmed terms found in all the documents on the first row all the firms id contained in the storage platform each cell will contain the number of occurencies of the specific term in all the documents referring the specific firm T/D Matrix firmid 1 firmid 2 firmid 3 firmid 4 firmid firmid N term term term term term term N

28 Step 3 FirmsDocTermMatrixGenerator The words are obtained in this way: all the words present in Solr are retrieved all the words having less than 3 or more than 25 characters are discarded all the words not recognized as "first language" words or "second language" words are discarded the "first language" words are lemmatized with TreeTagger and stemmed with SnowballStemmer the "second language" words are lemmatized with TreeTagger and stemmed with SnowballStemmer the words contained in a "go word list" are added to the word list the words contained in a "stop word list" are removed from the word list

29 Istat Web scraping chain Step 4 List of URLs RootJuice Scraped content T/D Matrix generator Final results Learners

30 Thank you for your attention!

Istat SW for webscraping

Istat SW for webscraping Donato Summa THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION 1 Shortly we have 2 use cases Url retrieval Webscraping of enterprise websites 2