Building a Faceted Browser in CouchDB Using Views on Views and Erlang Metaprogramming

Similar documents
The Virtual Language Observatory!

CLARIN for Linguists Portal & Searching for Resources. Jan Odijk LOT Summerschool Nijmegen,

SMART CONNECTOR TECHNOLOGY FOR FEDERATED SEARCH

Flexible Design for Simple Digital Library Tools and Services

MuseKnowledge Hybrid Search

Data Science Services Dirk Engfer Page 1 of 5

Data, Information, and Databases

META-SHARE: An Open Resource Exchange Infrastructure for Stimulating Research and Innovation

Ing. José A. Mejía Villar M.Sc. Computing Center of the Alfred Wegener Institute for Polar and Marine Research

Metadata Ingestion and Processinng

B2FIND: EUDAT Metadata Service. Daan Broeder, et al. EUDAT Metadata Task Force

META-SHARE : the open exchange platform Overview-Current State-Towards v3.0

National Documentation Centre Open access in Cultural Heritage digital content

Information Retrieval and Organisation

Oral Questions and Answers (DBMS LAB) Questions & Answers- DBMS

Bonus Content. Glossary

EUDAT-B2FIND A FAIR and Interdisciplinary Discovery Portal for Research Data

Curation module in action - its preliminary findings on VLO metadata quality

Search Framework for a Large Digital Records Archive DLF SPRING 2007 April 23-25, 25, 2007 Dyung Le & Quyen Nguyen ERA Systems Engineering National Ar

Testbed a walk-through

Capabilities of Cloudant NoSQL Database IBM Corporation

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CA Productivity Accelerator 12.1 and Later

Repository In A Box (RIB)

clarin:el an infrastructure for documenting, sharing and processing language data

Database infrastructure for electronic structure calculations

Oracle BI 12c: Build Repositories

Building a Digital Repository on a Shoestring Budget

UIS USER GUIDE SEPTEMBER 2013 USER GUIDE FOR UIS.STAT (BETA)

THE POSIT TOOLSET WITH GRAPHICAL USER INTERFACE

MapReduce Algorithm Design

Chapter 11 - Data Replication Middleware

The use of OpenSource technologies for distributing historic maps and creating search engines for searching though the catalogues

Patent Terminlogy Analysis: Passage Retrieval Experiments for the Intellecutal Property Track at CLEF

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Database of historical places, persons, and lemmas

DB2 for z/os: Programmer Essentials for Designing, Building and Tuning

Web Services for Visualization

Edge Side Includes (ESI) Overview

Composer Guide for JavaScript Development

New EuroVO registry. architecture and status as of May Menelaus Perdikeas, ESAC Neuropublic.

Introduction

How to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt

An Experimental Command and Control Information System based on Enterprise Java Bean Technology

Thinking Beyond Search with Solr Understanding How Solr Can Help Your Business Scale. Magento Expert Consulting Group Webinar July 31, 2013

FINANCIAL REGULATORY REPORTING ACROSS AN EVOLVING SCHEMA

Query Processing & Optimization

DELIVERABLE. D2.2: Modified MINT prototype. LoCloud. Local content in a Europeana cloud. Project Acronym: Grant Agreement number:

Sustainability of Text-Technological Resources

Erhard Hinrichs, Thomas Zastrow University of Tübingen

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

Implementing a Numerical Data Access Service

Version 2 Release 2. IBM i2 Enterprise Insight Analysis Upgrade Guide IBM SC

Néonaute: mining web archives for linguistic analysis

Efficient, Scalable, and Provenance-Aware Management of Linked Data

FLAT: A CLARIN-compatible repository solution based on Fedora Commons

Wayne State University Libraries Digital Collections Platform: A New Home for Research on Detroit

Bringing Europeana and CLARIN together: Dissemination and exploitation of cultural heritage data in a research infrastructure

The Design of a DLS for the Management of Very Large Collections of Archival Objects

Adobe. Using DITA XML for Instructional Documentation. Andrew Thomas 08/10/ Adobe Systems Incorporated. All Rights Reserved.

TAUdb: PerfDMF Refactored

Open Archives Forum - Technical Validation -

COMP9321 Web Application Engineering

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

Oracle BI 11g R1: Build Repositories

COMP9321 Web Application Engineering

A Closer Look at Fedora s Ingest Performance

Apparo Fast Edit Edit Data Version management 3 in an IBM Cognos environment Technical Document

See Types of Data Supported for information about the types of files that you can import into Datameer.

Introduction to Hadoop and MapReduce

From Open Data to Data- Intensive Science through CERIF

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Processing 11 billions events a day with Spark. Alexander Krasheninnikov

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

8) A top-to-bottom relationship among the items in a database is established by a

NOMAD Metadata for all

Oracle BI 11g R1: Build Repositories

3) CHARLIE HULL. Implementing open source search for a major specialist recruiting firm

Business Intelligence and Reporting Tools

Persistent identifiers: jnbn, a JEE application for the management of a national NBN infrastructure

EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal

Best practices in the design, creation and dissemination of speech corpora at The Language Archive

PROCESSING MANAGEMENT TOOLS FOR EARTH OBSERVATION PRODUCTS AT DLR-DFD

Voldemort. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

How can CLARIN archive and curate my resources?

HKTA TANG HIN MEMORIAL SECONDARY SCHOOL SECONDARY 3 COMPUTER LITERACY. Name: ( ) Class: Date: Databases and Microsoft Access

EUDAT. A European Collaborative Data Infrastructure. Daan Broeder The Language Archive MPI for Psycholinguistics CLARIN, DASISH, EUDAT

CACAO PROJECT AT THE 2009 TASK

Incremental Updates VS Full Reload

Contents. Microsoft is a registered trademark of Microsoft Corporation. TRAVERSE is a registered trademark of Open Systems Holdings Corp.

Community Edition. Web User Interface 3.X. User Guide

SDMX self-learning package No. 3 Student book. SDMX-ML Messages

Generalized Document Data Model for Integrating Autonomous Applications

Using the data in the archive

Appendix REPOX User Manual

Pentaho Data Integration (PDI) Standards for Lookups, Joins, and Subroutines

IBM InfoSphere Information Server Version 8 Release 7. Reporting Guide SC

Database Architectures

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

Transcription:

Browser in Using on and Erlang Browser in Using on and Erlang WFLP-2011 Odense, July 19 2011 on views claus.zinn@uni-tuebingen.de The NaLiDa Project Nachhaltigkeit Linguistischer Daten http://www.sfs.uni-tuebingen.de/nalida/.1

Browser in Using on and Erlang infrastructure (in Linguistics) on on views.2

State of affairs in the Humanities (and elsewhere) no systematic management of the underlying research data increasing pressure from funding agencies to document and make public research data escience infrastructure needed to support reproduction of results over identical data sets increase scientific quality and fights fraud in science help avoiding unmeant duplication of research work NaLiDa Project Browser in Using on and Erlang on views contributes to infrastructure for languages resources (corpora, lexica,...) and software tools (part-of-speech taggers, parsers,...) supports scientific community with infrastructure building, metadata management and storage assists institutions to systematically describe and expose their research with metadata terms of XML-based documents increase access to and visibility of resources.3

Data Aggregation and Exposure Browser in Using on and Erlang XML A XML B XML C OAI-PMH Harvesting At regular intervals new providers may join Document Storage on views.4

Metadata Descriptions in Linguistics Browser in Using on and Erlang can be very detailed with large variety in the usage of metadata field descriptors and their structural organisation most of the information is of little use for most users some information pieces matter for most users Increasing Popularity of Faceted Browsing well-suited for naive users to explore large data sets with small but informative set of facets customers can identify products along many dimensions facets & their value range & number of corresponding items shows structure and content of the search space many users learn the main criteria for navigation on views.5

Facet Selection governed by search for common denominator across collections will yield rather small set of (semantically similar) metadata fields main facets: organisation, language, resource type, modality conditional facets such as lifecyle status, tool type if ressource type is tool Facetification Facets: F 1,..., F n with values ranges {f 11,... f 1n }... {f n1,... f nm} document must be indexed by at least one facet-value pair Browser in Using on and Erlang on views document can be described by more than one value f ij for F i metadata for multimodal corpus with F i = modality and f ij gesture, sign language and spoken language.6

German English French Dutch Sign Language British Sign Language Swedish Sign Language German Sign Language Georgian Hungarian Dutch Italian Latin Russian Computations Languages Browser in Using on and Erlang Once facet-value pair f ik is selected, corresponding document set f ik must be intersected with each of the other subsets of F j with 1 < j < n, j i: document set of ring segment f ik must be intersected with document sets of all segments of all rings other than F i When users select facet F i with value f ik and facet F j with f jl first build intersection between the two corresponding document collections then, intersect (non-empty) result with all ring segments of all rings other than F i and F j on views.7

Browser in Using on and Erlang Requirements cope with metadata heterogeneity, given that documents will adhere to different schemas each defining its own structured set of descriptors and values preserve the original format of all metadata descriptions, and consider storing primary data in addition to the metadata describing it handle regular additions to document storage with only incremental update for document access provide effective and user-friendly access to all documents use a REST-based approach to make data storage read & write web-accessible on views.8

schema-less database design permits the inclusion of arbitrarily structured documents into the database original metadata format can be preserved, and primary data can also be associated with the metadata describing it map-reduce framework promises incrementality and scalability features a REST-based interface for document uploading, downloading and querying also hosts GUI, and provides Lucene port correspond to hardwired DB queries; also stored in once a query is executed, its result is also stored defined in terms of map & reduce written in Erlang, Javascript, and other languages Browser in Using on and Erlang on views.9

Motivation process lots of data to produce other data using many CPUs supporting automatic parallelization & distribution, fault-tolerance, I/O scheduling, status and monitoring Programming Model: Map processes input documents (key-value pairs) produces set/table of intermediate pairs map(in_key, in_value) list(out_key, intermed_value) must be referentially transparent given a document, the function will always emit the same key-value pairs document indexing process is incremental, can run in parallel can be written in Javascript and Erlang (& other ports) Browser in Using on and Erlang on views.10

Programming Model: Reduce combines all values for a particular key produces a set of merged output values (usually just one) reduce(out_key, list(intermed_value)) list(out_value) map function can be complemented by a reduce function takes as input the table of emitted values with identical keys as generated by the map function, and aggregates them, e.g., summing up the values associated with the same key: function(key, values) { return sum(values); } must be referentially transparent, commutative and associative must be call-able with output of map process, but also with intermediate values computed by prior reduce (rereduce). Browser in Using on and Erlang on views.11

Framework Browser in Using on and Erlang documents map map documents key-1 values key-2 values key-3 values key-1 values key-2 values key-3 values aggregation key 1 key 2 key 3 values values values on views intermediate values reduce reduce reduce final key-1 values final key-2 values final key-3 values.12

Browser in Using on and Erlang Stages 1 ingestion: OAI-PMH-harvested documents validated against their schema, which are then converted from XML to JSON supplied with unique id, timestamp, source, and schema information, and added to DB with original XML as attachment 2 indexing: to attack data heterogeneity at schema level 3 curation: to address variability in facet values 4 faceted search indexing: to precompute all possible queries 5 presentation: to give users navigation access to datasets on views.13

Document Indexing with map-reduce document indexing tackles data heterogeneity given that documents may adhere to different schemas Browser in Using on and Erlang Map Example (template) function(doc) { switch( doc.schema ) { case "<reference_to_schema_a>": if ( <tree_has_node> ) { emit(<path_to_node_val>, 1); break; } case "<reference_to_schema_b>": [...] [...] } } on views.14

Map to index organisations (fragment) function(doc) { switch( doc.schema ) { case "http://catalog.clarin.eu/ds/componentregistry/rest/registry/[...]1694580/xsd": if ( doc.cmd && doc.cmd.components && doc.cmd.components.textcorpusprofile && doc.cmd.components.textcorpusprofile.generalinfo && doc.cmd.components.textcorpusprofile.generalinfo.legalowner && doc.cmd.components.textcorpusprofile.generalinfo.legalowner.$t ) { emit( doc.cmd.components.textcorpusprofile.generalinfo.legalowner.$t, 1); break; } } } case "http://theharvestingday.eu/schemas/clarin_bamdes-1.1.xsd": if ( doc.lexicalresource && doc.lexicalresource.organization && doc.lexicalresource.organization.$t ) { emit( doc.lexicalresource.organization.$t, 1); break; }... Browser in Using on and Erlang on views.15

Map Result (organisations) Browser in Using on and Erlang on views.16

Reduce Result (organisations) Browser in Using on and Erlang on views.17

Reduce Result (organisations) Browser in Using on and Erlang on views Note: need for data curation.17

Document Indexing with map-reduce initially, manually coded, and adapted after schema change but this is tedious and prone to error now automatic generation of views from declarative facet specification using JavaScript (string concatenation) Facet specification { "facet" : "modality", "pathinfos" : [ { "schema": "http://catalog.clarin.eu/...:cr1:p_129094580/...", "path" : "doc.cmd.components.textcorpusprofile...", }, { "schema": "http://catalog.clarin.eu/...:cr1:p_129094579/...", "path" : "doc.cmd.components.lexicalresourceprofile..." },... ] } { "facet" : "language", "pathinfos" : [... ] } [...] Browser in Using on and Erlang on views.18

Data Curation each map function gives a view of the document space in terms of the facet it represents analysis shows large variability for many facet values, e.g., organisations with different names devised curation tables that map given names to preferred names data curation performed on the indices (for faceted search) rather than the original documents Conversion of to Documents faceted search to be defined in terms of document indexing established in first map-reduce cycle but s map-reduce framework is defined in terms of documents thus, not possible to define views on views, at least not directly Browser in Using on and Erlang on views.19

on re-using the result of document indexing by converting resulting views into documents conversion takes care of data curation conversion written in JavaScript implementing hash table of hash tables outer hash table gives access to the facets language inner hash table to all the values a chosen hash can take associating key German with all documents with this piece of information new index (of type docindex ) is stored into extra DB also holds all views to implement faceted search one index file for each document collection Browser in Using on and Erlang on views.20

document index for one collection Browser in Using on and Erlang on views.21

Map View for Country fun ({Doc}) -> case proplists:get_value(<<"doctype">>, Doc) of <<"docindex">> -> {CountryHash} = proplists:get_value(<<"country">>, Doc, {[]}), {LanguageHash} = proplists:get_value(<<"language">>, Doc, {[]}), <other hashes> lists:foreach(fun (CountryItem) -> DocSet = proplists:get_value(countryitem, CountryHash), DocSetSize = ordsets:size(docset), if DocSetSize > 0 -> Emit(CountryItem, {[{<<"facet">>, <<"_total_">>}, {<<"value">>, <<"_total_">>}, {<<"docs">>, DocSet}]}), lists:foreach(fun (LanguageItem) -> Intersection = ordsets:intersection(proplists:get_value(languageitem, LanguageHash), proplists:get_value(countryitem, CountryHash)), case Intersection == [] of false -> Emit(CountryItem, {[{<<"facet">>, <<"language">>}, {<<"value">>, LanguageItem}, {<<"docs">>, ordsets:size(intersection)}]}); _ -> ok end end, proplists:get_keys(languagehash)), Browser in Using on and Erlang on views <other intersections for other facets[...]> true -> ok end end, proplists:get_keys(countryhash)); _ -> ok end end..22

Result for Country View (fragment) Browser in Using on and Erlang on views.23

Reduce Function (common to FB views fun (Key, Values) -> AddToDict = fun (CurrentEntry, Dict) -> {[{<<"facet">>, Facet}, {<<"value">>, Value}, {<<"docs">>, Documents}]} = CurrentEntry, DictKey = {Facet, Value}, case Facet of <<"_total_">> -> dict:append_list(dictkey, Documents, Dict); _ -> dict:update(dictkey, fun (Old) -> Old + Documents end, Documents, Dict) end end, DictToList = fun (Dict) -> lists:map(fun (Entry) -> {{Facet, Value}, Docs} = Entry, {struct, [{<<"facet">>, Facet}, {<<"value">>, Value}, {<<"docs">>, Docs}]} end, dict:to_list(dict)) end, end. DictToList(lists:foldl(fun (Value, Dict) -> AddToDict(Value, Dict) end, dict:new(), Values)) Browser in Using on and Erlang on views.24

Coding of initially, views were coded manually in JavaScript but poor performance in view computation on large index files lead to the usage of Erlang instead, which resulted into a significant performance boost writing views by hand is tedious and prone to error have written Erlang code that generates the code definitions for Erlang views automatically Erlang meta-code based on the concatenation of Erlang code strings facet specification -define( FACETS, ["country","language","modality", "organisation", "resourceclass"] ). -define( COND_FACETS, [ { "resourceclass", "corpus", ["genre"] }, { "resourceclass", "Tool", ["tooltype", "applicationtype" "inputtype", "outputtype", "lifecyclestatus" ]}]). Browser in Using on and Erlang on views.25

Coding of (cont d) specification leads to the generation of 121 views, with each view having between 5000 and 12000 bytes of Erlang code not all possible combinations of set intersections are necessary document sets resulting from first selecting facet F 1 and then selecting facet F 2 are identical to those when F 2 is selected first and then F 1 realized computation of all necessary intersections using Erlang combinators Use of Erlang Combinators comb_4(l) -> case length(l) < 4 of true -> "supply lists with length >= 4" ; _ -> [ {A,B,C,D,Z} A <- L, B <- L--[A], A < B, C <- L--[A,B], B < C, D <- L--[A,B,C], C < D, Z <- [L--[A,B,C,D]] ] end. Browser in Using on and Erlang on views.26

GUI Browser in Using on and Erlang on views.27

Queries = View request Browser in Using on and Erlang /mpi_mgt/_design/country/_view/country?key= Germany &reduce=true View result {"rows":[ {"key":"germany"," value":[ {"facet":"modality","value":"unspecified","docs":140}, {"facet":"modality","value":"speech/gestures","docs":230}, {"facet":"language","value":"german Sign Language","docs":433}, {"facet":"genre","value":"secondary document","docs":3}, {"facet":"genre","value":"movie","docs":458}, {"facet":"_total_","value":"_total_", "docs":["oai:www.mpi.nl:mpi100", "oai:www.mpi.nl:mpi1002978"...]} [...] ]}]} on views.28

Browser in Using on and Erlang for Document Indexing views for document indexing are automatically generated from facet specification using JavaScript resulting map and reduce functions are in JavaScript too, s default view language computation of the view organisation takes approximately 25 minutes on 86k documents one-time payoff no effort has been made yet to increase the speed of view computation small changes in document database will have only small impact on view recomputation at the document indexing level on views.29

for computation of faceted search views computationally expensive JavaScript too slow Erlang much faster (better in memory and processor usage) Browser in Using on and Erlang setting each Erlang view stored in separate design document executed map-reduce computation to 24-core 96GB machine harvested and ingested approximately 86.000 metadata documents on language resources five unconditional facets language (371), country (67), organisation (39), modality (32), and genre (50) many different facet values: modality = speech (59463); language = Dutch (18345); country = Germany (16178); organisation = Max Planck Institute for Psycholinguistics (16568), and genre = Discourse (33676) 31 different map-reduce pairs on views.30

Browser in Using on and Erlang Computation for generation of the views language, country, organisation, modality, and genre takes altogether less than one minute (using 5 cpus) generation of the ten 2-level views (users selected two facets, e.g., country : genre, country : language...) was computed in less than 1 minute (using 10 cpus). computation of the ten 3-level views where users selected three facets: < 7.5 minutes computation of the 5 4-level views: more than 2 hours to compute on views.31

Future work for optimisation Browser in Using on and Erlang currently, one indexing document for each of the metadata providers update from one data provider only requires a limited view recomputation but some data providers provide 10.000s of documents optimise index documents for faceted search reflect additions by new index document, so that incremental updates are indeed limited to document additions modifications and deletions by introducing MODIFY and DELETE lists that a revised map-reduce combination would need to consider on views.32

Related Work: Flamenco toolkit with web-based interface to give faceted access to large data collections given import format: the file facets.tsv listing all facets the file attrs.tsv listing all attributes of a given item the file items.tsv listing each collection item (following definition in attrs.tsv) with unique id for each entry facet in facets.tsv file facet_term: lists all terms for given facet with unique facet term ids facet_map associates unique facet term id with item ids data files ingested into Flamenco relational database (MySQL) Flamenco generates faceted browser s default/customizable GUI user s selection of facet terms translated into corresponding MySQL queries to compute all necessary set interactions results of executing MySQL queries are cached to avoid re-computation Browser in Using on and Erlang on views.33

Flamenco used in VLO faceted search access to language resources using Flamenco with same dataset See http://www.clarin.eu/vlo used Perl to translate 80.000+ XML-based metadata files into Flamenco s indexing data format (incl. curation) ingested data into the Flamenco database and adapted GUI script to generate all queries to warm-up the cache Comparison data preparation required for Flamenco roughly corresponds to our -based document indexing phase (simple views) data curation only happens when the views of the indexing phase are converted into the indexing documents MySQL queries fired by Flamenco correspond to the views computed in terms of the indexing documents Browser in Using on and Erlang on views.34

Advantages of also stores original metadata documents (with varying schemata), thus also serves as permanent storage conditional facets contribute to usability guiding users navigation need only be computed in subsets whose documents are indexed against terms the conditional facet depends on index generation accommodates for incremental updates on the metadata sets, supporting regular harvesting without recomputing all indices/views In Flamenco, any change in data set requires overwriting of all contexts/caches facet specification offers more declarative view index generation taken to higher level; easy to experiment with different facet configurations but, once facet specification is changed, index generation starts from scratch Browser in Using on and Erlang on views.35

Browser in Using on and Erlang with its native language Erlang is well suited for the development of industrial-strength applications s REST-based interface offers lean alternative to established software (Java-based Apache Tomcat webserver) Erlang s main limitations is lack of full macro package allowing users to write programs to write other programs Common-Lisp like defmacro would have made life easier currently, no strong support for Lisp (or Haskell) port to index and query documents in s main limitation when used with Erlang being the lack of documentation and example code available on views.36

Browser in Using on and Erlang general approach to aggregate heterogeneously structured documents and to make them accessible via faceted (and full-text) search works as long as documents relevant content can be given in JSON ( s native format) for given context, facet specification was straightforward desirable to detect good facet candidates automatically Castanet algorithm requires definition of target terms to best reflect the topics present in given collection combines target terms with hypernymy (IS-A) information of WordNet to both build facet hierarchies and to assign documents to the facets on views.37

Questions Browser in Using on and Erlang on views.38