Building a Faceted Browser in CouchDB Using Views on Views and Erlang Metaprogramming

Browser in Using on and Erlang Browser in Using on and Erlang WFLP-2011 Odense, July 19 2011 on views claus.zinn@uni-tuebingen.de The NaLiDa Project Nachhaltigkeit Linguistischer Daten http://www.sfs.uni-tuebingen.de/nalida/.1

Browser in Using on and Erlang infrastructure (in Linguistics) on on views.2

State of affairs in the Humanities (and elsewhere) no systematic management of the underlying research data increasing pressure from funding agencies to document and make public research data escience infrastructure needed to support reproduction of results over identical data sets increase scientific quality and fights fraud in science help avoiding unmeant duplication of research work NaLiDa Project Browser in Using on and Erlang on views contributes to infrastructure for languages resources (corpora, lexica,...) and software tools (part-of-speech taggers, parsers,...) supports scientific community with infrastructure building, metadata management and storage assists institutions to systematically describe and expose their research with metadata terms of XML-based documents increase access to and visibility of resources.3

Data Aggregation and Exposure Browser in Using on and Erlang XML A XML B XML C OAI-PMH Harvesting At regular intervals new providers may join Document Storage on views.4

Metadata Descriptions in Linguistics Browser in Using on and Erlang can be very detailed with large variety in the usage of metadata field descriptors and their structural organisation most of the information is of little use for most users some information pieces matter for most users Increasing Popularity of Faceted Browsing well-suited for naive users to explore large data sets with small but informative set of facets customers can identify products along many dimensions facets & their value range & number of corresponding items shows structure and content of the search space many users learn the main criteria for navigation on views.5

Facet Selection governed by search for common denominator across collections will yield rather small set of (semantically similar) metadata fields main facets: organisation, language, resource type, modality conditional facets such as lifecyle status, tool type if ressource type is tool Facetification Facets: F 1,..., F n with values ranges {f 11,... f 1n }... {f n1,... f nm} document must be indexed by at least one facet-value pair Browser in Using on and Erlang on views document can be described by more than one value f ij for F i metadata for multimodal corpus with F i = modality and f ij gesture, sign language and spoken language.6

German English French Dutch Sign Language British Sign Language Swedish Sign Language German Sign Language Georgian Hungarian Dutch Italian Latin Russian Computations Languages Browser in Using on and Erlang Once facet-value pair f ik is selected, corresponding document set f ik must be intersected with each of the other subsets of F j with 1 < j < n, j i: document set of ring segment f ik must be intersected with document sets of all segments of all rings other than F i When users select facet F i with value f ik and facet F j with f jl first build intersection between the two corresponding document collections then, intersect (non-empty) result with all ring segments of all rings other than F i and F j on views.7

Browser in Using on and Erlang Requirements cope with metadata heterogeneity, given that documents will adhere to different schemas each defining its own structured set of descriptors and values preserve the original format of all metadata descriptions, and consider storing primary data in addition to the metadata describing it handle regular additions to document storage with only incremental update for document access provide effective and user-friendly access to all documents use a REST-based approach to make data storage read & write web-accessible on views.8

schema-less database design permits the inclusion of arbitrarily structured documents into the database original metadata format can be preserved, and primary data can also be associated with the metadata describing it map-reduce framework promises incrementality and scalability features a REST-based interface for document uploading, downloading and querying also hosts GUI, and provides Lucene port correspond to hardwired DB queries; also stored in once a query is executed, its result is also stored defined in terms of map & reduce written in Erlang, Javascript, and other languages Browser in Using on and Erlang on views.9

Motivation process lots of data to produce other data using many CPUs supporting automatic parallelization & distribution, fault-tolerance, I/O scheduling, status and monitoring Programming Model: Map processes input documents (key-value pairs) produces set/table of intermediate pairs map(in_key, in_value) list(out_key, intermed_value) must be referentially transparent given a document, the function will always emit the same key-value pairs document indexing process is incremental, can run in parallel can be written in Javascript and Erlang (& other ports) Browser in Using on and Erlang on views.10

Programming Model: Reduce combines all values for a particular key produces a set of merged output values (usually just one) reduce(out_key, list(intermed_value)) list(out_value) map function can be complemented by a reduce function takes as input the table of emitted values with identical keys as generated by the map function, and aggregates them, e.g., summing up the values associated with the same key: function(key, values) { return sum(values); } must be referentially transparent, commutative and associative must be call-able with output of map process, but also with intermediate values computed by prior reduce (rereduce). Browser in Using on and Erlang on views.11

Framework Browser in Using on and Erlang documents map map documents key-1 values key-2 values key-3 values key-1 values key-2 values key-3 values aggregation key 1 key 2 key 3 values values values on views intermediate values reduce reduce reduce final key-1 values final key-2 values final key-3 values.12

Browser in Using on and Erlang Stages 1 ingestion: OAI-PMH-harvested documents validated against their schema, which are then converted from XML to JSON supplied with unique id, timestamp, source, and schema information, and added to DB with original XML as attachment 2 indexing: to attack data heterogeneity at schema level 3 curation: to address variability in facet values 4 faceted search indexing: to precompute all possible queries 5 presentation: to give users navigation access to datasets on views.13

Document Indexing with map-reduce document indexing tackles data heterogeneity given that documents may adhere to different schemas Browser in Using on and Erlang Map Example (template) function(doc) { switch( doc.schema ) { case "<reference_to_schema_a>": if ( <tree_has_node> ) { emit(<path_to_node_val>, 1); break; } case "<reference_to_schema_b>": [...] [...] } } on views.14

Map to index organisations (fragment) function(doc) { switch( doc.schema ) { case "http://catalog.clarin.eu/ds/componentregistry/rest/registry/[...]1694580/xsd": if ( doc.cmd && doc.cmd.components && doc.cmd.components.textcorpusprofile && doc.cmd.components.textcorpusprofile.generalinfo && doc.cmd.components.textcorpusprofile.generalinfo.legalowner && doc.cmd.components.textcorpusprofile.generalinfo.legalowner.$t ) { emit( doc.cmd.components.textcorpusprofile.generalinfo.legalowner.$t, 1); break; } } } case "http://theharvestingday.eu/schemas/clarin_bamdes-1.1.xsd": if ( doc.lexicalresource && doc.lexicalresource.organization && doc.lexicalresource.organization.$t ) { emit( doc.lexicalresource.organization.$t, 1); break; }... Browser in Using on and Erlang on views.15

Map Result (organisations) Browser in Using on and Erlang on views.16

Reduce Result (organisations) Browser in Using on and Erlang on views.17

Reduce Result (organisations) Browser in Using on and Erlang on views Note: need for data curation.17

Document Indexing with map-reduce initially, manually coded, and adapted after schema change but this is tedious and prone to error now automatic generation of views from declarative facet specification using JavaScript (string concatenation) Facet specification { "facet" : "modality", "pathinfos" : [ { "schema": "http://catalog.clarin.eu/...:cr1:p_129094580/...", "path" : "doc.cmd.components.textcorpusprofile...", }, { "schema": "http://catalog.clarin.eu/...:cr1:p_129094579/...", "path" : "doc.cmd.components.lexicalresourceprofile..." },... ] } { "facet" : "language", "pathinfos" : [... ] } [...] Browser in Using on and Erlang on views.18

Data Curation each map function gives a view of the document space in terms of the facet it represents analysis shows large variability for many facet values, e.g., organisations with different names devised curation tables that map given names to preferred names data curation performed on the indices (for faceted search) rather than the original documents Conversion of to Documents faceted search to be defined in terms of document indexing established in first map-reduce cycle but s map-reduce framework is defined in terms of documents thus, not possible to define views on views, at least not directly Browser in Using on and Erlang on views.19

on re-using the result of document indexing by converting resulting views into documents conversion takes care of data curation conversion written in JavaScript implementing hash table of hash tables outer hash table gives access to the facets language inner hash table to all the values a chosen hash can take associating key German with all documents with this piece of information new index (of type docindex ) is stored into extra DB also holds all views to implement faceted search one index file for each document collection Browser in Using on and Erlang on views.20

document index for one collection Browser in Using on and Erlang on views.21

Map View for Country fun ({Doc}) -> case proplists:get_value(<<"doctype">>, Doc) of <<"docindex">> -> {CountryHash} = proplists:get_value(<<"country">>, Doc, {[]}), {LanguageHash} = proplists:get_value(<<"language">>, Doc, {[]}), <other hashes> lists:foreach(fun (CountryItem) -> DocSet = proplists:get_value(countryitem, CountryHash), DocSetSize = ordsets:size(docset), if DocSetSize > 0 -> Emit(CountryItem, {[{<<"facet">>, <<"_total_">>}, {<<"value">>, <<"_total_">>}, {<<"docs">>, DocSet}]}), lists:foreach(fun (LanguageItem) -> Intersection = ordsets:intersection(proplists:get_value(languageitem, LanguageHash), proplists:get_value(countryitem, CountryHash)), case Intersection == [] of false -> Emit(CountryItem, {[{<<"facet">>, <<"language">>}, {<<"value">>, LanguageItem}, {<<"docs">>, ordsets:size(intersection)}]}); _ -> ok end end, proplists:get_keys(languagehash)), Browser in Using on and Erlang on views <other intersections for other facets[...]> true -> ok end end, proplists:get_keys(countryhash)); _ -> ok end end..22

Result for Country View (fragment) Browser in Using on and Erlang on views.23

Reduce Function (common to FB views fun (Key, Values) -> AddToDict = fun (CurrentEntry, Dict) -> {[{<<"facet">>, Facet}, {<<"value">>, Value}, {<<"docs">>, Documents}]} = CurrentEntry, DictKey = {Facet, Value}, case Facet of <<"_total_">> -> dict:append_list(dictkey, Documents, Dict); _ -> dict:update(dictkey, fun (Old) -> Old + Documents end, Documents, Dict) end end, DictToList = fun (Dict) -> lists:map(fun (Entry) -> {{Facet, Value}, Docs} = Entry, {struct, [{<<"facet">>, Facet}, {<<"value">>, Value}, {<<"docs">>, Docs}]} end, dict:to_list(dict)) end, end. DictToList(lists:foldl(fun (Value, Dict) -> AddToDict(Value, Dict) end, dict:new(), Values)) Browser in Using on and Erlang on views.24

Coding of initially, views were coded manually in JavaScript but poor performance in view computation on large index files lead to the usage of Erlang instead, which resulted into a significant performance boost writing views by hand is tedious and prone to error have written Erlang code that generates the code definitions for Erlang views automatically Erlang meta-code based on the concatenation of Erlang code strings facet specification -define( FACETS, ["country","language","modality", "organisation", "resourceclass"] ). -define( COND_FACETS, [ { "resourceclass", "corpus", ["genre"] }, { "resourceclass", "Tool", ["tooltype", "applicationtype" "inputtype", "outputtype", "lifecyclestatus" ]}]). Browser in Using on and Erlang on views.25

Coding of (cont d) specification leads to the generation of 121 views, with each view having between 5000 and 12000 bytes of Erlang code not all possible combinations of set intersections are necessary document sets resulting from first selecting facet F 1 and then selecting facet F 2 are identical to those when F 2 is selected first and then F 1 realized computation of all necessary intersections using Erlang combinators Use of Erlang Combinators comb_4(l) -> case length(l) < 4 of true -> "supply lists with length >= 4" ; _ -> [ {A,B,C,D,Z} A <- L, B <- L--[A], A < B, C <- L--[A,B], B < C, D <- L--[A,B,C], C < D, Z <- [L--[A,B,C,D]] ] end. Browser in Using on and Erlang on views.26

GUI Browser in Using on and Erlang on views.27

Queries = View request Browser in Using on and Erlang /mpi_mgt/_design/country/_view/country?key= Germany &reduce=true View result {"rows":[ {"key":"germany"," value":[ {"facet":"modality","value":"unspecified","docs":140}, {"facet":"modality","value":"speech/gestures","docs":230}, {"facet":"language","value":"german Sign Language","docs":433}, {"facet":"genre","value":"secondary document","docs":3}, {"facet":"genre","value":"movie","docs":458}, {"facet":"_total_","value":"_total_", "docs":["oai:www.mpi.nl:mpi100", "oai:www.mpi.nl:mpi1002978"...]} [...] ]}]} on views.28

Browser in Using on and Erlang for Document Indexing views for document indexing are automatically generated from facet specification using JavaScript resulting map and reduce functions are in JavaScript too, s default view language computation of the view organisation takes approximately 25 minutes on 86k documents one-time payoff no effort has been made yet to increase the speed of view computation small changes in document database will have only small impact on view recomputation at the document indexing level on views.29

for computation of faceted search views computationally expensive JavaScript too slow Erlang much faster (better in memory and processor usage) Browser in Using on and Erlang setting each Erlang view stored in separate design document executed map-reduce computation to 24-core 96GB machine harvested and ingested approximately 86.000 metadata documents on language resources five unconditional facets language (371), country (67), organisation (39), modality (32), and genre (50) many different facet values: modality = speech (59463); language = Dutch (18345); country = Germany (16178); organisation = Max Planck Institute for Psycholinguistics (16568), and genre = Discourse (33676) 31 different map-reduce pairs on views.30

Browser in Using on and Erlang Computation for generation of the views language, country, organisation, modality, and genre takes altogether less than one minute (using 5 cpus) generation of the ten 2-level views (users selected two facets, e.g., country : genre, country : language...) was computed in less than 1 minute (using 10 cpus). computation of the ten 3-level views where users selected three facets: < 7.5 minutes computation of the 5 4-level views: more than 2 hours to compute on views.31

Future work for optimisation Browser in Using on and Erlang currently, one indexing document for each of the metadata providers update from one data provider only requires a limited view recomputation but some data providers provide 10.000s of documents optimise index documents for faceted search reflect additions by new index document, so that incremental updates are indeed limited to document additions modifications and deletions by introducing MODIFY and DELETE lists that a revised map-reduce combination would need to consider on views.32

Related Work: Flamenco toolkit with web-based interface to give faceted access to large data collections given import format: the file facets.tsv listing all facets the file attrs.tsv listing all attributes of a given item the file items.tsv listing each collection item (following definition in attrs.tsv) with unique id for each entry facet in facets.tsv file facet_term: lists all terms for given facet with unique facet term ids facet_map associates unique facet term id with item ids data files ingested into Flamenco relational database (MySQL) Flamenco generates faceted browser s default/customizable GUI user s selection of facet terms translated into corresponding MySQL queries to compute all necessary set interactions results of executing MySQL queries are cached to avoid re-computation Browser in Using on and Erlang on views.33

Flamenco used in VLO faceted search access to language resources using Flamenco with same dataset See http://www.clarin.eu/vlo used Perl to translate 80.000+ XML-based metadata files into Flamenco s indexing data format (incl. curation) ingested data into the Flamenco database and adapted GUI script to generate all queries to warm-up the cache Comparison data preparation required for Flamenco roughly corresponds to our -based document indexing phase (simple views) data curation only happens when the views of the indexing phase are converted into the indexing documents MySQL queries fired by Flamenco correspond to the views computed in terms of the indexing documents Browser in Using on and Erlang on views.34

Advantages of also stores original metadata documents (with varying schemata), thus also serves as permanent storage conditional facets contribute to usability guiding users navigation need only be computed in subsets whose documents are indexed against terms the conditional facet depends on index generation accommodates for incremental updates on the metadata sets, supporting regular harvesting without recomputing all indices/views In Flamenco, any change in data set requires overwriting of all contexts/caches facet specification offers more declarative view index generation taken to higher level; easy to experiment with different facet configurations but, once facet specification is changed, index generation starts from scratch Browser in Using on and Erlang on views.35

Browser in Using on and Erlang with its native language Erlang is well suited for the development of industrial-strength applications s REST-based interface offers lean alternative to established software (Java-based Apache Tomcat webserver) Erlang s main limitations is lack of full macro package allowing users to write programs to write other programs Common-Lisp like defmacro would have made life easier currently, no strong support for Lisp (or Haskell) port to index and query documents in s main limitation when used with Erlang being the lack of documentation and example code available on views.36

Browser in Using on and Erlang general approach to aggregate heterogeneously structured documents and to make them accessible via faceted (and full-text) search works as long as documents relevant content can be given in JSON ( s native format) for given context, facet specification was straightforward desirable to detect good facet candidates automatically Castanet algorithm requires definition of target terms to best reflect the topics present in given collection combines target terms with hypernymy (IS-A) information of WordNet to both build facet hierarchies and to assign documents to the facets on views.37

Questions Browser in Using on and Erlang on views.38