+ Databases and Information Retrieval Integration TIETS42 Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html
+ DB & IR Integration Databases and information retrieval are two areas that have been developed separately! They have focused on different areas of application They have given emphasis to different methodologies 2
+ DB & IR Integration In databases: We pose queries to data with a particular schema, we use an algebra, we care about the accuracy of the queries results In information retrieval: We focus on queries expressed with keywords, queries are applied on free text documents, we care about how to rank the queries results, based on statistics and probabilities 3
+ DB & IR Integration Nowadays, there are many applications that require the concurrent management of structured and unstructured data, so necessary shows the integration of these two worlds 4
+ Databases and Information Retrieval Integration TIETS42 Autumn 2016 Kostas Stefanidis kostas.stefanidis@uta.fi http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html
+ Adding Ranking to DB OR Adding Semantics to IR unstructured search (keywords) [keyword search on databases] IR Systems Search Engines structured search (sql, xquery) Database Systems [querying entities] structured data (records) unstructured data (documents) 6
+ DB & IR Differences Databases Structured data Structured querying Soundness & Completeness User is expected to be aware of the underlying structure of the data or a query language IR Unstructured data Unstructured querying High Precision & Recall No expectations 7
+ Why DB & IR Integration? DB and IR have evolved as separate communities Their focus is on very different application areas, e.g.: (DB) accounting and reservation systems (IR) library and patent information So, they have different methodological paradigms (DB) precise querying over schematized data, based on logic and algebra (IR) keyword search and ranking over text and uncertain data, based on statistics and probability theory 8
+ Why DB & IR Integration? TODAY: many applications require managing both structured and unstructured data Considerations on how to integrate the DB and IR worlds at both foundational and software-system levels In the next slides: Tenets, from different viewpoints, on why DB & IR Integration is desirable 9
+ too-many-answers Example: Searches over travel portals or product catalogs Too-many-answers problem What if, tightening the query conditions? This may produce too few or even no results Note also: interactive reformulation and browsing is timeconsuming and may irritate customers/users For large result sets, ranking! Data and/or workload statistics User profiles 10
+ text-matching Because of misspellings, spelling variants, etc., there is a need for adding text-matching functionality to DB systems Need for approximate matching E.g., record linkage for matching entities Reconcile Hector Garcia-Molina and Garcia-Molina, H. Intuitively, approximate matching by similarity measures requires ranking! 11
+ heterogeneity Typically, applications access multiple databases Often with a run-time choice of the data sources No unified global schema Even if the sources contain structured, exact data records and have an explicit schema The application has to cope with the heterogeneity of the underlying schema names, XML tags, or RDF properties Queries need to be schema-agnostic or tolerant to schema relaxation 12
+ information-extraction Textual information (natural-language sentences) contains named entities and relationships between them Information-extraction techniques (pattern matching, statistical learning) for locating the entities Potentially, large knowledge bases whose facts with some uncertainty Querying the extracted facts: need for ranking! 13
+ information-extraction Querying the extracted facts: Use keywords rather than sophisticated expressions in SQL or Xquery If the extracted data are organized in graph structures: Determine when keyword occurrences are interconnected in a meaningful way Efficiently compute answers in ranked order (new, or not so new, research problems ) 14
+ structured IR Structured IR: go beyond keyword search by understanding attributes, XML tags and metadata Digital libraries, enterprise intranets, e-science portals, and businessoriented Web sites Example: faceted search paradigm Access information organized according to multiple dimensions (ranking in multiple ways) Allow users to explore a collection of information by applying multiple filters Internet merchant sites for product search, result refinement, interactive exploration 15
+ search-result personalization Take care for the user s information needs Better search precision/recall, higher user satisfaction Exploit: User preferences Profiling User s long-term history of queries, clicks and data usage Contextual profiling User s short-term behavior in the context of the current task Personalization already in Web, news and blog search Enormous potential for individualizing 16
+ Different Views of the Coin About the need for structure DB emphasizes relaxation of structure IR emphasizes adding structure to information (The Web community takes a mix of structured and unstructured data for granted) About the need for named entities DB emphasizes approximate matching and ranking IR emphasizes adding relationships between entities 17
+ DB & IR Integration Learning outcomes After completing the course, the students are expected to: know the basic concepts and techniques for the integration of databases and information retrieval be able to handle contemporary research issues and problems on the topic be able to perform a comparative assessment of existing works 18
+ DB & IR Integration 24 Oct 16 Dec (8 weeks) Two parts: 1st part (4 weeks) all lectures will be given by the instructor 2nd part (4 weeks) lectures (in the form of assignments) will be mostly given by the students 19
+ DB & IR Integration 1st part (4 weeks) all lectures will be given by the instructor Introduction on big data and on the need for data exploration, on the techniques that will be presented at the lectures, and on the structure/organization of the course For this part, algorithmic exercises or extensions on the presented approaches will be given to the students on a weekly basis (each student will work on his/her own) 20
+ DB & IR Integration 1st part (4 weeks) all lectures will be given by the instructor Top-k and skyline queries Rank aggregation, top-k algorithms, skylines Keyword-based search Schema-based & graph-based approaches in databases Preferential search Preference representation and composition, preferential query processing Recommender systems Collaborative filtering, content-based recommendations 21
+ DB & IR Integration 2nd part (4 weeks) lectures (in the form of assignments) will be mostly given by the students Students will form groups (at most 4 students per group: TBD) Each group will be assigned with a project Each project will be associated with two research papers Each week, each group will make a short presentation 22
+ DB & IR Integration 2nd part (4 weeks) lectures (in the form of assignments) will be mostly given by the students Each week, each group will make a short presentation (~10-15 mins) 1st week: shortly describe the topic and the solutions of the papers of the projects 2nd week: describe the main disadvantages/drawbacks of the solutions given by the original authors 3rd week: present ideas from other related papers published after the papers of the project Search for upcoming papers related to the project 4th week: extend the ideas of the project students contributions 23
+ DB & IR Integration 2nd part (4 weeks) lectures (in the form of assignments) will be mostly given by the students + 1 assignment from my side on a weekly basis related to one of the projects 24
+ DB & IR Integration Grades The final grade will be determined: 30% by the assignments of the first part 20% by the assignments of the second part, and 50% by the presentations of the project 25
+ Course Projects Project 1: Top-k join tuples Project 2: Preference integration in databases Project 3: Personalized keyword search Project 4: Contextual recommendations Project 5: Recommend packages Project 6: Recommendations for groups Project 7: Diversity in recommender systems Project 8: Efficient diverse search Project 9: Frameworks based on different definitions of diversity Project 10: Tags for search Project 11: Interactive data exploration
+ Where, When When: Monday, Thursday, Friday: 10.00-12.00 (24 Oct 2016-16 Dec 2016) Where: Pinni B0016 Instructor: Kostas Stefanidis E-mail: kostas.stefanidis@uta.fi Course web page: http://www.uta.fi/sis/tie/dbir/index.html http://people.uta.fi/~kostas.stefanidis/dbir16/dbir16-main.html 27