Database Roma Tre

Size: px

Start display at page:

Download "Database Roma Tre"

Caren Merry Richards
5 years ago
Views:

Database Group @ Roma Tre http://www.dia.uniroma3.

1 Database Roma Tre DIPARTIMENTO DI INFORMATICA E AUTOMAZIONE July 2012 The Database Group is one of the six research groups of the Department of Computer Science and Automation (DIA Dipartimento di Informatica e Automazione) of Università Roma Tre. It is based in Via della Vasca Navale 79-81, 00146, Rome. The Department has currently 21 permanent faculty members, 3 temporary ones, 10 postdoc or other collaborators, and about 25 PhD students. All the faculty members belong to the School of Engineering and have the responsibility for all the core courses of the undergraduate and master s programs in Computer Engineering and in Automation Engineering, which have, overall (as of April 2012) 1330 students (1033 undergraduates and 307 at the master s level). They also teach service courses in the other engineering programs. The faculty members also form the Computer Engineering section of the Doctoral School in Engineering. The School of Engineering currently has 135 faculty members and 4100 students (cumulative undergraduate and master s). Overall, the university has eight schools, 920 faculty members and (undergraduate and master s) students. The Database Group is currently composed of five faculty members and eight other full-time members: Faculty Paolo Atzeni (Professore) Luca Cabibbo (Professore Associato) Valter Crescenzi (Ricercatore) Paolo Merialdo (Professore Associato) Riccardo Torlone (Professore) Postdocs, Ph.D. Students, Collaborators Roberto De Virgilio (Assegnista di ricerca) Mirko Bronzi (Assegnista di ricerca) Francesca Bugiotti (Assegnista di ricerca) Daniele Toti (Assegnista di ricerca) Celine Badr (PhD student) Antonio Maccioni (PhD student) Disheng Qiu (PhD student) Luca Rossi (PhD student) Recent former members Paolo Papotti (now with Qatar Computing Research Institute) Lorenzo Blanco (now with Google, UK) Pierluigi Del Nostro (now with CRMPA, Roma) Stefano Paolozzi (now with CRMPA, Roma) Paolo Cappellari (now with Collective[i], New York, USA) Giorgio Gianforme (now with Almawave s.r.l., Roma) Luigi Bellomarini (now with Banca d Italia) Fabrizio Celli (now with FAO) 1

2 The group is devoted to the development of new principles, methods and tools for the organization and management of information, in the form of databases. The focus is on the new requirements generated by the growth of the Internet and of the World-Wide- Web, with the possible availability, in most settings, of various sources of information. The sources can be heterogeneous (and need not be just databases, but Web sites or files) and it is important to offer users integrated and personalized views over them. The overall approach is to tackle problems that have a practical significance, providing both general solutions (with a theoretical background if relevant) as well as concrete tools that demonstrate the approach. Past topics of interest include Management of Web sites and applications by means of a database approach (Araneus) Wrappers for extraction of data from Web sites (RoadRunner) Updating object-oriented databases Management of data warehouses Database theory Recent major topics of interest are briefly described below. MIDST (Model-Independent Data and Schema Translation) The MIDST project has the goal of developing tools for the translation of database schemas and instances from a model to another. The approach is based on a "metamodel" notion: data models are described with reference to a small set of metaconstructs, and translations are specified on metaconstructs as well, so that they are reusable. The project started in its present form in 2003, as a follow-up of a previous project carried out in In MIDST, new techniques have been proposed for database translations from a model to another, for example from object oriented to SQL or from SQL to XML schema descriptions. The approach leverages a predefined, but large and extensible, set of models: given a source schema S expressed in a source model, and a target model TM, it generates a schema S expressed in TM that is "equivalent" to S. A wide family of models is handled by using a metamodel in which models can be succinctly and precisely described. The approach expresses the translation as Datalog rules and exposes the source and target of the translation in a generic relational dictionary. This makes the translation transparent, easy to customize and model-independent. The proposal includes automatic generation of translations, on the basis of a formal system that supports reasoning on signatures of modules and elementary translations. The original version of the approach generates offline translations, in the sense that schemas and databases are imported in the tool, translated and then exported in the target system. A second version has later been produced, with a run-time approach, where the translation of data is performed by views whose definition is generated by the tool, again with the metamodel approach. In this case, only schemas are imported in the tool. As a side topic, the same model-independent approach has been applied to other model management operators (merge and diff), leading to the proposal of the MISM (Model Independent Schema Management) platform. and most recent ones P. Atzeni, L. Bellomarini, F. Bugiotti, F. Celli, G. Gianforme, A runtime approach to model-generic translation of schema and data, Information Systems 37(3), May 2012, Paolo Atzeni, Giorgio Gianforme, Paolo Cappellari: Data model descriptions and translation signatures in a multi-model framework. Annals of Mathematics and Artificial Intelligence, 63(3-4): (2011). P. Atzeni, L. Bellomarini, F. Bugiotti, G. Gianforme: A runtime approach to model-independent schema and data translation. EDBT 2009: P. Atzeni, L. Bellomarini, F. Bugiotti, G. Gianforme: MISM: A Platform for Model-Independent Solutions to Model Management Problems. J. Data Semantics 14: (2009) P. Atzeni, P. Cappellari, R. Torlone, P. A. Bernstein,, G. Gianforme: Model-independent schema translation. VLDB J. 17(6): (2008) 2

3 SOS (Save Our Systems) This began as a follow-up of MIDST, but it has now become an independent project. It considers interoperability of systems in the so called nosql family. A first result here is the SOS platform, which allows for the uniform access to different systems in the family. Paolo Atzeni, Francesca, Bugiotti, Luca Rossi. Uniform Access to Non-relational Database Systems: The SOS Platform. CAiSE 2012, LNCS 7328, pp Paolo Atzeni, Francesca, Bugiotti, Luca Rossi. SOS (Save Our Systems): A uniform programming interface for non-relational systems. EDBT 2012, ACM, PRAISED: Automatic Abbreviations Discovery and Resolution A methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, was proposed and implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A more general approach, not related to specific domains is currently being developed. Daniele Toti, Paolo Atzeni, Fabio Polticelli, Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework. Bio-Algorithms and Medical-Systems Vol. 8, Issue 1 (Mar 2012), pp doi: /bams ISSN: X Paolo Atzeni, Fabio Polticelli, Daniele Toti Experimentation of an automatic resolution method for protein abbreviations in full-text papers, ACM BCB 2011: 2nd ACM Conference on Bioinformatics, Computational Biology and Biomedicine Pages: Doi / Temporal Content Management A follow-up of previous work on Web data management. Paolo Atzeni, Stefano Paolozzi, Pierluigi Del Nostro: Temporal Content Management and Web Sites Modeling: Putting Them Together. T. Large-Scale Data- and Knowledge-Centered Systems, 5: (2012) Doi: dx.doi.org/ / _7. Nyaya: Reasoning over Large Semantic Datasets Nyaya is a system for the management of Semantic-Web data which couples a general-purpose and extensible storage mechanism with efficient ontology reasoning and querying capabilities. Nyaya processes large Semantic-Web datasets, expressed in multiple formalisms, by transforming them into a collection of Semantic Data Kiosks. The native meta-data of each kiosk is uniformly exposed using the Datalog± language, a powerful rule based modelling language for ontological databases. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. Nyaya is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organization of the persistent storage. The approach has been experimented using well-known benchmarks, and compared to state-of-the-art research prototypes and commercial systems. 3

4 Roberto De Virgilio, Giorgio Orsi, Letizia Tanca, Riccardo Torlone: NYAYA: a System Supporting the Uniform Management of Large Sets of Semantic Data. ICDE Database preferences User preferences are a fundamental ingredient of personalized database applications, in particular those in which the user context plays a key role. Given a set of preferences defined in different contexts, we have studied the problem of deriving the preferences that hold in one of them, that is, how preferences propagate through contexts. For the sake of generality, the approach relies on an abstract context model, which only requires that the contexts form a poset. We have formalized the basic properties of the propagation process and have introduced an algebraic model for preference propagation that relies on two well-known operators for combining preferences: Pareto and Prioritized composition. We have also studied three alternative propagation methods and precisely characterize them in terms of the fairness and specificity properties. To our knowledge, these are the first results providing a theoretical foundation to the management of contextual preferences in database systems. Paolo Ciaccia, Riccardo Torlone: Modeling the Propagation of User Preferences. ER 2011: (Best paper award) Query relaxation Traditional information search in which queries are posed against a known and rigid schema over a structured database is shifting towards a Web scenario in which exposed schemas are vague or absent and therefore query answering cannot be precise, but needs to be relaxed in order to match user requests with accessible data. In this framework, we have proposed a logical model and an abstract query language as a foundation for querying data sets with vague schemas. Our approach takes advantages of the availability of taxonomies, that is, simple classifications of terms arranged in a hierarchical structure. The model is a natural extension of the relational model in which data domains are organized in hierarchies, according to different levels of generalization. The query language is a conservative extension of relational algebra where special operators allow the specification of relaxed queries over vaguely structured information. We have also studied equivalence and rewriting properties of the query language that can be used for query optimization. Davide Martinenghi, Riccardo Torlone: Querying Databases with Taxonomies. ER 2010: GAIA: Generic Mappings for Data Exchange We have addressed the novel problem of schema exchange, which naturally extends the data exchange process to collections of similar schemas: while the data exchange process operates over specific source and target schemas, the goal of schema exchange is rather the definition of generic transformations of data under structurally similar schemas. To this aim, we have introduced the notion of schema template, which is used to represent a class of different database schemas sharing the same structure. Then, given a mapping between the components of a source and a target template, the goal is the translation of any database whose schema conforms to the source template into a format conforming to the target template. This framework can be used to support several activities involved in the management of heterogeneous data sources: (i) the definition, once for all, of generic transformations that work for different but similar schemas, such as the denormalization of a pair of relation tables based on a foreign key between them; (ii) the reuse of a data exchange setting, since a mapping between templates can be derived from a mapping between schemas for later use in similar scenarios, and (iii) the specification of model translations, that is, translations of schemas and data from one data model to another (e.g., from relational to XML), a problem largely studied in recent years. 4

5 Paolo Papotti, Riccardo Torlone: Schema exchange: Generic mappings for transforming data and metadata. Data Knowl. Eng. 68(7): (2009) Paolo Papotti, Riccardo Torlone: Automatic Generation of Model Translations. CAiSE 2007: RFID data management Radio Frequency Identification (RFID) technology plays a key role in supply chains and a challenging problem is the effective and efficient management of the enormous volume of data generated by such systems. In this scenario, we have studied the problem of storing and querying a large amount of RFID data. Our approach relies on a compression technique which allows a significant saving of space according to a notion of aggregates over RFID data and a logical representation of these aggregates. We have proposed an indexing technique for aggregates of RFID data that guarantees the efficient execution of an important class of queries. Finally, we have defined the architecture of a tool implementing our approach and demonstrated, with a number of experimental results made with this tool, the feasibility and effectiveness of the underlying techniques. Roberto De Virgilio, Pierpaolo Sugamiele, Riccardo Torlone: Incremental aggregation of RFID data. IDEAS 2009: Flint: extraction and integration of Web data A large and increasing number of web sites publish structured data about recognizable concepts (such as stock quotes, movies, restaurants, etc.). The great chance to create applications that rely on the huge amount of data taken from these sites has been discussed for more than a decade now, but in practice only a small fraction of such information is currently used. The main reason is that extracting, curating and integrating web data is an expensive task, which often requires human intervention. The Flint project aims at developing automatic and domain independent tools to support the main tasks to benefit from Web data: discovering data intensive web sites containing information about entities of interest, extracting and integrating the published data, and performing a probabilistic analysis to characterize the impreciseness of the data and the accuracy of the sources. Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti: Automatically building probabilistic databases from the web. WWW (Companion Volume) 2011: Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti: Wrapper Generation for Overlapping Web Sources. ACM Web Intelligence 2011: Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti: Probabilistic Models to Reconcile Complex Data from Inaccurate Data Sources. CAiSE 2010: Paolo Papotti, Valter Crescenzi, Paolo Merialdo, Mirko Bronzi, Lorenzo Blanco: Redundancy- Driven Web Data Extraction and Integration. WebDB 2010 Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti: Flint: Google-basing the Web. EDBT 2008: Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo, Paolo Papotti: Supporting the automatic construction of entity aware search engines. ACM WIDM 2008: Lorenzo Blanco, Valter Crescenzi, Paolo Merialdo: Structure and Semantics of Data-IntensiveWeb Pages: An Experimental Study on their Relationships. J. UCS 14(11): (2008) Schema mapping A mapping system is a tool supporting the following scenario: given a source schema, a target schema, and a mapping between these two schemas, expressed as a set of attribute correspondences, generate an executable transformation (i.e., a set of queries) to compute target instances from source instances. In the context of relational mapping systems, we have proposed an extension of the well know Clio framework, to take into account mappings between relational schemas with keys, foreign keys and 5

6 nullable attributes. Specifically, we extended the two main components of a mapping system (a schema mapping generation algorithm and a query generation algorithm) to deal with such integrity constraints. As a further contribution, we have introduced referenced-attribute correspondences, which permit to specify more precise mappings than traditional attribute correspondences, while retaining a simple and intuitive semantics. Luca Cabibbo: On keys, foreign keys and nullable attributes in relational mapping systems, Proceedings of the 12th International Conference on Extending Database Technology, EDBT 2009, 2009, Textbooks for the database field and other general publications Memebers of the group maintain a set of textbooks and courses material for teaching databases in universities. It is carried out together with colleagues at Politecnico di Milano, and has led to the most popular pair of textbooks in the field in Italy, with reference to both introductory and advanced courses in databases methodsand technology. A new edition of each of the books is produced every two or three years, with attention to new development of the technology and to new teaching techniques. Additional writing material is produced from time to time. and most recent ones P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone Basi di dati: modelli e linguaggi di interrogazione McGraw-Hill Italia, terza edizione, P. Atzeni, S. Ceri, P. Fraternali, S. Paraboschi, R. Torlone Basi di dati: Architetture e linee di evoluzione, seconda edizione, McGraw-Hill Italia, P. Atzeni, S. Ceri, S. Paraboschi, R. Torlone: Database Systems - Concepts, Languages and Architectures McGraw-Hill Book Company 1999 Brief CVs for the permanent members of the group Paolo Atzeni is a database professor at Università Roma Tre. He received his Dr. Ing. degree in Electrical Engineering from Università di Roma "La Sapienza" in Before joining Università Roma Tre, he was with IASI-CNR in Rome, then a faculty member at Università di Napoli and later a professor at Università di Roma La Sapienza. He also had visiting appointments at the University of Toronto, at Università dell'aquila, at Microsoft Research and at the National University of Singapore. He has worked on various topics in the database field, including relational database theory, conceptual models and design tools, deductive databases, databases and the Web, model management, cooperation of database systems. He is the leader of the database group at Roma Tre, which includes six faculty members and various postdocs and students. They collaborate with various groups in Italy and abroad, on topics that include data models, data warehouses, data in the Web world. He is currently the vicepresident of the VLDB Endowment and a member of the Executive Board of the EDBT Association, of which he is also past President. Luca Cabibbo is associate professor at the School of Engineering of Università Roma Tre. He is with Università Roma Tre since 1997, previously as a research associate. He graduated with honors in Electrical Engineering in 1992 from Università di Roma La Sapienza. In 1996 he received his PhD, also from Università di Roma La Sapienza, under the supervision of Paolo Atzeni, with a thesis on Querying and updating complex-object databases. His main research interests are in the area of databases and information systems and include: models and languages for object-oriented databases; cooperative database systems; methods and tools for data warehousing and multi-dimensional analysis; models and tools for object-relational mapping (that is, for the transparent management of object persistence by means of relational databases). On these topics, he has published several papers on important international database journals, including ACM Transaction on Database Systems and Information and Computation, as well as on the proceedings of important international database conferences (ACM PODS, IEEE-ICDE, ICDT, EDBT). Valter Crescenzi is Assistant Professor at Università degli Studi Roma Tre from He received his 6

7 Computer Engineering degree (Laurea in Ingegneria Informatica) from Università degli Studi Roma Tre, in In 2001 he received his PhD, from Università degli Studi di Roma "La Sapienza", under the supervision of prof. Paolo Atzeni. During his PhD program he also spent six months at the UCSD of San Diego, working with prof. Bertram Ludaescher. His research interests include information extraction and data management techniques for Web data. He has published his research results in important journals of the field, including Journal of the ACM, IEEE Transactions on Knowledge and Data Engineering, Journal of Applied Artificial Intelligence, and in the refereed proceedings of the major conferences (VLDB). Paolo Merialdo is Associate Professor at Università degli Studi Roma Tre from He received his Computer Engineering degree (Laurea in Ingegneria Elettronica) from Università degli Studi di Genova, in In 1998 he received his PhD, from Università degli Studi di Roma "La Sapienza", under the supervision of prof. Paolo Atzeni. During his PhD program he also spent six months at the University of Toronto, working with prof. Alberto Mendelzon. His research interests include information extraction and data management techniques for Web data. He has published his research results in important journals of the field, including ACM Transactions on Internet Technology, IEEE Transactions on Knowledge and Data Engineering, IEEE Internet Computing, Journal of Applied Artificial Intelligence, and in the refereed proceedings of the major conferences (ACM-SIGMOD, VLDB, EDBT). He has been program committee member for many international conferences. He served as Associate Director for ACM SIGMOD-RECORD ( ). He is co-founder of InnovAction Lab, an Entrepreneurship Program for master students. Riccardo Torlone is a professor in the area of Information Systems at Università Roma Tre. He received his Dr. Ing. degree in Electrical Engineering from Università di Roma "La Sapienza". Before joining Università Roma Tre, he was member of the research staff at IASI-CNR in Rome, where he has still a research appointment. He also had a visiting research appointment at the University of California Los Angeles. His research has considered various topics in the database field, including the following: relational database theory, active and deductive databases, CASE tools for database design, models and languages for object-oriented databases, data warehouses and OLAP systems, Web based information systems, data and metadata exchange, adaptive information systems and personalization. He has published his research results in the major journals of the field, including ACM Transactions of Database Systems, VLDB Journal, Information Systems, SIAM Journal of Computing, IEEE Transaction on Data and Knowledge Engineering, Distributed and Parallel Databases and in the refereed proceedings of all the major conferences (ACM-SIGMOD, VLDB, EDBT, ACM-PODS, IEEE-ICDE, ICDT, ER, CIKM). He has authored the most spread book on databases in Italy, published also in an international edition and in several versions. He has also authored two other books. 7

MISM: A platform for model-independent solutions to model management problems

MISM: A platform for model-independent solutions to model management problems Paolo Atzeni, Luigi Bellomarini, Francesca Bugiotti, and Giorgio Gianforme Dipartimento di informatica e automazione Università