A Functorial Query Language Ryan Wisnesky ryan@cs.harvard.edu Harvard University DCP 2014 1
History In the early 1990s, Rosebrugh et al noticed that finitely presented categories could be thought of as database schemas. A finitely presented category - a schema - is a directed multigraph with path equations. An instance I on a schema C is a functor C Set. This is, for each node X, a set of IDs IX, and for each edge X Y, a function IX IY. 2
Example Schema & Instance manager worksin Employee Department secretary Employee.manager.worksIn=Employee.worksIn Department.secretary.worksIn=Department Employee ID manager worksin 101 103 q10 ID Department secretary 102 102 x02 103 103 q10 3 q10 102 x02 101
History, continued A schema mapping F : C D takes nodes(c) nodes(d) and edges(c) paths(d) in a way that respects C s path equations. This is, a mapping is a functor. Given a mapping F : C D, there are three adjoint data migration operations: Δ F : D-Inst C-Inst (similar to projection) Σ F : C-Inst D-Inst (similar to disjoint union) F : C-Inst D-Inst (similar to cartesian product) We call this the functorial data model. 4
Example Migrations A a1 a2 a3 B b1 b2 A B F C A c1 c2 c3 B c1 c2 c3 F ΣF Δ F (A) = C ΔF Δ F (B) = C C (a1,b1) (a1,b2) (a2,b1) (a2,b2) (a3,b1) (a3,b2) C (a1,a) (a2,a) (a3,a) (b1,b) (b2,b) Σ F (C) = A+B F (C) = A B C c1 c2 c3 5
Advantages Schemas and mappings form a bi-cartesian closed category (BCCC), so we can take products of schemas, co-products, exponentials, etc. For each schema T, the T-instances and their homomorphisms (which are natural transformations) form a topos (a BCCC with a sub-object classifier). Data integrity constraints are built-in to schemas and path equations are a natural and expressive class of constraint. 6
Disadvantages Instances must be considered up to isomorphism, not equality. So every constant is a meaningless ID. There is no obvious query language, nor is it obvious how to implement the three data migration operators using, for example, SQL. So Rosebrugh et al moved on to a different categorical data model, that of sketches. We pick up where they left off, and address these challenges. 7
Historical Aside There have been many other uses of category theory for information management. Wong, Tannen, Buneman, and others used category theory to develop the nested relational calculus, but their work is not related to ours. Alagic and Bernstein defined a notion of a good data model using category theory; the functorial data model is good by their definition. The Δ,Σ, data migration operations appear in a different guise in categorical logic and type theory. Key phrase: quantification is adjoint to substitution. 8
Contributions I have been working with David Spivak to extend the functorial data model, and to build practical tools based on it. The second half of this talk will be a demo. Key results: A way to store concrete data (attributes) in instances. A functorial query language, FQL. An implementation of FQL in SQL, and vice versa. Project webpage: wisnesky.net/fql.html 9
Attributes We associate to each node in a schema a set of attribute names and domains (strings, integers, etc). An instance contains additional columns for attributes. The category theory required to describe schemas and instances with attributes is verbose, but straightforward. The key challenge is making sure the useful properties of the functorial data model continue to hold. Example: isomorphisms of instances preserve attributes. 10
Employees with Attributes manager worksin Employee Department secretary first last Employee.manager.worksIn=Employee.worksIn Department.secretary.worksIn=Department name Employee ID manager worksin first last 101 103 q10 Al Akin 102 102 x02 Bob Bo Department ID secretary name q10 102 CS x02 101 Math 103 103 q10 Carl Cork 11
FQL Schemas with attributes still form a BCCC. We can define schemas and mappings using categorical abstract machine language. Equivalently, using the simply typed λ-calculus (STLC). Instances with attributes still form a topos. We can define instances and homomorphisms using higher-order logic (HOL) (= STLC + equality at all types). Some minor details about finite vs infinite domains apply. Migrations for the form Σ F o G o Δ H are closed under composition, provided F is a discrete op-fibration. An FQL query is a migration of the above form. 12
FQL Syntax schema S = { nodes Employee, Department; attributes name : Department -> string, first : Employee -> string, last : Employee -> string; arrows manager : Employee -> Employee, worksin : Employee -> Department, secretary : Department -> Employee; equations Employee.manager.worksIn = Employee.worksIn, Department.secretary.worksIn = Department } instance I = { nodes Employee -> { 101, 102, 103 }, Department -> { q10, x02 }; attributes first -> { (101, Alan), (102, Camille), (103, Andrey) }, last -> { (101, Turing), (102, Jordan), (103, Markov) }, name -> { (q10, AppliedMath), (x02, PureMath) }; arrows manager -> { (101, 103), (102, 102), (103, 103) }, worksin -> { (101, q10), (102, x02), (103, q10) }, secretary -> { (q10, 101), (x02, 102) }; } : S 13
FQL Syntax, continued //From products example schema S = { } //products of schemas schema T = { } schema A = (S * T) mapping p1 = fst S T mapping p2 = snd S T mapping p = (p1 * p2) //is id //products of instances instance I = { } : S instance J = { } : S instance A = (I * J) transform K = A.fst transform L = A.snd transform M = A.(K * L) //is id //From co-products example schema S = { } //co-products of schemas schema T = { } schema A = (S + T) mapping p1 = inl S T mapping p2 = inr S T mapping p = (p1 + p2) //is id //co-products of instances instance I = { } : S instance J = { } : S instance A = (I + J) transform K = A.inl transform L = A.inr transform M = A.(K + L) //is id 14
FQL / SQL Let SPCU denote the select-project-product-union relational algebra. Let guidgen denote the operation taking n-ary tables to n+1- ary tables by adding a new column with globally unique IDs. Every FQL query can be implemented using SPCU+guidgen. Every SPCU query under bag semantics can be implemented using FQL. FQL can be extended with an operation, relationalize, such that every SPCU query under set semantics can be implemented using FQL+relationalize. 15
FQL to SQL Δ migrations are compositions of tables, implementable with SPC. Σ migrations are unions of compositions of tables, implementable with SPCU. migrations are implementable with SPC+guidgen, but are much more complex to describe than Δ, Σ. Their construction requires comma categories and implementing diagram limits using joins. Products and co-products are implementable with SPCU. 16
SQL to FQL Consider a relational schema with two relations: R(c 1,, c n ) R (c 1,, c n ) It is encoded in FQL with an active domain and an attribute: c1 R R c1 cn cn adom att Using this encoding, Δ implements projection, implements selection and product, and Σ implements union. 17
FQL IDE Demo Download fql.jar from wisnesky.net/fql.html Run by double-clicking or java -jar fql.jar Requires Java 7 Internally, FQL emits SQL and runs it using the H2 SQL engine (h2database.com) The FQL IDE does allow additional operations that can t be implemented in SQL. The FQL IDE can execute against external instances using JDBC. FQL is case insensitive. 18
Home Screen 19
Employees 20
Delta - Mapping 21
Delta - Projection 22
Sigma - Mapping 23
Sigma - Union 24
Pi - Mapping 25
Pi - Product 26
Other FQL IDE Features Translates from SPCU (in SQL syntax) to FQL. Category of elements view displays an instance as a graph where every ID is a node. Observables view shows all the different attributes associated with an ID. Generates FQL from attribute correspondences. Supports enumerated (finite) types. Compiles FQL to embedded dependencies, an alternative relational language. 27
Conclusion We are excited about the functorial data model as an alternative basis for studying problems in information management. It has a number of useful properties that the relational model does not: Its schemas are naturally based on graphs and build-in constraints. It is naturally ID and bag based, and can be extended to work with sets. It can implement a number of information integration scenarios that relational tools like Clio and Rondo cannot (see my thesis). It can implement SPCU via an encoding. As the FQL IDE demonstrates, functorial data migration is more than just generalized abstract nonsense. Send questions/comments to ryan@cs.harvard.edu. 28