Storing data in databases The webinar will begin at 3pm You now have a menu in the top right corner of your screen. The red button with a white arrow allows you to expand and contract the webinar menu, in which you can write questions/comments. We won t have time to answer questions while we are presenting, but will answer them at the end You will be on mute throughout we can t hear you.
Storing data in databases Webinar 25 October 2016 Peter Smyth UK Data Service
Can you hear us?
Can you hear us? If Not: Check your volume, and that your speaker/headset is plugged in. Your invitation also included a phone number, you can call that to listen in. o UK +44 (0) 330 221 9914 o US +1 (914) 614-3429 We are recording this webinar, so you can always listen to it later.
Overview of this webinar Definition of a database Why Excel isn t always good enough Different Database types and availability Relational Databases A bit of history Data organisation Limitations Query examples Document Databases MongoDB Query examples Graph Database demo
Definition of Database A structured set of data held in a computer, especially one that is accessible in various ways. (Oxford University Press) Structured = Ordered? Or Arranged? Nothing about the details of the structuring Accessible = Searchable, able to query the contents to see what is there
Not a database! - Why not?
What about Excel? Worksheets are tabular in nature - very structured You can join sheets together using the VLOOKUP function There is a set of Database type functions (DSUM, DCOUNT etc.) You can write queries to filter the rows
Excel Restrictions Sheets have limit of 1 million rows (2 20 ) VLOOKUP can only return a single column The database functions can only return a single value Setting up queries is quite complex
Why use a desktop database? Size of data Convenience of a desktop system Flexibility in collecting and persisting data Flexibility in querying and analysis
Growing and shrinking data Desktop Application Big Data Environment Sent Tweet Data from Tweet Tweets All tweets from user All tweets from User & Friends All Smart meter data Smart meter by day Smart meter by Month Smart meter data By Month and Geography 1Kb 1Mb 1Gb 10+ Gb
Growing and shrinking data Sent Tweet Desktop Application Desktop Database Tweets Big Data Environment Data from Tweet All tweets from user All tweets from User & Friends All Smart meter data Smart meter by day Smart meter by Month By Month and Geography Smart meter data 1Kb 1Gb 5GB 25 Gb 25+ GB
Types of Databases There are many different types of Databases For the end user there are probably four main types. Relational Databases (MySQL, MS SQL, SQLite, Postgres ) Document databases MongoDB, CouchDB, ) Graph databases (Neo4j, Titan, ) Wide column stores (Cassandra, Hbase,, )
Types of Databases Relational Databases predominate by a long way Data held in tables with defined relationships between the tables Document databases and wide column databases use storage architectures designed to overcome some of the scalability problems of relational databases. Since Big Data sources have become available, these are gaining in popularity Graph Databases are designed to optimise specific type of querying of data where you are more interested in the relationship between different items that the actual attributes of the items, often used with Social networks
Types of Databases The link below provides a table of the different Databases systems available and their relative use. Both Commercial and Free databases systems are included. http://db-engines.com/en/ranking
Types of Databases (Table) Freely available options
The Relational Model Why do we have it? What is it good for? What are the pros and cons? What do we mean by relational?
The Relational Model - History The term "relational database" was first used by E. F. Codd in 1970 in the paper "A Relational Model of Data for Large Shared Data Banks Although not necessarily the primary driver, it should be noted that at the time computer storage was very expensive The Relational model can be very efficient when storing data. Typically data items are stored only once
The Relational Model - History Storage prices fell from about $193K per Gb in 1980 to about $0.03 in 2014 http://www.mkomo.com/cost-per-gigabyte-update
The Relational Model How it works If I wanted to record the details of a house and the people who lived there, I could create a table like this: HouseHold_All HouseHold_Id Address PostCode Person_id FirstName LastName DOB Sex Age No_of _Rooms No_of_Occupan ts Type Construction I would need a single record for each person at that address
The Relational Model How it works And populate it with data, like this HouseHold_Id Address PostCode Person_id FirstName LastName DOB Sex Age No_of _Rooms No_of_Occupa nts Type Construction Some street, Some 1Town AA1 2BB 1Alfie Smith 17/09/1963 M 60 8 5Semi Brick Some street, Some 1Town AA1 2BB 2Jane Smith 05/02/1970 F 60 8 5Semi Brick Some street, Some 1Town AA1 2BB 3John Smith 03/01/2001 M 60 8 5Semi Brick Some street, Some 1Town AA1 2BB 4Jack Smith 10/10/2005 M 60 8 5Semi Brick Some street, Some 1Town AA1 2BB 5 Jenny Smith 07/05/2009 F 60 8 5Semi Brick These records all relate to the same household, but the data about the house itself is repeated for each person in the house
The Relational Model How it works It makes more sense to use multiple tables and split the data between them This eliminates the need to duplicate data The arrows represent relationships between the tables. If I only wanted details about the a person, I wouldn t need to refer to the other tables
The Relational Model How it works All of the Occupant information is kept in a single table. Details of the Property are only recorded once in the three smaller tables
The Relational Model - Advantages Data is only stored once (across multiple tables if necessary) Efficient for well known and structured data Well defined and understood query language (SQL) variants available for all relational databases Schema on Write allows comprehensive data checking before loading making for cleaner data
The Relational Model - Disadvantages The need for multiple tables increases loading times Uses vertical scaling Not really relevant for desktop databases Schema on write cannot deal with unstructured data efficiently, if at all
Document Databases Why do we have it? What is it good for? What are the pros and cons? What is meant by a document?
Document Database A document does not mean a pdf or word document A document is semi-structured data It is structured in that every data item in the document has name associated with it It is semi- in that different documents in the same collection of documents don t have to have the same set of names
JSON Example semi-structured data The most popular format for Semi-structured data is JSON. Most data that can be downloaded from a Web based API will be in JSON format (or at least offer JSON as a choice of format)
JSON Example semi-structured data The following is a simple example of JSON formatted data { Name : Manchester, PostCode : M13 9PL, Established : 1824 } It is split over several lines just to aid reading. Everything between the { and } represents a single record, or document
Document Databases The semi-structured nature means that it is difficult to store the data in tables Not all fields need to be in each document Fields don t need to be in the same order { 'id' : 1234, 'Name' : 'Peter', 'Tel' : 012345678 } { 'Name' : 'John', 'id' : 3523, 'Email' : ['John@abc.com', 'j.smith@xyz.com'],'mob' : 012345678} Even more difficult to create a schema for the data in advance Instead, data is stored as-is and a schema is created when the data is read Schema on read
Document Databases - NoSQL Non-Relational databases like MongoDB typically do not use SQL to query the data. When you install MongoDB you are provided with a Simple Shell interface from which you can query the database. Use of the Shell to query requires a knowledge of Javascript. As an alternative, both Python and R have packages which interface to MongoDB to allow querying of the database using native Python or R like constructs The unstructured nature of the data, adds to the complexity of querying
A Graphics Database Neo4j The default installation of Neo4j provides a simple default Movies database. It also comes with tutorials to help get you started
Summary The size of your data may be enough to make you decide on using a desktop database But it may not be the only consideration o How are you collecting the data over time? o What is the structure of the data? o How do you intend to use the data o Can you clean and structure the data as you collect it? o Do you need to keep all of the raw data just in case?
Questions Peter Smyth Peter.smyth@manchester.ac.uk ukdataservice.ac.uk/help/ Subscribe to the UK Data Service news list at https://www.jiscmail.ac.uk/cgibin/webadmin?a0=ukdataservice Follow us on Twitter https://twitter.com/ukdataservice or Facebook https://www.facebook.com/ukdataservice