Data, Information, and Databases BDIS 6.1 Topics Covered Information types: transactional vsanalytical Five characteristics of information quality Database versus a DBMS RDBMS: advantages and terminology Multi-user issues BSAD 141 Dave Novak The Need for High-Quality Information Data are everywhere Which data are important? The Need for High-Quality Information Recall difference between data and information Which data should the organization store? Which data need to be further manipulated? Which data are required to make different types of decisions? How does the organization convert raw data into the information that is needed? The Need for High-Quality Information The need to obtain and analyze the many different levels, formats, and granularities of organizational information to make decisions The Need for High-Quality Information Decisions are only as good as the quality of the data and information that are used to make the decisions Garbage in Garbage out Using poor quality data doesn t help 1
Data Quality Problems Example of Poor Quality Data Characteristics of High Quality Data 1) Accurate 2) Complete 3) Consistent 4) Unique 5) Timely 1) Accurate 2) Complete Are the data (is the information) correct, precise, and exact? For example: Are the data factual? Are data error-free? Have data been verified? Correct spelling Precise numbers Are the data whole (complete) and do they have all the necessary parts? For example Are there missing values or pieces of data? Full street address Area code along with phone number Empty fields Full Names 3) Consistent Are the data are in agreement with themselves and with known facts? For example Does summary information agree with detailed information? Can you reconcile the data? Do mathematical manipulations yield correct results? Are data manipulations performed consistently for the entire data set? 4) Unique Are the data unique (one of a kind) or are there redundant, repetitious or unnecessary data stored in the same database? For example: Are there duplicate records for the same event? Are there different versions of the same file or event (which is the latest or most accurate?) 2
5) Timely Are the data current with respect to decision-making needs? Timeliness depends on the situation Real-time information Immediate, up-to-date information Real-time system Provides realtime information in response to requests Real-time is a relative description that depends on the use or need Examples of how can data be of poor quality Customers intentionally enter inaccurate information to protect their privacy or because they are irritated Different data entry standards and formats are used Operators enter abbreviated or erroneous information by accident or to save time Third party and external information contains inconsistencies, inaccuracies, and errors What is a Database? Database a collection of information organized in a way that provides efficient retrieval There are electronic and physical databases (paper/print) A database can be a very simple collection of data such as alphabetically arranging names in an address book What Is a Database? Self-describing collection of integrated records includes Meta Data about the fields/attributes Governs data acceptable formats for consistency Hierarchy of data elements Columns/Fields Rows/Records Tables/Relations A location to store and retrieve well structured and well governed data What is a Database Management System (DBMS)? Database management systems (DBMS) A set of computer programs / software that allow users to store, modify, query, and retrieve data in an organized, systematic, and controlled manner Database Management System (DBMS) A database (the physical collection of data) is typically not portable across different DBMS Like application software, different DBMS are generally designed to work with specific system software and specific database schema 3
Database Management System (DBMS) What is a database schema? The way in which the objects in the database are logically grouped / organized What are the tables and how are they linked? What are the different user views? What types of procedures and queries are stored? Database Management System (DBMS) A database is typically something inside the DBMS, although in the case of a MS Excel workbook the database is a standalone object Single File Data Management MS Excel is a database, but it is not a DBMS! There is NO DB management component - each worksheet is a single large two-dimensional matrix A DBMS is software that is used to manage the database and provides a set of tools used to manipulate and query data A database is simply an organized collection of data that can be accessed Why go beyond a Spreadsheet? Need to Store Multiple Themes of Data Spreadsheets Lack Structure and are prone to error To reduce redundantly stored data Optimized Query/Reporting Databases ENFORCE Consistency of the Data Spreadsheets are Clumsy & Time consuming to Update, Append or Expand Multiple User Access Why Redundancy and Duplication of data are Important to Avoid Update, Insertion and Deletion Anomalies Poorly normalized tables that require duplicate entries how do we ensure that when you change a value for one record that the duplicated value is changed? If an employee leaves or if you stop selling a specific product, should your system permit those records to be deleted? Would you have this level of control over a spreadsheet? Redundancy is great for backups but terribly inefficient for Data Structures Increase manual time required for development and data entry Increase required disk space Decrease processing speeds & response time Lead to data anomalies and inconsistencies Types of Database Architectures Hierarchical Model Parent/Child Tree Like Structure. Parents can have many children but children only one parent Network Model Permitted children to have many parents Offers more direct relationships between entities Mostly Replaced by Relational Model Object Model Ideal when demand for massive amounts of information about single items is frequent (high energy physics, molecular biology, spatial databases, telecommunications..) Relational Model Most Common and what we will study in this class By far the most dominant enterprise data structure 4
Database Management System (DBMS) NoSQL database technologies RDBMS are not well suited to handle unstructured data NoSQL technologies offer increased flexibility and scalability NoSQL technologies are designed with big data needs as opposed to transaction processing needs in mind RDBMS Most popular and common DBMS is the relational DBMS (RDBMS) A standard program and user interface in the RDBMS is the Structure Query Language (SQL) A programming language used to create, modify, and retrieve information from a database Different databases use different (proprietary) variations of SQL RDBMS RDBMS are still best for most business needs Oracle: Oracle Database and MySQL IBM: DB2 and Informix Microsoft: SQL Server SAP: Sybase Enterprise and Sybase IQ Teradata https://www.wired.com/insights/2013/09/thefuture-of-enterprise-data-rdbms-will-be-there/ RDBMS Data are organized as a set of formal tables Data can be accessed and combined in different ways without altering the data within the tables RDBMS can be easily extended / scaled new data and new categories of data can be added without changing existing data RDBMS Terminology Data model A picture of logical structures that detail the relationships among data elements RDBMS Terminology Data dictionary Compiles all of the metadata about the elements in the data model Metadata Formal description of data structures (like tables and fields) and any constraints of the table or values within the table Data about the containers of data 5
Entity Sets (Tables) Relational table or entity set Each table consists of columns (fields/attributes) and rows (records/entities) The table has a name that describes the group of related entities within the table For example, a table labeled Student would contain a group of student entities Entity / Record / Row A person, place, thing, transaction, or event about which data are being collected and stored The individual rows in a table contain entities Each row is also referred to as a record Example? Attributes / Field / Column The data elements that describe the characteristics of a specific entity The columns in each table contain the attributes Example? What is a Relationship? When designing a relational DB, data are grouped into tables Each table contains all related data elements For example we would store data related to customer (name, address, phone, etc.) and data related to the customer s particular order (orderid, date, shipping method, etc.) in different tables (Customer and Order) What is a Relationship? All information specific to a customer would go into a Customer table All information specific to the orders would go into an Order table We would then create a relationship between the tables that allows us to match a particular customer with a particular order What is a Relationship? A relationship in an RDBMS is an association between the entities within the different tables There are THREE (3) types of relationships: One-to-One (1:1) One-to-Many (1:M) Many-to-Many (M:M) 6
Creating Relationships Through Keys KEYS are used to create relationships between the entities in different tables in the RDB Primary key A field (or group of fields) that uniquely identifies a given entity in a table Foreign key A primary key of one table that appears an attribute in another table and acts to provide a logical relationship among the two tables Creating Relationships Through Keys For our purposes: Every table in a RDBMS MUST have a primary key The foreign key is not required in every table and will only appear on the many side of the relationship Advantages of RDBMs RDBMS advantages from a business perspective include 1) Flexibility 2) Scalability and performance 3) Improved information integrity (quality) Reduced information redundancy 4) Information security 1) Flexibility Handle changes quickly and easily Provide users with different views of the data Arranging data items in different ways depending on the specific user need Showing a particular user only some of the available fields while not showing them other fields 1) Flexibility: Schema Different database schema can be owned by or associated with different users The schema is a user personalized set of tables, views, and indexes 2) Scalability and Performance A DBMS must expand to meet increased demand, while maintaining acceptable performance levels Scalability Refers to how well a system can adapt to increased demands Performance Measures how quickly a system performs a certain process or transaction 7
3) Information Integrity Information integrity a measure of information quality Know that data have not been entered incorrectly or altered in an unauthorized manner Integrity constraint rules that help ensure the quality of information We will discuss entity integrity and referential integrity (there is also domain integrity) 3) Information Integrity: Controlling Redundancy Redundant data are ok if they serve a specific purpose such as being used as backup directly linked to the source Backup systems promote fault tolerance, Unintentional redundancy is not good Wasted storage Difficult to modify Possible inconsistencies 4) Information Security Information is an organizational asset and must be protected RDBMS offer several security features Access level Determines the level of access each individual user has Who can access the DBMS Access control Determines the types of things each group can do Types of access, such as power to create, modify, delete, and/or read Which types of SQL statements can be executed Multiuser Issues DBMS serve many different users with different needs Many users may require concurrent access to the same data Must preserve integrity of data and the performance of the system Multiuser Issues Enterprise DBMS Problem: if multiple users (say tens or even hundreds of users) access the same data concurrently, how does the DBMS allow one user to change data without immediately overwriting the change by another user? This is typically referred to as the Lostupdate problem 8
Multiuser Issues Concurrent transactions are addressed through the use of transactions and locks Transactions single indivisible action that affects some data Once a transaction is committed, it is permanent and changes are visible to all users If transaction is not committed, changes are rolled back or reversed Multiuser Issues Locks literally locks the data so that changes cannot be made on the data while another transaction is in process Summary Five characteristics of quality information Define database, DBMS, RDBMS, and supporting components and terminology Advantages of RDBMS What is SQL? Describe the lost-update problem and how it is addressed 9