BIG DATA Introduction Big Data - Some Words Connectivity Social Medias Share information Interactivity People Business Data Data mining Text mining Business Intelligence 1
What is Big Data Big Data means different things to people with different backgrounds and interests. Traditionally? "Big Data" = massive volumes of data For example: CERN data volume, NASA, Google,... Where does the Big Data come from? All over! Web logs, RFID, GPS systems, sensor networks, social networks, text documents on the Internet, Internet search, index cards, call detail, astronomy, atmospheric science, biology, genomics, nuclear physics, biochemical experiments, records, medical research, military surveillance, archives, multimedia, etc. What is Big Data Records of each step of modern life on social networks and the sharing of information between people and businesses have changed the general culture of humanity and created an environment conducive to a wave of innovations like never before. The register about all activities, data, behaviours are creating a new way how people and companies are interacting. The amount of data that YOU generate is amazing and is rich of information. The move from analog yielded the digital age an era when people enabled with smart phones and sensors began uploading troves of searchable digital content. While data used to stack up in fairly linear fashion, digital content is now created by consumers and is multiplying at rates previously unheard of. The volume of data generated is duplicated every 2 years, soon, it will be in 18 months. 2
Big Data Digital Data Volume Big Data Bytes Chart 3
Big Data Bytes Size 0.5 ZB All internet data until 2009 1 ZB = 75 Millions of ipads Air (16Gb) which if stacked would give 1.5 times at a distance between Earth and moon. 42 ZB All words said by the humanity during the whole history, if it could be digitized Big Data Data Creation Data Creation does not slowing down Hadron Collider (the world's largest and most powerful particle accelerator) - 1 PB/sec Boeing jet - 20 TB/hr Facebook - 500 TB/day. YouTube 1 TB/4 min. The proposed Square Kilometer Array telescope (the world s proposed biggest telescope) 1 EB/day 4
Big Data - Numbers Facebook Worldwide, there are over 2.01 billion monthly active Facebook users for June 2017 which is a 17 percent increase year over year. There are 1.15 billion mobile daily active users for December 2016, an increase of 23 percent year-over-year. On average, the Like and Share Buttons are viewed across almost 10 million websites daily. Five new profiles are created every second. There are 83 million fake profiles. Photo uploads total 300 million per day. Every 60 seconds on Facebook: 510,000 comments are posted, 293,000 statuses are updated, and 136,000 photos are uploaded. One in five page views in the United States occurs on Facebook. 16 Million local business pages have been created as of May 2013 which is a 100 percent increase from 8 million in June 2012. https://zephoria.com/top-15-valuable-facebook-statistics/ Big Data - Numbers Google estimates that every two days about 5 exabytes of information is generated - this is what humanity has generated throughout its history up to 2003. Twitter Total Number of Monthly Active Twitter Users: 328 million Total Number of Tweets sent per Day: 500 million Walmart The world s biggest retailer with over 20,000 stores in 28 countries, is in the process of building the world biggest private cloud, to process 2.5 petabytes of data every hour. 5
Big Data - Numbers Emails The estimate number of email users worldwide is 3.7 billion, and the amount of emails sent per day (in 2017) to be around 269 billion. First email system: 1971 Average office worker receives 121 emails a day Percentage of email that is spam: 49.7% Big Data - Definition There are several definitions of Big Data from leading authors in the market. The McKisney Global Institute defines Big Data as "the intense use of online social networks, mobile devices for Internet connection, transactions and digital content, as well as the increasing use of cloud computing, which has generated untold amounts of data. The term 'Big Data' refers to this data set whose growth is exponential and whose dimension is beyond the capabilities of the typical tools to capture, manage and analyze data. " Gartner defines Big Data as "the term adopted by the market to describe problems in managing and processing extreme information that exceed the capacity of traditional information technologies over one or several dimensions. Big Data is focused primarily on extremely large dataset volume issues generated from technological practices such as social media, operating technologies, Internet access, and distributed information sources. Big Data is essentially a practice that introduces new business opportunities. 6
Big Data - 3 V s or more Big Data is characterized by the three V's: Volume Variety Velocity Besides these dimensions there are others V s used by some very pertinent authors: Veracity (IBM) Variability (SAS) Value Big Data - Volume Volume is the most common trait of Big Data. Many factors contributed to the exponential increase in data volume, such as transaction-based fata storage through the years, text data constantly streaming form social media, increasing amount of sensor data being collected, automatically generated RFID and GPS data, and so on. In the past, excessive data volume created storage issues, both technical and financial. Today advanced technologies coupled with decreasing storage costs. Represents the increase in the amount of data we have. 7
Big Data - Volume Big Data - Variety Data today comes in all types of formats Database, xml files, text files, images, videos, sensor captures, emails, 85 % of all organizations data is in some sort of unstructured or semi structured format (a format that is not suitable for traditional databases schemas). 8
Big Data - Variety Big Data Velocity Velocity mean how fast data is being produced and how fast the data must be processed. Reacting quickly enough to deal with velocity is a challenge to most organizations. Time sensitive environment. 9
Big Data - Velocity Others V s Veracity : It refers o conformity to facts: Accuracy, quality, truthfulness, or trustworthiness. Variability : Inconsistence of the data flow linked with events or periodic peaks. Value : By analyzing large and feature-rich data, organizations can gain greater business value. Big data means Big analytics. Big analytics means greater insight and better decisions, something that every organization needs. 10
Big Data - Veracity Big Data - Value 11
The worst place to park in New York City using big data https://www.youtube.com/watch?v=lz_kidxbzga Structure Data The term structured data generally refers to data that has a defined length and format. Examples of structured data include numbers, dates, and groups of words and numbers called strings (for example, a customer s name, address, and so on). Structured data is the data that you re probably used to dealing with. It s usually stored in a database. You can query it using a language like structured query language (SQL). Traditional Sources includes Customer Relationship Management (CRM) data, operational Enterprise Resource Planning (ERP) data, and financial data. 12
Sources of structured data Computer- or machine-generated: Machine-generated data generally refers to data that is created by a machine without human intervention. Human-generated: This is data that humans, in interaction with computers, supply. Sensor data: Examples include radio frequency ID (RFID) tags, smart meters, medical devices, and Global Positioning System (GPS) data. For example, RFID is rapidly becoming a popular technology. It uses tiny computer chips to track items at a distance. Web log data: When servers, applications, networks, and so on operate, they capture all kinds of data about their activity. This can amount to huge volumes of data that can be useful, for example, to deal with service-level agreements or to predict security breaches. Point-of-sale data: When the cashier swipes the bar code of any product that you are purchasing, all that data associated with the product is generated. Just think of all the products across all the people who purchase them, and you can understand how big this data set can be. Sources of structured data Financial data: Lots of financial systems are now programmatic; they are operated based on predefined rules that automate processes. Stocktrading data is a good example of this. It contains structured data such as the company symbol and dollar value. Some of this data is machine generated, and some is human generated. Input data: This is any piece of data that a human might input into a computer, such as name, age, income, non-free-form survey responses, and so on. This data can be useful to understand basic customer behavior. Click-stream data: Data is generated every time you click a link on a website. This data can be analyzed to determine customer behavior and buying patterns. Gaming-related data: Every move you make in a game can be recorded. This can be useful in understanding how end users move through a gaming portfolio. 13
Unstructured Data Unstructured data is data that does not follow a specified format. If 15 % of the data available to enterprises is structured data, the other 85 % is unstructured. Unstructured data is really most of the data that you will encounter. Until recently, however, the technology didn t really support doing much with it except storing it or analyzing it manually. Unstructured data is everywhere. In fact, most individuals and organizations conduct their lives around unstructured data. Just as with structured data, unstructured data is either machine generated or human generated. Sources of unstructured data Satellite images: This includes weather data or the data that the government captures in its satellite surveillance imagery. Just think about Google Earth, and you get the picture (pun intended). Scientific data: This includes seismic imagery, atmospheric data, and high energy physics. Photographs and video: This includes security, surveillance, and traffic video. Radar or sonar data: This includes vehicular, meteorological, and oceanographic seismic profiles. Text internal to your company: Think of all the text within documents, logs, survey results, and e-mails. Enterprise information actually represents a large percent of the text information in the world today. Social media data: This data is generated from the social media platforms such as YouTube, Facebook, Twitter, LinkedIn, and Flickr. Mobile data: This includes data such as text messages and locationinformation. Website content: This comes from any site delivering unstructured content, like YouTube, Flickr, or Instagram. 14
Structured Vs Unstructured Data Blockchain https://www.youtube.com/watch?v=pl8olkkwrpc 15