Big Data The New Era of Data

Size: px

Start display at page:

Download "Big Data The New Era of Data"

Eric Miller
6 years ago
Views:

This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name

1 Big Data The New Era of Data Ruchita H.Bajaj, Prof. P. L. Ramteke CS-IT Department, Amravati University, India Abstract-Every day, we create 2.5 quintillion bytes of data so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is Big Data. As data scientists, we live in interesting times. Data has been the No. 1 fast growing phenomenon on the Internet for the last decade. Big Data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process the data within a tolerable elapsed time. Big data sizes are a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set. internet connection most of the things we talk about #edtech become useless. This situation leads us to the fact that students school life will be recorded in Big Data. Let s include their after school time by taking their personal devices into consideration and now you have full profile of each and every student in Big Data. Here comes a cliché: when students grow up they will be in the system. Actually same thing is true for us too but fortunately we were not born into this technology. It s not that Big Data is pure evil. If all data about your health could be recorded and analysed wouldn t it be useful when you try to recover from an illness? Keywords Big Data, petabytes, capture, curate, terabytes. I. INTRODUCTION As time goes by the amount of data increases. But, what does it really include? You can say that it includes your internet activities such as social media posts, photos but this type of data is not enough to create Big Data. What device did you use when you sent a tweet? When did you send it? Where were you? What was the operating system of your device? Which version was it? What other apps were installed? Did you save it as a draft and send it later or did you send it immediately? All this information, which we call unnecessary, are recorded and stored somewhere and they create the Big Data. Fig 2. A Decade of Digital Universe Growth, Storage in Exabytes Fig 1. We are living in the world of DATA First of all it is about collecting information. Whether it is big or not information has always been the most valuable thing in human history. Companies can use Big Data to analyse consumer behaviours. Scientists can use it to discover new facts. Governments can use it for well you can guess. Some people say Big Data is another representation of Big Brother. Who holds this information and how can it be analysed? These are technical questions. I want to talk about why it is related to education? Today s educational technology is based on internet. Without Big data is defined as large amount of data which requires new technologies and architectures to make possible to extract value from it by capturing and analysis process. Big Data has emerged because we are living in a society which makes increasing use of data intensive technologies. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Since Big data is a recent upcoming technology in the market which can bring huge benefits to the business organizations, it becomes necessary that various challenges and issues associated in bringing and adapting to this technology are need to be understood. Big Data concept means a datasets which continues to grow so much that it becomes difficult to manage it using existing database management concepts & tools. Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage search, sharing, transfer, analysis [1] and visualization

2 Big Data is certainly not a measurement but we should understand how much data is considered Big. In the table below the starting of terrabyte of data is considered to starting of what is referred to as big data. [2] Estimations 4.7 Gigabytes: A single DVD 1 Terabyte: About two years worth of non-stop MP3s. (Assumes one megabyte per minute of music) 10 Terabytes: The printed collection of the U.S. Library of Congress 1 Petabyte: The amount of data stored on a stack of CDs about 2 miles high or 13 years of HD-TV video 20 Petabytes: The storage capacity of all hard disk drives created in Exabyte: One billion gigabytes 5 Exabytes: All words ever spoken by mankind TABLE I DEFINITIONS AND ESTIMATIONS Definitions Gigabyte: 1024 megabytes Terabyte: 1024 giabytes Petabyte: 1024 terabytes Exabyte: 1024 petabytes II. ARCHITECTURE In this section, we will take a closer look at the overall architecture for big data. A. Traditional Information Architecture Capabilities To understand the high-level architecture aspects of Big Data, let s first review well formed logical information architecture for structured data. In the illustration, you see two data sources that use integration (ELT/ETL/Change Data Capture) techniques to transfer data into a DBMS data warehouse or operational data store, and then offer a wide variety of analytical capabilities to reveal the data. Some of these analytic capabilities include: dashboards, reporting, EPM/BI applications, summary and statistical query, semantic interpretations for textual data, and visualization tools for high-density data. In addition, some organizations have applied oversight and standardization across projects, and perhaps have matured the information architecture capability through managing it at the enterprise level. Fig 3: Traditional Information Architecture Capabilities The key information architecture principles include treating data as an asset through a value, cost, and risk lens, and ensuring timeliness, quality, and accuracy of data. And, the EA oversight responsibility is to establish and maintain a balanced governance approach including using center of excellence for standards management and training. B. Adding Big Data Capabilities The defining processing capabilities for big data architecture are to meet the volume, velocity, variety, and value requirements. Unique distributed (multi-node) parallel processing architectures have been created to parse these large data sets. There are differing technology strategies for real-time and batch processing requirements. For real-time, key-value data stores, such as NoSQL, allow for high performance, index-based retrieval. For batch processing, a technique known as Map Reduce, filters data according to a specific data discovery strategy. After the filtered data is discovered, it can be analyzed directly, loaded into other unstructured databases, sent to mobile devices, or merged into traditional data warehousing environment and correlated to structured data. Fig 4: Big Data Information Architecture Capabilities An Oracle White In addition to new unstructured data realms, there are two key differences for big data. First, due to the size of the data sets, we don t move the raw data directly to a data warehouse. However, after MapReduce processing we may integrate the reduction result into the data warehouse environment so that we can leverage conventional BI reporting, statistical, semantic, and correlation capabilities. It is ideal to have analytic capabilities that combine a conventional BI platform along with big data visualization and query capabilities. And second, to facilitate analysis in the Hadoop environment, sandbox environments can be created. For many use cases, big data needs to capture data that is continuously changing and unpredictable. And to analyze that data, a new architecture is needed. In retail, a good example is capturing real time foot traffic with the intent of delivering in-store promotion. To track the effectiveness of floor displays and promotions, customer movement and behavior must be interactively explored with visualization or query tools. In other use cases, the analysis cannot be complete until you correlate it with other enterprise data - structured data. In the example of consumer sentiment analysis, capturing a positive or negative social media comment has some value, but associating it with your most or least profitable customer makes it far more valuable. So, the needed capability with Big Data BI is context and understanding. Using powerful statistical and semantic tools allow you to find the needle in the haystack, and will help you predict the future. In summary, the Big Data architecture challenge is to meet the rapid use and rapid data interpretation requirements while at the same time correlating it with other data

What s important is that the key information architecture principles are the same, but the tactics of applying these principles differ. For example, how do we look at big data as an asset.

3 What s important is that the key information architecture principles are the same, but the tactics of applying these principles differ. For example, how do we look at big data as an asset. We all agree there s value hiding within the massive high-density data set. But how do we evaluate one set of big data against the other? How do we prioritize? The key is to think in terms of the end goal. Focus on the business values and understand how critical they are in support of the business decisions, as well as the potential risks of not knowing the hidden patterns. Another example of applying architecture principles differently is data governance. The quality and accuracy requirements of big data can vary tremendously. Using strict data precision rules on user sentiment data might filter out too much useful information, whereas data standards and common definitions are still going to be critical for fraud detections scenarios. To reiterate, it is important to leverage your core information architecture principles and practices, but apply them in a way that s relevant to big data. In addition, the EA responsibility remains the same for big data. It is to optimize success, centralize training, and establish standards. C. An Integrated Information Architecture One of the obstacles observed in Hadoop adoption in enterprise is the lack of integration with existing BI ecosystem. At present, the traditional BI and big data ecosystems are separate causing integrated data analysis headaches. As a result, they are not ready for use by the typical business user or executive. An Oracle White Paper in Enterprise Architecture Information Architecture: An Architect s Guide to Big Data Earlier adopters of big data have often times written custom code to move the processed results of big data back into database or developed custom solutions to report and analyze on them. These options might not be feasible and economical for enterprise IT. First of all, it creates proliferations of one-off code and different standards. Architecture impacts IT economics. Big Data done independently runs the risk of redundant investments. In addition, most businesses simply do not have the staff and skill level for such custom development work. A better option is to incorporate the Big Data results into the existing Data Warehousing platform. The power of information lies in our ability to make associations and correlation. What we need is the ability to bring different data sources, processing requirements together for timely and valuable analysis. Here is Oracle s holistic capability map that bridges traditional information architecture and big data architecture: As various data are captured, they can be stored and processed in traditional DBMS, simple files, or distributed-clustered systems such as NoSQL and Hadoop Distributed File System (HDFS). Architecturally, the critical component that breaks the divide is the integration layer in the middle. Fig 5: Oracle Integrated Information Architecture Capabilities This integration layer needs to extend across all of the data types and domains, and bridge the gap between the traditional and new data acquisition and processing framework. The data integration capability needs to cover the entire spectrum of velocity and frequency. It needs to handle extreme and ever-growing volume requirements. And it needs to bridge the variety of data structures. You need to look for technologies that allow you to integrate Hadoop / MapReduce with your data warehouse and transactional data stores in a bi-directional manner. The next layer is where you load the reductionresults from Big Data processing output into your data warehouse for further analysis. You also need the ability to access your structured data An Oracle White Paper in Enterprise Architecture Information Architecture: An Architect s Guide to Big Data such as customer profile information while you process through your big data to look for patterns such as detecting fraudulent activities. The Big Data processing output will be loaded into the traditional ODS, data warehouse, and data marts, for further analysis, just as the transaction data. The additional component in this layer is the Complex Event Processing engine to analyse stream data in real time. The Business Intelligence layer will be equipped with advanced analytics, in-database statistical analysis, and advanced visualization, on top of the traditional components such as reports, dashboards, and queries. Governance, security, and operational management also cover the entire spectrum of data and information landscape at the enterprise level. With this architecture, the business users do not see a divide. They don t even need to be made aware that there is a difference between traditional transaction data and Big Data. The data and analysis flow would be seamless as they navigate through various data and information sets, test hypothesis, analyse patterns, and make informed decisions. [3]

4 customers to choose between vendors, increasing competition. The challenge ahead of us is to combine these healthy features of prior systems as we devise novel solutions to the many new challenges of Big Data. In this paper, we consider each of the boxes in the figure above, and discuss both what has already been done and what challenges remain as we seek to exploit Big Data. We begin by considering the five stages in the pipeline, then move on to the five cross-cutting challenges, and end with a discussion of the architecture of the overall system that combines all these functions. Fig 6: An example of Big Data Architecture [4] III. BIG DATA ANALYSIS The analysis of Big Data involves multiple distinct phases as shown in the figure below, each of which introduces challenges. Many people unfortunately focus just on the analysis/modelling phase: while that phase is crucial, it is of little use without the other phases of the data analysis pipeline. Even in the analysis phase, which has received much attention, there are poorly understood complexities in the context of multi-tenanted clusters where several users programs run concurrently. Many significant challenges extend beyond the analysis phase. For example, Big Data has to be managed in context, which may be noisy, heterogeneous and not include an upfront model. Doing so raises the need to track provenance and to handle uncertainty and error: topics that are crucial to success, and yet rarely mentioned in the same breath as Big Data. Similarly, the questions to the data analysis pipeline will typically not all be laid out in advance. We may need to figure out good questions based on the data. Doing this will require smarter systems and also better support for user interaction with the analysis pipeline. In fact, we currently have a major bottleneck in the number of people empowered to ask questions of the data and analyze it [NYT2012]. We can drastically increase this number by supporting many levels of engagement with the data, not all requiring deep database expertise. Solutions to problems such as this will not come from incremental improvements to business as usual such as industry may make on its own. Rather, they require us to fundamentally rethink how we manage data analysis. Fortunately, existing computational techniques can be applied, either as is or with some extensions, to at least some aspects of the Big Data problem. For example, relational databases rely on the notion of logical data independence: users can think about what they want to compute, while the system (with skilled engineers designing those systems) determines how to compute it efficiently. Similarly, the SQL standard and the relational data model provide a uniform, powerful language to express many query needs and, in principle, allows Fig 7: Phases in the Processing Pipeline A. Data Acquisition and Recording Big Data does not arise out of a vacuum: it is recorded from some data generating source. For example, consider our ability to sense and observe the world around us, from the heart rate of an elderly citizen, and presence of toxins in the air we breathe, to the planned square kilometre array telescope, which will produce up to 1 million terabytes of raw data per day. Similarly, scientific experiments and simulations can easily produce petabytes of data today. Much of this data is of no interest, and it can be filtered and compressed by orders of magnitude. One challenge is to define these filters in such a way that they do not discard useful information. For example, suppose one sensor reading differs substantially from the rest: it is likely to be due to the sensor being faulty, but how can we be sure that it is not an artifact that deserves attention? In addition, the data collected by these sensors most often are spatially and temporally correlated (e.g., traffic sensors on the same road segment). We need research in the science of data reduction that can intelligently process this raw data to a size that its users can handle while not missing the needle in the haystack. Furthermore, we require on-line analysis techniques that can process such streaming data on the fly, since we cannot afford to store first and reduce afterward. The second big challenge is to automatically generate the right metadata to describe what data is recorded and how it is recorded and measured. For example, in scientific experiments, considerable detail regarding specific experimental conditions and procedures may be

5 required to be able to interpret the results correctly, and it is important that such metadata be recorded with observational data. Metadata acquisition systems can minimize the human burden in recording metadata. Another important issue here is data provenance. Recording information about the data at its birth is not useful unless this information can be interpreted and carried along through the data analysis pipeline. For example, a processing error at one step can render subsequent analysis useless; with suitable provenance, we can easily identify all subsequent processing that dependent on this step. Thus we need research both into generating suitable metadata and into data systems that carry the provenance of data and its metadata through data analysis pipelines. B. Information Extraction and Cleaning Frequently, the information collected will not be in a format ready for analysis. For example, consider the collection of electronic health records in a hospital, comprising transcribed dictations from several physicians, structured data from sensors and measurements (possibly with some associated uncertainty), and image data such as x-rays. We cannot leave the data in this form and still effectively analyze it. Rather we require an information extraction process that pulls out the required information from the underlying sources and expresses it in a structured form suitable for analysis. Doing this correctly and completely is a continuing technical challenge. Note that this data also includes images and will in the future include video; such extraction is often highly application dependent (e.g., what you want to pull out of an MRI is very different from what you would pull out of a picture of the stars, or a surveillance photo). In addition, due to the ubiquity of surveillance cameras and popularity of GPS-enabled mobile phones, cameras, and other portable devices, rich and high fidelity location and trajectory (i.e., movement in space) data can also be extracted. We are used to thinking of Big Data as always telling us the truth, but this is actually far from reality. For example, patients may choose to hide risky behavior and caregivers may sometimes mis-diagnose a condition; patients may also inaccurately recall the name of a drug or even that they ever took it, leading to missing information in (the history portion of) their medical record. Existing work on data cleaning assumes well-recognized constraints on valid data or well-understood error models; for many emerging Big Data domains these do not exist. C. Data Integration, Aggregation, and Representation Given the heterogeneity of the flood of data, it is not enough merely to record it and throw it into a repository. Consider, for example, data from a range of scientific experiments. If we just have a bunch of data sets in a repository, it is unlikely anyone will ever be able to find, let alone reuse, any of this data. With adequate metadata, there is some hope, but even so, challenges will remain due to differences in experimental details and in data record structure. Data analysis is considerably more challenging than simply locating, identifying, understanding, and citing data. For effective large-scale analysis all of this has to happen in a completely automated manner. This requires differences in data structure and semantics to be expressed in forms that are computer understandable, and then robotically resolvable. There is a strong body of work in data integration that can provide some of the answers. However, considerable additional work is required to achieve automated error-free difference resolution. Even for simpler analyses that depend on only one data set, there remains an important question of suitable database design. Usually, there will be many alternative ways in which to store the same information. Certain designs will have advantages over others for certain purposes, and possibly drawbacks for other purposes. Witness, for instance, the tremendous variety in the structure of bioinformatics databases with information regarding substantially similar entities, such as genes. Database design is today an art, and is carefully executed in the enterprise context by highly-paid professionals. We must enable other professionals, such as domain scientists, to create effective database designs, either through devising tools to assist them in the design process or through forgoing the design process completely and developing techniques so that databases can be used effectively in the absence of intelligent database design. D. Query Processing, Data Modeling, and Analysis Methods for querying and mining Big Data are fundamentally different from traditional statistical analysis on small samples. Big Data is often noisy, dynamic, heterogeneous, inter-related and untrustworthy. Nevertheless, even noisy Big Data could be more valuable than tiny samples because general statistics obtained from frequent patterns and correlation analysis usually overpower individual fluctuations and often disclose more reliable hidden patterns and knowledge. Further, interconnected Big Data forms large heterogeneous information networks, with which information redundancy can be explored to compensate for missing data, to crosscheck conflicting cases, to validate trustworthy relationships, to disclose inherent clusters, and to uncover hidden relationships and models. Mining requires integrated, cleaned, trustworthy, and efficiently accessible data, declarative query and mining interfaces, scalable mining algorithms, and big-data computing environments. At the same time, data mining itself can also be used to help improve the quality and trustworthiness of the data, understand its semantics, and provide intelligent querying functions. As noted previously, real-life medical records have errors, are heterogeneous, and frequently are distributed across multiple systems. The value of Big Data analysis in health care, to take just one example application domain, can only be realized if it can be applied robustly under these difficult conditions. On the flip side, knowledge developed from data can help in correcting errors and removing ambiguity. For example, a physician may write DVT as the diagnosis for a patient. This abbreviation is commonly used for both deep vein thrombosis and diverticulitis, two very different medical conditions. A knowledge-base constructed from related data

6 can use associated symptoms or medications to determine which of two the physician meant. Big Data is also enabling the next generation of interactive data analysis with real-time answers. In the future, queries towards Big Data will be automatically generated for content creation on websites, to populate hotlists or recommendations, and to provide an ad hoc analysis of the value of a data set to decide whether to store or to discard it. Scaling complex query processing techniques to terabytes while enabling interactive response times is a major open research problem today. A problem with current Big Data analysis is the lack of coordination between database systems, which host the data and provide SQL querying, with analytics packages that perform various forms of non-sql processing, such as data mining and statistical analyses. Today s analysts are impeded by a tedious process of exporting data from the database, performing a non-sql process and bringing the data back. This is an obstacle to carrying over the interactive elegance of the first generation of SQL-driven OLAP systems into the data mining type of analysis that is in increasing demand. A tight coupling between declarative query languages and the functions of such packages will benefit both expressiveness and performance of the analysis. E. Interpretation Having the ability to analyze Big Data is of limited value if users cannot understand the analysis. Ultimately, a decision-maker, provided with the result of analysis, has to interpret these results. This interpretation cannot happen in a vacuum. Usually, it involves examining all the assumptions made and retracing the analysis. Furthermore, as we saw above, there are many possible sources of error: computer systems can have bugs, models almost always have assumptions, and results can be based on erroneous data. For all of these reasons, no responsible user will cede authority to the computer system. Rather she will try to understand, and verify, the results produced by the computer. The computer system must make it easy for her to do so. This is particularly a challenge with Big Data due to its complexity. There are often crucial assumptions behind the data recorded. Analytical pipelines can often involve multiple steps, again with assumptions built in. The recent mortgage-related shock to the financial system dramatically underscored the need for such decision-maker diligence -- rather than accept the stated solvency of a financial institution at face value, a decision-maker has to examine critically the many assumptions at multiple stages of analysis. In short, it is rarely enough to provide just the results. Rather, one must provide supplementary information that explains how each result was derived, and based upon precisely what inputs. Such supplementary information is called the provenance of the (result) data. By studying how best to capture, store, and query provenance, in conjunction with techniques to capture adequate metadata, we can create an infrastructure to provide users with the ability both to interpret analytical results obtained and to repeat the analysis with different assumptions, parameters, or data sets. Furthermore, with a few clicks the user should be able to drill down into each piece of data that she sees and understand its provenance, which is a key feature to understanding the data [5].That is, users need to be able to see not just the results, but also understand why they are seeing those results. However, raw provenance, particularly regarding the phases in the analytics pipeline, is likely to be too technical for many users to grasp completely. One alternative is to enable the users to play with the steps in the analysis make small changes to the pipeline, for example, or modify values for some parameters. The users can then view the results of these incremental changes. By these means, users can develop an intuitive feeling for the analysis and also verify that it performs as expected in corner cases. Accomplishing this requires the system to provide convenient facilities for the user to specify analyses. IV. DIMENSIONS The convergence of these four dimensions helps both to define and distinguish big data: A. Volume The amount of data. Perhaps the characteristic most associated with big data, volume refers to the mass quantities of data that organizations are trying to harness to improve decision-making across the enterprise. Data volumes continue to increase at an unprecedented rate. However, what constitutes truly high volume varies by industry and even geography, and is smaller than the petabytes and zetabytes often referenced. Just over half of respondents consider datasets between one terabyte and one petabyte to be big data, while another 30 percent simply didn t know how big big is for their organization. Still, all can agree that whatever is considered high volume today will be even higher tomorrow. B. Variety Different types of data and data sources. Variety is about managing the complexity of multiple data types, including structured, semi-structured and unstructured data. Organizations need to integrate and analyse data from a complex array of both traditional and non-traditional information sources, from within and outside the enterprise. With the explosion of sensors, smart devices and social collaboration technologies, data is being generated in countless forms, including: text, web data, tweets, sensor data, audio, video, click streams, log files and more. C. Velocity Data in motion. The speed at which data is created, processed and analyzed continues to accelerate. Contributing to higher velocity is the real-time nature of data creation, as well as the need to incorporate streaming data into business processes and decision making. Velocity impacts latency the lag time between when data is created or captured, and when it is accessible. Today, data is continually being generated at a pace that is impossible for traditional systems to capture, store and analyse. For timesensitive processes such as real-time fraud detection or multi-channel instant marketing, certain types of data must be analysed in real time to be of value to the business

User can run certain queries against the data stored and thus can deduct important results from the filtered data obtained and can also rank it according to the dimensions they require.

7 D. Data Value Data value measures the usefulness of data in making decisions. Data science is exploratory and useful in getting to know the data, but analytic science encompasses the predictive power of big data. User can run certain queries against the data stored and thus can deduct important results from the filtered data obtained and can also rank it according to the dimensions they require. These reports help these people to find the business trends according to which they can change their strategies. Fig 8: Big Data in Dimensions D. Veracity Data uncertainty. Veracity refers to the level of reliability associated with certain types of data. Striving for high data quality is an important big data requirement and challenge, but even the best data cleansing methods cannot remove the inherent unpredictability of some data, like the weather, the economy, or a customer s actual future buying decisions. The need to acknowledge and plan for uncertainty is a dimension of big data that has been introduced as executives seek to better understand the uncertain world around them (see sidebar, Veracity, the fourth V. ).[6] V. BIG DATA CHARACTERISTICS A. Data Volume The Big word in Big data itself defines the volume. At present the data existing is in petabytes (1015) and is supposed to increase to zetabytes (1021) in nearby future. Data volume measures the amount of data available to an organization, which does not necessarily have to own all of it as long as it can access it.[7] B. Data Velocity Velocity in Big data is a concept which deals with the speed of the data coming from various sources. This characteristic is not being limited to the speed of incoming data but also speed at which the data flows and aggregated. C. Data Variety Data variety is a measure of the richness of the data representation text, images video, audio, etc. Data being produced is not of single category as it not only includes the traditional data but also the semi structured data from various resources like web Pages, Web Log Files, social media sites, , documents. Fig 9: Era of Big Data E. Complexity Complexity measures the degree of interconnectedness (possibly very large) and interdependence in big data structures such that a small change (or combination of small changes) in one or a few elements can yield very large changes or a small change that ripple across or cascade through the system and substantially affect its behavior, or no change at all (Katal, Wazid, & Goudar, 2013). VI. CHALLENGES IN BIG DATA The challenges in Big Data are usually the real implementation hurdles which require immediate attention. Any implementation without handling these challenges may lead to the failure of the technology implementation and some unpleasant results. A. Privacy and Security It is the most important challenges with Big data which is sensitive and includes conceptual, technical as well as legal significance. The personal information (e.g. in database of a merchant or social networking website) of a person when combined with external large data sets, leads to the inference of new facts about that person and it s possible that these kinds of facts about the person are secretive and the person might not want the data owner to know or any person to know about them. Information regarding the people is collected and used in order to add value to the business of the organization. This is done by creating insights in their lives which they are unaware of. Another important consequence arising would be Social stratification where a literate person would be taking advantages of the Big data predictive analysis and on the other hand underprivileged will be easily identified and treated worse. Big Data used by law enforcement will increase the chances of certain tagged people to suffer from adverse consequences without the ability to fight back or even having knowledge that they are being discriminated. B. Data Access and Sharing of Information If the data in the companies information systems is to be used to make accurate decisions in time it becomes necessary that it should be available in accurate, complete and timely manner. This makes the data management and governance process bit complex adding the necessity to

8 make data open and make it available to government agencies in standardized manner with standardized APIs, metadata and formats thus leading to better decision making, business intelligence and productivity improvements. Expecting sharing of data between companies is awkward because of the need to get an edge in business. Sharing data about their clients and operations threatens the culture of secrecy and competitiveness. C. Analytical Challenges The main challenging questions are as: What if data volume gets so large and varied and it is not known how to deal with it? Does all data need to be stored? Does all data need to be analyzed? How to find out which data points are really important? How can the data be used to best advantage? Big data brings along with it some huge analytical challenges. The type of analysis to be done on this huge amount of data which can be unstructured, semi structured or structured requires a large number of advance skills. Moreover the type of analysis which is needed to be done on the data depends highly on the results to be obtained i.e. decision making. This can be done by using one of two techniques: either incorporate massive data volumes in analysis or determine upfront which Big data is relevant. D. Human Resources and Manpower Since Big data is at its youth and an emerging technology so it needs to attract organizations and youth with diverse new skill sets. These skills should not be limited to technical ones but also should extend to research, analytical, interpretive and creative ones. These skills need to be developed in individuals hence requires training programs to be held by the organizations. Moreover the Universities need to introduce curriculum on Big data to produce skilled employees in this expertise. E. Technical Challenges 1.) Fault Tolerance: With the incoming of new technologies like Cloud computing and Big data it is always intended that whenever the failure occurs the damage done should be within acceptable threshold rather than beginning the whole task from the scratch. Fault-tolerant computing is extremely hard, involving intricate algorithms. It is simply not possible to devise absolutely foolproof, 100% reliable fault tolerant machines or software. Thus the main task is to reduce the probability of failure to an acceptable level. Unfortunately, the more we strive to reduce this probability, the higher the cost. Two methods which seem to increase the fault tolerance in Big data are as: 1. First is to divide the whole computation being done into tasks and assign these tasks to different nodes for computation. 2. Second is, one node is assigned the work of observing that these nodes are working properly. If something happens that particular task is restarted. But sometimes it s quite possible that that the whole computation can t be divided into such independent tasks. There could be some tasks which might be recursive in nature and the output of the previous computation of task is the input to the next computation. Thus restarting the whole computation becomes cumbersome process. This can be avoided by applying Checkpoints which keeps the state of the system at certain intervals of the time. In case of any failure, the computation can restart from last checkpoint maintained. 2.) Scalability: The scalability issue of Big data has lead towards cloud computing, which now aggregates multiple disparate workloads with varying performance goals into very large clusters. This requires a high level of sharing of resources which is expensive and also brings with it various challenges like how to run and execute various jobs so that we can meet the goal of each workload cost effectively. It also requires dealing with the system failures in an efficient manner which occurs more frequently if operating on large clusters. These factors combined put the concern on how to express the programs, even complex machine learning tasks. There has been a huge shift in the technologies being used. Hard Disk Drives (HDD) are being replaced by the solid state Drives and Phase Change technology which are not having the same performance between sequential and random data transfer. Thus, what kinds of storage devices are to be used; is again a big question for data storage. 3.) Quality of Data: Collection of huge amount of data and its storage comes at a cost. More data if used for decision making or for predictive analysis in business will definitely lead to better results. Business Leaders will always want more and more data storage whereas the IT Leaders will take all technical aspects in mind before storing all the data. Big data basically focuses on quality data storage rather than having very large irrelevant data so that better results and conclusions can be drawn. This further leads to various questions like how it can be ensured that which data is relevant, how much data would be enough for decision making and whether the stored data is accurate or not to draw conclusions from it etc. (Katal, Wazid, & Goudar, 2013). 4.) Heterogeneous Data: Unstructured data represents almost every kind of data being produced like social media interactions, to recorded meetings, to handling of PDF documents, fax transfers, to s and more. Working with unstructured data is cumbersome and of course costly too. Converting all this unstructured data into structured one is also not feasible.structured data is always organized into highly mechanized and manageable way. It shows well integration with database but unstructured data is completely raw and unorganized

VII. ISSUES IN BIG DATA The issues in Big Data are some of the conceptual points that should be understood by the organization to implement the technology effectively.

9 VII. ISSUES IN BIG DATA The issues in Big Data are some of the conceptual points that should be understood by the organization to implement the technology effectively. Big data Issues are need not be confused with problems but they are important to know and crucial to handle. A. Issues related to the Characteristics Data Volume As data volume increases, the value of different data records will decrease in proportion to age, type, richness, and quantity among other factors. The social networking sites existing are themselves producing data in order of terabytes everyday and this amount of data is definitely difficult to be handled using the existing traditional systems.[7] Data Velocity Our traditional systems are not capable enough on performing the analytics on the data which is constantly in motion. E-Commerce has rapidly increased the speed and richness of data used for different business transactions (for example, web-site clicks). Data velocity management is much more than a bandwidth issue. Data Variety All this data is totally different consisting of raw, structured, semi structured and even unstructured data which is difficult to be handled by the existing traditional analytic systems. From an analytic perspective, it is probably the biggest obstacle toeffectively using large volumes of data. Incompatible data formats, non-aligned data structures, and inconsistent data semantics represents significant challenges that can lead to analytic sprawl. Fig 10 : Explosion in size of Data (Hewlett-Packard Development Company, 2012) Data Value As the data stored by different organizations is being used by them for data analytics. It will produce a kind of gap in between the Business leaders and the IT professionals the main concern of business leaders would be to just adding value to their business and getting more and more profit unlike the IT leaders who would have to concern with the technicalities of the storage and processing. Data Complexity One current difficulty of big data is working with it using relational databases and desktop statistics/visualization packages, requiring massively parallel software running on tens, hundreds, or even thousands of servers. It is quite an undertaking to link, match, cleanse and transform data across systems coming from various sources. It is also necessary to connect and correlate relationships, hierarchies and multiple data linkages or data can quickly spiral out of control (Katal, Wazid, & Goudar, 2013).[8] B. Storage and Transport Issues The quantity of data has exploded each time we have invented a new storage medium. The difference about the most recent data explosion, mainly due to social media, is that there has been no new storage medium. Moreover, data is being created by everyone and everything, (from Mobile Devices to Super Computers) not just, as here to fore, by professionals such as scientist, journalists, writers etc. Current disk technology limits are about 4 terabytes (1012) per disk. So, 1 Exabyte (1018) would require 25,000 disks. Even if an Exabyte of data could be processed on a single computer system, it would be unable to directly attach the requisite number of disks. Access to that data would overwhelm current communication networks. Assuming that a 1 gigabyte per second network has an effective sustainable transfer rate of 80%, the sustainable bandwidth is about 100 megabytes. Thus, transferring an Exabyte would take about 2800 hours, if we assume that a sustained transfer could be maintained. It would take longer time to transmit the data from a collection or storage point to a processing point than the time required to actually process it. To handle this issue, the data should be processed in place and transmit only the resulting information. In other words, bring the code to the data, unlike the traditional method of bring the data to the code. (Kaisler, Armour, Espinosa, & Money, 2013). C.Data Management Issues Data Management will, perhaps, be the most difficult problem to address with big data. Resolving issues of access, utilization, updating, governance, and reference (in publications) have proven to be major stumbling blocks. The sources of the data are varied - by size, by format, and by method of collection. Individuals contribute digital data in mediums comfortable to them like- documents, drawings, pictures, sound and video recordings, models, software behaviors, user interface designs, etc., with or without adequate metadata describing what, when, where, who, why and how it was collected and its provenance. Unlike the collection of data by manual methods, where rigorous protocols are often followed in order to ensure accuracy and validity, Digital data collection is much more relaxed. Given the volume, it is impractical to validate every data item. New approaches to data qualification and validation are needed. The richness of digital data representation prohibits a personalized methodology for data collection. To summarize, there is no perfect big data management solution yet. This represents an important gap in the research literature on big data that needs to be filled. D. Processing Issues Assume that an Exabyte of data needs to be processed in its entirety. For simplicity, assume the data is chunked into blocks of 8 words, so 1 Exabyte = 1K

10 petabytes.[9] Assuming a processor expends 100 instructions on one block at 5 gigahertz, the time required for end-to-end processing would be 20 nanoseconds. To process 1K petabytes would require a total end-to-end processing time of roughly 635 years. Thus, effective processing of Exabyte of data will require extensive parallel processing and new analytics algorithms in order to provide timely and actionable information (Kaisler, Armour, Espinosa, & Money, 2013). VIII. SECURITY A BIG QUESTION FOR BIG DATA Big security for big data We are children of the information generation. No longer tied to large mainframe computers, we now access information via applications, mobile devices, and laptops to make decisions based on real-time data. It is because information is so pervasive that businesses want to capture this data and analyze it for intelligence. A. Data explosion The multitude of devices, users, and generated traffic all combine to create a proliferation of data that is being created with incredible volume, velocity, and variety. As a result, organizations need a way to protect, utilize, and gain real-time insight from big data. This intelligence is not only valuable to businesses and consumers, but also to hackers. Robust information marketplaces have arisen for hackers to sell credit card information, account usernames, passwords, national secrets (WikiLeaks), as well as intellectual property. How does anyone keep secrets anymore? How does anyone keep secrets protected from hackers? In the past when the network infrastructure was straightforward and perimeters used to exist, controlling access to data was much simpler. If your secrets rested within the company network, all you had to do to keep the data safe was to make sure you had a strong firewall in place. However, as data became available through the Internet, mobile devices, and the cloud having a firewall was not enough. Companies tried to solve each security problem in a piecemeal manner, tacking on more security devices like patching a hole in the wall. But, because these products did not interoperate, you could not coordinate a defense against hackers. In order to meet the current security problems faced by organizations, a new paradigm shift needs to occur. Businesses need the ability to secure data, collect it, and aggregate into an intelligent format, so that real-time alerting and reporting can take place. The first step is to establish complete visibility so that your data and who accesses the data can be monitored. Next, you need to understand the context, so that you can focus on the valued assets, which are critical to your business. Finally, utilize the intelligence gathered so that you can harden your attack surface and stop attacks before the data is exfiltrated. So, how do we get started? B. Data collection Your first job is to aggregate all the information from every device into one place. This means collecting information from cloud, virtual, and real appliances: network devices, applications, servers, databases, desktops, and security devices. With Software-as-a-Service (SaaS) applications deployed in the cloud, it is important to collect logs from those applications as well since data stored in the cloud can contain information spanning from human resource management to customer information. Collecting this information gives you visibility into who is accessing your company s information, what information they are accessing, and when this access is occurring. The goal is to capture usage patterns and look for signs of malicious behavior. Typically, data theft is done in five stages1. First, hackers research their target in order to find a way to enter the network. After infiltrating the network, they may install an agent to lie dormant and gather information until they discover where the payload is hosted, and how to acquire it. Once the target is captured, the next step is to exfiltrate the information out of the network. Most advanced attacks progress through these five stages, and having this understanding helps you look for clues on whether an attack is taking place in your environment, and how to stop the attacker from reaching their target. The key to determining what logs to collect are to focus on records where an actor is accessing information or systems. C. Data integration Once the machine data is collected, the data needs to be parsed to derive intelligence from cryptic log messages. Automation and rule-based processing is needed because having a person review logs manually would make the problem of finding an attacker quite difficult since the security analyst would need to manually separate attacks from logs of normal behaviour. The solution is to normalize machine logs so that queries can pull context-aware information from log data. For example, HP ArcSight connectors normalize and categorize log data into over 400 meta fields. Logs that have been normalized become more useful because you no longer need an expert on a particular device to interpret the log. By enriching logs with metadata, you can turn strings of text into information that can be indexed and searched. D. Data analytics Normalized logs are indexed and categorized to make it easy for a correlation engine to process and identify patterns based on heuristics and security rules. It is here where the art of combining logs from multiple sources and correlating events together help to create real-time alerts. This preprocessing also speeds up correlation and makes vendoragnostic event logs, which give analysts the ability to build reports and filters with simple English queries. In real time versus the past Catching a hacker and being able to stop them as the attack is taking place is more useful to a company than being able to use forensics to piece together an attack that already took place.[10] However, in order to have that as part of your arsenal, we have to resolve four problems: How do you insert data faster into your data store? How do you store all this data? How do you quickly process events? How do you return results faster?

Introduction to Data Science

UNIT I INTRODUCTION TO DATA SCIENCE Syllabus Introduction of Data Science Basic Data Analytics using R R Graphical User Interfaces Data Import and Export Attribute and Data Types Descriptive Statistics