Agile Data Science i

Size: px
Start display at page:

Download "Agile Data Science i"

Transcription

1 i

2 About the Tutorial Agile is a software development methodology that helps in building software through incremental sessions using short iterations of 1 to 4 weeks so that the development is aligned with the changing business needs. Agile Data science comprises of a combination of agile methodology and data science. In this tutorial, we have used appropriate examples to help you understand agile development and data science in a general and quick way. Audience This tutorial has been prepared for developers and project managers to help them understand the basics of agile principles and its implementation. After completing this tutorial, you will find yourself at a moderate level of expertise, from where you can advance further with implementation of data science and agile methodology. Prerequisites It is important to have basic knowledge of data science modules and software development concepts such as software requirements, coding along with testing. Copyright & Disclaimer Copyright 2018 by Tutorials Point (I) Pvt. Ltd. All the content and graphics published in this e-book are the property of Tutorials Point (I) Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish any contents or a part of contents of this e-book in any manner without written consent of the publisher. We strive to update the contents of our website and tutorials as timely and as precisely as possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt. Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our website or its contents including this tutorial. If you discover any errors on our website or in this tutorial, please notify us at contact@tutorialspoint.com i

3 Table of Contents About the Tutorial... i Audience... i Prerequisites... i Copyright & Disclaimer... i Table of Contents... ii 1. Agile Data Science Introduction Agile Data Science Methodology Concepts... 3 Daily Stand-up... 4 User Story... 4 What is Scrum?... 5 Why Scrum Master?... 6 Benefits of Scrum... 7 Conclusion Agile Data Science Data Science Process Agile Data Science Agile Tools and Installation Local Environmental Setup Agile Data Science Data Processing in Agile Structured data Semi-structured data Unstructured data Agile Data Science SQL versus NoSQL Why NoSQL for agile? MongoDB Installation Agile Data Science NoSQL and Dataflow programming Dataflow of NoSQL Agile Data Science Collecting and Displaying Records ii

4 9. Agile Data Science Data Visualization Agile Data Science Data Enrichment Agile Data Science Working with Reports Agile Data Science Role of Predictions Predictive Analytics Making Predictions Agile Data Science Extracting features with PySpark Overview of Spark Agile Data Science Building a Regression Model Agile Data Science Deploying a predictive system Agile Data Science SparkML Why learn Spark ML for Agile? ML Algorithms Agile Data Science Fixing Prediction Problem Agile Data Science Improving Prediction Performance Agile Data Science Creating better scene with agile and data science Build a better plan Predictive Analysis and Big data Agile Data Science Implementation of Agile iii

5 1. Agile Data Science Introduction Agile Data Science Agile data science is an approach of using data science with agile methodology for web application development. It focusses on the output of the data science process suitable for effecting change for an organization. Data science includes building applications that describe research process with analysis, interactive visualization and now applied machine learning as well. The major goal of agile data science is to - document and guide explanatory data analysis to discover and follow the critical path to a compelling product. Agile data science is organized with the following set of principles: Continuous Iteration This process involves continuous iteration with creation tables, charts, reports and predictions. Building predictive models will require many iterations of feature engineering with extraction and production of insight. Intermediate Output This is the track list of outputs generated. It is even said that failed experiments also have output. Tracking output of every iteration will help creating better output in the next iteration. Prototype Experiments Prototype experiments involve assigning tasks and generating output as per the experiments. In a given task, we must iterate to achieve insight and these iterations can be best explained as experiments. Integration of data The software development life cycle includes different phases with data essential for customers developers, and the business The integration of data paves way for better prospects and outputs. 1

6 Pyramid data value The above pyramid value described the layers needed for Agile data science development. It starts with a collection of records based on the requirements and plumbing individual records. The charts are created after cleaning and aggregation of data. The aggregated data can be used for data visualization. Reports are generated with proper structure, metadata and tags of data. The second layer of pyramid from the top includes prediction analysis. The prediction layer is where more value is created but helps in creating good predictions that focus on feature engineering. The topmost layer involves actions where the value of data is driven effectively. The best illustration of this implementation is Artificial Intelligence. 2

7 2. Agile Data Science Methodology Concepts In this chapter, we will focus on the concepts of software development life cycle called agile. The Agile software development methodology helps in building a software through increment sessions in short iterations of 1 to 4 weeks so the development is aligned with changing business requirements. There are 12 principles that describe the Agile methodology in detail: Satisfaction of customers The highest priority is given to customers focusing on the requirements through early and continuous delivery of valuable software. Welcoming new changes Changes are acceptable during software development. Agile processes is designed to work in order to match the customer s competitive advantage. Delivery Delivery of a working software is given to clients within a span of one to four weeks. Collaboration Business analysts, quality analysts and developers must work together during the entire life cycle of project. Motivation Projects should be designed with a clan of motivated individuals. It provides an environment to support individual team members. Personal conversation Face-to-face conversation is the most efficient and effective method of sending information to and within a development team. Measuring progress Measuring progress is the key that helps in defining the progress of project and software development. Maintaining constant pace Agile process focusses on sustainable development. The business, the developers and the users should be able to maintain a constant pace with the project. 3

8 Monitoring It is mandatory to maintain regular attention to technical excellence and good design to enhance the agile functionality. Simplicity Agile process keeps everything simple and uses simple terms to measure the work that is not completed. Self-organized terms An agile team should be self-organized and should be independent with the best architecture; requirements and designs emerge from self-organized teams. Review the work It is important to review the work at regular intervals so that the team can reflect on how the work is progressing. Reviewing the module on a timely basis will improve performance. Daily Stand-up Daily stand-up refers to the daily status meeting among the team members. It provides updates related to the software development. It also refers to addressing obstacles of project development. Daily stand-up is a mandatory practice, no matter how an agile team is established regardless of its office location. The list of features of a daily stand-up are as follows: The duration of daily stand-up meet should be roughly 15 minutes. It should not extend for a longer duration. Stand-up should include discussions on status update. Participants of this meeting usually stand with the intention to end up meeting quickly. User Story A story is usually a requirement, which is formulated in few sentences in simple language and it should be completed within an iteration. A user story should include the following characteristics: All the related code should have related check-ins. The unit test cases for the specified iteration. All the acceptance test cases should be defined. Acceptance from product owner while defining the story. 4

9 What is Scrum? Scrum can be considered as a subset of agile methodology. It is a lightweight process and includes the following features: It is a process framework, which includes a set of practices that need to be followed in consistent order. The best illustration of Scrum is following iterations or sprints. It is a lightweight process meaning that the process is kept as small as possible, to maximize the productive output in given duration specified. Scrum process is known for its distinguishing process in comparison with other methodologies of traditional agile approach. It is divided into the following three categories: Roles Artifacts Time Boxes Roles define the team members and their roles included throughout the process. The Scrum Team consists of the following three roles: Scrum Master Product Owner Team The Scrum artifacts provide key information that each member should be aware of. The information includes details of product, activities planned, and activities completed. The artefacts defined in Scrum framework are as follows: Product backlog Sprint backlog Burn down chart Increment 5

10 Time boxes are the user stories which are planned for each iteration. These user stories help in describing the product features which form part of the Scrum artefacts. The product backlog is a list of user stories. These user stories are prioritized and forwarded to the user meetings to decide which one should be taken up. Why Scrum Master? Scrum Master interacts with every member of the team. Let us now see the interaction of the Scrum Master with other teams and resources. Product Owner The Scrum Master interacts the product owner in following ways: Finding techniques to achieve effective product backlog of user stories and managing them. Helping team to understand the needs of clear and concise product backlog items. Product planning with specific environment. Ensuring that product owner knows how to increase the value of product. Facilitating Scrum events as and when required. Scrum Team The Scrum Master interacts with the team in several ways: Coaching the organization in its Scrum adoption. Planning Scrum implementations to the specific organization. Helping employees and stakeholders to understand the requirement and phases of product development. Working with Scrum Masters of other teams to increase effectiveness of the application of Scrum of the specified team. Organization The Scrum Master interacts with organization in several ways. A few are mentioned below: Coaching and scrum team interacts with self-organization and includes a feature of cross functionality. Coaching the organization and teams in such areas where Scrum is not fully adopted yet or not accepted. 6

11 Benefits of Scrum Scrum helps customers, team members and stakeholders collaborate. It includes timeboxed approach and continuous feedback from the product owner ensuring that the product is in working condition. Scrum provides benefits to different roles of the project. Customer The sprints or iterations are considered for shorter duration and user stories are designed as per priority and are taken up at sprint planning. It ensures that every sprint delivery, customer requirements are fulfilled. If not, the requirements are noted and are planned and taken for sprint. Organization Organization with the help of Scrum and Scrum masters can focus on the efforts required for development of user stories thus reducing work overload and avoiding rework if any. This also helps in maintaining increased efficiency of development team and customer satisfaction. This approach also helps in increasing the potential of the market. Product Managers The main responsibility of the product managers is to ensure that the quality of product is maintained. With the help of Scrum Masters, it becomes easy to facilitate work, gather quick responses and absorb changes if any. Product managers also verify that the designed product is aligned as per the customer requirements in every sprint. Development Team With time-boxed nature and keeping sprints for a smaller duration of time, development team becomes enthusiastic to see that the work is reflected and delivered properly. The working product increments each level after every iteration or rather we can call them as sprint. The user stories which are designed for every sprint become customer priority adding up more value to the iteration. Conclusion Scrum is an efficient framework within which you can develop software in teamwork. It is completely designed on agile principles. ScrumMaster is there to help and co-operate the team of Scrum in every possible way. He acts like a personal trainer who helps you stick with designed plan and perform all the activities as per the plan. The authority of ScrumMaster should never extend beyond the process. He/she should be potentially capable to manage every situation. 7

12 3. Agile Data Science Data Science Process In this chapter, we will understand the data science process and terminologies required to understand the process. Data science is the blend of data interface, algorithm development and technology in order to solve analytical complex problems. Data science is an interdisciplinary field encompassing scientific methods, processes and systems with categories included in it as Machine learning, math and statistics knowledge with traditional research. It also includes a combination of hacking skills with substantive expertise. Data science draws principles from mathematics, statistics, information science, and computer science, data mining and predictive analysis. The different roles that form part of the data science team are mentioned below: 8

13 Customers Customers are the people who use the product. Their interest determines the success of project and their feedback is very valuable in data science. Business Development This team of data science signs in early customers, either firsthand or through creation of landing pages and promotions. Business development team delivers the value of product. Product Managers Product managers take in the importance to create best product, which is valuable in market. Interaction designers They focus on design interactions around data models so that users find appropriate value. Data scientists Data scientists explore and transform the data in new ways to create and publish new features. These scientists also combine data from diverse sources to create a new value. They play an important role in creating visualizations with researchers, engineers and web developers. Researchers As the name specifies researchers are involved in research activities. They solve complicated problems, which data scientists cannot do. These problems involve intense focus and time of machine learning and statistics module. Adapting to Change All the team members of data science are required to adapt to new changes and work on the basis of requirements. Several changes should be made for adopting agile methodology with data science, which are mentioned as follows: Choosing generalists over specialists. Preference of small teams over large teams. Using high-level tools and platforms. Continuous and iterative sharing of intermediate work. Note: In the Agile data science team, a small team of generalists uses high-level tools that are scalable and refine data through iterations into increasingly higher states of value. 9

14 Consider the following examples related to the work of data science team members: Designers deliver CSS. Web developers build entire applications, understand the user experience, and interface design. Data scientists should work on both research and building web services including web applications. Researchers work in code base, which shows results explaining intermediate results. Product managers try identifying and understanding the flaws in all the related areas. 10

15 4. Agile Data Science Agile Tools and Installation Agile Data Science In this chapter, we will learn about the different Agile tools and their installation. The development stack of agile methodology includes the following set of components: Events An event is an occurrence that happens or is logged along with its features and timestamps. An event can come in many forms like servers, sensors, financial transactions or actions, which our users take in our application. In this complete tutorial, we will use JSON files that will facilitate data exchange among different tools and languages. Collectors Collectors are event aggregators. They collect events in a systematic manner to store and aggregate bulky data queuing them for action by real time workers. Distributed document These documents include multinode (multiple nodes) which stores document in a specific format. We will focus on MongoDB in this tutorial. Web application server Web application server enables data as JSON through client through visualization, with minimal overhead. It means web application server helps to test and deploy the projects created with agile methodology. Modern Browser It enables modern browser or application to present data as an interactive tool for our users. Local Environmental Setup For managing data sets, we will focus on the Anaconda framework of python that includes tools for managing excel, csv and many more files. The dashboard of Anaconda framework once installed is as shown below. It is also called the Anaconda Navigator : 11

16 The navigator includes the Jupyter framework which is a notebook system that helps to manage datasets. Once you launch the framework, it will be hosted in the browser as mentioned below: 12

17 5. Agile Data Science Data Processing in Agile In this chapter, we will focus on the difference between structured, semi-structured and unstructured data. Structured data Structured data concerns the data stored in SQL format in table with rows and columns. It includes a relational key, which is mapped into pre-designed fields. Structured data is used on a larger scale. Structured data represents only 5 to 10 percent of all informatics data. Semi-structured data Sem-structured data includes data which do not reside in relational database. They include some of organizational properties that make it easier to analyse. It includes the same process to store them in relational database. The examples of semi-structured database are CSV files, XML and JSON documents. NoSQL databases are considered semistructured. Unstructured data Unstructured data represents 80 percent of data. It often includes text and multimedia content. The best examples of unstructured data include audio files, presentations and web pages. The examples of machine generated unstructured data are satellite images, scientific data, photographs and video, radar and sonar data. The above pyramid structure specifically focusses on the amount of data and the ratio on which it is scattered. Quasi-structured data appears as type between unstructured and semi-structured data. In this tutorial, we will focus on semi-structured data, which is beneficial for agile methodology and data science research. 13

18 Semi structured data does not have a formal data model but has an apparent, selfdescribing pattern and structure which is developed by its analysis. 14

19 6. Agile Data Science SQL versus NoSQL Agile Data Science The complete focus of this tutorial is to follow agile methodology with less number of steps and with implementation of more useful tools. To understand this, it is important to know the difference between SQL and NoSQL databases. Most of the users are aware of SQL database, and have a good knowledge on either MySQL, Oracle or other SQL databases. Over the last several years, NoSQL database is getting widely adopted to solve various business problems and requirements of project. The following table shows the difference between SQL and NoSQL databases: SQL SQL databases are mainly called Relational Database Management system (RDBMS). SQL based databases includes structure of table with rows and columns. Collection of tables and other schema structures called database. NoSQL NoSQL database is also called documentoriented database. It is non-relational and distributed. NoSQL database includes documents as major structure and the inclusion of documents is called collection. SQL databases include predefined schema. NoSQL databases have dynamic data and include unstructured data. SQL databases are vertical scalable. NoSQL databases are horizontal scalable. 15

20 SQL databases are good fit for complex query environment. NoSQL do not have standard interfaces for complex query development. SQL databases are not feasible for hierarchal data storage. NoSQL databases fits better for hierarchical data storage. SQL databases are best fit for heavy transactions in the specified applications. SQL databases provides excellent support for their vendors. NoSQL databases are still not considered comparable in high load for complex transactional applications. NoSQL database still relies on community support. Only few experts are available for setup and deployed for large-scale NoSQL deployments. SQL databases focuses on ACID properties Atomic, Consistency, Isolation And Durability. NoSQL database focuses on CAP properties Consistency, Availability, and Partition tolerance. SQL databases can be classified as open source or closed source based on the vendors who have opted them. NoSQL databases are classified based on the storage type. NoSQL databases are open source by default. Why NoSQL for agile? The above-mentioned comparison shows that the NoSQL document database completely supports agile development. It is schema-less and does not completely focus on data modelling. Instead, NoSQL defers applications and services and thus developers get a better idea of how data can be modeled. NoSQL defines data model as the application model. 16

21 17

22 MongoDB Installation Throughout this tutorial, we will focus more on the examples of MongoDB as it is considered the best NoSQL schema. 18

23 19

24 20

25 21

26 Agile Science 7. Agile Data Science NoSQL and Dataflow programming There are times when the data is unavailable in relational format and we need to keep it transactional with the help of NoSQL databases. In this chapter, we will focus on the dataflow of NoSQL. We will also learn how it is operational with a combination of agile and data science. One of the major reasons to use NoSQL with agile is to increase the speed with market competition. The following reasons show how NoSQL is a best fit to agile software methodology: Fewer Barriers Changing the model, which at present is going through mid-stream has some real costs even in case of agile development. With NoSQL, the users work with aggregate data instead of wasting time in normalizing data. The main point is to get something done and working with the goal of making model perfect data. Increased Scalability Whenever an organization is creating product, it lays more focus on its scalability. NoSQL is always known for its scalability but it works better when it is designed with horizontal scalability. Ability to leverage data NoSQL is a schema-less data model that allows the user to readily use volumes of data, which includes several parameters of variability and velocity. When considering a choice of technology, you should always consider the one, which leverages the data to a greater scale. Dataflow of NoSQL Let us consider the following example wherein, we have shown how a data model is focused on creating the RDBMS schema. Following are the different requirements of schema: User Identification should be listed. Every user should have mandatory at least one skill. The details of every user s experience should be maintained properly. 22

27 The user table is normalized with 3 separate tables: Users User skills User experience The complexity increases while querying the database and time consumption is noted with increased normalization which is not good for Agile methodology. The same schema can be designed with the NoSQL database as mentioned below: NoSQL maintains the structure in JSON format, which is light- weight in structure. With JSON, applications can store objects with nested data as single documents. 23

28 8. Agile Data Science Collecting and Displaying Records In this chapter, we will focus on the JSON structure, which forms part of the Agile methodology. MongoDB is a widely used NoSQL data structure and operates easily for collecting and displaying records. Step 1 This step involves establishing connection with MongoDB for creating collection and specified data model. All you need to execute is mongod command for starting connection and mongo command to connect to the specified terminal. 24

29 Step 2 Create a new database for creating records in JSON format. For now, we are creating a dummy database named mydb. >use mydb switched to db mydb >db mydb >show dbs local test GB GB >db.user.insert({"name":"agile Data Science"}) >show dbs local mydb test GB GB GB Step 3 Creating collection is mandatory to get the list of records. This feature is beneficial for data science research and outputs. >use test switched to db test >db.createcollection("mycollection") { "ok" : 1 } >show collections mycollection system.indexes >db.createcollection("mycol", { capped : true, autoindexid : true, size : , max : } ) { "ok" : 1 } >db.agiledatascience.insert({"name" : "demoname"}) >show collections mycol 25

30 mycollection system.indexes demoname 26

31 9. Agile Data Science Data Visualization Agile Data Science Data visualization plays a very important role in data science. We can consider data visualization as a module of data science. Data Science includes more than building predictive models. It includes explanation of models and using them to understand data and make decisions. Data visualization is an integral part of presenting data in the most convincing way. From the data science point of view, data visualization is a highlighting feature which shows the changes and trends. Consider the following guidelines for effective data visualization: Position data along common scale. Use of bars are more effective in comparison of circles and squares. Proper color should be used for scatter plots. Use pie chart to show proportions. Sunburst visualization is more effective for hierarchical plots. Agile needs a simple scripting language for data visualization and with data science in collaboration Python is the suggested language for data visualization. Example 1 The following example demonstrates data visualization of GDP calculated in specific years. Matplotlib is the best library for data visualization in Python. The installation of this library is shown below: 27

32 Consider the following code to understand this: import matplotlib.pyplot as plt years = [1950, 1960, 1970, 1980, 1990, 2000, 2010] gdp = [300.2, 543.3, , , , , ] # create a line chart, years on x-axis, gdp on y-axis plt.plot(years, gdp, color='green', marker='o', linestyle='solid') # add a title plt.title("nominal GDP") # add a label to the y-axis plt.ylabel("billions of $") plt.show() Output The above code generates the following output: There are many ways to customize the charts with axis labels, line styles and point markers. Let s focus on the next example which demonstrates the better data visualization. These results can be used for better output. Example 2 import datetime import random import matplotlib.pyplot as plt 28

33 # make up some data x = [datetime.datetime.now() + datetime.timedelta(hours=i) for i in range(12)] y = [i+random.gauss(0,1) for i,_ in enumerate(x)] # plot plt.plot(x,y) # beautify the x-labels plt.gcf().autofmt_xdate() plt.show() Output The above code generates the following output: 29

34 10. Agile Data Science Data Enrichment Agile Data Science Data enrichment refers to a range of processes used to enhance, refine and improve raw data. It refers to useful data transformation (raw data to useful information). The process of data enrichment focusses on making data a valuable data asset for modern business or enterprise. The most common data enrichment process includes correction of spelling mistakes or typographical errors in database through use of specific decision algorithms. Data enrichment tools add useful information to simple data tables. Consider the following code for spell correction of words: import re from collections import Counter def words(text): return re.findall(r'\w+', text.lower()) WORDS = Counter(words(open('big.txt').read())) def P(word, N=sum(WORDS.values())): "Probabilities of words" return WORDS[word] / N def correction(word): "Spelling correction of word" return max(candidates(word), key=p) def candidates(word): "Generate possible spelling corrections for word." return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word]) def known(words): "The subset of `words` that appear in the dictionary of WORDS." return set(w for w in words if w in WORDS) 30

35 def edits1(word): "All edits that are one edit away from `word`." letters = 'abcdefghijklmnopqrstuvwxyz' splits = [(word[:i], word[i:]) for i in range(len(word) + 1)] deletes = [L + R[1:] for L, R in splits if R] transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(r)>1] replaces = [L + c + R[1:] for L, R in splits if R for c in letters] inserts = [L + c + R for L, R in splits for c in letters] return set(deletes + transposes + replaces + inserts) def edits2(word): "All edits that are two edits away from `word`." return (e2 for e1 in edits1(word) for e2 in edits1(e1)) print(correction('speling')) print(correction('korrectud')) In this program, we will match with big.txt which includes corrected words. Words match with words included in text file and print the appropriate results accordingly. Output The above code will generate the following output: 31

36 11. Agile Data Science Working with Reports In this chapter, we will learn about report creation, which is an important module of agile methodology. Agile sprints chart pages created by visualization into full-blown reports. With reports, charts become interactive, static pages become dynamic and network related data. The characteristics of reports stage of the data value pyramid is shown below: We will lay more stress on creating csv file, which can be used as report for data science analysis, and drawing conclusion. Although agile focusses on less documentation, generating reports to mention the progress of product development is always considered. import csv # def csv_writer(data, path): """ Write data to a CSV file path """ with open(path, "wb") as csv_file: writer = csv.writer(csv_file, delimiter=',') for line in data: writer.writerow(line) # if name == " main ": 32

37 data = ["first_name,last_name,city".split(","), "Tyrese,Hirthe,Strackeport".split(","), "Jules,Dicki,Lake Nickolasville".split(","), "Dedric,Medhurst,Stiedemannberg".split(",") ] path = "output.csv" csv_writer(data, path) The above code will help you generate the csv file as shown below: Let us consider the following benefits of csv (comma- separated values) reports: It is human friendly and easy to edit manually. It is simple to implement and parse CSV can be processed in all applications. It is smaller and faster to handle. CSV follows a standard format. It provides straightforward schema for data scientists. 33

38 12. Agile Data Science Role of Predictions In this chapter, we will earn about the role of predictions in agile data science. The interactive reports expose different aspects of data. Predictions form the fourth layer of agile sprint. When making predictions, we always refer to the past data and use them as inferences for future iterations. In this complete process, we transition data from batch processing of historical data to real-time data about the future. The role of predictions includes the following: Predictions help in forecasting. Some forecasts are based on statistical inference. Some of the predictions are based on opinions of pundits. Statistical inference are involved with predictions of all kinds. Sometimes forecasts are accurate, while sometimes forecasts are inaccurate. Predictive Analytics Predictive analytics includes a variety of statistical techniques from predictive modeling, machine learning and data mining which analyze current and historical facts to make predictions about future and unknown events. Predictive analytics requires training data. Trained data includes independent and dependent features. Dependent features are the values a user is trying to predict. Independent features are features describing the things we want to predict based on dependent features. 34

39 The study of features is called feature engineering; this is crucial to making predictions. Data visualization and exploratory data analysis are parts of feature engineering; these form the core of Agile data science. Making Predictions There are two ways of making predictions in agile data science: Regression Classification Building a regression or a classification completely depends on business requirements and its analysis. Prediction of continuous variable leads to regression model and prediction of categorical variables leads to classification model. Regression Regression considers examples that comprise features and thereby, produces a numeric output. Classification Classification takes the input and produces a categorical classification. Note: The example dataset that defines input to statistical prediction and that enables the machine to learn is called training data. 35

40 13. Agile Data Science Extracting features with PySpark In this chapter, we will learn about the application of the extracting features with PySpark in Agile Data Science. Overview of Spark Apache Spark can be defined as a fast real-time processing framework. It does computations to analyze data in real time. Apache Spark is introduced as stream processing system in real-time and can also take care of batch processing. Apache Spark supports interactive queries and iterative algorithms. Spark is written in Scala programming language. PySpark can be considered as a combination of Python with Spark. PySpark offers PySpark shell, which links Python API to the Spark core and initializes the Spark context. Most of the data scientists use PySpark for tracking features as discussed in the previous chapter. In this example, we will focus on the transformations to build a dataset called counts and save it to a particular file. text_file = sc.textfile("hdfs://...") counts = text_file.flatmap(lambda line: line.split(" ")) \.map(lambda word: (word, 1)) \.reducebykey(lambda a, b: a + b) counts.saveastextfile("hdfs://...") Using PySpark, a user can work with RDDs in python programming language. The inbuilt library, which covers the basics of Data Driven documents and components, helps in this. 36

41 14. Agile Data Science Building a Regression Model Logistic Regression refers to the machine learning algorithm that is used to predict the probability of categorical dependent variable. In logistic regression, the dependent variable is binary variable, which consists of data coded as 1 (Boolean values of true and false). In this chapter, we will focus on developing a regression model in Python using continuous variable. The example for linear regression model will focus on data exploration from CSV file. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit. import pandas as pd import numpy as np from sklearn import preprocessing import matplotlib.pyplot as plt plt.rc("font", size=14) from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import train_test_split import seaborn as sns sns.set(style="white") sns.set(style="whitegrid", color_codes=true) data = pd.read_csv('bank.csv', header=0) data = data.dropna() print(data.shape) print(list(data.columns)) 37

42 Follow these steps to implement the above code in Anaconda Navigator with Jupyter Notebook : Step 1: Launch the Jupyter Notebook with Anaconda Navigator. 38

43 Step 2: Upload the csv file to get the output of regression model in systematic manner. 39

44 Step 3: Create a new file and execute the above-mentioned code line to get the desired output. 40

45 15. Agile Data Science Deploying a predictive system In this example, we will learn how to create and deploy predictive model which helps in the prediction of house prices using python script. The important framework used for deployment of predictive system includes Anaconda and Jupyter Notebook. Follow these steps to deploy a predictive system: Step 1: Implement the following code to convert values from csv files to associated values. import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import mpl_toolkits %matplotlib inline data = pd.read_csv("kc_house_data.csv") data.head() The above code generates the following output: 41

46 Step 2: Execute the describe function to get the data types included in attributed of csv files. data.describe() Step 3: We can drop the associated values based on the deployment of the predictive model that we created. train1 = data.drop(['id', 'price'],axis=1) train1.head() Step 4: You can visualize the data as per the records. The data can be used for data science analysis and output of white papers. data.floors.value_counts().plot(kind='bar') 42

47 43

48 16. Agile Data Science SparkML Agile Data Science Machine learning library also called the SparkML or MLLib consists of common learning algorithms, including classification, regression, clustering and collaborative filtering. Why learn Spark ML for Agile? Spark is becoming the de-facto platform for building machine learning algorithms and applications. The developers work on Spark for implementing machine algorithms in a scalable and concise manner in the Spark framework. We will learn the concepts of Machine learning, its utilities and algorithms with this framework. Agile always opts for a framework, which delivers short and quick results. ML Algorithms ML Algorithms include common learning algorithms such as classification, regression, clustering and collaborative filtering. Features It includes feature extraction, transformation, dimension reduction and selection. Pipelines Pipelines provide tools for constructing, evaluating and tuning machine-learning pipelines. Popular Algorithms Following are a few popular algorithms: Basic Statistics Regression Classification Recommendation System Clustering Dimensionality Reduction Feature Extraction Optimization Recommendation System A recommendation system is a subclass of information filtering system that seeks prediction of rating and preference that a user suggests to a given item. Recommendation system includes various filtering systems, which are used as follows: 44

49 Collaborative Filtering It includes building a model based on the past behavior as well as similar decisions made by other users. This specific filtering model is used to predict items that a user is interested to take in. Content based Filtering It includes the filtering of discrete characteristics of an item in order to recommend and add new items with similar properties. In our subsequent chapters, we will focus on the use of recommendation system for solving a specific problem and improving the prediction performance from the agile methodology point of view. 45

50 17. Agile Data Science Fixing Prediction Problem In this chapter, we will focus on fixing a prediction problem with the help of a specific scenario. Consider that a company wants to automate the loan eligibility details as per the customer details provided through online application form. The details include name of customer, gender, marital status, loan amount and other mandatory details. The details are recorded in the CSV file as shown below: Execute the following code to evaluate the prediction problem: import pandas as pd from sklearn import ensemble import numpy as np from scipy.stats import mode from sklearn import preprocessing,model_selection from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder #loading the dataset data=pd.read_csv('train.csv',index_col='loan_id') 46

51 def num_missing(x): return sum(x.isnull()) #imputing the the missing values from the data data['gender'].fillna(mode(list(data['gender'])).mode[0], inplace=true) data['married'].fillna(mode(list(data['married'])).mode[0], inplace=true) data['self_employed'].fillna(mode(list(data['self_employed'])).mode[0], inplace=true) # print (data.apply(num_missing, axis=0)) # #imputing mean for the missing value data['loanamount'].fillna(data['loanamount'].mean(), inplace=true) mapping={'0':0,'1':1,'2':2,'3+':3} data = data.replace({'dependents':mapping}) data['dependents'].fillna(data['dependents'].mean(), inplace=true) data['loan_amount_term'].fillna(method='ffill',inplace=true) data['credit_history'].fillna(method='ffill',inplace=true) print (data.apply(num_missing,axis=0)) #converting the cateogorical data to numbers using the label encoder var_mod = ['Gender','Married','Education','Self_Employed','Property_Area','Loan_Status'] le = LabelEncoder() for i in var_mod: le.fit(list(data[i].values)) data[i] = le.transform(list(data[i])) #Train test split x=['gender','married','education','self_employed','property_area','loanamount', 'Loan_Amount_Term','Credit_History','Dependents'] y=['loan_status'] print(data[x]) X_train,X_test,y_train,y_test=model_selection.train_test_split(data[x],data[y], test_size=0.2) # 47

52 # #Random forest classifier # clf=ensemble.randomforestclassifier(n_estimators=100, criterion='gini',max_depth=3,max_features='auto',n_jobs=-1) clf=ensemble.randomforestclassifier(n_estimators=200,max_features=3,min_samples _split=5,oob_score=true,n_jobs=-1,criterion='entropy') clf.fit(x_train,y_train) accuracy=clf.score(x_test,y_test) print(accuracy) Output The above code generates the following output. 48

53 18. Agile Data Science Improving Prediction Performance In this chapter, we will focus on building a model that helps in the prediction of student s performance with a number of attributes included in it. The focus is to display the failure result of students in an examination. Process The target value of assessment is G3. This values can be binned and further classified as failure and success. If G3 value is greater than or equal to 10, then the student passes the examination. Example Consider the following example wherein a code is executed to predict the performance if students: import pandas as pd """ Read data file as DataFrame """ df = pd.read_csv("student-mat.csv", sep=";") """ Import ML helpers """ from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix from sklearn.model_selection import GridSearchCV, cross_val_score from sklearn.pipeline import Pipeline from sklearn.feature_selection import SelectKBest, chi2 from sklearn.svm import LinearSVC # Support Vector Machine Classifier model """ Split Data into Training and Testing Sets """ def split_data(x, Y): return train_test_split(x, Y, test_size=0.2, random_state=17) """ Confusion Matrix """ def confuse(y_true, y_pred): cm = confusion_matrix(y_true=y_true, y_pred=y_pred) # print("\nconfusion Matrix: \n", cm) 49

54 fpr(cm) ffr(cm) """ False Pass Rate """ def fpr(confusion_matrix): fp = confusion_matrix[0][1] tf = confusion_matrix[0][0] rate = float(fp) / (fp + tf) print("false Pass Rate: ", rate) """ False Fail Rate """ def ffr(confusion_matrix): ff = confusion_matrix[1][0] tp = confusion_matrix[1][1] rate = float(ff) / (ff + tp) print("false Fail Rate: ", rate) return rate """ Train Model and Print Score """ def train_and_score(x, y): X_train, X_test, y_train, y_test = split_data(x, y) clf = Pipeline([ ('reduce_dim', SelectKBest(chi2, k=2)), ('train', LinearSVC(C=100)) ]) scores = cross_val_score(clf, X_train, y_train, cv=5, n_jobs=2) print("mean Model Accuracy:", np.array(scores).mean()) clf.fit(x_train, y_train) confuse(y_test, clf.predict(x_test)) print() """ Main Program """ 50

55 def main(): print("\nstudent Performance Prediction") # For each feature, encode to categorical values class_le = LabelEncoder() for column in df[["school", "sex", "address", "famsize", "Pstatus", "Mjob", "Fjob", "reason", "guardian", "schoolsup", "famsup", "paid", "activities", "nursery", "higher", "internet", "romantic"]].columns: df[column] = class_le.fit_transform(df[column].values) # Encode G1, G2, G3 as pass or fail binary values for i, row in df.iterrows(): if row["g1"] >= 10: df["g1"][i] = 1 else: df["g1"][i] = 0 if row["g2"] >= 10: df["g2"][i] = 1 else: df["g2"][i] = 0 if row["g3"] >= 10: df["g3"][i] = 1 else: df["g3"][i] = 0 # Target values are G3 y = df.pop("g3") # Feature set is remaining features X = df print("\n\nmodel Accuracy Knowing G1 & G2 Scores") print("=====================================") train_and_score(x, y) 51

56 # Remove grade report 2 X.drop(["G2"], axis = 1, inplace=true) print("\n\nmodel Accuracy Knowing Only G1 Score") print("=====================================") train_and_score(x, y) # Remove grade report 1 X.drop(["G1"], axis=1, inplace=true) print("\n\nmodel Accuracy Without Knowing Scores") print("=====================================") train_and_score(x, y) main() Output The above code generates the output as shown below The prediction is treated with reference to only one variable. With reference to one variable, the student performance prediction is as shown below: 52

57 19. Agile Data Science Creating better scene with agile and data science Agile methodology helps organizations to adapt change, compete in the market and build high quality products. It is observed that organizations mature with agile methodology, with increasing change in requirements from clients. Compiling and synchronizing data with agile teams of organization is significant in rolling up data across as per the required portfolio. Build a better plan The standardized agile performance solely depends on the plan. The ordered data-schema empowers productivity, quality and responsiveness of the organization s progress. The level of data consistency is maintained with historical and real time scenarios. Consider the following diagram to understand the data science experiment cycle: Data science involves the analysis of requirements followed by the creation of algorithms based on the same. Once the algorithms are designed along with the environmental setup, a user can create experiments and collect data for better analysis. This ideology computes the last sprint of agile, which is called actions. 53

58 Actions involves all the mandatory tasks for the last sprint or level of agile methodology. The track of data science phases (with respect to life cycle) can be maintained with story cards as action items. Predictive Analysis and Big data The future of planning completely lies in the customization of data reports with the data collected from analysis. It will also include manipulation with big data analysis. With the help of big data, discrete pieces of information can be analyzed, effectively with slicing and dicing the metrics of the organization. Analysis is always considered as a better solution. 54

59 20. Agile Data Science Implementation of Agile There are various methodologies used in the agile development process. These methodologies can be used for data science research process as well. The flowchart given below shows the different methodologies: Scrum In software development terms, scrum means managing work with a small team and management of a specific project to reveal the strength and weaknesses of the project. Crystal methodologies Crystal methodologies include innovative techniques for product management and execution. With this method, teams can go about similar tasks in different ways. Crystal family is one of the easiest methodology to apply. Dynamic Software Development Method This delivery framework is primarily used to implement the current knowledge system in software methodology. Future driven development The focus of this development life cycle is features involved in project. It works best for domain object modeling, code and feature development for ownership. 55

60 Lean Software development This method aims at increasing the speed of software development at low cost and focusses the team on delivering specific value to customer. Extreme Programming Extreme programming is a unique software development methodology, which focusses on improving the software quality. This comes effective when the customer is not sure about the functionality of any project. Agile methodologies are taking root in data science stream and it is considered as the important software methodology. With agile self-organizing, cross-functional teams can work together in effective manner. As mentioned there are six main categories of agile development and each one of them can be streamed with data science as per the requirements. Data science involves an iterative process for statistical insights. Agile helps in breaking down the data science modules and helps in processing iterations and sprints in effective manner. The process of Agile Data Science is an amazing way of understanding how and why data science module is implemented. It solves problems in creative manner. 56

Data Science. Data Analyst. Data Scientist. Data Architect

Data Science. Data Analyst. Data Scientist. Data Architect Data Science Data Analyst Data Analysis in Excel Programming in R Introduction to Python/SQL/Tableau Data Visualization in R / Tableau Exploratory Data Analysis Data Scientist Inferential Statistics &

More information

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Creating a Recommender System. An Elasticsearch & Apache Spark approach Creating a Recommender System An Elasticsearch & Apache Spark approach My Profile SKILLS Álvaro Santos Andrés Big Data & Analytics Solution Architect in Ericsson with more than 12 years of experience focused

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018

Lab Five. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 29th 2018 Lab Five COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 29th 2018 1 Decision Trees and Random Forests 1.1 Reading Begin by reading chapter three of Python Machine

More information

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing.

This tutorial will help computer science graduates to understand the basic-to-advanced concepts related to data warehousing. About the Tutorial A data warehouse is constructed by integrating data from multiple heterogeneous sources. It supports analytical reporting, structured and/or ad hoc queries and decision making. This

More information

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course: DATA SCIENCE About the Course: In this course you will get an introduction to the main tools and ideas which are required for Data Scientist/Business Analyst/Data Analyst/Analytics Manager/Actuarial Scientist/Business

More information

Blurring the Line Between Developer and Data Scientist

Blurring the Line Between Developer and Data Scientist Blurring the Line Between Developer and Data Scientist Notebooks with PixieDust va barbosa va@us.ibm.com Developer Advocacy IBM Watson Data Platform WHY ARE YOU HERE? More companies making bet-the-business

More information

Oracle Big Data Discovery

Oracle Big Data Discovery Oracle Big Data Discovery Turning Data into Business Value Harald Erb Oracle Business Analytics & Big Data 1 Safe Harbor Statement The following is intended to outline our general product direction. It

More information

Data Science with Python Course Catalog

Data Science with Python Course Catalog Enhance Your Contribution to the Business, Earn Industry-recognized Accreditations, and Develop Skills that Help You Advance in Your Career March 2018 www.iotintercon.com Table of Contents Syllabus Overview

More information

Testing is the process of evaluating a system or its component(s) with the intent to find whether it satisfies the specified requirements or not.

Testing is the process of evaluating a system or its component(s) with the intent to find whether it satisfies the specified requirements or not. i About the Tutorial Testing is the process of evaluating a system or its component(s) with the intent to find whether it satisfies the specified requirements or not. Testing is executing a system in order

More information

Topic 01. Software Engineering, Web Engineering, agile methodologies.

Topic 01. Software Engineering, Web Engineering, agile methodologies. Topic 01 Software Engineering, Web Engineering, agile methodologies. 1 What is Software Engineering? 2 1 Classic Software Engineering The IEEE definition: Software Engineering is the application of a disciplined,

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

EUROPEAN ICT PROFESSIONAL ROLE PROFILES VERSION 2 CWA 16458:2018 LOGFILE

EUROPEAN ICT PROFESSIONAL ROLE PROFILES VERSION 2 CWA 16458:2018 LOGFILE EUROPEAN ICT PROFESSIONAL ROLE PROFILES VERSION 2 CWA 16458:2018 LOGFILE Overview all ICT Profile changes in title, summary, mission and from version 1 to version 2 Versions Version 1 Version 2 Role Profile

More information

Specialist ICT Learning

Specialist ICT Learning Specialist ICT Learning APPLIED DATA SCIENCE AND BIG DATA ANALYTICS GTBD7 Course Description This intensive training course provides theoretical and technical aspects of Data Science and Business Analytics.

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018

Lab Four. COMP Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves. October 22nd 2018 Lab Four COMP 219 - Advanced Artificial Intelligence Xiaowei Huang Cameron Hargreaves October 22nd 2018 1 Reading Begin by reading chapter three of Python Machine Learning until page 80 found in the learning

More information

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark About the Tutorial Apache Spark is written in Scala programming language. To support Python with Spark, Apache Spark community released a tool, PySpark. Using PySpark, you can work with RDDs in Python

More information

This tutorial also elaborates on other related methodologies like Agile, RAD and Prototyping.

This tutorial also elaborates on other related methodologies like Agile, RAD and Prototyping. i About the Tutorial SDLC stands for Software Development Life Cycle. SDLC is a process that consists of a series of planned activities to develop or alter the Software Products. This tutorial will give

More information

Certified Data Science with Python Professional VS-1442

Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional VS-1442 Certified Data Science with Python Professional Certified Data Science with Python Professional Certification Code VS-1442 Data science has become

More information

Data Science Bootcamp Curriculum. NYC Data Science Academy

Data Science Bootcamp Curriculum. NYC Data Science Academy Data Science Bootcamp Curriculum NYC Data Science Academy 100+ hours free, self-paced online course. Access to part-time in-person courses hosted at NYC campus Machine Learning with R and Python Foundations

More information

Python With Data Science

Python With Data Science Course Overview This course covers theoretical and technical aspects of using Python in Applied Data Science projects and Data Logistics use cases. Who Should Attend Data Scientists, Software Developers,

More information

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect BEOP.CTO.TP4 Owner: OCTO Revision: 0001 Approved by: JAT Effective: 08/30/2018 Buchanan & Edwards Proprietary: Printed copies of

More information

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core) Introduction to Data Science What is Analytics and Data Science? Overview of Data Science and Analytics Why Analytics is is becoming popular now? Application of Analytics in business Analytics Vs Data

More information

Introduction to Data Science

Introduction to Data Science UNIT I INTRODUCTION TO DATA SCIENCE Syllabus Introduction of Data Science Basic Data Analytics using R R Graphical User Interfaces Data Import and Export Attribute and Data Types Descriptive Statistics

More information

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem.

This is a brief tutorial that explains how to make use of Sqoop in Hadoop ecosystem. About the Tutorial Sqoop is a tool designed to transfer data between Hadoop and relational database servers. It is used to import data from relational databases such as MySQL, Oracle to Hadoop HDFS, and

More information

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model? SUPERVISED LEARNING WITH SCIKIT-LEARN How good is your model? Classification metrics Measuring model performance with accuracy: Fraction of correctly classified samples Not always a useful metric Class

More information

SQL Server Machine Learning Marek Chmel & Vladimir Muzny

SQL Server Machine Learning Marek Chmel & Vladimir Muzny SQL Server Machine Learning Marek Chmel & Vladimir Muzny @VladimirMuzny & @MarekChmel MCTs, MVPs, MCSEs Data Enthusiasts! vladimir@datascienceteam.cz marek@datascienceteam.cz Session Agenda Machine learning

More information

Software Development Methodologies

Software Development Methodologies Software Development Methodologies Lecturer: Raman Ramsin Lecture 8 Agile Methodologies: XP 1 extreme Programming (XP) Developed by Beck in 1996. The first authentic XP book appeared in 1999, with a revised

More information

Testing in the Agile World

Testing in the Agile World Testing in the Agile World John Fodeh Solution Architect, Global Testing Practice 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Outline

More information

Deploying, Managing and Reusing R Models in an Enterprise Environment

Deploying, Managing and Reusing R Models in an Enterprise Environment Deploying, Managing and Reusing R Models in an Enterprise Environment Making Data Science Accessible to a Wider Audience Lou Bajuk-Yorgan, Sr. Director, Product Management Streaming and Advanced Analytics

More information

Data Analyst Nanodegree Syllabus

Data Analyst Nanodegree Syllabus Data Analyst Nanodegree Syllabus Discover Insights from Data with Python, R, SQL, and Tableau Before You Start Prerequisites : In order to succeed in this program, we recommend having experience working

More information

SAP Lumira is known as a visual intelligence tool that is used to visualize data and create stories to provide graphical details of the data.

SAP Lumira is known as a visual intelligence tool that is used to visualize data and create stories to provide graphical details of the data. About the Tutorial SAP Lumira is known as a visual intelligence tool that is used to visualize data and create stories to provide graphical details of the data. Data is entered in Lumira as dataset and

More information

DATA MINING AND WAREHOUSING

DATA MINING AND WAREHOUSING DATA MINING AND WAREHOUSING Qno Question Answer 1 Define data warehouse? Data warehouse is a subject oriented, integrated, time-variant, and nonvolatile collection of data that supports management's decision-making

More information

Before you start proceeding with this tutorial, we are assuming that you are already aware about the basics of Web development.

Before you start proceeding with this tutorial, we are assuming that you are already aware about the basics of Web development. About the Tutorial This tutorial will give you an idea of how to get started with SharePoint development. Microsoft SharePoint is a browser-based collaboration, document management platform and content

More information

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight ESG Lab Review InterSystems Data Platform: A Unified, Efficient Data Platform for Fast Business Insight Date: April 218 Author: Kerry Dolan, Senior IT Validation Analyst Abstract Enterprise Strategy Group

More information

The Seven Steps to Implement DataOps

The Seven Steps to Implement DataOps The Seven Steps to Implement Ops ABSTRACT analytics teams challenged by inflexibility and poor quality have found that Ops can address these and many other obstacles. Ops includes tools and process improvements

More information

Adopting Agile Practices

Adopting Agile Practices Adopting Agile Practices Ian Charlton Managing Consultant ReleasePoint Software Testing Solutions ANZTB SIGIST (Perth) 30 November 2010 Tonight s Agenda What is Agile? Why is Agile Important to Testers?

More information

Progress DataDirect For Business Intelligence And Analytics Vendors

Progress DataDirect For Business Intelligence And Analytics Vendors Progress DataDirect For Business Intelligence And Analytics Vendors DATA SHEET FEATURES: Direction connection to a variety of SaaS and on-premises data sources via Progress DataDirect Hybrid Data Pipeline

More information

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer DBMS

About the Tutorial. Audience. Prerequisites. Copyright & Disclaimer DBMS About the Tutorial Database Management System or DBMS in short refers to the technology of storing and retrieving users data with utmost efficiency along with appropriate security measures. DBMS allows

More information

Talend Open Studio for MDM Web User Interface. User Guide 5.6.2

Talend Open Studio for MDM Web User Interface. User Guide 5.6.2 Talend Open Studio for MDM Web User Interface User Guide 5.6.2 Talend Open Studio for MDM Web User Interface Adapted for v5.6.2. Supersedes previous releases. Publication date: May 12, 2015 Copyleft This

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

The main website for Henrico County, henrico.us, received a complete visual and structural

The main website for Henrico County, henrico.us, received a complete visual and structural Page 1 1. Program Overview The main website for Henrico County, henrico.us, received a complete visual and structural overhaul, which was completed in May of 2016. The goal of the project was to update

More information

Big Data Insights Using Analytics

Big Data Insights Using Analytics Governors State University OPUS Open Portal to University Scholarship All Capstone Projects Student Capstone Projects Fall 2015 Big Data Insights Using Analytics Naga Krishna Reddy Muppidi Governors State

More information

SAFe Atlassian Style (Updated version with SAFe 4.5) Whitepapers & Handouts

SAFe Atlassian Style (Updated version with SAFe 4.5) Whitepapers & Handouts SAFe Atlassian Style (Updated version with SAFe 4.5) Whitepapers & Handouts Exported on 09/12/2017 1 Table of Contents 1 Table of Contents...2 2 Abstract...4 3 Who uses SAFe and Why?...5 4 Understanding

More information

Introduction to Machine Learning. Useful tools: Python, NumPy, scikit-learn

Introduction to Machine Learning. Useful tools: Python, NumPy, scikit-learn Introduction to Machine Learning Useful tools: Python, NumPy, scikit-learn Antonio Sutera and Jean-Michel Begon September 29, 2016 2 / 37 How to install Python? Download and use the Anaconda python distribution

More information

Shine a Light on Dark Data with Vertica Flex Tables

Shine a Light on Dark Data with Vertica Flex Tables White Paper Analytics and Big Data Shine a Light on Dark Data with Vertica Flex Tables Hidden within the dark recesses of your enterprise lurks dark data, information that exists but is forgotten, unused,

More information

About Intellipaat. About the Course. Why Take This Course?

About Intellipaat. About the Course. Why Take This Course? About Intellipaat Intellipaat is a fast growing professional training provider that is offering training in over 150 most sought-after tools and technologies. We have a learner base of 700,000 in over

More information

Certified Business Analysis Professional (CBAP )

Certified Business Analysis Professional (CBAP ) Certified Business Analysis Professional (CBAP ) 3 Days Classroom Training PHILIPPINES :: MALAYSIA :: VIETNAM :: SINGAPORE :: INDIA Content Certified Business Analysis Professional - (CBAP ) Introduction

More information

Q1) Describe business intelligence system development phases? (6 marks)

Q1) Describe business intelligence system development phases? (6 marks) BUISINESS ANALYTICS AND INTELLIGENCE SOLVED QUESTIONS Q1) Describe business intelligence system development phases? (6 marks) The 4 phases of BI system development are as follow: Analysis phase Design

More information

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda

1 Dulcian, Inc., 2001 All rights reserved. Oracle9i Data Warehouse Review. Agenda Agenda Oracle9i Warehouse Review Dulcian, Inc. Oracle9i Server OLAP Server Analytical SQL Mining ETL Infrastructure 9i Warehouse Builder Oracle 9i Server Overview E-Business Intelligence Platform 9i Server:

More information

Integration With the Business Modeler

Integration With the Business Modeler Decision Framework, J. Duggan Research Note 11 September 2003 Evaluating OOA&D Functionality Criteria Looking at nine criteria will help you evaluate the functionality of object-oriented analysis and design

More information

Learning Objectives for Data Concept and Visualization

Learning Objectives for Data Concept and Visualization Learning Objectives for Data Concept and Visualization Assignment 1: Data Quality Concept and Impact of Data Quality Summarize concepts of data quality. Understand and describe the impact of data on actuarial

More information

2 The IBM Data Governance Unified Process

2 The IBM Data Governance Unified Process 2 The IBM Data Governance Unified Process The benefits of a commitment to a comprehensive enterprise Data Governance initiative are many and varied, and so are the challenges to achieving strong Data Governance.

More information

(Complete Package) We are ready to serve Latest Testing Trends, Are you ready to learn? New Batches Info

(Complete Package) We are ready to serve Latest Testing Trends, Are you ready to learn? New Batches Info (Complete Package) WEB APP TESTING DB TESTING We are ready to serve Latest Testing Trends, Are you ready to learn? New Batches Info START DATE : TIMINGS : DURATION : TYPE OF BATCH : FEE : FACULTY NAME

More information

Hands-On Lab. Agile Planning and Portfolio Management with Team Foundation Server Lab version: Last updated: 11/25/2013

Hands-On Lab. Agile Planning and Portfolio Management with Team Foundation Server Lab version: Last updated: 11/25/2013 Hands-On Lab Agile Planning and Portfolio Management with Team Foundation Server 2013 Lab version: 12.0.21005.1 Last updated: 11/25/2013 CONTENTS OVERVIEW... 3 EXERCISE 1: AGILE PROJECT MANAGEMENT... 4

More information

BEST BIG DATA CERTIFICATIONS

BEST BIG DATA CERTIFICATIONS VALIANCE INSIGHTS BIG DATA BEST BIG DATA CERTIFICATIONS email : info@valiancesolutions.com website : www.valiancesolutions.com VALIANCE SOLUTIONS Analytics: Optimizing Certificate Engineer Engineering

More information

*ANSWERS * **********************************

*ANSWERS * ********************************** CS/183/17/SS07 UNIVERSITY OF SURREY BSc Programmes in Computing Level 1 Examination CS183: Systems Analysis and Design Time allowed: 2 hours Spring Semester 2007 Answer ALL questions in Section A and TWO

More information

Big Data Specialized Studies

Big Data Specialized Studies Information Technologies Programs Big Data Specialized Studies Accelerate Your Career extension.uci.edu/bigdata Offered in partnership with University of California, Irvine Extension s professional certificate

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Hybrid Data Platform

Hybrid Data Platform UniConnect-Powered Data Aggregation Across Enterprise Data Warehouses and Big Data Storage Platforms A Percipient Technology White Paper Author: Ai Meun Lim Chief Product Officer Updated Aug 2017 2017,

More information

Data Virtualization Implementation Methodology and Best Practices

Data Virtualization Implementation Methodology and Best Practices White Paper Data Virtualization Implementation Methodology and Best Practices INTRODUCTION Cisco s proven Data Virtualization Implementation Methodology and Best Practices is compiled from our successful

More information

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Big Data Technology Ecosystem Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara Agenda End-to-End Data Delivery Platform Ecosystem of Data Technologies Mapping an End-to-End Solution Case

More information

FACETs. Technical Report 05/19/2010

FACETs. Technical Report 05/19/2010 F3 FACETs Technical Report 05/19/2010 PROJECT OVERVIEW... 4 BASIC REQUIREMENTS... 4 CONSTRAINTS... 5 DEVELOPMENT PROCESS... 5 PLANNED/ACTUAL SCHEDULE... 6 SYSTEM DESIGN... 6 PRODUCT AND PROCESS METRICS...

More information

You should have a basic understanding of Relational concepts and basic SQL. It will be good if you have worked with any other RDBMS product.

You should have a basic understanding of Relational concepts and basic SQL. It will be good if you have worked with any other RDBMS product. About the Tutorial is a popular Relational Database Management System (RDBMS) suitable for large data warehousing applications. It is capable of handling large volumes of data and is highly scalable. This

More information

Cisco Digital Media System: Simply Compelling Communications

Cisco Digital Media System: Simply Compelling Communications Cisco Digital Media System: Simply Compelling Communications Executive Summary The Cisco Digital Media System enables organizations to use high-quality digital media to easily connect customers, employees,

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Introduction to BEST Viewpoints

Introduction to BEST Viewpoints Introduction to BEST Viewpoints This is not all but just one of the documentation files included in BEST Viewpoints. Introduction BEST Viewpoints is a user friendly data manipulation and analysis application

More information

An Introduction to Big Data Formats

An Introduction to Big Data Formats Introduction to Big Data Formats 1 An Introduction to Big Data Formats Understanding Avro, Parquet, and ORC WHITE PAPER Introduction to Big Data Formats 2 TABLE OF TABLE OF CONTENTS CONTENTS INTRODUCTION

More information

CS639: Data Management for Data Science. Lecture 1: Intro to Data Science and Course Overview. Theodoros Rekatsinas

CS639: Data Management for Data Science. Lecture 1: Intro to Data Science and Course Overview. Theodoros Rekatsinas CS639: Data Management for Data Science Lecture 1: Intro to Data Science and Course Overview Theodoros Rekatsinas 1 2 Big science is data driven. 3 Increasingly many companies see themselves as data driven.

More information

This is a small tutorial where we will cover all the basic steps needed to start with Balsamiq Mockups.

This is a small tutorial where we will cover all the basic steps needed to start with Balsamiq Mockups. About the Tutorial Balsamiq Mockups is an effective tool for presenting the software requirements in the form of wireframes. This helps the software development team to visualize how the software project

More information

SAFe AGILE TRAINING COURSES

SAFe AGILE TRAINING COURSES SAFe AGILE TRAINING COURSES INDEX INTRODUCTION COURSE Implementing SAfe Leading SAFe SAFe for Teams SAFe Scrum Master CERTIFICATION SAFe Program Consultant SAFe Agilist SAFe Practitioner SAFe Scrum Master

More information

Provide Real-Time Data To Financial Applications

Provide Real-Time Data To Financial Applications Provide Real-Time Data To Financial Applications DATA SHEET Introduction Companies typically build numerous internal applications and complex APIs for enterprise data access. These APIs are often engineered

More information

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT

Python for Data Analysis. Prof.Sushila Aghav-Palwe Assistant Professor MIT Python for Data Analysis Prof.Sushila Aghav-Palwe Assistant Professor MIT Four steps to apply data analytics: 1. Define your Objective What are you trying to achieve? What could the result look like? 2.

More information

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems

WKU-MIS-B10 Data Management: Warehousing, Analyzing, Mining, and Visualization. Management Information Systems Management Information Systems Management Information Systems B10. Data Management: Warehousing, Analyzing, Mining, and Visualization Code: 166137-01+02 Course: Management Information Systems Period: Spring

More information

Business Club. Decision Trees

Business Club. Decision Trees Business Club Decision Trees Business Club Analytics Team December 2017 Index 1. Motivation- A Case Study 2. The Trees a. What is a decision tree b. Representation 3. Regression v/s Classification 4. Building

More information

Final Paper/Best Practice/Tutorial Advantages OF BDD Testing

Final Paper/Best Practice/Tutorial Advantages OF BDD Testing Final Paper/Best Practice/Tutorial Advantages OF BDD Testing Preeti Khandokar Test Manager Datamatics Global Solutions Ltd Table of Contents Table of Contents... 2 Abstract... 3 Introduction... 3 Solution:...

More information

Data Management Glossary

Data Management Glossary Data Management Glossary A Access path: The route through a system by which data is found, accessed and retrieved Agile methodology: An approach to software development which takes incremental, iterative

More information

Qlik Sense Enterprise architecture and scalability

Qlik Sense Enterprise architecture and scalability White Paper Qlik Sense Enterprise architecture and scalability June, 2017 qlik.com Platform Qlik Sense is an analytics platform powered by an associative, in-memory analytics engine. Based on users selections,

More information

Story Refinement How to write and refine your stories so that your team can reach DONE by the end of your sprint!

Story Refinement How to write and refine your stories so that your team can reach DONE by the end of your sprint! + Story Refinement How to write and refine your stories so that your team can reach DONE by the end of your sprint! Tonya McCaulley Director of Training ROME Agile + About Your Speaker Tonya McCaulley

More information

10 Steps to Building an Architecture for Space Surveillance Projects. Eric A. Barnhart, M.S.

10 Steps to Building an Architecture for Space Surveillance Projects. Eric A. Barnhart, M.S. 10 Steps to Building an Architecture for Space Surveillance Projects Eric A. Barnhart, M.S. Eric.Barnhart@harris.com Howard D. Gans, Ph.D. Howard.Gans@harris.com Harris Corporation, Space and Intelligence

More information

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta Six Core Data Wrangling Activities An introductory guide to data wrangling with Trifacta Today s Data Driven Culture Are you inundated with data? Today, most organizations are collecting as much data in

More information

2015 Ed-Fi Alliance Summit Austin Texas, October 12-14, It all adds up Ed-Fi Alliance

2015 Ed-Fi Alliance Summit Austin Texas, October 12-14, It all adds up Ed-Fi Alliance 2015 Ed-Fi Alliance Summit Austin Texas, October 12-14, 2015 It all adds up. Sustainability and Ed-Fi Implementations 2 Session Overview Introduction (5 mins) Define the problem (10 min) Share In-Flight

More information

Building Self-Service BI Solutions with Power Query. Written By: Devin

Building Self-Service BI Solutions with Power Query. Written By: Devin Building Self-Service BI Solutions with Power Query Written By: Devin Knight DKnight@PragmaticWorks.com @Knight_Devin CONTENTS PAGE 3 PAGE 4 PAGE 5 PAGE 6 PAGE 7 PAGE 8 PAGE 9 PAGE 11 PAGE 17 PAGE 20 PAGE

More information

Sample Exam. Advanced Test Automation - Engineer

Sample Exam. Advanced Test Automation - Engineer Sample Exam Advanced Test Automation - Engineer Questions ASTQB Created - 2018 American Software Testing Qualifications Board Copyright Notice This document may be copied in its entirety, or extracts made,

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Xyleme Studio Data Sheet

Xyleme Studio Data Sheet XYLEME STUDIO DATA SHEET Xyleme Studio Data Sheet Rapid Single-Source Content Development Xyleme allows you to streamline and scale your content strategy while dramatically reducing the time to market

More information

The Kanban Applied Guide

The Kanban Applied Guide The Kanban Applied Guide Official Guide to Applying Kanban as a Process Framework May 2018 2018 Kanban Mentor P a g e 1 Table of Contents Purpose of the Kanban Applied Guide... 3 Kanban Applied Principles...

More information

SAS (Statistical Analysis Software/System)

SAS (Statistical Analysis Software/System) SAS (Statistical Analysis Software/System) SAS Adv. Analytics or Predictive Modelling:- Class Room: Training Fee & Duration : 30K & 3 Months Online Training Fee & Duration : 33K & 3 Months Learning SAS:

More information

Stages of Data Processing

Stages of Data Processing Data processing can be understood as the conversion of raw data into a meaningful and desired form. Basically, producing information that can be understood by the end user. So then, the question arises,

More information

Data Analysis and Data Science

Data Analysis and Data Science Data Analysis and Data Science CPS352: Database Systems Simon Miner Gordon College Last Revised: 4/29/15 Agenda Check-in Online Analytical Processing Data Science Homework 8 Check-in Online Analytical

More information

Rapid growth of massive datasets

Rapid growth of massive datasets Overview Rapid growth of massive datasets E.g., Online activity, Science, Sensor networks Data Distributed Clusters are Pervasive Data Distributed Computing Mature Methods for Common Problems e.g., classification,

More information

Lily 2.4 What s New Product Release Notes

Lily 2.4 What s New Product Release Notes Lily 2.4 What s New Product Release Notes WHAT S NEW IN LILY 2.4 2 Table of Contents Table of Contents... 2 Purpose and Overview of this Document... 3 Product Overview... 4 General... 5 Prerequisites...

More information

Pre-Requisites: CS2510. NU Core Designations: AD

Pre-Requisites: CS2510. NU Core Designations: AD DS4100: Data Collection, Integration and Analysis Teaches how to collect data from multiple sources and integrate them into consistent data sets. Explains how to use semi-automated and automated classification

More information

THE SCRUM FRAMEWORK 1

THE SCRUM FRAMEWORK 1 THE SCRUM FRAMEWORK 1 ROLES (1) Product Owner Represents the interests of all the stakeholders ROI objectives Prioritizes the product backlog Team Crossfunctional Self-managing Self-organizing 2 ROLES

More information

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch

Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch Nick Pentreath Nov / 14 / 16 Building a Scalable Recommender System with Apache Spark, Apache Kafka and Elasticsearch About @MLnick Principal Engineer, IBM Apache Spark PMC Focused on machine learning

More information

Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers

Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers Video AI Alerts An Artificial Intelligence-Based Approach to Anomaly Detection and Root Cause Analysis for OTT Video Publishers Live and on-demand programming delivered by over-the-top (OTT) will soon

More information

Enable IoT Solutions using Azure

Enable IoT Solutions using Azure Internet Of Things A WHITE PAPER SERIES Enable IoT Solutions using Azure 1 2 TABLE OF CONTENTS EXECUTIVE SUMMARY INTERNET OF THINGS GATEWAY EVENT INGESTION EVENT PERSISTENCE EVENT ACTIONS 3 SYNTEL S IoT

More information

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks Asanka Padmakumara ETL 2.0: Data Engineering with Azure Databricks Who am I? Asanka Padmakumara Business Intelligence Consultant, More than 8 years in BI and Data Warehousing A regular speaker in data

More information

BIS Database Management Systems.

BIS Database Management Systems. BIS 512 - Database Management Systems http://www.mis.boun.edu.tr/durahim/ Ahmet Onur Durahim Learning Objectives Database systems concepts Designing and implementing a database application Life of a Query

More information

Applied Machine Learning

Applied Machine Learning Applied Machine Learning Lab 3 Working with Text Data Overview In this lab, you will use R or Python to work with text data. Specifically, you will use code to clean text, remove stop words, and apply

More information