MATH36032 Problem Solving by Computer Data Science
NO. of jobs on jobsite 1 10000 NO. of Jobs 8000 6000 4000 2000 MATLAB Data Data Science 0 Jan 2016 Jul 2016 Jan 2017 1 http://www.jobsite.co.uk/
What is Data Science? (from Wiki) an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics,...
What is Data Science? Math & Stat Knowledge: calculus, statistics/probability, linear algebra Hacking skills: number/text manipulation and (vectorized) manipulation, algorithmic thinking,... Substantive expertise/domain knowledge: knowledge related to specific facts
Data Science/Big Data: why now?
Data Science/Big Data: why now? We are generating more data before
Data Science/Big Data: why now? We are generating more data before Technology in data collection and storage are improving
Applications: Social media and search engine Personalised webpage How does Amazon know which items to recommend?
Applications: Retailers Personalised promotion offers Who should get what kind of offer?
Applications: Credit Card Fraud Detection Do these transactions look normal?
Applications: Credit Card Fraud Detection Do these transactions look normal? This is my credit card statement. The transactions are made within two hours after I lost the card in Montreal in the summer of 2015.
More applications Insurance/ Actuarial Science: how much do you charge your customer Weather/climate forecasting: long term prediction Finance: better prediction of the stock prices.
Big data leaked and generated Is the 2.6 terabytes Panama Papers big data? How about the 1 billion accounts leaked from Yahoo s database?
Big data leaked and generated Is the 2.6 terabytes Panama Papers big data? How about the 1 billion accounts leaked from Yahoo s database? Data generated daily More than 10 terabytes for most national meteorological center More than 500 terabytes of data processed by Facebook More than 20 petabytes (2.0 10 16 bytes) handled by Google
Three V s of Big Data Volume: large quantity of data, big size of datasets Variety: many different types and forms of data, e.g. transactional from ATMs, social media site, emails, demographics data, tracking data from cell phones, etc. Velocity: data that is coming in at a very fast pace
Three V s of Big Data Volume: large quantity of data, big size of datasets Variety: many different types and forms of data, e.g. transactional from ATMs, social media site, emails, demographics data, tracking data from cell phones, etc. Velocity: data that is coming in at a very fast pace Require many new softwares/technologies: The support for big data sets is extended to all major technologies in MATLAB (mapreduce, datastore and other toolboxes)
Big Data Landscape
The first data science application? Kepler s three laws of planetary motion
Make a TV show using data?
TV made by Amazon and Netflix using big data
TV made by Amazon and Netflix using big data Alpha House was not as successful as expected. Netflix also had open competitions for best algorithms to predict user ratings for films (discontinued now for privacy and other reasons).
Google flu trends Google made a big splash in the news in 2008
Google flu trends Google made a big splash in the news in 2008 and five years later
The dark side of data science Data, if used in the right way, can greatly facilitate our life (like the concept of smart cities), but...
The dark side of data science Data, if used in the right way, can greatly facilitate our life (like the concept of smart cities), but... Privacy: how the data collected from you are being used?
The dark side of data science Data, if used in the right way, can greatly facilitate our life (like the concept of smart cities), but... Privacy: how the data collected from you are being used? Biased data: polls before Brexit or US presidential election in 2016
The dark side of data science Data, if used in the right way, can greatly facilitate our life (like the concept of smart cities), but... Privacy: how the data collected from you are being used? Biased data: polls before Brexit or US presidential election in 2016 Biased interpretation:
The dark side of data science Data, if used in the right way, can greatly facilitate our life (like the concept of smart cities), but... Privacy: how the data collected from you are being used? Biased data: polls before Brexit or US presidential election in 2016 Biased interpretation: Data shows that people who shop at Waitrose have longer life span than those at Aldi and Asda.
Data Science Workflow (in business)
Data Science using MATLAB MATLAB is not the best tool for data science. Certain tasks like text processing is better done with Python or R. More tools and data types are introduced in MATLAB for the past few years, mainly to cope with the increased need in data science.
MATLAB Data Type 2 We have already seen these (type whos in command window): Numerical Types: double (double precision floating number), uint8 (images), int32, int64,... Symbolic: Defined by syms Logical: true,false Characters and strings: A = string Function Handles: @ as in integral(@(x) sin(x),...) New data types introduced recently : structures (struct), cell arrays (cell), Time Series (timeseries from 2006b), Table (table from 2013b), Categorical Arrays (categorical from 2013b), Date and Time (datetime from 2015b),.. 2 http://uk.mathworks.com/help/matlab/data-types data-types.html
The plan for the rest of the semester In Week 8 (Friday) Review (and introduce) a few new data structures (mainly character strings) Read (and write) different data formats (csv, excel, image,...) Specific topics: Random simulation (week 9) Regression and classification (week 10) Dimension reduction/low rank approximation (week 10) Google Pagerank (week 11) Other related topics (if time permits, week 11)