Project on Data Analytics CIS 660 Sunnie S Chung

Size: px

Start display at page:

Download "Project on Data Analytics CIS 660 Sunnie S Chung"

Meryl Armstrong
6 years ago
Views:

1 Project on Data Analytics CIS 660 Sunnie S Chung 2 Person Group Project: 20% Presentation of a project with related research papers: 10 % You can choose one of the following projects or you can create your own. You can change some details of the project that you choose from the list as you need. For some of those projects on Social Network sites, you will need to get an account approval from Twitter, Yelp, Facebook, or LinkedIn site to register your Project (App) as a developer to be able to download data from the sites. Check their Developer/App/Tool options in those sites for this process. Give the class Project site for App URL for the process. Or you can choose any set of web sites, system log files or any other data that you can obtain to process for your project. Some of the available data sets are listed below. For those who want to work with NoSQL systems on Hadoop, you may use any Hadoop related apps/tools to create projects (See CIS612 Project List for the guides for this). More instructions to download and install them will be given per request. However, this option is not recommended for those who have never had any experience on Hadoop or NoSQL systems. Please take CIS612 for that. Submit 1-2 page proposal on a project your group choose to specify your data, major tasks and data analytic systems/tools to use and plan a time line by the deadline of Phase 1. Each group (2 person group) will give a 20 min presentation on a project and the related research paper you choose (tasks and tools used for this project as well) during last class sessions. Presentation scheduling will be done after midterm. First session presentation groups will get 5-10% extra credits (This not applicable for any summer semester).

2 Project Specification CIS 660 SS Chung Phase 1: Planning Plan your project by researching data sets and data mining algorithms/tools to create your data mining project. Submit 1-2 page proposal per your group. Phase 2: Data Cleaning/Preprocessing/Transformation Obtain your data and preprocess them. Create a data mining project with your data set using a data mining system or tools of your choice. For this project, you can use and any data mining tools or any open source implementations of the data mining techniques covered in class and any data set of your choice given below or any data that you obtain from the suggested links. Phase 3: Implement/Apply Data Mining, Validate your result, and Presentation Implement/Perform Data Mining Algorithms to get results. Validate your results using cross validation tool available in your choice of systems. Visualize your results and prepare your presentation. See the deadline for each phase on the class webpage.

3 Project List You can create your own data mining project or you can choose your project in the suggested project list below and papers in the suggested research topics and the paper list here. You can also choose one on the topics and the papers on the conference sites below or related resource sites that listed here. You can change the detail of the project as you wish. Examples of Selective Current Research Topics in Big Data Analytics/Data Mining 1. Text Mining of Social Network Data: Twitter, Yelp, Facebook, LinkedIn, and more Sentiment Analysis of Product Review Social Network Data Analysis One of the most common Data Analytics is mining text data which are unstructured/semi structured data. The common examples of such data are message logging data from social media sites or system generated log files. One way to mining such data is to transform the unstructured/semi-structured logging format into structured files to process. You can also create a database/collections from the transformed files to query for data mining. Such structured files could be tables in RDBMS, Key Value Stores (in JSON format), CSV(Comma Separated Value), TSV (Tab Separated Value) or a Document Collection for the common NoSQL systems like Mongo DB, Hive, Cassandra in HDFS. You can use HBase or Pig as well. There are useful open source tools like Tweepy, FacePager, Flume, or any other available tools. They can be used to download a stream of data from the Twitter/Facebook site to your system or any HDFS system. Once you transform your text data into a structured file, then you can apply any datamining tool/algorithms to the transformed data for Classification:Decision Tree, PEBLS, Neural Network, SVM or Clustering. Text data you can download from: Twitter Yelp Facebook LinkedIn You can download any web pages or any data sets. (See Resource List below) One available data set that used for the examples in this section from (This download contains the text for 219 State of the Union addresses of U.S. Presidents between 1790and 2006)

4 See the project guide below for more detail - an example of the project for Text Mining with R: Twitter Data Analysis Any Public Facebook Sites: NewYork Times, Washington Post, Boston Tribune Facebook Data transformation into either one of the platforms: Tables in RDBMS (MS SQL Server or any database server with Java/JDBC) Key Value Stores (JSON file format), CSV(Comma Separated Value), TSV (Tab Separated Value) Twitter Message data transformation into either one of the platforms: Tables in RDBMS (MS SQL Server or any database server with Java/JDBC) Key Value Stores in JSON file format, CSV(Comma Separated Value), TSV (Tab Separated Value) The Unified Logging Infrastructure for Data Analytics at Twitter George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy Twitter, Inc. Webpage or Document Processing for Text Analysis Document Clustering, Phrase Search Generating Word2Vec for each word in Wikipedia or Webpage collection and generating Paragraph2vec for each document to do Similarity Search for Document (Webpage) Clustering or Sentiment Analysis See Natural Language Processing in Unstructured Text Mining section of Class Lecture Notes for the details, the tutorial sites, papers, and Data sets. Text Mining (Sentiment Analysis) with SVM using Yelp Review Data Set or Movie Review in rotten tomato site. Implement Sentiment Analysis in the papers below. (See me for more guides on this) Review Data Sources for Sentiment Analysis Amazon Product Review Data: Movie Review Data

5 Yelp Data Set Question Answering System Question Answering Data on Amazon Product Reviews Papers: Some related papers to start: User-Level Sentiment Analysis Incorporating Social Networks in Twitter (Yahoo) Good Research Project on Sentiment Analysis Sites: Topic sentiment analysis in twitter: a graph-based hashtag sentiment classification approach Text Mining Using MS Analysis Service with Association Rule Mining See detail guides in the Project Section on the class webpage or the following links This section examines two particularly interesting data flow transformations that facilitate text mining: Term Extraction and Term Lookup. SQL Server Data Mining supports the TEXT data type, but that data type is not enough to perform meaningful text analysis. From the algorithm s perspective, columns having the TEXT data type are treated just like discrete columns that have the LONG data type as a collection of various distinct states, without any way to directly access the content of a text value.

6 To perform text mining with SQL Server Data Mining, you must first bring the text to some form that can be consumed by the algorithms. The solution included in the product is to represent each piece of text as a collection of words and phrases, and perform data mining based on the occurrence of certain key words and phrases inside a certain document (and possibly some frequency-related scores). Therefore, a document is modeled very similarly to a shopping basket that contains (or does not contain) certain items (which happen to be key words and phrases). After each document is represented as a collection of key phrases, you can perform data mining using one of the following model types: Classification models that use the key words and phrases nested table as input to predict the class of a document Clustering models that find similar documents based on common occurrences Association models that detect cross-correlations between key words and phrases 1. Build a dictionary of key words and phrases over a collection of representative documents. This task is usually accomplished using the Term Extraction transformation. 2. Based on the dictionary, extract the list of significant key words and phrases for each document to be analyzed. This task is usually accomplished using the Term Lookup transformation. 3. Train mining models on top of the transformed data. NOTE More Data Sources for text mining: State of the Union Any electronic books available on the web About 500 webpages on the Wikipedia site Fortune 500 Company Any Newspaper or Magazine Site Instead of using MS Data Tool, you can build your Document Frequency and Inverted Index described in the Lecture Notes on Information Retrieval to build Term Frequency and Document Frequency for Cosine Similarity. The lecture notes show how cosine similarity is adopted as vector space scoring for document ranking. The one that is not done in the lab2 (I didn't ask this in the lab2) is building weight matrix by calculating weighted score based on tf-idf on page in in the lecture note. Then you can calculated Cosine similarity between documents and the keyword using the weight score based on tf-idf you

7 built. At the end of lecture notes, there are variations of the scoring matrix to optimize. Cosine normalization as well. You can use any electronic books on the web or more than 500 webpages on the web. 2. Fraud Detection or Intrusion Detection using Data Mining Intrusion Detection - Process system Logging files to build database to query - Transform log files in any system into CSV file or a Table to apply any Data mining techniques for Anomaly Detection with Classification (e.g., SVM), Clustering (K Mean), etc. Two Datasets are available per request: NASA Webserver Log file (Old Data set from 1990) See an example project guide in detail to get NASA HTTP Access Logs Wireless Network Log file (New data Set from 2015) For the Data Set and papers, See Anomaly Detection Section in Class Lecture Notes on the Class Webpage Related Paper: Networks-Empirical-Evaluation-of-Threats.pdf 3. Recommendation System o Item-to-Item Collaborative Filtering in Recommendation System o Implement Data Transformation (Binarization of Basket Item Sets) to apply the data mining algorithm SVM. Data Source: Related papers from Amazon Recommendation System:

8 IBM Research Project: Building Data Analytic Artificial Intelligence: IBM Watson DeepQA Project Crime Forecasting Using Clustering Techniques NIJ (National Institute of Justice) Crime Forecasting Challenges and data set 6. Image Data Analytics Deep Learning for Image Recognition See Image Recognition Section at the end of the Class Lecture Notes for the details, the tutorial sites, papers, and Data sets. Face Recognition Research Image Data Processing Tutorial Sites: Data Source:

9 Related Research Papers: ImageNet Classification with Deep Convolutional Neural Networks Going Deeper with Convolutions Data Source: IMDB, Instagram Web Scrapping with XPath in Python E.md tics.py Other Related References: Image Data Sources:

10 Suggested Data Sources The suggested public social media sites or known data collection sites for data analytics are listed below with related industry research papers. You can deploy your big data infrastructure on Cloud. Data transformation into one of the HDFS based NoSQL Systems or both of the following HDFS platforms and RDBMS: 1-1) XML, Key Value Stores, JSON files in a Document Collection for Mongo DB, Cassandra or CSV(Comma Separated Value), TSV (Tab Separated Value) in Hive, PigLatin or Volt DB in HDFS. 1-2) Big Table in HBase in HDFS 1-3) RDD in Spark in HDFS to use Pipeling 1-4) Tables in RDBMS (MS SQL Server in Data Integration Service/ Data Analysis Service using LINQ or any RDBMS Database Server) 1. LinkedIn Related papers to read: Avatara: OLAP for Webscale Analytics Products Lili Wu Roshan Sumbaly Chris Riccomini Gordon Koo Hyung Jin Kim Jay Kreps Sam Shah LinkedIn The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn 2. Any well-known Newspaper or Magazine sites on Facebook: Related papers to read: Petabyte Scale Databases and Storage Systems Deployed at Facebook. Dhruba Borthakur Data Warehousing and Analytics Infrastructure at Facebook, in SIGMOD 2010 by Ashish Thusoo (Facebook), et al,

11 3. Twitter Message data transformation: Related papers to read: will be given The Unified Logging Infrastructure for Data Analytics at Twitter George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy Twitter, Inc. Fast Data in the Era of Big Data: Twitter s Real-Time Related Query Suggestion Architecture Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, Jimmy Lin Twitter, Inc Yelp Data Challenge: Business Data set 6. Transform log files in any system into either one of the platforms: Related papers to read: will be given 7. Webpage or Document Processing for Text Analysis Document Clustering, Phrase Search See Natural Language Processing in Unstructured Text Mining section of Class Lecture Notes for the details, the tutorial sites, papers, and Data sets. Or download all the webpages in one domain sites in any well known public news sites of your choice and extract the text body only using XPATH library in any web browser or language. Or You can download preprocessed Wikipedia texts (in XML) here

12 Arxiv research paper repository to Download You can download the data used for the examples in this section from This download contains the text for 219 State of the Union addresses of U.S. Presidents between 1790and IMDB Movie Review Collection for Sentiment Analysis: 9. WordNet You can build your Document Frequency and Inverted Index described in the Lecture Notes on Information Retrieval to build Any IR related Metrics in an algorithm or to apply Association Rule Mining algorithm. The lecture notes show how cosine similarity is adopted as vector space scoring for document ranking. One example is building weight matrix by calculating weighted score based on tf-idf on page in in the lecture note. Then you can calculated Cosine similarity between documents and the keyword using the weight score based on tf-idf you built. At the end of lecture notes, there are variations of the scoring matrix to optimize. Cosine normalization is one of them as well. 8. Transform any electronic books or online documents for text processing analysis Any Electronic book on line See item 7 Webpage Processing above for processing. 9. Text Mining with Data Source in 7 for Association Rule Mining Using MS Analysis Service See detail guides in the Project Section on the class webpage or the following links This section examines two particularly interesting data flow transformations that facilitate text mining: Term Extraction and Term Lookup. SQL Server Data Mining supports the TEXT data type, but that data type

13 is not enough to perform meaningful text analysis. From the algorithm s perspective, columns having the TEXT data type are treated just like discrete columns that have the LONG data type as a collection of various distinct states, without any way to directly access the content of a text value. To perform text mining with SQL Server Data Mining, you must first bring the text to some form that can be consumed by the algorithms. The solution included in the product is to represent each piece of text as a collection of words and phrases, and perform data mining based on the occurrence of certain key words and phrases inside a certain document (and possibly some frequency-related scores). Therefore, a document is modeled very similarly to a shopping basket that contains (or does not contain) certain items (which happen to be key words and phrases). After each document is represented as a collection of key phrases, you can perform data mining using one of the following model types: Classification models that use the key words and phrases nested table as input to predict the class of a document Clustering models that find similar documents based on common occurrences Association models that detect cross-correlations between key words and phrases 1. Build a dictionary of key words and phrases over a collection of representative documents. This task is usually accomplished using the Term Extraction transformation. 2. Based on the dictionary, extract the list of significant key words and phrases for each document to be analyzed. This task is usually accomplished using the Term Lookup transformation. 3. Train mining models on top of the transformed data. Data Source for text mining: Or You can use any electronic books on the web or more than 500 webpages on the web. 10. Image Data Analytics Deep Learning for Image Recognition See Deep Learning for Image Recognition Section at the end of the Class Lecture Notes for the details, the tutorial sites, papers, and Data sets. Data Sets: ImageNet Building Social Network Graph into a store

14 Facebook Friends Social Network (Graph API) data transformation Facebook Friends Social Network (Graph API) data transformation Related papers to read: will be given: 12. Implement any Data Mining Metric you learned in class with a Cube and Dimensions using Microsoft DW. Create Dimensions with a set of attributes and define measure in terms of similarity, distance, or correlation between any two records in vtargetmail data set for Clustering. 13. Minority Class Detection with Decision Tree with adapted measure and weight You can implement your own metric specified in the paper below that can be used in a Decision Tree Algorithm and test with Adventure Data Set. A Robust Decision Tree Algorithm for Imbalanced Data Sets Information Retrieval for finding the most related documents with keywords using any set of webpages or Wikipedia webpages. 15. Any Data Mining Project using Data Warehouse/OLAP with MDX and DMX See DW Tutorial and MDX, DMX Tutorial in Lab3 section for this. 16. Building Social Network Graph into a store Facebook Friends Social Network (Graph API) data transformation into either one of the platforms: Tables in RDBMS (MS SQL Server or any database server with Java/JDBC) Key Value Stores in JSON file format, CSV(Comma Separated Value), TSV (Tab Separated Value) Processing JSON file to table or CSV files with user id with edge columns then apply to data mining query 17. Any GIS data mining 18. Any Papers on One of the Following Topics: Stream data mining using Sparks Sequential pattern mining, sequence classification and clustering Time-series analysis, regression and trend analysis Biological sequence analysis and biological data mining Graph pattern mining, graph classification and clustering Social network analysis

15 Information network analysis Spatial, spatiotemporal and moving object data mining Multimedia data mining Mining computer systems and sensor networks Mining software programs Statistical data mining methods Other Useful Data Sources: Other Related Sites: Useful Resources R or Weka is a collection of machine learning or data mining algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. R Programming:

16 SQL Server Analysis Services (SSAS) Data Tools: You can use R in 2016 SQL Server or Stand Alone R Server R Hadoop System: Weka: Good Conference Sites to Search: KDD Top Research Data Mining Conferences: KDD, IEEE ICDE, IEEE ICDM, CIKM, and SIAM SDM. ACM SIGMOD : VLDB (IEEE): ICDE (IEEE) Cyber Security: Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition (Mahmood Sharif Carnegie Mellon University at SIGMOD 2016) AmpPot: Monitoring and Defending Against Amplication DDoS Attacks

17 A Privacy Protection Technique for Publishing Data Mining Models and Research Data fu.pdf?ip= &id= &acc=active%20service&key=1d8e1ca5b8d7d 8DD%2E3DC751E0CA962F99%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35&CFID= &CFTOKEN= & acm = _163bd14f58b49ab867c87c6de3 9445e9#URLTOKEN# Privacy-Preserving Data Mining through Knowledge Model Sharing IMR based Anonymization for Privacy Preservation in Data Mining EN= Hiding a Needle in a Haystack: Privacy Preserving Apriori Algorithm in MapReduce Framework EN= Artificial Intelligence and Machine Learning: o Deep Face Recognition by Omkar M Parkhi o o o Some Research Resources (will be updated) Major Conference Proceedings that will be used 1. DM conferences: ACM SIGKDD (KDD), ICDM (IEEE, Int. Conf. Data Mining), SDM (SIAM Data Mining), PKDD (Principles KDD)/ECML, PAKDD (Pacific-Asia) 2. DB conferences: ACM SIGMOD, VLDB, ICDE 3. ML conferences: NIPS, ICML 4. IR conferences: SIGIR, CIKM 5. Web conferences: WWW, WSDM 6. Other related conferences and journals 7. IEEE TKDE, ACM TKDD, DMKD, ML Recommended Reference Books 1. C. M. Bishop, Pattern Recognition and Machine Learning, Springer 2007.

18 2. S. Chakrabarti, Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data, Morgan Kaufmann, T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction,2nd ed., Springer-Verlag, B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer, D. Easley and J. Kleinberg, Networks, Crowds, and Markets: Reasoning About a Highly Connected World, Cambridge Univ. Press, M. Newman, Networks: An Introduction, Oxford Univ. Press, 2010.

CIS 601 Graduate Seminar in Computer Science Sunnie S. Chung

CIS 601 Graduate Seminar in Computer Science Sunnie S. Chung Research on Topics in Recent Computer Science Research and related papers in the subject that you choose and give presentations in class and