Applications of HPCC Systems at Clemson University. Amy Apon, PhD Linh Ngo, PhD Michael Payne Big Data Systems Laboratory Clemson University

Applications of HPCC Systems at Clemson University Amy Apon, PhD Linh Ngo, PhD Michael Payne Big Data Systems Laboratory Clemson University

Clemson Strengths and Opportunities People Facilities PhD-level faculty & research staff Talented students Significant industry collaborators Palmetto Top 5 in US Academic Supercomputers ~2000 nodes, 20K cores, 600 GPUs 100Gb Internet connectivity

Big Data Systems Lab Overview Big Data Systems Lab Vision Perform World Class Research on the Systems and Enabling Information Technology for Advanced Data Analytics Big Data Systems Lab Research Areas Systems and Architectures Tools and Operations Data Analytics and Applications

Effect of High Performance Computing on Academic Research Productivity Motivation: There is a lot of pressure on federal funding We propose efficiency as a measure from which to gain insights on return on investment We show that locallyavailable HPC has a positive effect on the ability of a university to do research

Text mining of news reports and social media for business intelligence Motivation: Government and business need information about public sentiment. Research: We develop and apply methods to analyze large amounts of textual data to enable inquiry of social and business problems.

Shared Computing Resources among Researchers Shared Execution Environment Temporary Local Storage User Privileges Only

Linh Ngo, PhD HPCC Systems in a Shared Research Computing Environment

Shared Computing Resources among Researchers Shared Execution Environment Temporary Local Storage User Privileges Only How to provision and configure an HPCC cluster dynamically for research purposes? Step 1: Configure, install, and deploy HPCC as a non-root user Step 2: Dynamically provision HPCC cluster in a shared research environment

Installation and Configuration of Dependencies / /usr binutils ICU XALAN APR / /home /lib64 /lib $USER /opt /local_scratch /parallel_scratch

Resolving Non-default Installation Path Conflicts / etc init.d Administrative privileges / home Non-administrative privileges $USER opt var HPCCSystems HPCCSystems lib log HPCCSystems HPCCSystems configmgr mydafilesrv mydafilesrv local_scratch hpcc $USER hpcc parallel_scratch $USER HPCCSystems lib log lock pid

Non-root Deployment Remove/relax root-level settings: i.e.: is_root Reduce default configuration settings for resource requirements: depended on resource allocation requests

Dynamic Provisioning user.palmetto.clemson.edu 2 PBS_NODEFILE 3 environment.xml 4 1 mydafilesrv mydafile myeclc mythor 5 myroxie Deploy to /local_scratch or /parallel_scratch?

Michael Payne Using HPCC Systems to Manage Academic Data LexisNexis Summer 2014 Internship

Using HPCC Systems to Manage Academic Data Research in Scholarly Data requires academic data from many different sources, which store data under various formats Aggregating these sources into a useful and cohesive structure requires a data-intensive approach to preprocessing, integration and analysis HPCC Systems is a platform to streamline this process

Categories of Scholarly Data Research Higher Education Institutions Funding Support High Performance Computing Capability

Scholarly Data Description Enrollment (multi-sheet Excel) Financial (multi-sheet Excel) Faculty (multi-sheet Excel) Detailed Award (XML) Federal Funding (tabdelimited) Expenditures (tab-delimited) Institution Patent (Excel) Degrees Conferred (Excel) Funding Support Research Higher Education Institutions Detailed list of articles by discipline with abstracts, references, and disambiguated authors. (XML) High Performance Computing Capability Institutional Information with Carnegie s Research Classifications (multi-sheet Excel) Institutions from States with EPSCoR status (tab-delimited) NIH Award Data (CSV/Excel) Top500 Supercomputer affiliated with academic institutions (XML)

Examples of Scholarly Data Links PI/Author name Institution name Name similarity Address Name similarity Match name with WoS s Organization-Enhanced name Institution name/email Acknowledgment attributes (2008 on ward, automatic)

Ongoing Work Porting data analytic processes to ECL Applying Machine Learning techniques for article abstract classification

Summer 2014 Internship - Logistic Regression for Dense Matrices LexisNexis Internship Machine Learning Manager Timothy Humphrey Mentor Arjuna Chala

Logistic Regression Prediction using continuous and discrete values No distributional assumptions on the predictors May not be normally distributed or linearly related Relationship between the discrete variable and the predictor is non-linear

Parallel Block Basic Linear Algebra Subprograms (PB-BLAS) Matrices can be partitioned Schemes must be compatible There are multiple choices! X = 4 x 4 4 x 1 4 x 1 X 3 x 1 = 2 x 3 2 x 1 1 x 1 2 x 1

Machine Learning in ECL 35 Logistic Runtimes Hard Coded Mapping Full Higgs Dataset 11,000,000 x 28 Time in Minutes 30 25 20 15 10 5 0 Higgs 1,000 Higgs 10,000 Higgs 100,000 PB-BLAS Non PB-BLAS

Machine Learning in ECL 25 Logistic Runtimes Auto Mapping Full Elsevier Dataset 100,000 x 3,291 Time in Minutes 20 15 10 5 0 Elsevier 100 PB-BLAS Non PB-BLAS

Machine Learning in ECL 450 Logistic Runtimes Auto Mapping Full Elsevier Dataset 100,000 x 3,291 Time in Minutes 400 350 300 250 200 150 100 50 0 Elsevier 1,000 PB-BLAS Non PB-BLAS

Project Summary Logistic Regression code and supporting functions have been documented and merged to ECL-ML GitHub repository Auto block vector mapping function for any user that wants to use PB-BLAS Ready to use element wise multiplication in PB-BLAS Updated debugging statements that a clear understanding of errors Test functions for both block vector mapping function Sample code for using logistic regression Currently working on K-means implementation that utilizes PB-BLAS

Linh Ngo, PhD Alex Herzog, PhD Michael Payne Amy Apon, PhD {lngo, aherzog, mpayne3, aapon}@clemson.edu Big Data Systems Laboratory Clemson University