Applications of HPCC Systems at Clemson University. Amy Apon, PhD Linh Ngo, PhD Michael Payne Big Data Systems Laboratory Clemson University

Similar documents
HPCC / Spark Integration. Boca Raton Documentation Team

EDA Toolkit for Data Scientists

Science 2.0 VU Big Science, e-science and E- Infrastructures + Bibliometric Network Analysis

BEST BIG DATA CERTIFICATIONS

Clemson HPC and Cloud Computing

CRITERIA FOR ACCREDITING COMPUTING PROGRAMS

How to Guide. For Personal Users

Data Processing at Scale (CSE 511)

Semi-Structured Data Management (CSE 511)

How to Guide. For Personal Users

Pre-Requisites: CS2510. NU Core Designations: AD

Data Mining: STATISTICA

Introducing Microsoft SQL Server 2016 R Services. Julian Lee Advanced Analytics Lead Global Black Belt Asia Timezone

2017 Resource Allocations Competition Results

Radiation Center Strategic Plan Mission, Vision, Goals and Strategies for Activities and Operations

IBM Spectrum Scale IO performance

Outline. Prepare the data Classification and regression Clustering Association rules Graphic user interface

GRADUATE PROGRAMS IN ENTERPRISE AND CLOUD COMPUTING

The IRIS Data Management Center maintains the world s largest system for collecting, archiving and distributing freely available seismological data.

Funded Project Final Survey Report

SOFTWARE ENGINEERING. Curriculum in Software Engineering. Program Educational Objectives

Computer Information Systems

Scholarly Big Data: Leverage for Science

Federated XDMoD Requirements

JAKUB KOPERWAS, HENRYK RYBINSKI, ŁUKASZ SKONIECZNY Institute of Computer Science, Warsaw University of Technology

Presented by: Cathy Payne, Applications Analyst Piedmont Technical College

Higher Education in Texas: Serving Texas Through Transformational Education, Research, Discovery & Impact

Big Data Specialized Studies

Using Existing Numerical Libraries on Spark

MSc(IT) Program. MSc(IT) Program Educational Objectives (PEO):

EGI federated e-infrastructure, a building block for the Open Science Commons

DDN Annual High Performance Computing Trends Survey Reveals Rising Deployment of Flash Tiers & Private/Hybrid Clouds vs.

Operations Orchestration 10.x Flow Authoring (OO220)

Bringing OpenStack to the Enterprise. An enterprise-class solution ensures you get the required performance, reliability, and security

HPCC Systems ECL and Distributed Machine Learning with the HPCC Systems Platform.

Gauging the User: TESTING THE UX IN AN INSTITUTIONAL REPOSITORY AFTER AN ACADEMIC LIBRARY AND A PUBLISHER COLLABORATE

RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Summary. RapidMiner Project 12/13/2011 RAPIDMINER FREE SOFTWARE FOR DATA MINING, ANALYTICS AND BUSINESS INTELLIGENCE

Integrating Identity Management Aspirations and Issues

Stream Processing on IoT Devices using Calvin Framework

OPPORTUNITY FOR PLACEMENT AHRC Knowledge Exchange Fellow FuseBox/Wired Sussex

Conference The Data Challenges of the LHC. Reda Tafirout, TRIUMF

Brown University Libraries Technology Plan

SPARC 2 Consultations January-February 2016

Predicting Service Outage Using Machine Learning Techniques. HPE Innovation Center

Strategic Energy Institute Energy Policy Innovation Center EPICenter

Demystifying Scopus APIs

Now, Data Mining Is Within Your Reach

BUCKNELL S SCIENCE DMZ

New Faculty Orientation Technology Session Research Resources August 2017

Oracle Machine Learning Notebook

INSTITUTE OF INFORMATION TECHNOLOGY UNIVERSITY OF DHAKA

Data Science Bootcamp Curriculum. NYC Data Science Academy

BOARD OF REGENTS ACADEMIC AFFAIRS COMMITTEE 4 STATE OF IOWA SEPTEMBER 12-13, 2018

Does Research ICT KALRO? Transforming education using ICT

The Cambridge Bio-Medical-Cloud An OpenStack platform for medical analytics and biomedical research

Microsoft SharePoint Server 2013 Plan, Configure & Manage

Identity Management: Setting Context

OneUConn IT Service Delivery Vision

Gain Greater Productivity in Enterprise Data Mining

UCLA RESEARCH INFORMATICS STRATEGIC PLAN Taking Action June, 2013

Scientific databases

Information Systems and Tech (IST)

Student Handbook Master of Information Systems Management (MISM)

Event: PASS SQL Saturday - DC 2018 Presenter: Jon Tupitza, CTO Architect

Program Proposal for a Direct Converted Program. BS in COMPUTER SCIENCE

The Shogun Machine Learning Toolbox

INTRODUCTION TO BIG DATA, DATA MINING, AND MACHINE LEARNING

High Performance Computing Data Management. Philippe Trautmann BDM High Performance Computing Global Research

QUALCOMM: Company Overview and Opportunities for Students and Collaboration. Junyi Li Vice President of Technology QUALCOMM

AT&T Labs Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey

Commonwealth Cyber Initiative Letters of Support

SciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus

GUJARAT TECHNOLOGICAL UNIVERSITY MASTER OF COMPUTER APPLICATIONS (MCA) Semester: IV

Data science How to prepare engineers for this field

AN OVERVIEW OF COMPUTING RESOURCES WITHIN MATHS AND UON

The library s role in promoting the sharing of scientific research data

Python With Data Science

CYBERSECURITY: Scholarship and Job Opportunities

Preamble. A Strategic Plan for the Internet2 Community Spring 2008

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Computing Accreditation Commission Version 2.0 CRITERIA FOR ACCREDITING COMPUTING PROGRAMS

High Performance Computing Resources at MSU

National Research Data Cloud

Cloud Computing For Researchers

Educational Data Mining: Performance Evaluation of Decision Tree and Clustering Techniques using WEKA Platform

Cyber Security Program

Parallel Methods for Convex Optimization. A. Devarakonda, J. Demmel, K. Fountoulakis, M. Mahoney

Government IT Modernization and the Adoption of Hybrid Cloud

Data Analytics Training Program

Internet of Things specialization at Institut Mines-Télécom / Télécom Bretagne. Rennes campus, France

Mass Big Data: Progressive Growth through Strategic Collaboration

Mendeley Institutional Edition Group Owner Guide

Accelerating Spark Workloads using GPUs

SMART. Investing in urban innovation

Question Bank. 4) It is the source of information later delivered to data marts.

SIMPLIFY IT. Transform IT with VCE and Vblock TM Infrastructure Platforms. Copyright 2011 VCE Company LLC, All rights reserved.

If you missed any of the information leading up to the launch of The Learning Exchange, please visit:

After completing this course, participants will be able to:

Tackling Big Data Using MATLAB

Transcription:

Applications of HPCC Systems at Clemson University Amy Apon, PhD Linh Ngo, PhD Michael Payne Big Data Systems Laboratory Clemson University

Clemson Strengths and Opportunities People Facilities PhD-level faculty & research staff Talented students Significant industry collaborators Palmetto Top 5 in US Academic Supercomputers ~2000 nodes, 20K cores, 600 GPUs 100Gb Internet connectivity

Big Data Systems Lab Overview Big Data Systems Lab Vision Perform World Class Research on the Systems and Enabling Information Technology for Advanced Data Analytics Big Data Systems Lab Research Areas Systems and Architectures Tools and Operations Data Analytics and Applications

Effect of High Performance Computing on Academic Research Productivity Motivation: There is a lot of pressure on federal funding We propose efficiency as a measure from which to gain insights on return on investment We show that locallyavailable HPC has a positive effect on the ability of a university to do research

Text mining of news reports and social media for business intelligence Motivation: Government and business need information about public sentiment. Research: We develop and apply methods to analyze large amounts of textual data to enable inquiry of social and business problems.

Shared Computing Resources among Researchers Shared Execution Environment Temporary Local Storage User Privileges Only

Linh Ngo, PhD HPCC Systems in a Shared Research Computing Environment

Shared Computing Resources among Researchers Shared Execution Environment Temporary Local Storage User Privileges Only How to provision and configure an HPCC cluster dynamically for research purposes? Step 1: Configure, install, and deploy HPCC as a non-root user Step 2: Dynamically provision HPCC cluster in a shared research environment

Installation and Configuration of Dependencies / /usr binutils ICU XALAN APR / /home /lib64 /lib $USER /opt /local_scratch /parallel_scratch

Resolving Non-default Installation Path Conflicts / etc init.d Administrative privileges / home Non-administrative privileges $USER opt var HPCCSystems HPCCSystems lib log HPCCSystems HPCCSystems configmgr mydafilesrv mydafilesrv local_scratch hpcc $USER hpcc parallel_scratch $USER HPCCSystems lib log lock pid

Non-root Deployment Remove/relax root-level settings: i.e.: is_root Reduce default configuration settings for resource requirements: depended on resource allocation requests

Dynamic Provisioning user.palmetto.clemson.edu 2 PBS_NODEFILE 3 environment.xml 4 1 mydafilesrv mydafile myeclc mythor 5 myroxie Deploy to /local_scratch or /parallel_scratch?

Michael Payne Using HPCC Systems to Manage Academic Data LexisNexis Summer 2014 Internship

Using HPCC Systems to Manage Academic Data Research in Scholarly Data requires academic data from many different sources, which store data under various formats Aggregating these sources into a useful and cohesive structure requires a data-intensive approach to preprocessing, integration and analysis HPCC Systems is a platform to streamline this process

Categories of Scholarly Data Research Higher Education Institutions Funding Support High Performance Computing Capability

Scholarly Data Description Enrollment (multi-sheet Excel) Financial (multi-sheet Excel) Faculty (multi-sheet Excel) Detailed Award (XML) Federal Funding (tabdelimited) Expenditures (tab-delimited) Institution Patent (Excel) Degrees Conferred (Excel) Funding Support Research Higher Education Institutions Detailed list of articles by discipline with abstracts, references, and disambiguated authors. (XML) High Performance Computing Capability Institutional Information with Carnegie s Research Classifications (multi-sheet Excel) Institutions from States with EPSCoR status (tab-delimited) NIH Award Data (CSV/Excel) Top500 Supercomputer affiliated with academic institutions (XML)

Examples of Scholarly Data Links PI/Author name Institution name Name similarity Address Name similarity Match name with WoS s Organization-Enhanced name Institution name/email Acknowledgment attributes (2008 on ward, automatic)

Ongoing Work Porting data analytic processes to ECL Applying Machine Learning techniques for article abstract classification

Summer 2014 Internship - Logistic Regression for Dense Matrices LexisNexis Internship Machine Learning Manager Timothy Humphrey Mentor Arjuna Chala

Logistic Regression Prediction using continuous and discrete values No distributional assumptions on the predictors May not be normally distributed or linearly related Relationship between the discrete variable and the predictor is non-linear

Parallel Block Basic Linear Algebra Subprograms (PB-BLAS) Matrices can be partitioned Schemes must be compatible There are multiple choices! X = 4 x 4 4 x 1 4 x 1 X 3 x 1 = 2 x 3 2 x 1 1 x 1 2 x 1

Machine Learning in ECL 35 Logistic Runtimes Hard Coded Mapping Full Higgs Dataset 11,000,000 x 28 Time in Minutes 30 25 20 15 10 5 0 Higgs 1,000 Higgs 10,000 Higgs 100,000 PB-BLAS Non PB-BLAS

Machine Learning in ECL 25 Logistic Runtimes Auto Mapping Full Elsevier Dataset 100,000 x 3,291 Time in Minutes 20 15 10 5 0 Elsevier 100 PB-BLAS Non PB-BLAS

Machine Learning in ECL 450 Logistic Runtimes Auto Mapping Full Elsevier Dataset 100,000 x 3,291 Time in Minutes 400 350 300 250 200 150 100 50 0 Elsevier 1,000 PB-BLAS Non PB-BLAS

Project Summary Logistic Regression code and supporting functions have been documented and merged to ECL-ML GitHub repository Auto block vector mapping function for any user that wants to use PB-BLAS Ready to use element wise multiplication in PB-BLAS Updated debugging statements that a clear understanding of errors Test functions for both block vector mapping function Sample code for using logistic regression Currently working on K-means implementation that utilizes PB-BLAS

Linh Ngo, PhD Alex Herzog, PhD Michael Payne Amy Apon, PhD {lngo, aherzog, mpayne3, aapon}@clemson.edu Big Data Systems Laboratory Clemson University