The Social Network of Java Classes Power-law for Dummies

Similar documents
Query-Driven Document Partitioning and Collection Selection

SIGIR Workshop Report. The SIGIR Heterogeneous and Distributed Information Retrieval Workshop

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Clustering (COSC 488) Nazli Goharian. Document Clustering.

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

Text Analytics (Text Mining)

Information Retrieval

Databases 2 (VU) ( / )

Overlay (and P2P) Networks

UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Text Analytics (Text Mining)

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Chapter 2. Architecture of a Search Engine

Information Retrieval CSCI

Contextual Information Retrieval Using Ontology-Based User Profiles

CSE 494 Project C. Garrett Wolf

Automatic Identification of User Goals in Web Search [WWW 05]

Lecture 1: January 22

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns

Information Retrieval

Clustering (COSC 416) Nazli Goharian. Document Clustering.

An Architectural Approach to Inter-domain Measurement Data Sharing

Part I: Data Mining Foundations

Search Computing: Business Areas, Research and Socio-Economic Challenges

Information Retrieval

Object Model. Object Orientated Analysis and Design. Benjamin Kenwright

Inverted Index for Fast Nearest Neighbour

VK Multimedia Information Systems

Midterm Exam Search Engines ( / ) October 20, 2015

Web Security Vulnerabilities: Challenges and Solutions

Lecture 1: January 23

Space Details. Available Pages

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Modelling data networks research summary and modelling tools

How to explore big networks? Question: Perform a random walk on G. What is the average node degree among visited nodes, if avg degree in G is 200?

OnCrawl Metrics. What SEO indicators do we analyze for you? Dig into our board of metrics to find the one you are looking for.

INTRODUCTION TO DATA MINING

A project report submitted to Indiana University

Internet Search. (COSC 488) Nazli Goharian Nazli Goharian, 2005, Outline

Image Credit: Photo by Lukas from Pexels

Feature selection. LING 572 Fei Xia

Chuck Cartledge, PhD. 21 January 2018

Information Retrieval. hussein suleman uct cs

Beyond Ten Blue Links Seven Challenges

ONTOPARK: ONTOLOGY BASED PAGE RANKING FRAMEWORK USING RESOURCE DESCRIPTION FRAMEWORK

An overview of Graph Categories and Graph Primitives

Prototyping Data Intensive Apps: TrendingTopics.org

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Enhancing applications with Cognitive APIs IBM Corporation

Models of Network Formation. Networked Life NETS 112 Fall 2017 Prof. Michael Kearns

Syllabus INFO-GB Design and Development of Web and Mobile Applications (Especially for Start Ups)

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

Map Reduce. Yerevan.

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Fundamentals of STEP Implementation

JBoss World 2009 Aaron Darcy

CS47300 Web Information Search and Management

Help! My Birthday Reminder Wants to Brick My Phone!

Efficient Diversification of Web Search Results

A Generating Function Approach to Analyze Random Graphs

Exploiting peer group concept for adaptive and highly available services

: Semantic Web (2013 Fall)

BUBBLE RAP: Social-Based Forwarding in Delay-Tolerant Networks

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Mobile Device Architecture CS 4720 Mobile Application Development

Computing Yi Fang, PhD

Information Retrieval

How to pimp high volume PHP websites. 27. September 2008, PHP conference Barcelona. By Jens Bierkandt

A project report submitted to Indiana University

Hyperbolic Geometry of Complex Network Data

Social Information Retrieval

A Primer on Design Patterns

Relevance of a Document to a Query

Computer Science 572 Midterm Prof. Horowitz Thursday, March 8, 2012, 2:00pm 3:00pm

Introduction to Information Retrieval

CSE373: Data Structures & Algorithms Lecture 23: Course Victory Lap. Kevin Quinn Fall 2015

Exam IST 441 Spring 2014

Introduction to Networks and Business Intelligence

Historical Text Mining:

The Constellation Project. Andrew W. Nash 14 November 2016

Improving Difficult Queries by Leveraging Clusters in Term Graph

XOTcl an Object-Oriented Scripting Language. First European Tcl/Tk User Meeting, 15/16th June 2000

Programming II. Modularity 2017/18

Social Behavior Prediction Through Reality Mining

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Case Studies in Complex Networks

SEO STRATEGY DOCUMENT

vector space retrieval many slides courtesy James Amherst

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

USC Viterbi School of Engineering

Authoritative K-Means for Clustering of Web Search Results

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Wednesday, March 8, Complex Networks. Presenter: Jirakhom Ruttanavakul. CS 790R, University of Nevada, Reno

Query Processing in Highly-Loaded Search Engines

Transcription:

The Social Network of Java Classes Power-law for Dummies Diego Puppin Institute for Information Sciences and Technology Pisa, Italy March 24, 2006 Diego Puppin (ISTI-CNR) LabDay March 24, 2006 1 / 41

Outline Today s Menu 1 Today s Menu 2 Appetizer: Choice of Mixed Topics 3 First Courses Social Networks and Power Law Java Software Engineering 4 Second Course: Salad of Good Ideas and Boiled Theories From Here to Ranking Software Experiments 5... the bill, please! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 2 / 41

Today s Menu Why in English? Anybody speaking good French or German? Are these slides going to be on-line somewhere/someday? I need to get ready for SAC 2006 :-) Diego Puppin (ISTI-CNR) LabDay March 24, 2006 3 / 41

Today s Menu Quick Check-pointing Document-partitioning using the query logs Peer-to-peer discovery systems Scheduling on SUN machines Pretty sweet languages Diego Puppin (ISTI-CNR) LabDay March 24, 2006 4 / 41

...and then... Today s Menu Hold on! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 5 / 41

...and then... Today s Menu Hold on! Awesome findings on power laws and Java! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 5 / 41

Outline Appetizer: Choice of Mixed Topics 1 Today s Menu 2 Appetizer: Choice of Mixed Topics 3 First Courses Social Networks and Power Law Java Software Engineering 4 Second Course: Salad of Good Ideas and Boiled Theories From Here to Ranking Software Experiments 5... the bill, please! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 6 / 41

Appetizer: Choice of Mixed Topics Document-partitioning using the query logs WHY: Started as a problem related to the scalability of the discovery system WHAT FOR: General applicability to Web search engine, Big research potential WHAT: We identified a novel approach to partition documents using query logs, so to minimize the number of queried partitions To be presented at INFOSCALE 2006, raised the interest of Prof. Ricard Baeza-Yates Diego Puppin (ISTI-CNR) LabDay March 24, 2006 7 / 41

Appetizer: Choice of Mixed Topics Document-partitioning using the query logs WHY: Started as a problem related to the scalability of the discovery system WHAT FOR: General applicability to Web search engine, Big research potential WHAT: We identified a novel approach to partition documents using query logs, so to minimize the number of queried partitions To be presented at INFOSCALE 2006, raised the interest of Prof. Ricard Baeza-Yates and of Dr. Fabrizio Silvestri... I almost forgot Diego Puppin (ISTI-CNR) LabDay March 24, 2006 7 / 41

Appetizer: Choice of Mixed Topics Peer-to-peer discovery systems WHAT FOR: In the context of XtreemOS WHAT: Distributed solutions for an information service We have one tirocinio going on (Thx Hanien) Part of the syllabus of the class Topic in Grid Computing at the Indian Institute of Science, Bangalore: http://www.serc.iisc.ernet.in/ vss/courses/gc2005/ Diego Puppin (ISTI-CNR) LabDay March 24, 2006 8 / 41

Appetizer: Choice of Mixed Topics Scheduling on SUN machines WHY: My MS thesis at MIT does not want to die WHAT FOR: Maybe we can show it off to SUN WHAT: So far, we have one student tirocinio and one student thesis, plus Marco Pasquali Diego Puppin (ISTI-CNR) LabDay March 24, 2006 9 / 41

Sweet sausages Appetizer: Choice of Mixed Topics Diego Puppin (ISTI-CNR) LabDay March 24, 2006 10 / 41

Sweet sausages Appetizer: Choice of Mixed Topics I meant... languages. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 10 / 41

Appetizer: Choice of Mixed Topics Sweet sausages I meant... languages. WHY: different ways of doing components WHAT: I studied Aspect Programming and Design-by-Contract Along the way, Python and Ruby (pretty sweet) WHAT FOR: Diego Puppin (ISTI-CNR) LabDay March 24, 2006 10 / 41

Appetizer: Choice of Mixed Topics Sweet sausages I meant... languages. WHY: different ways of doing components WHAT: I studied Aspect Programming and Design-by-Contract Along the way, Python and Ruby (pretty sweet) WHAT FOR: When all you have got is a hammer, everything looks like a nail Diego Puppin (ISTI-CNR) LabDay March 24, 2006 10 / 41

Appetizer: Choice of Mixed Topics If you want a full serving of these appetizers... just ask! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 11 / 41

Outline First Courses 1 Today s Menu 2 Appetizer: Choice of Mixed Topics 3 First Courses Social Networks and Power Law Java Software Engineering 4 Second Course: Salad of Good Ideas and Boiled Theories From Here to Ranking Software Experiments 5... the bill, please! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 12 / 41

First Courses Social Networks and Power Law Social Networks Behavior emerging from linked entities Ranging from biology, to computer science, to economics First studies in late 1930s Then, studies on random graph by Erdos Recently, good summary by Barabazi Diego Puppin (ISTI-CNR) LabDay March 24, 2006 13 / 41

First Courses Social Networks and Power Law Emerging Behavior Rich Web pages getting richer Stocks raising or falling quickly Ants building nests Diego Puppin (ISTI-CNR) LabDay March 24, 2006 14 / 41

First Courses Social Networks and Power Law Emerging Behavior Rich Web pages getting richer Stocks raising or falling quickly Ants building nests Traffic jams Diego Puppin (ISTI-CNR) LabDay March 24, 2006 14 / 41

First Courses Social Networks and Power Law Emerging Behavior Rich Web pages getting richer Stocks raising or falling quickly Ants building nests Traffic jams When counting links, visitors, references in a natural network, large events are rare, but small ones are quite common. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 14 / 41

First Courses Social Networks and Power Law Zipf, Pareto, Power Law Zipf s law: the size y of an occurrence of an event is inversely proportional to its rank : y r b, with b close to unity. Pareto: the number of events larger than x is an inverse power of x: P[X > x] x k. Power law: the number of events equal to x: P[X = x] x (k+1) = x a. Also called scale-free distribution, because the distribution doesn t change with scale. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 15 / 41

First Courses Social Networks and Power Law Diego Puppin (ISTI-CNR) LabDay March 24, 2006 16 / 41

First Courses Social Networks and Power Law...may I have some more? Read Barabasi s Linked (in our library). Seminal paper by Faloutsos (x3) (1999). Diego Puppin (ISTI-CNR) LabDay March 24, 2006 17 / 41

First Courses Java Software Engineering Power-Law from Engineering Choices From Power Law Distribution in Class Relationships. Interesting power-laws were found in the design of the Java Development Kit. Coherent SW engineering choices (modularity, orthogonality, interface contracts) lead to a scale-free network with power-law distributions: number of sub-classes, number of fields, number of methods, number of constructors. Also in Scale-free Networks from Optimal Design. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 18 / 41

First Courses Java Software Engineering Diego Puppin (ISTI-CNR) LabDay March 24, 2006 19 / 41

First Courses Java Software Engineering Power-Law from Social Behavior We want to study the behavior that is naturally emerging from the social network, at the level of static references. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 20 / 41

First Courses Java Software Engineering Scale Free Geometry in OO Programs Authors get a snapshot of the memory at run time They counts the number of references (link) among objects Number of inlinks and outlinks are distributed with power law Diego Puppin (ISTI-CNR) LabDay March 24, 2006 21 / 41

First Courses Java Software Engineering Scale Free Geometry in OO Programs Diego Puppin (ISTI-CNR) LabDay March 24, 2006 21 / 41

First Courses Java Software Engineering Scale Free Geometry in OO Programs No typical size of an object Few objects are very important Not CLASS, but object Interesting, but not useful for developers Dynamic analysis of code: Interesting results from dynamic analysis (profiling, optimization) [GGMS03, vdab03] BUT dynamic data can be commercial secrets Diego Puppin (ISTI-CNR) LabDay March 24, 2006 21 / 41

Good for ACM First Courses Java Software Engineering Clustering Object-Oriented Software Systems using Spectral Graph Partitioning, by Spiros Xanthos, second place at the 2005 ACM Student Research Competition. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 22 / 41

Outline Ranking 1 Today s Menu 2 Appetizer: Choice of Mixed Topics 3 First Courses Social Networks and Power Law Java Software Engineering 4 Second Course: Salad of Good Ideas and Boiled Theories From Here to Ranking Software Experiments 5... the bill, please! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 23 / 41

Ranking From Here to Ranking Software How to Rank Software Similar to Google PageRank Static analysis of code links INTERFACES ONLY Every time a class is used, there is a rank boost No source code is needed No runtime information is needed Only public interfaces Diego Puppin (ISTI-CNR) LabDay March 24, 2006 24 / 41

Ranking From Here to Ranking Software Why only interfaces? Commercial component will hide source code Will also hide runtime profile information Interfaces must be public, in order to use a component We suggest to base ranking on this A composed application should also make public the composition structure use/provide ports Diego Puppin (ISTI-CNR) LabDay March 24, 2006 25 / 41

Ranking From Here to Ranking Software Why composition should be public? To improve trust To become more popular To support a standard Open Source... Compare with digital libraries bib. references VS full-text Diego Puppin (ISTI-CNR) LabDay March 24, 2006 26 / 41

Ranking From Here to Ranking Software Class Graph: Static vs Dynamic Abstract static information Progr. Interface:Class Rank Executable information Program Code:Comp. Rank Dynamic information Process:Scale Free Diego Puppin (ISTI-CNR) LabDay March 24, 2006 27 / 41

A Model for Ranking Ranking From Here to Ranking Software Assumptions: Avoid ontologies An ontology for the whole Grid? Heavy emphasis on LINKS Positive experience on the Web No use of source code and dynamic data Hidden in commercial apps Use only public interfaces Maybe semantic info can be used Sub/super-types... Diego Puppin (ISTI-CNR) LabDay March 24, 2006 28 / 41

Ranking Experiments Initial Experiments Java classes, simple composition model Strong documentation (Java Docs) Unique ID (Package names) Class links are very easy to see We collected 49000 classes We parsed and built a social graph Diego Puppin (ISTI-CNR) LabDay March 24, 2006 29 / 41

Ranking Experiments Social Network in Java Classes Java programmer component user S/he chooses most general and useful classes Power-law behavior Web pages, blogs, social networks sociali etc see On Power-Law Relationships of the Internet Topology, by Faloutsos et al. component usage Web linking!!! Component search Web Search Diego Puppin (ISTI-CNR) LabDay March 24, 2006 30 / 41

Ranking Experiments Diego Puppin (ISTI-CNR) LabDay March 24, 2006 31 / 41

Ranking Experiments Diego Puppin (ISTI-CNR) LabDay March 24, 2006 32 / 41

Class Rank Ranking Experiments To determine the rank of a class C, we iterate the following formula: rank C = λ + (1 λ) i inlinks C rank i #outlinks i where inlinks C is the set of classes that use C (with a link into C), #outlinks i is the number of classes used by i (number of links out of i), and λ a small factor, usually around 0.15. Diego Puppin (ISTI-CNR) LabDay March 24, 2006 33 / 41

Ranking Experiments Top-ranking classes String, Object, Class, Exception #7: Apache MessageResources #11: Tomcat CharChunk #14: DBXML Value #73: JXTA ID Diego Puppin (ISTI-CNR) LabDay March 24, 2006 34 / 41

Ranking Experiments GRIDLE 0.1 Ranking using two metrics: TF.IDF (term frequency times inverted document frequency) GRIDLE Rank Bells and whistles: Snippets, Links and Reverse Links http://gridle.isti.cnr.it Diego Puppin (ISTI-CNR) LabDay March 24, 2006 35 / 41

Ranking Experiments A search engine for SW components GRIDLE: Google-like Ranking, Indexing and Discovery service for a Link-based Eco-system of software components Diego Puppin (ISTI-CNR) LabDay March 24, 2006 36 / 41

Ranking Experiments Diego Puppin (ISTI-CNR) LabDay March 24, 2006 37 / 41

Outline... the bill, please! 1 Today s Menu 2 Appetizer: Choice of Mixed Topics 3 First Courses Social Networks and Power Law Java Software Engineering 4 Second Course: Salad of Good Ideas and Boiled Theories From Here to Ranking Software Experiments 5... the bill, please! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 38 / 41

... the bill, please! Conclusions Usage of Java classes follows the pattern of links among Web pages We can rank classes according to this finding...towards a search engine for software Diego Puppin (ISTI-CNR) LabDay March 24, 2006 39 / 41

... the bill, please!...going to SAC 2006! Please, give me feedback!!! Thanks for coming!! Diego Puppin (ISTI-CNR) LabDay March 24, 2006 40 / 41

... the bill, please! Acknowledgments MIUR CNR Strategic Project L 499/97-2000 (5%) NextGrid CoreGRID Università degli Studi di Pisa ISTI-CNR Diego Puppin (ISTI-CNR) LabDay March 24, 2006 41 / 41