Performance Evaluation Java Collections Framework TECHNICAL WHITEPAPER Author: Kapil Viren Ahuja Date: October 17, 2008
Table of Contents 1 Introduction...3 1.1 Scope of this document...3 1.2 Intended audience...3 2 Evaluation Approach...4 2.1 Comparison parameters...4 2.2 Comparison scenarios...5 2.3 Environment...6 2.4 Execution and sampling...7 3 Measurements...9 3.1 Insertion of unique elements (Long)...9 3.2 Comparison of unique elements (Element)...10 3.3 Comparison of non-unique elements (Element)...12 3.4 Iteration over elements (Long)...12 3.5 Iteration over elements (Element)...14 Appendices A Data Structure for custom class...16 B List of Tables...18 C Change Log...19 Page 2
Introduction 1 INTRODUCTION Managing list or collection of objects is a very common scenario. In addition, managing that list effectively, that provides the optimum performance is also a very common need. The Java programming language offers many in-built data types for representing and modeling collection of objects. Some of the commonly used data types are: java.lang.arraylist java.lang.hashset java.lang.treemap Each of the data types behave differently under different scenarios. In addition, when writing algorithms that demonstrate highest levels of performance it is necessary to make the right choice. For many developers and architects it is not an easy choice. This document provides details of a comparison done across various data-types supported by Java Collections Framework. In addition, it will study their performance under different circumstances. 1.1 Scope of this document This document provides performance data for various data types in Java Collections Framework. It will not provide details of the Collections Framework or about the data. This is a factor of how each collection data type is implemented in Java and hence is subject to change from one implementation of the Java Virtual Machine specification to another. If developers are interested in learning the reasons behind performance are encouraged to read the Java documentation on Sun Microsystems website. This document does not contain any recommendations. This document only covers performance results in a specific environment. How you interpret the results and use these is entirely up to you. 1.2 Intended audience All Java developers who are using or intend to use the Java Collections Framework while developing an application want to decide which collection data type to use in a given scenario. Page 3
Evaluation Approach 2 EVALUATION APPROACH To benchmark the performance we have to establish some common rules, which can be consistently applied to various scenarios. These are listed below: 1. Comparison parameters 2. Comparison scenarios 3. Environment 4. Execution and sampling 2.1 Comparison parameters For the success of any benchmark, it is critical that various parameters are identified upfront. This helps in a consistent comparison. We had selected four different parameters for our comparison. These are explained below: Collection size The very first parameter used in the benchmarking process is the size of the collection itself. The number of elements contained in a collection identifies the size of a collection. Performance was benchmarked for varied sizes of 1000 to 100,000 in multiples of 10. We did not consider size less than 10000, as results for different data types were similar. In addition, we did not consider size of 1,000,000 because we were running into Java heap size issues. Collection type The second parameter used in the benchmarking process is the data type of the Java Collections Framework. These are listed below: 1. ArrayList 2. LinkedList 3. HashSet 4. TreeSet 5. Vector 6. HashMap 7. TreeMap 8. LinkedHashMap Page 4
If developers are interested to understand the Java Collections Framework, they are encouraged to read more the following provided links: Wikipedia IBM Data type of the elements stored Another parameter used in the benchmarking process is the data type of the element stored in the collection. Data types used were primitive, in-built and user-defined. The intention of using all three kinds of data types is to provide coverage across all kinds of data types. These are listed below: 1. In-built data type: java.lang.long 2. User-defined data type: We created a custom class called "Element". We created instance of this class with random data during the exercise. The structure has been defined in Annexure Data Structure for custom class Sample size The fourth and last parameter used in the benchmarking process is the sample size. It is a very common practice to repeat a process several times and collect data points. This ensures that we have tested for consistency of the behavior. Using this data a correlation can easily be drawn on the data set. We performed 10 iterations for every scenario. 2.2 Comparison scenarios To benchmark the performance, we identified a few but very commonly used scenarios. These have been explained below: Insertion One of the very basic requirements of a collection is to insert an element or number of elements into a collection. This scenario deals with the common use case of inserting elements in a collection. We evaluated two aspects of the scenarios: 1. In the first scenario, we inserted unique elements in a collection. We used the value returned by the hashcode to identify the uniqueness of an element. This was tested for elements of data type Long as two objects return different values 2. In the second scenario, we inserted non-unique elements in a collection. For creating non-unique elements, the hashcode method of the element Page 5
class was overridden to return the same value always. This case was tried only on data types of Set and Map because only these two types support filtering out non-unique elements. Iteration Another very common use case is to iterate over a collection. We observed in most cases, iteration over a collection is more frequently used scenario when compared to insertion and deletion of elements. 2.3 Environment Results of any performance benchmark are dependent on the environment on which the data is being deduced. For the purposes of this benchmark, the system specifications have been listed below: Hardware specifications Processor Value Intel Core 2 Duo CPU T8100 Number of CPUs 2 CPU speed RAM model and make Both cores @ 2.10 GHZ 3070 MB Table 1: Hardware specifications Software specifications Operating System Java runtime Value Windows Vista Home Premium Java 2 Runtime Environment, Standard Edition (build 1.5.0_08-b03) IDE Eclipse 3.3.2 Build id: M20080221-1800 Table 2: Software specifications Page 6
2.4 Execution and sampling During the benchmarking exercise all, the scenarios were run as per the parameters agreed upon. We had available with us two approaches to record samples for the benchmark. These have been listed below: Iterations for scenarios In this approach, we considered one scenario as one sample. We then iterated over the same scenario 10 times and collected samples. For example, we inserted 10 records in an ArrayList and recorded one sample. This sample was the time taken to insert 10 records in the collection. We repeated the process 10 times. Elements in a collection In this approach, we considered one element as one sample. For example, we inserted 10 records in an ArrayList and recorded a sample which was the time taken to insert that element in the collection. At the end of the use case, we had 10 samples. For the purposes of this evaluation, we opted for the former approach, because in most common cases, a user is interested in performance of one complete operation. The later approach is not so much useful, because it will not provide diversified samples. Hence, measuring the predictability of the collection is not feasible. In addition, collecting samples for iterations will ensure that any variation during while adding elements to collection are captured. Interpretation of results Benchmark was prepared using various mathematical parameters. These have been listed below: S. No Description Symbol Unit 1 Iterations The total number of times a specific scenario was performed n N.A. 2 Minimum time Minimum amount of time taken to complete an iteration 3 Maximum time Maximum amount of time taken to complete an iteration min µs max µs 4 Total time Total time taken to complete all iterations T µs Page 7
5 time Average time taken to complete all iterations M µs 6 Standard Deviation Standard deviation of the operation from mean σ µs 7 Number of samples that are outside 1 sigma m±σ N.A. 8 Number of samples that are outside 2 sigma m±2σ N.A. 9 Number of samples that are outside 3 sigma m±3σ N.A. Table 3: Mathematical parameters for benchmark To compare a collection for a given scenario, the following two factors should be looked at collectively: 1. time: This represents the average time consumed to perform an iteration of a scenario. It is calculated as average of the times taken for all the iterations 2. outside sigma: Standard Deviation is the factor that determined the stability of a distribution. It has been proven that a distribution is said to be most stable if it follows a Normal distribution. As per the laws of the normal distribution if the number of samples outside the lower and upper control limits of mean and standard deviation are less, the distribution is said to be more stable. When comparing two or more data types, we should look for a data type that is the fastest in a given scenario. However, if the faster data type is less stable then we cannot predict the same performance every time. This will mean that in a real scenario there is a higher probability of the data type to run slower of faster than expected. However, a more stable data type, which is a little slow in execution, is a better option. Page 8
Measurements 3 MEASUREMENTS 3.1 Insertion of unique elements (Long) Comparison for data size of 10000 elements (Long) ArrayList 0.8 2 1 1 LinkedList 0.8 12 1 0 HashSet 1.28 1 1 1 TreeSet 6.8 2 1 1 Vector 1.2 1 1 1 HashMap 1.48 5 1 1 TreeMap 6.36 5 1 1 LinkedHashMap 1.88 1 1 1 Table 4: Results for insertion of 10000 unique elements (Long) Comparison for data size of 100000 elements (Long) ArrayList 25.6 4 1 0 LinkedList 54.96 1 1 7 HashSet 63.12 2 1 1 TreeSet 126.08 2 2 2 Vector 20.04 5 3 0 HashMap 42.68 4 3 0 Page 9
TreeMap 99.12 7 1 0 LinkedHashMap 59.12 8 2 0 Table 5: Results for insertion of 100000 unique elements (Long) Comparison of data size of 1000000 elements (Long) ArrayList 165.44 1 1 1 LinkedList 378.52 1 1 1 HashSet 675.24 3 1 1 TreeSet 1253.16 6 2 0 Vector 231 1 1 1 HashMap 711.52 1 1 1 TreeMap 1215.25 10 1 0 LinkedHashMap 863.48 2 1 1 Table 6: Results for insertion of 100000 unique elements (Long) 3.2 Comparison of unique elements (Element) Comparison for data size of 10000 elements ArrayList 3.12 4 1 1 LinkedList 1.28 1 1 1 HashSet 1.28 2 2 2 Vector 1.28 2 2 2 Page 10
HashMap 1.24 1 1 1 LinkedHashMap 1.92 2 2 1 Table 7: Results for insertion of 10000 unique elements (Element) Comparison for data size of 100000 elements ArrayList 51.76 1 1 1 LinkedList 45.52 7 0 0 HashSet 52.44 3 2 1 Vector 20.6 4 2 1 HashMap 1.28 2 2 2 LinkedHashMap 78.04 9 1 0 Table 8: Results for insertion of 100000 unique elements (Element) Comparison for data size of 500000 elements ArrayList 188.48 1 1 1 LinkedList 298.44 2 1 1 HashSet 444.2 1 1 1 Vector 169.08 11 0 0 HashMap 437.4 1 1 1 LinkedHashMap 530.44 1 1 1 Page 11
Table 9: Results for insertion of 500000 unique elements (Element) 3.3 Comparison of non-unique elements (Element) Comparison for data size of 10000 elements HashSet 1218.08 12 0 0 HashMap 1233.08 4 2 0 LinkedHashMap 1234.96 9 0 0 Table 10: Results for insertion of 10000 non-unique elements (Element) You can notice that time taken for inserting 10000 non-unique elements, is significantly more than for unique elements. This clearly shows that such cases should be avoided unless necessary. We did not carry on any further benchmarking of this scenario due to our observation above. 3.4 Iteration over elements (Long) Comparison for data size of 10000 elements (Long) ArrayList 0.64 1 1 1 LinkedList 0 0 0 0 HashSet 0.64 1 1 1 TreeSet 0 O O O Page 12
Vector 0 0 0 0 HashMap 1.84 3 3 0 TreeMap 5.64 9 0 0 LinkedHashMap 2.56 4 4 0 Table 11: Results for iteration of 10000 elements (Long) Comparison for data size of 100000 elements (Long) ArrayList 7.52 1 1 0 LinkedList 8 1 1 1 HashSet 17 1 1 1 TreeSet 9.96 11 1 1 Vector 4.4 7 0 0 HashMap 28.76 11 1 0 TreeMap 97.96 6 1 0 LinkedHashMap 71.12 10 0 0 Table 12: Results for iteration of 100000 elements (Long) Comparison for data size of 1000000 elements (Long) ArrayList 33.8 2 2 1 LinkedList 33.08 8 1 1 HashSet 53.64 2 2 0 Page 13
TreeSet 35.12 5 3 0 Vector 29.4 7 2 0 HashMap 683.16 1 1 1 TreeMap 1219.92 4 1 1 LinkedHashMap 844.96 1 1 1 Table 13: Results for iteration of 1000000 elements (Long) 3.5 Iteration over elements (Element) Comparison for data size of 10000 elements (Element) ArrayList 2.48 4 4 0 LinkedList 1.28 1 1 1 HashSet 1.28 1 1 1 Vector 1.28 2 2 2 HashMap 1.28 1 1 1 LinkedHashMap 1.88 3 3 0 Table 14: Results for iteration of 10000 elements (Element) Comparison for data size of 100000 elements (Element) ArrayList 51.16 1 1 1 LinkedList 54.92 19 0 0 HashSet 54.32 3 3 1 Page 14
Vector 19.26 4 1 1 HashMap 51,84 8 3 0 LinkedHashMap 80.96 7 0 0 Table 15: Results for iteration of 100000 elements (Element) Comparison for data size of 1000000 elements (Element) ArrayList 187.28 1 1 1 LinkedList 325.12 1 1 1 HashSet 470.44 5 1 1 Vector 177.24 15 0 0 HashMap 448.4 5 1 1 LinkedHashMap 548.44 1 1 1 Table 16: Results for iteration of 1000000 elements (Element) Page 15
Appendix A DATA STRUCTURE FOR CUSTOM CLASS package com.kapil.spikes.collections; public class Element private Long identifier; public Element(Long identifier) this.identifier = identifier; @Override public int hashcode() final int prime = 31; int result = 1; result = prime * result + ((identifier == null)? 0 : identifier.hashcode()); return result; // Returing a constant value of 1 will make all objects equal // return 1; @Override public boolean equals(object obj) if (this == obj) return true; if (obj == null) return false; if (getclass()!= obj.getclass()) return false; final Element other = (Element) obj; if (identifier == null) if (other.identifier!= null) return false; else if (!identifier.equals(other.identifier)) return false; Page 16
return true; Page 17
List of Tables B LIST OF TABLES Table 1: Hardware specifications...6 Table 2: Software specifications...6 Table 3: Mathematical parameters for benchmark...8 Table 4: Results for insertion of 10000 unique elements (Long)...9 Table 5: Results for insertion of 100000 unique elements (Long)...10 Table 6: Results for insertion of 100000 unique elements (Long)...10 Table 7: Results for insertion of 10000 unique elements (Element)...11 Table 8: Results for insertion of 100000 unique elements (Element)...11 Table 9: Results for insertion of 500000 unique elements (Element)...12 Table 10: Results for insertion of 10000 non-unique elements (Element)...12 Table 11: Results for iteration of 10000 elements (Long)...13 Table 12: Results for iteration of 100000 elements (Long)...13 Table 13: Results for iteration of 1000000 elements (Long)...14 Table 14: Results for iteration of 10000 elements (Element)...14 Table 15: Results for iteration of 100000 elements (Element)...15 Table 16: Results for iteration of 1000000 elements (Element)...15 Page 18
Appendix C CHANGE LOG ID Description User Date 1 First Draft of the benchmark Kapil Viren Ahuja 2008-10-08 2 Published Kapil Viren Ahuja 2008-10-21 Page 19