Parallel Implementation of the NIST Statistical Test Suite

Similar documents
A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Randomness Assessment of an Unpredictable Random Number Generator based on Hardware Performance Counters

The Art of Parallel Processing

Analysis of Parallelization Effects on Textual Data Compression

OPTIMIZATION OF THE CODE OF THE NUMERICAL MAGNETOSHEATH-MAGNETOSPHERE MODEL

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Pseudo-random Bit Generation Algorithm Based on Chebyshev Polynomial and Tinkerbell Map

Analysis of Statistical Properties of Inherent Randomness

Alfio Lazzaro: Introduction to OpenMP

Applying Multi-Core Model Checking to Hardware-Software Partitioning in Embedded Systems

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

Forrest B. Brown, Yasunobu Nagaya. American Nuclear Society 2002 Winter Meeting November 17-21, 2002 Washington, DC

I. INTRODUCTION FACTORS RELATED TO PERFORMANCE ANALYSIS

- 0 - CryptoLib: Cryptography in Software John B. Lacy 1 Donald P. Mitchell 2 William M. Schell 3 AT&T Bell Laboratories ABSTRACT

arxiv: v1 [cs.ds] 12 Oct 2018

Principles of Parallel Algorithm Design: Concurrency and Mapping

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

Extended Dataflow Model For Automated Parallel Execution Of Algorithms

High Performance Multithreaded Model for Stream Cipher

Proposed Pseudorandom Number Generator

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

A j-lanes tree hashing mode and j-lanes SHA-256

Jukka Julku Multicore programming: Low-level libraries. Outline. Processes and threads TBB MPI UPC. Examples

COZMO - A New Lightweight Stream Cipher

LabVIEW Based Embedded Design [First Report]

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

Design and Simulation of New One Time Pad (OTP) Stream Cipher Encryption Algorithm

Scheduling FFT Computation on SMP and Multicore Systems Ayaz Ali, Lennart Johnsson & Jaspal Subhlok

EPL372 Lab Exercise 5: Introduction to OpenMP

Using Quasigroups for Generating Pseudorandom Numbers

Principles of Parallel Algorithm Design: Concurrency and Mapping

Architecture, Programming and Performance of MIC Phi Coprocessor

Unique File Identification in the National Software Reference Library

Searching for Random Data in File System During Forensic Expertise

PSEUDORANDOM numbers are very important in practice

Accelerating String Matching Algorithms on Multicore Processors Cheng-Hung Lin

Single Block Attacks and Statistical Tests on CubeHash

Randomness Analysis on Speck Family Of Lightweight Block Cipher

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

ADAPTIVE SORTING WITH AVL TREES

A Test Suite for High-Performance Parallel Java

Analysis of Parallelization Techniques and Tools

A THREAD BUILDING BLOCKS BASED PARALLEL GENETIC ALGORITHM

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

CPU-GPU hybrid computing for feature extraction from video stream

NEW CLASS OF PSEUDORANDOM D-SEQUENCES TO GENERATE CRYPTOGRAPHIC KEYS B. Prashanth Reddy Oklahoma State University, Stillwater

Optimization of thread affinity and memory affinity for remote core locking synchronization in multithreaded programs for multicore computer systems

AN 831: Intel FPGA SDK for OpenCL

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

REDUCING BEAMFORMING CALCULATION TIME WITH GPU ACCELERATED ALGORITHMS

Characterizing Storage Resources Performance in Accessing the SDSS Dataset Ioan Raicu Date:

high performance medical reconstruction using stream programming paradigms

Online Course Evaluation. What we will do in the last week?

Determining Differences between Two Sets of Polygons

Optimization solutions for the segmented sum algorithmic function

Mobile Robot Path Planning Software and Hardware Implementations

Cray XE6 Performance Workshop

SOME NOTES ON MULTIPLICATIVE CONGRUENTIAL RANDOM NUMBER GENERATORS WITH MERSENNE PRIME MODULUS Dr. James Harris*

Code Parallelization for Multi-Core Software Defined Radio Platforms with OpenMP

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Statistical Testing of Software Based on a Usage Model

Recurrent Neural Network Models for improved (Pseudo) Random Number Generation in computer security applications

An Image encryption using pseudo random bit generator based on a non-linear dynamic chaotic system

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Analysis of Matrix Multiplication Computational Methods

An 11-Step Sorting Network for 18 Elements. Sherenaz W. Al-Haj Baddar, Kenneth E. Batcher

PARALLEL BUCKET SORTING ALGORITHM

DESIGN AND IMPLEMENTATION OF PSEUDO RANDOM NUMBER GENERATOR USED IN AES ALGORITHM

HOWTO: A Simple Random Number Generator for the ATmega1280 Microcontroller under C and TinyOS

ISA[k] Trees: a Class of Binary Search Trees with Minimal or Near Minimal Internal Path Length

A Simple Model for Estimating Power Consumption of a Multicore Server System

Go Multicore Series:

Parallel Programming Concepts. Parallel Algorithms. Peter Tröger

A 12-STEP SORTING NETWORK FOR 22 ELEMENTS

OPTIMAL SERVICE PRICING FOR A CLOUD CACHE

RESEARCH ARTICLE Software/Hardware Parallel Long Period Random Number Generation Framework Based on the Well Method

Lecture 10 Midterm review

An Introduction to OpenMP

I. INTRODUCTION II. EXISTING SYSTEM

Parallelization of Graph Isomorphism using OpenMP

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Introduction to. Slides prepared by : Farzana Rahman 1

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Cryptompress: A Symmetric Cryptography algorithm to deny Bruteforce Attack

PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES

Universal Fuzzy Statistical Test for Pseudo Random Number Generators (UFST-PRNG)

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY FPGA

DESIGNING OF STREAM CIPHER ARCHITECTURE USING THE CELLULAR AUTOMATA

Modification and Evaluation of Linux I/O Schedulers

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

Efficient Use of Random Delays

Enhancement of Lempel-Ziv Algorithm to Estimate Randomness in a Dataset

Building Efficient Concurrent Graph Object through Composition of List-based Set

Course overview. The course lessons. Education Think Parallel: Teaching Parallel Programming Today

Transcription:

Parallel Implementation of the NIST Statistical Test Suite Alin Suciu, Iszabela Nagy, Kinga Marton, Ioana Pinca Computer Science Department Technical University of Cluj-Napoca Cluj-Napoca, Romania Alin.Suciu@cs.utcluj.ro, Iszabela.Nagy@yahoo.com, Kinga.Marton@cs.utcluj.ro, Pinca.Ioana@gmail.com Abstract Randomness test suites constitute an essential component within the process of assessing random number generators in view of determining their suitability for a specific application. Evaluating the randomness quality of random numbers sequences produced by a given generator is not an easy task considering that no finite set of statistical tests can assure perfect randomness, instead each test attempts to rule out sequences that show deviation from perfect randomness by means of certain statistical properties. This is the reason why several batteries of statistical tests are applied to increase the confidence in the selected generator. Therefore, in the present context of constantly increasing volumes of random data that need to be tested, special importance has to be given to the performance of the statistical test suites. Our work enrolls in this direction and this paper presents the results on improving the well known NIST Statistical Test Suite (STS) by introducing parallelism and a paradigm shift towards byte processing delivering a design that is more suitable for today s multicore architectures. Experimental results show a very significant speedup of up to 103 times compared to the original version. Keywords-statistical testing, random and pseudorandom number generators, NIST STS, parallel implementation I. INTRODUCTION The existence of random and pseudorandom number sequences is of paramount importance in various domains such as cryptography, computational mathematics, simulation and modeling, constituting an essential input for applications requiring unpredictability, irreproducibility, uniform distribution or other specific properties of random number sequences. However, selecting an appropriate randomness generator is not an easy task as the desired quality of randomness may and do differ from one application domain to another. Hence, the selection of a suitable generator has to rely on a thorough analysis of the randomness source and the randomness properties of the outcome sequences. Statistical test suites play a very important role in this analysis, assessing the randomness quality of sequences produced by random number generators. In computing the degree of suitability, statistical tests rely on measuring certain properties of random number sequences in terms of probability, and based on the likely outcome of the tests applied on perfect random sequences can highlight possible deviations from randomness. Yet no finite set of statistical tests can be considered complete and consequently cannot ensure perfect randomness. In order to increase the confidence in the selected generator different statistical tests have to be applied on the generated random sequences. This highlights the need of constructing efficient statistical tests that can detect certain kinds of weaknesses in the randomness generators. There are several well known batteries of statistical tests for random number generators such as the Diehard test suite [4] developed by Marsaglia, John Walker s ENT [5] and Test01 [6] designed by L Ecuyer and Simard. The most popular is the NIST statistical test suite [1] developed by the National Institute of Standards and Technology (NIST) as a result of a comprehensive theoretical and experimental analysis and may be considered as the stateof-the-art in randomness testing [8] for cryptographic applications. The test suite has become a standard stage in assessing the outcome of random number generators shortly after its publication. The NIST battery of tests is based on statistical hypothesis testing and contains a total of 16 statistical tests specially designed to assess the randomness required for cryptographic applications (out of which two tests are currently disregarded because of some problems found by NIST and other researchers [2]). A hypothesis test is a procedure for determining if a given assertion is true, in this case the provided P-values determine whether or not the tested sequence is random from the perspective of the selected randomness statistic. Applying several batteries of statistical tests on large sequences of random numbers is a very time consuming operation; therefore, in order to satisfy the increasing demand for large volumes of random data, the development of high throughput randomness generators must be combined with high performance statistical test suites that take advantage of the new generation multi-core architectures. This work was supported by the CNMP funded CryptoRand project, nr. 11-020/2007 and PRODOC POSDRU/6/1.5/S/5 ID 7676. 978-1-4244-8230-6/10/$26.00 2010 IEEE 363

Unfortunately the implementations of the most popular batteries of test suites are not focused on efficiency and high performance and do not benefit of the processing power offered by today s multi-core processors and tend to become bottlenecks in the processing of large volumes of data generated by various random number generators. Hence there is a stringent need for providing highly efficient statistical tests and our efforts on improving and parallelizing the NIST test suite intends to fill this need. The paper is structured in 5 sections. Section 2 presents the original NIST test suite briefly describing the included statistical tests. Section 3 points out the main drawbacks in the implementation of the original version and introduces the optimization steps followed by experimental results in section 4, demonstrating the effectiveness of our approach. Section 5 presents final conclusions and further work. II. THE NIST TEST SUITE After carrying out a comprehensive investigation of known statistical tests for random and pseudorandom generators, the US National Institute of Standards and Technology (NIST) published its results and recommendations for practical usage in [1]. The 16 proposed statistical tests form the NIST Statistical Test Suite shortly became the most popular battery of statistical tests for assessing the quality of sequences produced by a given random number generator. The test suite initially consisted of 16 statistical tests but recently, the Discrete Fourier Transform (Spectral) test and the Lempel-Zip Compression test were disregarded due to problems identified by NIST and other researchers [2]. Out of the remaining 14 tests we found 9 tests that are suitable for parallel implementation. The tests are implemented sequentially in C programming language and can be applied on input data - usually files containing binary random sequences - the results indicating the degree to which properties of the input data deviate from properties expected from perfect random sequences expressed by means of P-values. In the following we give a short description of the statistical tests that we parallelized, preceded by a short presentation of the way statistical tests work. Each statistical test has a relevant randomness statistic and is formulated to test a null hypothesis (H 0 ). The null hypothesis under test in case of the NIST tests is that the sequence being testes is random, and the alternative hypothesis (H a ) is that the tested sequence is not random. Mathematical methods determine a reference distribution of the selected statistic under the null hypothesis and a critical value is selected. Each test derives a decision based on the comparison between the critical value and the test statistic value computed on the sequence being tested and according to this decision it accepts (test statistic value < critical value) or rejects (test statistic value > critical value) the null hypothesis and concludes on whether the tested generator is or is not producing random numbers [1]. A. The Frequency (Monobit) Test The frequency test determines whether zero and one bits appear in the tested sequence with approximately the same probability. This simple test can reveal the most obvious deviations from randomness hence further tests depend on this result. B. Frequency Test within a Block The frequency test within a block is a generalization of the Frequency (Monobit) test, having the purpose of determining the frequency of zeros and ones within M-bit blocks and thus revealing whether zeros and ones are uniformly distributed throughout the tested sequence. C. Runs Test In order to determine whether transitions between zeroes and ones in the sequence appear as often as expected from a random sequence, the runs test counts the total number of runs of various lengths. A run consists of an uninterrupted sequence of identical bits. D. Longest Run of Ones in a Block Test In case of the longest run of ones in a block test, the sequence is processed in M bit blocks with the aim of determining whether the length of the longest run of ones in a block is consistent with the length expected from a random sequence. E. Non-overlapping Template Matching Test The purpose of this test is to detect generators that produce too many occurrences of a given non-periodic pattern by searching for occurrences of a given m-bit non-periodic pattern. F. Overlapping Template Matching Test The overlapping template matching test is similar to the non-overlapping template matching test, but it extends the search criteria to overlapping patterns. G. Linear Complexity Test The purpose of this test is to determine the linear complexity of the LFSR (Linear Feedback Shift Register) that could generate the tested sequence. If the complexity is not sufficiently high, the sequence is non-random. H. Serial Test In order to verify the uniformity of templates the test counts the occurrences of every possible m-bit overlapping patterns in the sequence. A high level of uniformity patterns occur with the same probability - indicates that the sequence is close to random. I. Approximate Entropy Test The purpose of the approximate entropy test is to compare the frequency of overlapping patterns of two consecutive lengths, m and m+1, against the expected frequency in a true random sequence. 364

III. THE OPTIMIZATION STEPS There are several limitations we encountered using the original NIST test suite, especially when applied to large volumes of random number sequences, which caused a significant setback in performance and motivated our efforts to eliminate the drawbacks and provide an important performance boost. In the following we point out the main limitations and give an insight on the optimization steps performed in order to reach a high performance implementation of the test suite. A. Paradigm shift towards byte processing mode The original NIST implementation stores and processes the input sequence, after read from the input file, as an array of bits, a very time consuming transformation process. On the other hand, our design introduces byte processing mode, where bit level operations are performed with the help of special precomputed lookup tables. B. Allowing the processing of very large files The original version of the test suite shows a severe limit on the maximum file size that can be processed in one execution. This drawback is due to the storage method of the tested sequence (array of bits) and the limitation in the variable definition. The maximum limit is theoretically set to 256 MB (the maximum value of an int type variable represented on 32 bits), but as practice has shown us, this value generally reduces to 198 MB in the original NIST implementation. In contrast, our implementation allows the processing of files of sizes up to 1GB and eliminates the need of introducing the input file size in number of bits (necessary in the original version). C. Parallelization of the algorithms using the OpenMP API While the first two optimization steps alone generated an average speedup of 13.45 times, as we have shown in [9], the third optimization step aims at further increasing the speedup by introducing parallelism (using the OpenMP API), which is in fact the focus of this paper. As the spread of multicore processor systems increases, sequential implementations are no longer suitable solutions and become a bottleneck in the utilization of the available processing power. Shared memory multiprocessing (SMP) systems like the above mentioned multicore systems support program annotation standards, namely the insertion of compiler directives in serial programs for code parallelization. This method is the most common and cost effective for SMP systems. Our parallel implementation of the NIST statistical test suite ParNIST is designed to use OpenMP [7], a well known directive based API for parallel programming on SMPs that can extend programming languages like C, C++ and FORTRAN using a fork-join model of parallel execution. After analyzing each test we reached the conclusion that data parallelism is more appropriate for some of the algorithms used by the statistical tests while other tests are more suitable for internal algorithm parallelization. Therefore our design implies both data parallelization, where the tested sequence is split in equal sized chunks on which the algorithm is applied in parallel, and internal parallelization of the statistical tests which is performed by dividing the workload in equal chunks assigned to each thread. As the chunks are executed in parallel shared resources are concurrently updated and in order to avoid write conflicts and a wrong final result, synchronization methods are needed such as private and shared variables, OpenMP s reduction constructs and critical regions. IV. EXPERIMENTAL RESULTS In order to benchmark the enhanced and parallel version of the NIST test suite we turned to comparatively test the original NIST STS version 1.8, and the improved version with input files of different sizes, measuring the processing time without taking into account the time needed for reading the input files from disk. The benchmark was performed on a system with Intel Xeon E5405 processor at 2 GHz and 4 GB of RAM, running Windows 2008 Server, the 64 bit edition. Each test was run 5 times for each of the 10 various size input files in order to obtain an average execution time, and thus a more accurate result. Our implementation was parallelized on a number of 8 threads, in order to use the entire processing power of the system. The comparative results for each of the improved tests are presented in the following, where the improved version is called ParNIST. A. Frequency (Monobit) Test In the original version of the Monobit test the proportion of zeroes and ones for the entire sequence is determined by computing the sum of all adjusted (-1, +1) digits in the sequence. Based on the fact that the total count of ones is complementary to the total count of zeros in the sequence, our design computes just the number of ones by using lookup tables for byte processing mode. The introduced data parallelization is based on splitting the sequence into chunks according to the number of available threads and each thread computes the number of ones in the assigned chunk. Fig. 1 presents the comparative results of the Frequency (Monobit) test where the parallel version shows a speedup of approximately 50 times over the original NIST implementation. The input files were selected with sizes up to 198 MB as the original NIST version ceased to work on larger files. 365

resulting in a performance level approximately 35 times higher compared to the original version, as shown in Fig 3. Tests were run on input files of up to 43 MB in size as the original NIST test ceases to work on larger files. Figure 1. Comparison of execution time for the Frequency (Monobit) Test. B. Frequency Test within a Block In case of this statistical test, alongside the introduction of byte processing mode for determining the proportion of zeros and ones inside M-bit blocks (M multiple of 8 for byte processing mode), similar to the global frequency test (above), the improved version also allows the use of larger block sizes, instead of the restricted block size of the original version caused by a limitation in the variable definition. The parallelization works by assigning to each thread the processing of an equal number of blocks for which the proportion of ones has to be determined. The obtained results are then used in computing a sum on which reduction is applied. The execution time for the frequency test within a block parallel version has known a significant decrease in execution time of approximately 54 times compared to the original NIST version. Fig. 2 shows the results obtained for input files up to 198 MB. Figure 3. Comparison of execution time for the Runs Test. D. Longest Run of Ones in a BlockTest The test for the longest run of ones tabulates the frequencies of the longest run of ones in each block into categories and then selects the longest run. Our parallel implementation divided the data into chunks and the obtained frequency values for each chunk were then synchronized through a reduction mechanism in order to obtain the longest run of ones. The experimental results presented in Fig 4. demonstrate an average speedup of approximately 50 times compared to the original version. Figure 4. Comparison of execution time for the Longest Run of Ones in a Block Test. Figure 2. Comparison of execution time for the Frequency Test within a Block. C. Runs Test The method applied in order to determine the number of runs in the input sequence is to count the passes from zero to one and vice versa. The original version performs this task by comparing every two adjacent bits, whereas the enhanced version uses byte processing mode and data parallelization E. Non-overlapping Template Matching Test For determining the number of pre-defined m-bit target substrings in the sequence the original version uses an m-bit window, and in case of non-overlapping matching when the pattern is found the window slides m bits otherwise only one bit, on the other hand, when overlapping templates are considered, the window slides only one bit regardless of the result of matching. Our version uses an 8-bit window where the template is compared with the bytes of the input sequence that successively form a value of short int type which in non- 366

overlapping template matching gets shifted left by 1 byte if the pattern is found and by 1 bit otherwise. On the other hand, overlapping template matching test shifts this value always by just one bit. Each time enough space is available a new byte is added to the variable. splitting the sequence into blocks and computing the LFSR s length for each block. For this statistical test, shifting the paradigm to a byte sequence cannot be applied since the Berlekamp-Massey algorithm works only at bit level. However, in the parallel version each thread is assigned an equal number of blocks for which the length of the LSFR has to be computed. Figure 5. Comparison of execution time for the Non-overlapping Template Matching Test. Fig 5. presents the comparative results for non-overlapping template matching test with an average speedup of about 18. The original NIST version ceases to work on files larger than 198 MB. F. Overlapping Template Matching Test While the parallelization of the non-overlapping template matching test implies splitting the number of templates among the available threads, in case of the overlapping template matching test the parallelization is performed by assigning each thread an equal number of blocks for which the frequency of the pattern has to be computed. Fig 6. shows the comparative results for overlapping template matching test with an average speedup of about 103. The original NIST version ceases to work on files larger than 198 MB. Figure 7. Comparison of execution time for the Linear Complexity Test. The decrease in execution time of about 30 times compared to the original NIST version, due only to the parallelization of this test, is visible in Fig. 7. The input files were selected with sizes up to 198 MB as the original NIST version ceased to work on larger files. H. Serial Test Determining the frequency of each and every overlapping m-bit pattern across the entire sequence implies the transformation of every m-bit block into its corresponding decimal value. Since this test implied processing the sequence for different sizes of the m-bit pattern, different algorithms were used for each size of the pattern. Therefore, the parallelization consists actually in assigning to each thread the processing of a different code section. Together with the paradigm shift towards byte processing, the results for the parallel version of this statistical test shows a significant performance boost of about 64 times, depicted in fig. 8. The maximal input file size allowed by the original NIST version is 198 MB. Figure 6. Comparison of execution time for the Overlapping Template Matching Test. G. Linear Complexity Test The test computes the length of a linear feedback shift register (LFSR) using the Berlekamp-Massey algorithm by Figure 8. Comparison of execution time for the Serial Test. 367

I. Approximate Entropy Test This test is rather similar with the Serial Test in that the frequency of each and every overlapping m-bit pattern across the entire sequence has to be determined in order to compare the frequency of overlapping patterns of two consecutive lengths, m and m+1, against the expected frequency. The parallelization introduced is based on assigning to each thread the processing of a different code section. Figure 9. Comparison of execution time for the Approximate Entropy Test. The improvements achieved by our implementation shows a significant performance boost of about 86 times, depicted in fig. 9. The maximal input file size allowed by the original NIST version is 198 MB. V. CONCLUSIONS AND FURTHER WORK We have shown how the need for efficiently implemented batteries of statistical tests that show high performance can be satisfied by improving the existing and well known statistical test suites to overcome their performance limitations. The purpose of this paper is to present the improvements applied to the well known NIST statistical test suite for assessing the quality of random and pseudorandom number generators and to emphasize the efficiency of the improved and parallel version displaying the benchmark results that demonstrate significant performance improvement. After introducing the important role randomness tests play in assessing the quality of random number generators, we stressed out the need for high efficiency test suites and presented the optimization steps applied to the NIST battery of statistical tests in order to overcome the performance limitations and make it more suitable for execution on today s multicore architectures. The applied optimization steps include the paradigm shift towards byte processing mode, the possibility to process large volumes of input data up to 1 GB and the most significant improvement stage the parallelization. The result is a high performance parallel version, we called ParNIST, which was subject to a thorough benchmarking process and the results show a significant performance improvement, the execution time being reduced up to 103 times, and on average by approximately 54 times by the ParNIST system. Table 1 presents the average speedup compared to the original version for each test improved by ParNIST. TABLE I. BENCHMARK RESULTS Test No. Test Name Speed-up 1 Frequency (Monobit) Test 50 2 Frequency Test within a Block 54 3 Runs Test 35 4 Longest Run of Ones in a Block Test 50 7 Non-overlapping Template Matching Test 18 8 Overlapping Template Matching Test 103 10 Linear Complexity Test 30 11 Serial Test 64 12 Approximate Entropy Test 86 Average Speedup 54 We are currently working on a parallel implementation that will target a Grid architecture; taking advantage of the benefits of the Grid we believe that even greater speedups can be achieved. REFERENCES [1] A.Rukhin et al., A statistical test suite for random and pseudorandom number generators for cryptographic applications, NIST Special Publication 800-22 (with revisions dated April, 2010). [2] S. Kim, K. Umeno, A. Hasegawa, Corrections of the NIST Statistical Test Suite for Randomness, Cryptology eprint Archive, Report 2004/018, 2004. [3] D. E. Knuth: The Art of Computer Programming, 3rd ed., vol 2: Seminumerical Algorithms, Addison-Wesley, 1998. [4] G. Marsaglia, DIEHARD: A battery of tests of randomness. http://www.stat.fsu.edu/pub/diehard/, 1996. [5] J. Walker, ENT - A pseudorandom number sequence test program, Fourmilab, http://www.fourmilab.ch, 1998 [6] P. L Ecuyer, R. Simard, TestU01: A C library for empirical testing of random number generators, ACM Trans. Math. Softw., vol. 33, no. 4, article 22, 2007. [7] B. Chapman, G. Jost, R. Van Der Pas, Using OpenMP: Portable Shared Memory Parallel Programming, MIT Press, 2007. [8] B. Ryabko, A. Fionov, Basics of Contemporary Cryptography for IT Practitioners, World Scientific Publishing Company, 2005. [9] A. Suciu, K. Marton, I. Nagy, I. Pinca, Byte-oriented Efficient Implementation of the NIST Statistical Test Suite, Proc. of 2010 IEEE International Conference on Automation, Quality and Testing, Robotics, Cluj-Napoca, Romania, May 28-30 2010. 368