Overview of the intervals package

Similar documents
Package genomeintervals

Package intervals. R topics documented: August 29, Version Date Type Package Title Tools for Working with Points and Intervals

Stat 579: Objects in R Vectors

STAT 540: R: Sections Arithmetic in R. Will perform these on vectors, matrices, arrays as well as on ordinary numbers

A LISP Interpreter in ML

Package BiocNeighbors

Test Requirement Catalog. Generic Clues, Developer Version

IRanges and GenomicRanges An introduction

Introduction to the R Language

The SPL Programming Language Reference Manual

Lecture Notes on Memory Layout

mefa4 Design Decisions and Performance

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

Lecture 2 Tao Wang 1

Working with large arrays in R

The Language of Sets and Functions

The PCAT Programming Language Reference Manual

Constructing a Cycle Basis for a Planar Graph

Cluster Randomization Create Cluster Means Dataset

6 Subscripting. 6.1 Basics of Subscripting. 6.2 Numeric Subscripts. 6.3 Character Subscripts

Machine-Independent Optimizations

Introduction to R Programming

Package UNF. June 13, 2017

Online Document Clustering Using the GPU

Work with External Data in SPSS or Excel Format - Open SPSS data

6.001 Notes: Section 8.1

Common Lisp Object System Specification. 1. Programmer Interface Concepts

Register Allocation. CS 502 Lecture 14 11/25/08

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Lecture notes on the simplex method September We will present an algorithm to solve linear programs of the form. maximize.

C++ Programming. Arrays and Vectors. Chapter 6. Objectives. Chiou. This chapter introduces the important topic of data structures collections

Extending Pattern Directed Inference Systems. EECS 344 Winter 2008

γ 2 γ 3 γ 1 R 2 (b) a bounded Yin set (a) an unbounded Yin set

More about The Competition

A Short Summary of Javali

v0.4.1 ( ) An AviSynth plug-in that remaps the frame indices in a clip as specified by an input text file or by an input string.

Introduction to Programming in C Department of Computer Science and Engineering. Lecture No. #16 Loops: Matrix Using Nested for Loop

Functional Fibonacci to a Fast FPGA

Table of Contents. Oceanwide Bridge. User Guide - Calculated Fields. Version Version Bridge User Guide User Guide - Calculated Fields

Array. Prepared By - Rifat Shahriyar

CS558 Programming Languages

Exercises Computational Complexity

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Data Manipulation. Module 5

Topics. Java arrays. Definition. Data Structures and Information Systems Part 1: Data Structures. Lecture 3: Arrays (1)

XDS An Extensible Structure for Trustworthy Document Content Verification Simon Wiseman CTO Deep- Secure 3 rd June 2013

COUNTING AND PROBABILITY

ECOM 4311 Digital System Design using VHDL. Chapter 7

SETS. Sets are of two sorts: finite infinite A system of sets is a set, whose elements are again sets.

GraphBLAS Mathematics - Provisional Release 1.0 -

Outline Introduction Sequences Ranges Views Interval Datasets. IRanges. Bioconductor Infrastructure for Sequence Analysis.

Data Types. (with Examples In Haskell) COMP 524: Programming Languages Srinivas Krishnan March 22, 2011

Generell Topologi. Richard Williamson. May 27, 2013

STABILITY AND PARADOX IN ALGORITHMIC LOGIC

The Structure of Bull-Free Perfect Graphs

Lecture 1 August 9, 2017

FRAC: Language Reference Manual

2.3 Algorithms Using Map-Reduce

Comparative Study of Subspace Clustering Algorithms

CMP-3440 Database Systems

SQL Data Querying and Views

Lecture 17: Solid Modeling.... a cubit on the one side, and a cubit on the other side Exodus 26:13

Outline. Data and Operations. Data Types. Integral Types

Exercise: Using Numbers

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

A Note on Scheduling Parallel Unit Jobs on Hypercubes

Programming Languages Third Edition

BigFix Inspector Library

CHAPTER 5 Querying of the Information Retrieval System

Daylight XVMerlin Manual

BerkeleyImageSeg User s Guide

LAB #2: SAMPLING, SAMPLING DISTRIBUTIONS, AND THE CLT

Hava Language Technical Reference

Instructor: Craig Duckett. Lecture 11: Thursday, May 3 th, Set Operations, Subqueries, Views

UDP: Datagram Transport Service

Pebble Sets in Convex Polygons

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

CS Bootcamp Boolean Logic Autumn 2015 A B A B T T T T F F F T F F F F T T T T F T F T T F F F

Unsupervised Learning : Clustering

Microsoft Visual Basic 2005: Reloaded

Package slam. December 1, 2016

CSCI-1200 Data Structures Spring 2016 Lecture 6 Pointers & Dynamic Memory

1 Linear programming relaxation

Data-flow Analysis. Y.N. Srikant. Department of Computer Science and Automation Indian Institute of Science Bangalore

survsnp: Power and Sample Size Calculations for SNP Association Studies with Censored Time to Event Outcomes

Lab # 6. Data Manipulation Language (DML)

GO MOCK TEST GO MOCK TEST I

CS301 - Data Structures Glossary By

COMP520 - GoLite Type Checking Specification

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

CPSC 3740 Programming Languages University of Lethbridge. Data Types

(Refer Slide Time 6:48)

THE VARIABLE LIST Sort the Variable List Create New Variables Copy Variables Define Value Labels... 4

Matching Algorithms. Proof. If a bipartite graph has a perfect matching, then it is easy to see that the right hand side is a necessary condition.

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

And Parallelism. Parallelism in Prolog. OR Parallelism

Implementation of Lexical Analysis. Lecture 4

EDIABAS BEST/2 LANGUAGE DESCRIPTION. VERSION 6b. Electronic Diagnostic Basic System EDIABAS - BEST/2 LANGUAGE DESCRIPTION

Informal Reference Manual for X

Package orthogonalsplinebasis

Transcription:

Overview of the intervals package Richard Bourgon 06 June 2009 Contents 1 Introduction 1 2 Interpretation of objects 2 2.1 As a subset of Z or R........................ 3 2.2 As a set of meaningful, possibly overlapping intervals....... 4 3 Floating point and intervals over R 7 4 Notes on implementation 9 4.1 Endpoint representation....................... 9 4.2 Efficiency............................... 9 4.3 Checking validity........................... 10 5 Session information 10 1 Introduction The intervals packages defines two S4 classes which represent collections of intervals over either the integers (Z) or the real number line (R). An instance of either class consists of a two-column matrix of endpoints, plus additional slots describing endpoint closure and whether the intervals are to be thought of as being over Z or R. > library( intervals ) > x <- Intervals( matrix( 1:6, ncol = 2 ) ) > x [1, 4] [2, 5] [3, 6] > x[2,2] <- NA > x[3,1] <- 6 > x 1

[1, 4] [2, NA] [6, 6] Objects of class Intervals represent collections of intervals with common endpoint closure, e.g., all left-closed, right-open. More control over endpoints is permitted with the Intervals_full class. (Both classes are derived from Intervals_virtual, which is not intended for use by package users.) > y <- as( x, "Intervals_full" ) > y <- c( y, Intervals_full( c(2,3,5,7) ) ) > closed(y)[2:3,1] <- FALSE > closed(y)[4,2] <- FALSE > rownames(y) <- letters[1:5] > y _full 5 intervals over R: a [1, 4] b (2, NA] c (6, 6] d [2, 5) e [3, 7] The size method gives measure counting measure over Z or Lebesgue measure over R for each interval represented in an object. The empty method identifies intervals that are in fact empty sets, which over R is not the same thing as having size 0. (Valid objects must have each right endpoint no less than the corresponding left endpoint. When one or both endpoints are open, however, valid intervals may be empty.) > size(x) [1] 3 NA 0 > empty(x) [1] FALSE NA FALSE > empty(y) [1] FALSE NA TRUE FALSE FALSE 2 Interpretation of objects An Intervals or Intervals_full object can be thought of in two different modes, each of which is useful in certain contexts: 1. As a (non-unique) representation of a subset of Z or R. 2. As a collection of (possibly overlapping) intervals, each of which has a meaningful identity. 2

e d a c 1 2 3 4 5 6 7 Figure 1: The Intervals_full object y, plotted with plot(y). The second row contains an NA endpoint, and is not shown in the plot. The empty interval is plotted as a hollow point. By default, vertical placement avoids overlaps but is compact. 2.1 As a subset of Z or R The intervals package provides a number of basic tools for working in the first mode, where an object represents a subset of Z or R but the rows of the endpoint matrix do not have any external identity. Basic tools include reduce, which returns a sorted minimal representation equivalent to the original (dropping any intervals with NA endpoints), as well as interval_union, interval_complement, and interval_intersection. > reduce( y ) _full [1, 7] > interval_intersection( x, x + 2 ) 2 intervals over R: [3, 4] [6, 6] > interval_complement( x ) 3

(-Inf, 1) (4, 6) (6, Inf) Note that combining x and its complement in order to compute a union requires mixing endpoint closure types; coercion to Intervals_full is automatic. > interval_union( x, interval_complement( x ) ) _full (-Inf, Inf) The distance_to_nearest function treats its to argument in the first mode, as just a subset of Z or R; it treats its from argument, however, in the second mode, returning one distance for every row of the from object. In the example below, we also look at performance for large data sets (less than one second on a 2 GHz Intel Core 2 Duo Macintosh, although the time shown below will likely differ). A histogram of d is given in Figure 2. > B <- 100000 > left <- runif( B, 0, 1e8 ) > right <- left + rexp( B, rate = 1/10 ) > v <- Intervals( cbind( left, right ) ) > head( v ) 6 intervals over R: [89346565.6787157, 89346578.7312074] [8734528.32456678, 8734539.59384101] [70845473.1618986, 70845475.8958628] [16631298.0465591, 16631303.3358477] [44932203.3207864, 44932218.7735821] [79755224.4737744, 79755245.8314965] > mean( size( v ) ) [1] 10.04129 > dim( reduce( v ) ) [1] 99013 2 > system.time( d <- distance_to_nearest( sample( 1e8, B ), v ) ) user system elapsed 0.232 0.052 0.284 2.2 As a set of meaningful, possibly overlapping intervals In some applications, each row of an object s endpoint matrix has a meaningful identity, and particular points from Z or R may be found in more than one row. To support this mode, objects may be given row names, which are propagated 4

Distance to nearest interval Frequency 0 5000 10000 20000 30000 0 1000 2000 3000 4000 5000 d Figure 2: Histogram of distances from a random set of points to the nearest point in v. There is also a distance_to_nearest method for comparing two sets of intervals. through calculations when appropriate. The c methods simply stack objects (like rbind), preserving row names and retaining redundancy, if any. The interval_overlap method works in this mode. In the next example we use it to identify rows of v which are at least partially redundant, i.e., which intersect at least one other row of v. All rows overlap themselves, so we look for rows that overlap at least two rows: > rownames(v) <- sprintf( "%06i", 1:nrow(v) ) > io <- interval_overlap( v, v ) > head( io, n = 3 ) $`000001` [1] 1 $`000002` [1] 2 $`000003` [1] 3 > n <- sapply( io, length ) > sum( n > 1 ) [1] 1955 > k <- which.max( n ) > io[ k ] 5

$`002723` [1] 2723 32908 78038 > v[ k, ] 002723 [51659394.800663, 51659424.9491307] > v[ io[[ k ]], ] 002723 [51659394.800663, 51659424.9491307] 032908 [51659405.1616266, 51659414.5838721] 078038 [51659405.4410234, 51659419.3632795] The which_nearest method also respects row identity, for both its to and from arguments. In addition to computing the distance from each from interval to the nearest point in to, it also returns the row index of the to interval (or intervals, in case of ties) located at the indicated distance. Another function which operates in this mode is clusters, which takes a set of points or intervals and identifies maximal groups which cluster together which are separated from one another by no more than a user-specified threshold. The following code is taken from the clusters documentation: > B <- 100 > left <- runif( B, 0, 1e4 ) > right <- left + rexp( B, rate = 1/10 ) > y <- reduce( Intervals( cbind( left, right ) ) ) > w <- 100 > c2 <- clusters( y, w ) > c2[1:3] [[1]] 2 intervals over R: [368.363803718239, 372.503521507606] [423.708616290241, 426.031748544782] [[2]] [686.433725059032, 695.915765305561] [730.20518058911, 764.996966218281] [849.269276950508, 863.005137098533] [[3]] 2 intervals over R: [1059.95753780007, 1068.73573756543] [1070.96883933991, 1079.80115717235] 6

3 Floating point and intervals over R When type == "R", interval endpoints are not truly in R, but rather, in the subset which can be represented by floating point arithmetic. (For the moment, this is also true when type == "Z". See Section 4.1.) This limits the endpoint values which can be represented; more importantly, if computations are performed on interval endpoints, it means that floating point error can affect whether or not endpoints coincide, whether intervals which meet at or near endpoints overlap one another, etc. In spite of this potentially serious limitation, it is still often convenient to work with intervals with non-integer endpoints, including data where adjacent intervals exactly meet at a non-integer endpoint. To address this, the intervals package takes the following approach: ˆ Floating point representations of interval endpoints are assumed to be exactly equal (in the sense of == in R or C++) if and only if the user intends the real values corresponding to these representations to be exactly equal. ˆ For cases where floating point error and approximate equality are a concern, tools are provided to permit distinguishing between ambiguous and unambiguous intersection, union, etc. In the next example, y1 does not literally overlap y2[2,], although R s all.equal function asserts that the gap between them is smaller than the default tolerance for equivalence up to floating point precision. > delta <-.Machine[[ "double.eps" ]]^0.5 > y1 <- Intervals( c(.5, 1 - delta / 2 ) ) > y2 <- Intervals( c(.25, 1,.75, 2 ) ) > y1 [0.5, 0.999999992549419] > y2 2 intervals over R: [0.25, 0.75] [1, 2] > all.equal( y1[1,2], y2[2,1] ) [1] TRUE > interval_intersection( y1, y2 ) [0.5, 0.75] 7

The expand and contract methods, used with type = "relative", permit consideration of the maximal and minimal interval sets which are consistent with the nominal endpoints from the point of view of endpoint relative difference. The contract method, for example, contracts each interval in a collection so that the relative difference between original and contracted endpoints is equal to tolerance delta. Thus, if a relative difference less than or equal to delta is our criterion for approximate floating point equality, the contracted object has endpoints which are approximately equal to those of the original even though the contracted object is a proper subset of the original. The expand method is similar, but generates a proper superset. > contract( y1, delta, "relative" ) [0.500000007450581, 0.999999977648258] We compute two separate intersections which bound the nominal intersection: > inner <- interval_intersection( + contract( y1, delta, "relative" ), + contract( y2, delta, "relative" ) + ) > inner [0.500000007450581, 0.749999988824129] > outer <- interval_intersection( + expand( y1, delta, "relative" ), + expand( y2, delta, "relative" ) + ) > outer 2 intervals over R: [0.499999992549419, 0.750000011175871] [0.999999985098839, 1.00000000745058] Finally, we identify points which may or may not be in the intersection, depending on whether we make a conservative, literal, or anti-conservative interpretation of the nominal endpoints. > interval_difference( outer, inner ) _full [0.499999992549419, 0.500000007450581) (0.749999988824129, 0.750000011175871] [0.999999985098839, 1.00000000745058] The expand and contract methods have other uses as well. Here, we eliminate gaps of size 2 or smaller: 8

> x <- Intervals( c(1,10,100,8,50,200), type = "Z" ) > x 3 intervals over Z: [1, 8] [10, 50] [100, 200] > w <- 2 > close_intervals( contract( reduce( expand(x, w/2) ), w/2 ) ) 2 intervals over Z: [1, 50] [100, 200] 4 Notes on implementation 4.1 Endpoint representation For the moment, interval endpoints are always stored using R s numeric data type. Although this is wasteful from an memory and speed point of view, we do it for two reasons. First, use of R s Inf and -Inf not possible with the integer type is very convenient when computing complements. Second, the range of integers which can be represented using the numeric data type is considerably greater: >.Machine$integer.max [1] 2147483647 > numeric_max <- with(.machine, double.base^double.digits ) > options( digits = ceiling( log10( numeric_max ) ) ) > numeric_max [1] 9007199254740992 4.2 Efficiency All computations are accomplished by treating intervals as pairs of tagged endpoints, sorting these endpoints (along with their tags), and then making a single pass through the results. Computational complexity for set operations is therefore O(n log n), where input object i contains n i rows and n = i n i. The same sorting approach is also used for interval_overlap, although if every interval in a query object of m rows overlaps every intervals in a target object of n rows, generating output alone must of necessity be O(mn). Sorted endpoint vectors are not retained in memory. If one wishes to query a particular object over and over, repeated sorting would be inefficient; in practice so far, however, such repeated querying has not been needed. 9

4.3 Checking validity The code behind which_nearest and reduce (key methods in the intervals package, which may be directly called by the user and are also used internally in numerous locations) is written in C++ for efficiency. The compiled code makes a number of assumptions about the SEXP objects passed in as arguments, but does not explicitly check these assumptions. Nonetheless, when the R wrappers for the compiled code are applied to valid objects from the Intervals or Intervals_full classes, all assumptions will always be met. This design decision was taken so that the requirements for individual objects and their contents could be gathered together in a single, natural location: the classes validity functions. The intervals package provides replacement methods e.g., type and closed which implement error checking and preserve object validity. R s implementation of S4 classes, however, leaves object data slots exposed to the user. As a consequence, a user can directly manipulate the data slots of a valid Intervals or Intervals_full object in a way that invalidates the object, but does not generate any warning or error. To prevent invalid objects from being passed to compiled code and potentially generating segmentation faults or other problems all wrapper code in this package includes a check_valid argument. This argument is set to TRUE by default, so that validobject is called on relevant objects before handing them off to the compiled code. For efficiency, users may choose to override this extra check if they are certain they have not manually assigned inappropriate content to objects data slots. 5 Session information ˆ R version 3.2.2 (2015-08-14), x86_64-pc-linux-gnu ˆ Base packages: base, datasets, grdevices, graphics, methods, stats, utils ˆ Other packages: intervals 0.15.1 ˆ Loaded via a namespace (and not attached): tools 3.2.2 10