Contents. 2.5 DBSCAN Pair Auto-Correlation Saving your results References... 18

Similar documents
DS504/CS586: Big Data Analytics Big Data Clustering II

DS504/CS586: Big Data Analytics Big Data Clustering II

Density-based clustering algorithms DBSCAN and SNN

University of Florida CISE department Gator Engineering. Clustering Part 4

Clustering Part 4 DBSCAN

DBSCAN. Presented by: Garrett Poppe

Distance-based Methods: Drawbacks

Clustering CS 550: Machine Learning

Insight: Measurement Tool. User Guide

COMP 465: Data Mining Still More on Clustering

Data Mining 4. Cluster Analysis

Multivariate Capability Analysis

1. ABOUT INSTALLATION COMPATIBILITY SURESIM WORKFLOWS a. Workflow b. Workflow SURESIM TUTORIAL...

Clustering Documentation

A New Online Clustering Approach for Data in Arbitrary Shaped Clusters

Nature Methods: doi: /nmeth Supplementary Figure 1

EN1610 Image Understanding Lab # 3: Edges

Cognalysis TM Reserving System User Manual

ECE 176 Digital Image Processing Handout #14 Pamela Cosman 4/29/05 TEXTURE ANALYSIS

Clustering Algorithms for Data Stream

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Cluster Analysis (b) Lijun Zhang

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

EE795: Computer Vision and Intelligent Systems

Unsupervised Learning : Clustering

Geostatistics 2D GMS 7.0 TUTORIALS. 1 Introduction. 1.1 Contents

ECG782: Multidimensional Digital Signal Processing

Supplementary Material

Cluster Analysis of Super Resolution Fluorescence Images Ki Woong Sung

Physics MRI Research Centre UNIFIT VERSION User s Guide

Fast and accurate automated cell boundary determination for fluorescence microscopy

ArcView QuickStart Guide. Contents. The ArcView Screen. Elements of an ArcView Project. Creating an ArcView Project. Adding Themes to Views

Texture Analysis. Selim Aksoy Department of Computer Engineering Bilkent University

COMPARISON OF DENSITY-BASED CLUSTERING ALGORITHMS

Analysis and Extensions of Popular Clustering Algorithms

The Projected Dip-means Clustering Algorithm

6. Object Identification L AK S H M O U. E D U

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

CHAPTER 7. PAPER 3: EFFICIENT HIERARCHICAL CLUSTERING OF LARGE DATA SETS USING P-TREES

Clustering Algorithm (DBSCAN) VISHAL BHARTI Computer Science Dept. GC, CUNY

STATS PAD USER MANUAL

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

2D Image Processing INFORMATIK. Kaiserlautern University. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz

Lab 12: Sampling and Interpolation

Operators-Based on Second Derivative double derivative Laplacian operator Laplacian Operator Laplacian Of Gaussian (LOG) Operator LOG

Solve a Maze via Search

Automatic Partiicle Tracking Software USE ER MANUAL Update: May 2015

CSE 5243 INTRO. TO DATA MINING

ECG782: Multidimensional Digital Signal Processing

Lecture #5. Point transformations (cont.) Histogram transformations. Intro to neighborhoods and spatial filtering

Density Based Clustering using Modified PSO based Neighbor Selection

Algorithm User Guide:

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/09/2018)

Excel 2010 with XLSTAT

Implemented by Valsamis Douskos Laboratoty of Photogrammetry, Dept. of Surveying, National Tehnical University of Athens

Particle localization and tracking GUI: TrackingGUI_rp.m

Automated AFM Image Processing User Manual

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

6-1 THE STANDARD NORMAL DISTRIBUTION

Cell based GIS. Introduction to rasters

Data Mining Algorithms

1. Introduction Installation and requirements... 2

SynapsesSA. User Reference Manual. Version 1.0. Luis Rodriguez-Lujan Laura Antón-Sánchez Pedro Larrañaga Concha Bielza

TexRAD Research Version Client User Guide Version 3.9

Schedule for Rest of Semester

Working with Unlabeled Data Clustering Analysis. Hsiao-Lung Chan Dept Electrical Engineering Chang Gung University, Taiwan

Contrast Optimization A new way to optimize performance Kenneth Moore, Technical Fellow

Clustering Lecture 4: Density-based Methods

Morphological Image Processing

Notes. Reminder: HW2 Due Today by 11:59PM. Review session on Thursday. Midterm next Tuesday (10/10/2017)

4. TROUBLESHOOTING PREVIOUS VERSIONS RUN LOLITRACK ALWAYS AS ADMIN WIBU SOFTWARE PROTECTION... 30

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

OPTICS-OF: Identifying Local Outliers

Data and Function Plotting with MATLAB (Linux-10)

The Ball-Pivoting Algorithm for Surface Reconstruction

Anno accademico 2006/2007. Davide Migliore

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

ENVI Tutorial: Introduction to ENVI

Chapter 2 Basic Structure of High-Dimensional Spaces

How to Measure Wedge. Purpose. Introduction. Tools Needed

Introduction. Computer Vision & Digital Image Processing. Preview. Basic Concepts from Set Theory

FlowJo Software Lecture Outline:

Texture. Texture is a description of the spatial arrangement of color or intensities in an image or a selected region of an image.

Lecture 6: Multimedia Information Retrieval Dr. Jian Zhang

Histograms. h(r k ) = n k. p(r k )= n k /NM. Histogram: number of times intensity level rk appears in the image

Unsupervised learning on Color Images

CSE 252B: Computer Vision II

Algorithm User Guide:

Coarse-to-fine image registration

Region-based Segmentation

Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra

button in the lower-left corner of the panel if you have further questions throughout this tutorial.

Image Processing

CS443: Digital Imaging and Multimedia Binary Image Analysis. Spring 2008 Ahmed Elgammal Dept. of Computer Science Rutgers University

ENVI Classic Tutorial: Introduction to ENVI Classic 2

Tutorial: Using Tina Vision s Quantitative Pattern Recognition Tool.

Contents. CRITERION Vantage 3 Analysis Training Manual. Introduction 1. Basic Functionality of CRITERION Analysis 5. Charts and Reports 17

Modify Panel. Flatten Tab

University of Florida CISE department Gator Engineering. Clustering Part 5

CHAOS Chaos Chaos Iterate

Transcription:

Contents 1 Getting Started... 2 1.1 Launching the program:... 2 1.2 Loading and viewing your data... 2 1.3 Setting parameters... 5 1.4 Cropping points... 6 1.5 Selecting regions of interest... 7 2 Analysis... 8 2.1 Pairwise distance distribution... 8 2.2 Cumulative distance distribution... 9 2.3 Getis-based Clustering Analysis... 11 2.4 Spatial Statistics... 13 2.5 DBSCAN... 14 2.6 Pair Auto-Correlation... 16 3 Saving your results... 17 4 References... 18

1 Getting Started 1.1 Launching the program: Download the file SuperCluster.m into a directory on your MATLAB path. Launch the program either by typing SuperCluster at the command line or by opening the file in the MATLAB editor and hitting the run button. Required MATLAB toolboxes and classes: Image Processing Toolbox, Statistics Toolbox, DERIVEST Suite (used in autocorrelation curve fitting) 1.2 Loading and viewing your data Upon launching the program, you will be presented with the main GUI window:

First, select the source of your data either dynamically generated simulated data or data loaded from file. To load saved data from file, first enter the conversion factor between pixels (or whatever your data units might be) and nanometers. If your data is already in units of nanometers, leave this box empty: Next, load your data using the Browse button. Acceptable file types and formats are: -.mat files containing SR_demo objects -.mat files containing x-y coordinates of super resolution data -.mat files containing previous SuperCluster results files -.csv files containing x-y coordinates of super resolution data -.txt files containing comma-separated pairs of super resolution coordinates, with no headers, either in 2 rows or in 2 columns

If you choose to simulate data, enter the desired simulation parameters and hit the Simulate button:

When you ve selected your file or simulated data, the localizations will be plotted in the main GUI figure window: 1.3 Setting parameters At this point, you will want to set your parameters. 1 2 3 4 5 6 7 8 1. Minimum distance threshold: This is the minimum point-to-point distance that you wish to consider in your analyses. Point-to-point distances below this threshold will be considered noise and disregarded.

2. Maximum distance threshold: This is the maximum point-to-point distance that you wish to consider in your analyses. 3. Ripley cutoff: This is the largest radius r to be used in calculating the Ripley s K statistic (and related values). 4. Subregion size: For portions of the clustering analysis, including the Getis-based cluster analysis and the pair auto-correlation analysis, the initial ROIs are broken down into a number of equal-sized subregions (you can think of these subregions as bins for super-resolution localizations, or pixels) which are used to create a slightly lower resolution map of the ROI for area-based analyses (this helps to ameliorate the effects of multiple localizations). 5. Box size: This is the length of one side of a square ROI. Set this value if you wish to create an ROI of a specific size. If you plan to draw your ROI, you can leave this as the default. 6. Epsilon: This is the search radius used in the DBSCAN clustering algorithm. 7. Min. Count per Cluster: This is the minimum number of SR localizations to be considered a cluster. Used in both DBSCAN and the Getis-based clustering. 8. Combine PC Results: Combine the pair auto-correlation results for all ROIs. This will result in a single curve fit for the data set. Leaving this box unchecked will result in an individual curve fit for each ROI. ** NOTE: All units are in nanometers unless otherwise specified ** 1.4 Cropping points If your data set has an anomaly that you wish to remove to prevent artifacts in your analyses, you can do this using the Crop Points button. To remove a region from the data, first click the Crop Points button. You will see the cursor change into a plus sign. Next, (single left) click points around the area you wish to remove: (you will see a dotted line indicating the selected region)

When you reach your last point, double click the mouse to close the region and remove the points: You may repeat this operation as many times as you wish by repeating these steps. Please note that you may not crop points that are already inside an ROI. 1.5 Selecting regions of interest There are two ways to specify a region of interest (ROI): automatic and manual. For automatic specification: enter the desired box size (see section 1.3), then click the Select Regions button. Left click to place an ROI box in the data viewing pane (the clicked point will be the center point of the ROI). To place additional ROIs, repeat the previous steps. For manual specification: click the Select Regions button. Right click the mouse anywhere in the data viewing pane and then right click and drag to draw an ROI box. The size of this ROI will be retained and you can place additional ROIs of the same size by now following the steps for automatic ROI specification.

Note: You may use anywhere between 1 and 12 ROIs in a single data set, however using more than 4 ROIs will cause increased processing time, some garbled display text, and difficult to read plots. If you do not care about the display and only want the values, you may use up to 12 ROIs. 2 Analysis Now that you have selected your regions, it is time to move on to analyzing your data. The following are descriptions of the currently available analyses, with appropriate references provided at the end of this document. 2.1 Pairwise distance distribution Plot Distances calculates the pairwise distance between all localizations in the ROI, except those below or above the minimum and maximum distance thresholds (respectively) and displays the probability density function of the distances. The results are displayed side-by-side with the expected results of an equal number of randomly distributed points (drawn from random uniform distribution) in an ROI of the same size.

The numerical output (see section 3) of this function is a list of all the pairwise distances (thresholds are NOT applied), a list of the pairwise distances between random points for comparison, the probability histogram in the form of [bins, counts] (with thresholds applied). 2.2 Cumulative distance distribution The Cumulative Dist. function calculates and displays the cumulative density function of the pairwise distances (thresholds applied), as well as the comparison to random data. It also performs a comparison between the data using the 2 sample Chi-Squared test (this only applies if you are interested in differences between ROIs in your data set). If more than 2 ROIs are being analyzed, this test is performed in a pairwise fashion. The output (see section 3) of this function is the cumulative histogram of the pairwise distance in the form of [bins, counts], the empirical CDF, and the results of the Chi-Squared test (names of the regions being compared, a value indicating the acceptance or rejection of the null hypothesis (1 = reject, 0 = accept) and the p-value.

2.3 Getis-based Clustering Analysis This portion of the analysis involves breaking the ROI into equal-area subregions (see section 1.3). These subregions are then assigned a value ( ) based on the number of super-resolution localizations they each contain (a histogram image). Histogram image of ROI 1 (from above). The Getis statistical analysis has been used to analyze geographical clustering data, and is a method that indicates the local amount of clustering between subregions (pixels) of a given size in an image (1,2). The calculation is carried out based on the following equation: () = (), where is the pairwise distance between subregions and ; is the number of subregions in the ROI; and is a binary weight matrix where () = 1 if the pairwise distance,, between subregions and is less than the cutoff distance D c and () = 0 if is greater than D c. As you can see, this results in a ratio between the sum of the values of subregions within a distance of the subregion and the sum of the values of all subregions, which is a measure of local clustering. By evaluating max ( ()) for = [0, ), we can identify length scales over which the degree of local clustering is increasing (indicating continuous domain structure) and where it is not (indicating discontinuities in domain structure). This analysis method finds clusters based on local maxima of the Getis G statistic and provides domain visualization, as well as a histogram of the equivalent diameters, the convexity, compactness and distance to nearest neighboring cluster.

The output variables (see section 3) are: a cell array containing the values of () for = [0, ) at each pixel in the ROI, a 1 x n array of cluster aspect ratios, a 1 x n array of cluster equivalent diameters, a 1 x n array of the cluster density/whole cell density, a maximum projection

of pixel-by-pixel G values, a 1 x n array of nearest cluster distances, a cell array containing the alpha-hull of each cluster, a 1 x n array of the area of each cluster (nm^2), a cell array containing all of the SR localizations contained in each cluster, and the mean equivalent diameter. 2.4 Spatial Statistics The spatial statistics module calculates both the Hopkins statistic and Ripley s K function, with methods adapted from previous STMC work 3 (http://stmc.health.unm.edu/tools-anddata/index.html). The Hopkins statistic tests for spatial randomness by comparing nearest-neighbor distances from simulated randomly distributed points to their nearest real neighbors, and the nearest-neighbor distances within the set of real points. The Hopkins statistic is given by: = + where is the set of nearest-neighbor distances between real points and is the set of distances between the simulated random points and their nearest real neighbors. If the real points are randomly distributed, the value of will be approximately ½, while clustered points will result in an value close to 1. Points that are uniformly distributed will result in an that is approaching 0. Ripley s K function (and related statistics) are based on the detection of deviation from spatial homogeneity. Ripley s K function is given by: () = () where () is the expected number of points within a distance of the th point, is the point density (approximated by the number of points divided by the area of the region), and is the total number of points. The expected value of this function is. The function can be normalized to give a linear expected value: () = () and further normalized so that the expected value is 0: () = ()

The Spatial Statistics function calculates and plots, (), (), and (). The output (see section 3) includes an array containing the values of, and a 1x4 cell array containing the values of (), the distance step-size, (), and (). 2.5 DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996 4.

DBSCAN's definition of a cluster is based on the notion of density reachability. A point q is directly density-reachable from a point p -neighborhood) and if p is surrounded by sufficiently many points such that one may consider p and q to be part of a cluster. q is called density-reachable (note the distinction from "directly density-reachable") from p if there is a sequence p 1,,p n of points with p 1 = p and p n = q where each p {i+1} is directly density-reachable from p i. Note that the relation of density-reachable is not symmetric. q might lie on the edge of a cluster, having insufficiently many neighbors to count as dense itself. This would halt the process of finding a path that stops with the first non-dense point. By contrast, starting the process with p would lead to q (though the process would halt there, q being the first non-dense point). Due to this asymmetry, the notion of density-connected is introduced: two points p and q are densityconnected if there is a point o such that both p and q are density-reachable from o. Densityconnectedness is symmetric. A cluster, which is a subset of the points of the database, satisfies two properties: 1. All points within the cluster are mutually density-connected. 2. If a point is density-connected to any point of the cluster, it is part of the cluster as well. Output variables (see section 3): a cell array containing the number of SR localizations per DBSCAN cluster, a cell array containing the convex hull of each DBSCAN cluster, a 1 x n array of cluster equivalent diameters, the mean cluster equivalent diameter, and a cell array containing the area of each DBSCAN cluster (based on convex hull).

2.6 Pair Auto-Correlation This analysis is a method developed by Sarah Veatch and colleagues 5. Pair correlation functions quantify organization in heterogeneous systems and are easily applied to super-resolution localization data. The pair auto-correlation function, g(r), that reports the increased probability of finding a second localized signal a distance r away from a given localized signal, is efficiently calculated using Fast Fourier Transforms, and can account for complex boundary shapes without additional assumptions. The pair autocorrelation code was written by Sarah Veatch and the complimentary fitting code was written by Keith Lidke. The pair auto-correlation analysis can be performed on each ROI individually, or on the combined ROIs (shown below). Output variables (see section 3): a 1x4 cell array containing: the ROI image; a matrix with the radius values in column 1, the angularly averaged autocorrelation function in column 2, and the errors of the angularly averaged autocorrelation function in column 3; the average density rho; and the 2D autocorrelation function; a matrix containing the estimated fitting parameters (objects per domain, sigma for 2D Gaussian domain size, observations per object, sigma for Gaussian localization precision) with one row for each ROI or one row for the combined ROIs; and the fit model generated from these parameters.

3 Saving your results The plots and output variables generated by this program are not saved automatically. You must use the Save Variables button to do this. Upon clicking the Save Variables button, you will be prompted to specify the save location and name of the results file. The results file is a.mat file containing two variables: points, which is the x- y coordinates of the original data, and regs, which is a 1xn struct with the following fields: points - the x-y coordinates of the points in that ROI ROI - rectangular ROI that you selected area - area of the ROI in nm^2 dists - pairwise distances between all points in the ROI label - (internal program use, ignore) name - (internal program use, ignore) randoms - a set of randomly generated points that would fit in the ROI, used for comparison between the real data and a similar set of random points (uniform random) randists - pairwise distance between all of the random points disthist - histogram of pairwise distances, excluding values above or below the designated cutoffs cumhist - cumulative histogram of same data as in disthist cdf - empirical cdf of pairwise distances chidats - result of chi^2 test between cdfs (this is only applicable if you care about a statistical comparison between regions on the same cell) G - a cell array containing the values of G_i (d) for d=[0,cutoff) at each pixel in the ROI maxg - the maximum projection of pixel-by-pixel G values clusts - cell array containing all of the SR localizations contained in each cluster ahull - cell array containing the alpha-hull of each cluster, both inner and outer Clarea - a 2 x n array of the area of each cluster (nm^2), based on alpha-hull, both inner and outer compact - a 2 x n array of cluster compactness measurements, both inner and outer density - a 2 x n array of the cluster density/whole cell density values, both inner and outer neardist - a 2 x n array of nearest cluster distances, both inner and outer eqdiameter - a 2 x n array of cluster equivalent diameters, both inner and outer estmndiam 1 x 2 array, the mean equivalent diameter, both inner and outer hop - the results of the Hopkins test rip - a 1x4 cell array containing the results of the Ripley test, in this order: (), the distance step-size, (), and (). DBcounts - cell array containing the number of SR localizations per DBSCAN cluster DBhull - cell array containing the convex hull of each DBSCAN cluster DBdiam - a 1 x n array of cluster equivalent diameters DBmnDiam - the mean cluster equivalent diameter DBarea - cell array containing the area of each DBSCAN cluster (based on convex hull)

paircorr - a 1x4 cell array containing: the ROI image; a matrix with the radius values in column 1, the angularly averaged autocorrelation function in column 2, and the errors of the angularly averaged autocorrelation function in column 3; the average density rho; and the 2D autocorrelation function fitresults - a matrix containing the estimated fitting parameters (objects per domain, sigma for 2D Gaussian domain size, observations per object, sigma for Gaussian localization precision) with one row for each ROI or one row for the combined ROIs model - the fit model generated from these parameters params - the list of parameters used in generating these results. Additionally, all plots will be automatically saved to the same directory. The plots are saved as.fig files, which facilitates format changes and allows the plots to be exported in a number of other formats by the user. The exception to this is the Hopkins plot and DBSCAN plot which, due to an asyet unsolved bug, must be saved as a.png file (sorry for the inconvenience). 4 References

1. Getis, A., and J. K. Ord. 1992. The analysis of spatial association by use of distance statistics. Geographical Analysis 24:189-206. 2. Getis, A., and J. Aldstadt. 2004. Constructing the spatial weights matrix using a local statistic. Geographical Analysis 36:90-104. 3. Zhang, J., Leiderman, K., Pfeiffer, J., Wilson, B., Oliver, J., and Steinberg, S. 2006. Characterizing the Topography and Interactions of Membrane Receptors and Signaling Molecules from Spatial Patterns Obtained using Nanometer-scale Electron-dense Probes and Electron Microscopy, Micron, 37: 14-34. 4. Ester, Martin, et al. "A density-based algorithm for discovering clusters in large spatial databases with noise." KDD. Vol. 96. 1996. 5. Veatch, Sarah L., et al. "Correlation functions quantify super-resolution images and estimate apparent clustering due to over-counting." PloS one 7.2 (2012): e31457.