Bi 1x Spring 2014: Plotting and linear regression In this tutorial, we will learn some basics of how to plot experimental data. We will also learn how to perform linear regressions to get parameter estimates. In doing so, we will also get an introduction to NumPy s random number generation module, numpy.random, which we will use later on in the course. I also note that for the purposes of this simple tutorial, we are not going to consider error bars on experimental measurements nor error estimates on computed regression parameters. 1 Generating fake data For the purposes of this tutorial, we will generate some fake data to use for plotting. I chose to do this instead of giving you data because I want to also introduce you to random number generation in Python. We will create a module to generate fake data. Store it in a file fake_data.py. 1 """ Module to generate random data 3 """ 5 import numpy. random as np_rand 7 # ######################## def generate_ fake_ data (x, f, args =(), noise_ factor =0.1) : 9 """ Generates fake data that follows a function f(x, * args ) with random noise. 11 13 The data are generate at points x. The amplitude of the noise from the function scales like 15 noise_factor * mean ( abs (f(x))). 17 args is a tuple of other parameters that are passed into f. """ 19 # Generate base curve 21 base_ curve = f(x, * args ) 23 # Generate random noise ( sample random numbers on interval [ -1,1] noise = 2.0 * np_rand. rand ( len (x)) - 1.0 25 # Add noise 27 y = base_ curve + noise_ factor * abs ( base_ curve ). mean () * noise 29 return y 31 # ###################### def linear_ function (x, m, b): 33 """ Returns m * x + b. 35 """ 1
return m * x + b The function numpy.random.rand generates uniformly distributed random numbers on the interval [0, 1). The argument to the functions says how many numbers to generate. For example, rand(100) returns an np.ndarray of 100 random numbers between zero and one. To get uniformly distributed random numbers on the interval [ 1, 1), we linearly transform the numbers, as in line 27 above. For today, we will generate fake data that falls along a line. We therefore also include a simple linear function in our fake_data module. Note that in the generate_data function, we has used a function call f(x, *args). In Python, we can take a tuple and pass it as separate arguments into a function by using the * operator. We can test it out. 2 4 6 8 10 12 In [1]: import fake_ data In [2]: x = 1.0 In [3]: m, b = 4.0, 5.0 In [4]: args = (4.0, 5.0) In [5]: fake_ data. linear_ function (x, m, b) In [6]: fake_data. linear_function (x, * args ) In [7]: fake_ data. linear_ function (x, args ) The last function call will give an error, because without the *, args is just a single argument passed into the function. To generate the fake data for use within our Python window in Canopy, we simply use these functions. In [8]: import numpy as np In [9]: x = np. linspace (0.0, 10.0, 20) # 20 evenly spaces pts from 0 to 10 In [10]: y = fake_ data. generate_ fake_ data (x, fake_ data. linear_ function, args = args ) In [11]: x, y We now have our fake data, and we can begin plotting. 2 Plotting experimental data 2.1 Plotting data points The plt.plot function is the main utility for plotting data. You have seen the plt.fill_between function in the image processing tutorial, which was useful for viewing histograms, but that is in a way a fancy plotting function. You have also used skimage.io.imshow repeatedly, which is plotting data, as we have discussed. plt.plot is the workhorse of plotting. So, let s start by naively just plotting our data. 2
In [12]: import matplotlib. pyplot as plt In [13]: plt. plot (x, y) In [14]: plt. draw () In [15]: plt. show () First off, note that the function calls to plt.draw and plt.show are often unnecessary when operating in the Canopy Python window. They are necessary, however, to pull up windows with plots when you are running scripts. Now, when we look at our plot, we see that the default is to connect points with straight lines. This is useful when plotting theoretical curves. We sample the curves at dense points (e.g., x = np.linspace(0, 1, 200)), and then plot the function as a line. However, for experimental data do not plot your data as lines unless it is very highly sampled, like in an electrocardiogram. Plot your data as individual points. To do this, we can make use of plt.plot s many keyword arguments. In [16]: plt. clf () # This clears the figure window In [17]: plt. plot (x, y, marker = o, linestyle = none ) In [18]: plt. draw () ; plt. show () Now, we have a series of dots. There are many keyword arguments that give you lots of control over how the data are presented. In [19]: plt. plot? Finally, we most often plot our data as black dots. A shortcut to get this kind of plot is In [20]: plt. clf () In [21]: plt. plot (x, y, ko ) In [22]: plt. draw () ; plt. show () 2.2 Labeling axes Now that we have a plot to work with, we can label our axes. Always label your axes. As a reminder, always label your axes. I would say it a third time, but that would be obnoxious. Let us pretend for a moment that our x-axis is time in units of years and the y-axis is the average height of trees in my yard. Then, we would label our axes as In [23]: plt. xlabel ( time [ years ], fontsize =18) In [24]: plt. ylabel ( average height [ feet ], fontsize =18) In [25]: plt. draw () ; plt. show () Note that I have used the fontsize keyword argument to control the font size. Always make your fonts large enough to be easily legible. You can play around with with the font size to make it look right. 3
2.3 Legends If you have multiple plots, it is often useful to have a legend. Just for demonstration purposes, we ll make a legend for our single plot, In [26]: plt. legend (( tree height,), loc = upper left, numpoints =1, fontsize =16) In [27]: plt. draw () ; plt. show () Notices that the first argument is a tuple containing the labels for your curves. The ordering of the tuple corresponds to the order in which the curves were put in the figure using plt.plot. For multiple curves, it would have more than one entry. I also like to use the numpoints keyword argument to include only one marker in the legend. The default is to include two, which I think is ugly. Note that the text may seem off-center in the legend. This is usually corrected with you save the figure (see below). 2.4 Saving your plot You can save your figure. It is best to save it as vector graphics, such as PDF or SVG. PDF is usually preferred. In [28]: plt. tight_ layout () In [29]: plt. draw () ; plt. show () In [30]: plt. savefig ( tree_height. pdf ) The plt.tight_layout function is convenient to make sure all axis labels, etc., will appear properly in your saved figure. 3 Performing a linear regression To perform a linear regression, we try to find the values of m and b for the line y = mx + b that best describe the data. To do so, we minimize the sum of the square of the residuals. A residual is the difference between the line you are fitting and the data point itself for a given point x. For example, let s say experimental data point i, (x i, y i ), should fall on the line y = mx + b. The residual for point i is r i y i y = y i (mx i + b). (1) To get an idea of what the residuals are, we can draw a line through our data and plot the residuals in red. In [31]: y_theor = 4.0 * x + 5.0 In [32]: plot (x, y_theor, linestyle = -, color = gray ) In [33]: for i in xrange ( len (x)):...: plot ((x[i], x[i]), (y[i], y_theor [i]), r- ) In [34]: plt. draw () ; plt. show () 4
We can see from the plot that if we minimize the sum of the squares of the residuals, we can get the line that is closest to all the points taken together. (Note: This is a very deep topic, and we are only scratching the very surface.) So, the linear regression problem can be stated as m, b = arg min m,b [y i (mx i + b)] 2, (2) i where the term in brackets is the residual for data point i. This optimization problem is called a least squares problem. For the case where the function we are fitting our data with is a polynomial (such as a line), the problem has a unique solution that can be found by matrix operations. The details can be found in most introductory linear algebra textbooks. We will instead use the curve_fit function in the scipy.optimize module to perform the curve fit. I chose to do this instead of working through the linear algebra because this function can also be used to perform nonlinear regression. It is well-worth reading the doc string for this function. In [35]: from scipy. optimize import curve_ fit In [36]: curve_ fit? The first argument is the function we are fitting the data to. It must be of the form f(x, *args), which is conveniently what we already specified. The keyword argument p0 is the guess at the best parameters as an np.ndarray. For fitting a linear function, it is not important to specify this, but it is a very good idea to do so, as it can be very important for nonlinear regression (which we will do later in Bi 1x). The function curve_fit will assume that p0 is all ones otherwise. curve_fit returns to np.ndarrays. The first contains the best-fit parameters, given in the order that they are inputted into the fit function f. The second is the covariance matrix, the diagonals of which are supposed to give the variance of the fit parameters. Warning: in curve_fit s implementation, they variances will not be correctly reported unless you include error bars with you data, which we will not be doing in Bi 1x. Therefore, you should ignore the covariance matrix returned by curve_fit for the purposes of Bi 1x. Note also that if you want more control over your curve fitting routines and want to do more sophisticated error analysis, you can directly use scipy.optimize.leastsq, which is what curve_fit uses under the hood. Without further ado, let s fit our data with a line. In [37]: popt, pcov = curve_ fit ( fake_ data. linear_ function, x, y, p0 =(4.0, 5.0) ) In [38]: m, b = popt In [39]: m, b We can make a plot of our data with the curve fit. In [39]: x_theor = np. linspace (x[0], x[ -1], 200) 2 4 6 In [40]: y_theor = fake_ data. linear_ function ( x_theor, * popt ) In [41]: plt. clf () In [42]: plt. plot (x, y, ko ) 5
8 10 12 14 In [43]: plt. plot ( x_theor, y_theor, linestyle = -, color = gray ) In [44]: plt. xlabel ( time [ years ], fontsize =18) In [45]: plt. ylabel ( average height [ feet ], fontsize =18) In [46]: plt. draw () ; plt. show () Now that you know how to do a linear regression, I d like you to think about this: If you know that your data had to pass through zero, i.e., that you only had to fit the slope and not the slope plus intercept, how would you do it? 6