Maximum Likelihood Estimation 5.0

Size: px

Start display at page:

Download "Maximum Likelihood Estimation 5.0"

Alyson Holland
6 years ago
Views:

1 Maximum Likelihood Estimation 5.0 for GAUSS TM Mathematical and Statistical System Aptech Systems, Inc.

2 Information in this document is subject to change without notice and does not represent a commitment on the part of Aptech Systems, Inc. The software described in this document is furnished under a license agreement or nondisclosure agreement. The software may be used or copied only in accordance with the terms of the agreement. The purchaser may make one copy of the software for backup purposes. No part of this manual may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, for any purpose other than the purchaser s personal use without the written permission of Aptech Systems, Inc. c Copyright by Aptech Systems, Inc., Black Diamond, WA. All Rights Reserved. GAUSS, GAUSS Engine and GAUSS Light are trademarks of Aptech Systems, Inc. Other trademarks are the property of their respective owners. Part Number: Version 5.0 Documentation Revision: 2173 June 12, 2012

3 Contents Contents 1 Installation 1.1 UNIX/Linux/Mac Download CD Windows Download CD Bit Windows Difference Between the UNIX and Windows Versions Getting Started 3 Maximum Likelihood Estimation 3.1 The Log-likelihood Function Algorithm Derivatives The Secant Algorithms Convergence Berndt, Hall, Hall, and Hausman s (BHHH) Method Polak-Ribiere-type Conjugate Gradient (PRCG) Line Search Methods Random Search Weighted Maximum Likelihood Active and Inactive Parameters Example Managing Optimization iii

4 Maxlik 5.0 for GAUSS Scaling Condition Starting Point Diagnosis Gradients Analytical Gradient User-Supplied Numerical Gradient Algorithmic Derivatives Analytical Hessian User-Supplied Numerical Hessian Switching Algorithms Automatically FASTMAX Fast Execution MAXLIK Undefined Function Evaluation Inference Wald Inference Profile Likelihood Inference Profile Trace Plots Bootstrap Pseudo-Random Number Generators Bayesian Inference Run-Time Switches Calling MAXLIK Recursively Using MAXLIK Directly Error Handling Return Codes Error Trapping References Maximum Likelihood Reference iv

5 Contents FASTMAX FASTBayes FASTBoot FASTPflClimits FASTProfile MAXLIK MAXBayes MAXBoot MAXBlimits MAXCLPrt MAXDensity MAXHist MAXProfile MAXPflClimits MAXPrt MAXSet MAXTlimits Event Count and Duration Regression README Files Setup About the COUNT Procedures Inputs Outputs Global Control Variables Statistical Inference Problems with Convergence Annotated Bibliography v

6 Maxlik 5.0 for GAUSS 6 Count Reference CountCLPrt CountPrt CountSet Expgam Expon Hurdlep Negbin Pareto Poisson Supreme Supreme Index vi

7 Installation Installation UNIX/Linux/Mac If you are unfamiliar with UNIX/Linux/Mac, see your system administrator or system documentation for information on the system commands referred to below Download 1. Copy the.tar.gz or.zip file to /tmp. 2. If the file has a.tar.gz extension, unzip it using gunzip. Otherwise skip to step 3. gunzip app_appname_vernum.revnum_unix.tar.gz 3. cd to your GAUSS or GAUSS Engine installation directory. We are assuming /usr/local/gauss in this case. cd /usr/local/gauss 1-1

8 Maxlik 5.0 for GAUSS 4. Use tar or unzip, depending on the file name extension, to extract the file. tar xvf /tmp/app_appname_vernum.revnum_unix.tar or unzip /tmp/app_appname_vernum.revnum_unix.zip CD 1. Insert the Apps CD into your machine s CD-ROM drive. 2. Open a terminal window. 3. cd to your current GAUSS or GAUSS Engine installation directory. We are assuming /usr/local/gauss in this case. cd /usr/local/gauss 4. Use tar or unzip, depending on the file name extensions, to extract the files found on the CD. For example: tar xvf /cdrom/apps/app_appname_vernum.revnum_unix.tar or unzip /cdrom/apps/app_appname_vernum.revnum_unix.zip However, note that the paths may be different on your machine. 1.2 Windows Download Unzip the.zip file into your GAUSS or GAUSS Engine installation directory CD 1. Insert the Apps CD into your machine s CD-ROM drive. 1-2

9 2. Unzip the.zip files found on the CD to your GAUSS or GAUSS Engine installation directory. Installation Installation Bit Windows If you have both the 64-bit version of GAUSS and the 32-bit Companion Edition installed on your machine, you need to install any GAUSS applications you own in both GAUSS installation directories. 1.3 Difference Between the UNIX and Windows Versions If the functions can be controlled during execution by entering keystrokes from the keyboard, it may be necessary to press ENTER after the keystroke in the UNIX version. 1-3

11 Getting Started 2 Getting Started 2-1

13 Maximum Likelihood Estimation 3 MaxLik 3.1 The Log-likelihood Function Maximum Likelihood is a set of procedures for the estimation of the parameters of models via the maximum likelihood method with general constraints on the parameters, along with an additional set of procedures for statistical inference. Maximum Likelihood solves the general maximum likelihood problem L = N log P(Y i ; θ) w i i=1 where N is the number of observations, P(Y i, θ) is the probability of Y i given θ, a vector of parameters, and w i is the weight of the i-th observation. 3-1

14 Maxlik 5.0 for GAUSS The Maximum Likelihood procedure Maxlik finds values for the parameters in θ such that L is maximized. In fact Maxlik minimizes L. It is important to note, however, that the user must specify the log-probability to be maximized. Maxlik transforms the function into the form to be minimized. Maxlik has been designed to make the specification of the function and the handling of the data convenient. The user supplies a procedure that computes log P(Y i ; θ), i.e., the log-likelihood, given the parameters in θ, for either an individual observation or set of observations (i.e., it must return either the log-likelihood for an individual observation or a vector of log-likelihoods for a matrix of observations; see discussion of the global variable row below). Maxlik uses this procedure to construct the function to be minimized. 3.2 Algorithm Maximum Likelihood finds values for the parameters using an iterative method. In this method the parameters are updated in a series of iterations beginning with a starting values that you provide. Let θ t be the current parameter values. Then the succeeding values are θ t+1 = θ t + ρδ where δ is a k 1 direction vector, and ρ a scalar step length. Direction Define Σ(θ) = 2 L θ θ 3-2

15 Maximum Likelihood Estimation Ψ(θ) = L θ The direction, δ is the solution to Σ(θ t )δ = Ψ(θ t ) This solution requires that Σ be positive definite. Line Search The line search finds a value of ρ that minimizes or decreases L(θ t + ρδ). MaxLik Derivatives The minimization requires the calculation of a Hessian, Σ, and the gradient, Ψ. Maxlik computes these numerically if procedures to compute them are not supplied. If you provide a proc for computing Ψ, the first derivative of L, Maxlik uses it in computing Σ, the second derivative of L, i.e., Σ is computed as the Jacobian of the gradient. This improves the computational precision of the Hessian by about four places. The accuracy of the gradient is improved and thus the iterations converge in fewer iterations. Moreover, the convergence takes less time because of a decrease in function calls - the numerical gradient requires k function calls while an analytical gradient reduces that to one The Secant Algorithms The Hessian may be very expensive to compute at every iteration, and poor start values may produce an ill-conditioned Hessian. For these reasons alternative algorithms are 3-3

16 Maxlik 5.0 for GAUSS provided in Maxlik for updating the Hessian rather than computing it directly at each iteration. These algorithms, as well as step length methods, may be modified during the execution of Maxlik. Beginning with an initial estimate of the Hessian, or a conformable identity matrix, an update is calculated. The update at each iteration adds more information to the estimate of the Hessian, improving its ability to project the direction of the descent. Thus after several iterations the secant algorithm should do nearly as well as Newton iteration with much less computation. There are two basic types of secant methods, the BFGS (Broyden, Fletcher, Goldfarb, and Shanno), and the DFP (Davidon, Fletcher, and Powell). They are both rank two updates, that is, they are analogous to adding two rows of new data to a previously computed moment matrix. The Cholesky factorization of the estimate of the Hessian is updated using the functions cholup and choldn. In addition, Maxlik includes a scoring method, BHHH (Berndt, Hall, Hall, and Hausman). This method computes the gradient of the likelihood by observation, i.e., the Jacobian, and estimates Σ as the cross-product of this Jacobian. Secant Methods (BFGS and DFP) BFGS is the method of Broyden, Fletcher, Goldfarb, and Shanno, and DFP is the method of Davidon, Fletcher, and Powell. These methods are complementary (Luenberger 1984, page 268). BFGS and DFP are like the NEWTON method in that they use both first and second derivative information. However, in DFP and BFGS the Hessian is approximated, reducing considerably the computational requirements. Because they do not explicitly calculate the second derivatives they are sometimes called quasi-newton methods. While it takes more iterations than the NEWTON method, the use of an approximation produces a gain because it can be expected to converge in less overall time (unless analytical second derivatives are available in which case it might be a toss-up). The secant methods are commonly implemented as updates of the inverse of the Hessian. This is not the best method numerically for the BFGS algorithm (Gill and Murray, 1972). 3-4

17 Maximum Likelihood Estimation This version of Maxlik, following Gill and Murray (1972), updates the Cholesky factorization of the Hessian instead, using the functions cholup and choldn for BFGS. The new direction is then computed using cholsol, a Cholesky solve, as applied to the updated Cholesky factorization of the Hessian and the gradient Convergence Convergence is declared when the relative gradient is less than _max_gradtol. The relative gradient is a scaled gradient and is used for determining convergence in order to reduce the effects of scale. It is defined as the absolute value of the gradient times the absolute value of the parameter vector divided by the larger of zero and the absolute value of the function. By default, _max_gradtol = 1e-5. MaxLik Berndt, Hall, Hall, and Hausman s (BHHH) Method BHHH is a method proposed by Berndt, Hall, Hall and Hausman (1974) for the maximization of log-likelihood functions. It is a scoring method that uses the cross-product of the matrix of first derivatives to estimate the Hessian matrix. This calculation can be time-consuming, especially for large data sets, since a gradient matrix exactly the same size as the data set must be computed. For that reason BHHH cannot be considered a preferred choice for an optimization algorithm Polak-Ribiere-type Conjugate Gradient (PRCG) The conjugate gradient method is an improvement on the steepest descent method without the increase in memory and computational requirements of the secant methods. Only the gradient is stored, and the calculation of the new direction is different: d t+1 = g t+1 + β t d t 3-5

18 Maxlik 5.0 for GAUSS where t indicates t-th iteration, d is the direction, g is the gradient. The conjugate gradient method used in Maxlik is a variation called the Polak-Ribiere method where β t = (g t+1 g t ) g t+1 g t g t The Newton and secant methods require the storage on the order of the Hessian in memory, i.e., 8k 2 bytes of memory, where k is the number of parameters. For a very large problem this can be prohibitive. For example, 200 parameters will require 3.2 megabytes of memory, and this doesn t count the copies of the Hessian that may be generated by the program. For large problems, then, the PRCG and STEEP methods may be the only alternative. As described above, STEEP can be very inefficient in the region of the minimum, and therefore the PRCG is the method of choice in these cases Line Search Methods Given a direction vector d, the updated estimate of the parameters is computed θ t+1 = θ t + ρδ where ρ is a constant, usually called the step length, that increases the descent of the function given the direction. Maxlik includes a variety of methods for computing ρ. The value of the function to be minimized as a function of ρ is L(θ t + ρδ) Given θ and d, this is a function of a single variable ρ. Line search methods attempt to find a value for ρ that decreases m. STEPBT is a polynomial fitting method, BRENT and HALF are iterative search methods. A fourth method called ONE forces a step length of 1. The default line search method is STEPBT. If this, or any selected method, fails, then BRENT is tried. If BRENT fails, then HALF is tried. If all of the line search methods fail, then a random search is tried (provided _max_randradius is greater than zero). 3-6

19 Maximum Likelihood Estimation STEPBT STEPBT is an implementation of a similarly named algorithm described in Dennis and Schnabel (1983). It first attempts to fit a quadratic function to m(θ t + ρδ) and computes an ρ that minimizes the quadratic. If that fails it attempts to fit a cubic function. The cubic function more accurately portrays the F which is not likely to be very quadratic, but is, however, more costly to compute. STEPBT is the default line search method because it generally produces the best results for the least cost in computational resources. BRENT MaxLik This method is a variation on the golden section method due to Brent (1972). In this method, the function is evaluated at a sequence of test values for ρ. These test values are determined by extrapolation and interpolation using the constant, ( 5 1)/2 = This constant is the inverse of the so-called golden ratio (( 5 + 1)/2 = and is why the method is called a golden section method. This method is generally more efficient than STEPBT but requires significantly more function evaluations. HALF This method first computes m(x + d), i.e., sets ρ = 1. If m(x + d) < m(x) then the step length is set to 1. If not, then it tries m(x +.5d). The attempted step length is divided by one half each time the function fails to decrease, and exits with the current value when it does decrease. This method usually requires the fewest function evaluations (it often only requires one), but it is the least efficient in that it is not very likely to find the step length that decreases m the most. 3-7

20 Maxlik 5.0 for GAUSS BHHHStep This is a variation on the golden search method. A sequence of step lengths are computed, interpolating or extrapolating using a golden ratio, and the method exits when the function decreases by an amount determined by _max_interp Random Search If the line search fails, i.e., no ρ is found such that m(θ t + ρδ) < m(θ t ), then a search is attempted for a random direction that decreases the function. The radius of the random search is fixed by the global variable, _max_randradius (default =.01), times a measure of the magnitude of the gradient. Maxlik makes _max_maxtry attempts to find a direction that decreases the function, and if all of them fail, the direction with the smallest value for m is selected. The function should never increase, but this assumes a well-defined problem. In practice, many functions are not so well-defined, and it often is the case that convergence is more likely achieved by a direction that puts the function somewhere else on the hyper-surface even if it is at a higher point on the surface. Another reason for permitting an increase in the function here is that halting the minimization altogether is only alternative if it is not at the minimum, and so one might as well retreat to another starting point. If the function repeatedly increases, then you would do well to consider improving either the specification of the problem or the starting point Weighted Maximum Likelihood Weights are specified by setting the GAUSS global, weight to a weighting vector, or by assigning it the name of a column in the GAUSS data set being used in the estimation. Thus if a data matrix is being analyzed, weight must be assigned to a vector. Maxlik assumes that the weights sum to the number of observations, i.e, that the weights 3-8

21 Maximum Likelihood Estimation are frequencies. This will be an issue only with statistical inference. Otherwise, any multiple of the weights will produce the same results Active and Inactive Parameters The Maxlik global _max_active may be used to fix parameters to their start values. This allows estimation of different models without having to modify the function procedure. _max_active must be set to a vector of the same length as the vector of start values. Elements of _max_active set to zero will be fixed to their starting values, while nonzero elements will be estimated. This feature may also be used for model testing. _max_numobs times the difference between the function values (the second return argument in the call to Maxlik) is chi-squared distributed with degrees of freedom equal to the number of fixed parameters in _max_active. MaxLik Example This example estimates coefficients for a tobit model: library maxlik; #include maxlik.ext; maxset; proc lpr(x,z); local t,s,m,u; s = x[4]; if s <= 1e-4; retp(error(0)); endif; m = z[.,2:4]*x[1:3,.]; u = z[.,1]./= 0; t = z[.,1]-m; retp(u.*(-(t.*t)./(2*s)-.5*ln(2*s*pi)) + (1-u).*(ln(cdfnc(m/sqrt(s))))); 3-9

22 Maxlik 5.0 for GAUSS endp; x0 = { 1, 1, 1, 1 }; title = "tobit example"; {x,f,g,cov,ret} = maxlik("tobit",0,&lpr,x0); call maxprt(x,f,g,cov,ret); The output is: =========================================================================== tobit example =========================================================================== MAXLIK Version /30/2001 1:11 pm =========================================================================== Data Set: tobit return code = 0 normal convergence Mean log-likelihood Number of cases 100 Covariance matrix of the parameters computed by the following method: Inverse of computed Hessian Parameters Estimates Std. err. Est./s.e. Prob. Gradient P P P P Correlation matrix of the parameters

23 Maximum Likelihood Estimation Number of iterations 17 Minutes to convergence Managing Optimization The critical elements in optimization are scaling, starting point, and the condition of the model. When the data are scaled, the starting point is reasonably close to the solution, and the data and model go together well, the iterations converge quickly and without difficulty. MaxLik For best results therefore, you want to prepare the problem so that model is well-specified, the data scaled, and that a good starting point is available. The tradeoff among algorithms and step length methods is between speed and demands on the starting point and condition of the model. The less demanding methods are generally time consuming and computationally intensive, whereas the quicker methods (either in terms of time or number of iterations to convergence) are more sensitive to conditioning and quality of starting point Scaling For best performance, the diagonal elements of the Hessian matrix should be roughly equal. If some diagonal elements contain numbers that are very large and/or very small with respect to the others, Maxlik has difficulty converging. How to scale the diagonal elements of the Hessian may not be obvious, but it may suffice to ensure that the constants (or data ) used in the model are about the same magnitude. 3-11

24 Maxlik 5.0 for GAUSS Condition The specification of the model can be measured by the condition of the Hessian. The solution of the problem is found by searching for parameter values for which the gradient is zero. If, however, the Jacobian of the gradient (i.e., the Hessian) is very small for a particular parameter, then Maxlik has difficulty determining the optimal values since a large region of the function appears virtually flat to Maxlik. When the Hessian has very small elements, the inverse of the Hessian has very large elements and the search direction gets buried in the large numbers. Poor condition can be caused by bad scaling. It can also be caused by a poor specification of the model or by bad data. Bad models and bad data are two sides of the same coin. If the problem is highly nonlinear, it is important that data be available to describe the features of the curve described by each of the parameters. For example, one of the parameters of the Weibull function describes the shape of the curve as it approaches the upper asymptote. If data are not available on that portion of the curve, then that parameter is poorly estimated. The gradient of the function with respect to that parameter is very flat, elements of the Hessian associated with that parameter is very small, and the inverse of the Hessian contains very large numbers. In this case it is necessary to respecify the model in a way that excludes that parameter. Computer Arithmetic Computer arithmetic is fundamentally flawed by the fact that the computer number is finite (see Higham, 1996, for a general discussion). The standard double precision number in PCs carries about 16 decimal significant places. A simple operation can destroy nearly all of those places. The most destructive operation on a computer is addition and subtraction. Numbers are stored in a computer in the form of an abscissa and an exponent, e.g., e+02. There are about 16 decimal places of precision on most computers. The problem occurs when adding numbers that are of very different size. Before adding the number must be transformed so that the exponents are the same. For example consider adding e-07 to e+00: 3-12

25 Maximum Likelihood Estimation e e e+00 As you can see eight places were lost in the smaller number. If the exponent in the smaller number was 16 all of the places in that number would be lost. This problem is due to the finiteness of the computer number, not to the implementation of the operators. It is an inherent problem in all computers and the only solution, adding more bits to the computer number, is only temporary because sooner or later a problem will arise where that quantity of bits won t be enough. The first lesson to be learned from this is to avoid operations combining very small numbers with relatively large numbers. And for very small numbers, 1 can be a large number, as the example shows. MaxLik The standard method for evaluating the precision lost in computing a matrix inverse is the ratio of the largest to the smallest eigenvalue of the matrix. This quantity is sometimes called the condition number. The log of the condition number to the base 10 is approximately the number of decimal places lost in computing the inverse. A condition number greater than 1e16 therefore indicates that all of the 16 decimal places are lost that are available in the standard double precision floating point number. The BFGS optimization method in Maxlik has been successful primarily because its method of generating an approximation to the Hessian encourages better conditioning. The implementation of the NEWTON method involves a numerical calculation of the Hessian. A numerical Hessian, like all numerical derivatives, are computed by first computing a difference, the most destructive operation as we ve seen, and then compounding that by dividing the difference by a very small quantity. In general, when using double precision with 16 places of accuracy, about four places are lost in calculating a first derivative and another four with the second derivative. The numerical Hessian therefore begins with a loss of eight places of precision. If there are any problems computing the function itself, or if the model itself contains any problems of condition, there may be nothing left at all. 3-13

26 Maxlik 5.0 for GAUSS The BFGS method avoids much of the problems in computing a numerical Hessian. It produces an approximation by building information slowly with each iteration. Initially the Hessian is set tot he identity matrix, the matrix with the best condition but the least information. Information is increased at each iteration with a method that guarantees a positive definite result. This provides for stabler, though slower, progress towards convergence. The implementation of has been designed to minimize the damage to the precision of the optimization problem. The BFGS method avoids a direct calculation of the numerical Hessian, and uses sophisticated techmiques for calculating the direction that preserve as much precision as possible. However, all of this can be defeated by a poorly scaled problem or a poorly specified model. When the objective function being optimized is a log-likelihood, the inverse of the Hessian is an estimate of the covariance matrix of the sampling distribution of the parameters. The condition of the Hessian is related to (i) the scaling of the parameters, and (ii) the degree with which there are linear dependencies in the sampling distribution of the parameters. Scaling Scaling is under the direct control of the investigator and should never be an issue in the optimization. It might not always be obvious how to do it, though. In estimation problems scaling of the parameters is usually implemented by scaling the data. in regression models this is simple to accomplish, but in more complicated models it might be more difficult to do. It might be necessary to experiment with different scaling to get it right. The goal is to optimize the condition of the Hessian. The definition of the condition number implies that we endeavor to minimize the difference of the largest to the smallest eigenvalue of the Hessian. A rule of thumb for this is to scale the Hessian so that the diagonal elements are all about the same magnitude. If the scaling of the Hessian proves too difficult, an alternative method is to scale the parameters directly in the procedure computing the log-likelihood. Multiply or divide the parameter values being passed to the procedure by setting quantities before their use in the calculation of the log-likelihood. Experiment with different values until the diagonal 3-14

27 Maximum Likelihood Estimation elements of the Hessian are all about the same magnitude. Linear Dependencies or Nearly Linear Dependencies in the Sampling Distribution This is the most common difficulty in estimation and arises because of a discrepancy between the data and the model. If the data do not contain sufficient information to identify a parameter or set of parameters, a linear dependency is generated. A simple example occurs in regressors that cannot be distinquished from the constant because its variation is too small. When this happens, the sampling distribution of these two parameters becomes highly collinear. This collinearity will produce an eigenvalue approaching zero in the Hessian, increasing the number of places lost in the calculation of the inverse of the Hessian, degrading the optimization. MaxLik In the real world the data we have available will frequently fail to contain the information we need to estimate all of the parameters of our models. This means that it is a constant struggle to a well-conditioned estimation. When the condition sufficiently deteriorates to the point that the optimization fails, or the statistical inference fails through a failure to invert the Hessian, either more data must be found, or the model must be re-specified. Re-specification means either the direct reduction of the parameter space, that is, a parameter is deleted from the mdoel, or some sort of restriction is applied to the parameters. Diagnosing the Linear Dependency At times it may be very difficult to determine the cause of the ill-conditioning. If the Hessian being computed at convergence for teh covariance matrix of the parameters fails to invert, try the following: first generate the pivoted QR factorization of the Hessian, { R,E } = qre(h); 3-15

28 Maxlik 5.0 for GAUSS The linearly dependent columns of H are pivoted to the end of the R matrix. E contains the new order of the columns of H after pivoting. The number of linearly dependent columns is found by looking at the number of nearly zero elements at the end of the diagonal fo R. We can compute a coefficient matrix of the linear relationship of the dependent columns on the remaining columns by computing R 1 11 R 12 where R 11 is that portion of the R matrix associated with the independent columns and R 12 the independent with dependent. Rather than use the inverse function in GAUSS, we use a special solve function that takes advantage of the triangular shape of R 11. Suppose that the last two elements of R are nearly zero, then r0 = rows(r); r1 = rows(r) - 1; r2 = rows(r) - 2; B = utrisol(r[1:r2,r1:r0],r[1:r2,1:r2); B describes the linear dependencies among the columns of H and can be used to diagnose the ill-conditioning in the Hessian Starting Point When the model is not particularly well-defined, the starting point can be critical. When the optimization doesn t seem to be working, try different starting points. A closed form solution may exist for a simpler problem with the same parameters. For example, ordinary least squares estimates may be used for nonlinear least squares problems or nonlinear regressions like probit or logit. There are no general methods for computing start values and it may be necessary to attempt the estimation from a variety of starting points. 3-16

29 Maximum Likelihood Estimation Diagnosis When the optimization is not proceeding well, it is sometimes useful to examine the function, the gradient Ψ, the direction δ, the Hessian Σ, the parameters θ t, or the step length ρ, during the iterations. The current values of these matrices can be printed out or stored in the global _max_diagnostic by setting _max_diagnostic to a nonzero value. Setting it to 1 causes Maxlik to print them to the screen or output file, 2 causes Maxlik to store then in _max_diagnostic, and 3 does both. When you have selected _max_diagnostic = 2 or 3, Maxlik inserts the matrices into _max_diagnostic using the vput command. The matrices are extracted using the vread command. For example, MaxLik _max_diagnostic = 2; call MAXPrt(maxlik("tobit",0,&lpr,x0)); h = vread(_max_diagnostic,"hessian"); d = vread(_max_diagnostic,"direct"); The following table contains the strings to be used to retrieve the various matrices in the vread command: θ δ Σ Ψ ρ params direct hessian gradient step When nested calls to Maxlik are made, i.e., when the procedure for computing the log-likelihood itself calls its own version of Maxlik, _max_diagnostic returns the matrices of the outer call to Maxlik only. 3-17

30 Maxlik 5.0 for GAUSS 3.4 Gradients Analytical Gradient To increase accuracy and reduce time, you may supply a procedure for computing the gradient, Ψ(θ) = L/ θ, analytically. This procedure has two input arguments, a K 1 vector of parameters and an N i L submatrix of the input data set. The number of rows of the data set passed in the argument to the call of this procedure may be less than the total number of observations when the data are stored in a GAUSS data set and there was not enough space to store the data set in RAM in its entirety. In that case subsets of the data set are passed to the procedure in sequence. The gradient procedure must be written to return a gradient (or more accurately, a Jacobian ) with as many rows as the input submatrix of the data set. Thus the gradient procedure returns an N i K matrix of gradients of the N i observations with respect to the K parameters. The Maxlik global, _max_gradproc is then set to the pointer to that procedure. For example, library maxlik; #include maxlik.ext; maxset; proc lpsn(b,z); /* Function - Poisson Regression */ local m; m = z[.,2:4]*b; retp(z[.,1].*m-exp(m)); endp; proc lgd(b,z); /* Gradient */ retp((z[.,1]-exp(z[.,2:4]*b)).*z[.,2:4]); endp; x0 = {.5,.5,.5 }; _max_gradproc = &lgd; _max_gradchecktol = 1e-3; 3-18

31 Maximum Likelihood Estimation { x,f0,g,h,retcode } = MAXLIK("psn",0,&lpsn,x0); call MAXPrt(x,f0,g,h,retcode); In practice, unfortunately, much of the time spent on writing the gradient procedure is devoted to debugging. To help in this debugging process, Maxlik can be instructed to compute the numerical gradient along with your prospective analytical gradient for comparison purposes. In the example above this is accomplished by setting _max_gradchecktol to 1e User-Supplied Numerical Gradient MaxLik You may substitute your own numerical gradient procedure for the one used by Maxlik by default. This is done by setting the Maxlik global, _max_usergrad to a pointer to the procedure. Maxlik includes some numerical gradient functions in gradient.src which can be invoked using this global. One of these procedures, gradre, computes numerical gradients using the Richardson Extrapolation method. To use this method set _max_usernumgrad = &gradre; Algorithmic Derivatives Algorithmic Derivatives is a program that can be used to generate a GAUSS procedure to compute derivatives of the log-likelihood function. If you have Algorithmic Derivatives, be sure to read its manual for details on doing this. First, copy the procedure computing the log-likelihood to a separate file. Second, from the command line enter 3-19

32 Maxlik 5.0 for GAUSS ad file_name d_file_name where file_name is the name of the file containing the input function procedure, and d_file_name is the name of the file containing the output derivative procedure. If the input function procedure is named lpr, the output derivative procedure has the name d_1_lpr where the addition to the _1_ indicates that the derivative is with respect to the first of the two arguments. For example, put the following function into a file called lpr.fct proc lpr(x,z); local s,m,u; s = x[4]; m = z[.,2:4]*x[1:3,.]; u = z[.,1]./= 0; retp(u.*lnpdfmvn(z[.,1]-m,s) + (1-u).*(lncdfnc(m/sqrt(s)))); endp; Then enter the following at the GAUSS command line library ad; ad lpr.fct d_lpr.fct; If successful, the following is printed to the screen java -jar d:\gauss6.0\src\gaussad.jar lpr.fct d_lpr.fct and the derivative procedure is written to file named d_lpr.fct: 3-20

33 Maximum Likelihood Estimation /* Version:1.0 - May 15, 2004 */ /* Generated from:lpr.fct */ /* Taking derivative with respect to argument 1 */ Proc(1)=d_1_lpr(x, z); Clearg _AD_fnValue; Local s, m, u; s = x[(4)] ; Local _AD_t1; _AD_t1 = x[(1):(3),.] ; m = z[.,(2):(4)] * _AD_t1; u = z[.,(1)]./= 0; _AD_fnValue = (u.* lnpdfmvn( z[.,(1)] - m, s)) + ((1 - u).* lncdfnc(m / sqrt(s))); /* retp(_ad_fnvalue); */ /* endp; */ struct _ADS_optimum _AD_d AD_t1,_AD_d_x,_AD_d_s,_AD_d_m,_AD_d AD_fnValue; /* _AD_d AD_t1 = 0; _AD_d_s = 0; _AD_d_m = 0; */ _AD_d AD_fnValue = _ADP_d_x_dx(_AD_fnValue); _AD_d_s = _ADP_DtimesD(_AD_d AD_fnValue, _ADP_DplusD(_ADP_DtimesD(_ADP_d_xplusy_dx(u.* lnpdfmvn( z[.,(1)] - m, s), (1 - u).* lncdfnc(m / sqrt(s))), _ADP_DtimesD(_ADP_d_ydotx_dx(u, lnpdfmvn( z[.,(1)] - m, s)), _ADP_DtimesD(_ADP_internal(d_2_lnpdfmvn( z[.,(1)] - m, s)), _ADP_d_x_dx(s)))), _ADP_DtimesD(_ADP_d_yplusx_dx(u.* lnpdfmvn( z[.,(1)] - m, s), (1 - u).* lncdfnc(m / sqrt(s))), _ADP_DtimesD(_ADP_d_ydotx_dx(1 - u, lncdfnc(m / sqrt(s))), _ADP_DtimesD(_ADP_d_lncdfnc(m / sqrt(s)), _ADP_DtimesD(_ADP_d_ydivx_dx(m, sqrt(s)), _ADP_DtimesD(_ADP_d_sqrt(s), _ADP_d_x_dx(s)))))))); _AD_d_m = _ADP_DtimesD(_AD_d AD_fnValue, _ADP_DplusD(_ADP_DtimesD(_ADP_d_xplusy_dx(u.* lnpdfmvn( z[.,(1)] - m, s), (1 - u).* lncdfnc(m / sqrt(s))), _ADP_DtimesD(_ADP_d_ydotx_dx(u, lnpdfmvn( z[.,(1)] - m, s)), _ADP_DtimesD(_ADP_internal(d_1_lnpdfmvn( z[.,(1)] - m, s)), _ADP_DtimesD(_ADP_d_yminusx_dx( z[.,(1)], m), _ADP_d_x_dx(m))))), _ADP_DtimesD(_ADP_d_yplusx_dx(u.* lnpdfmvn( z[.,(1)] - m, s), (1 - u).* lncdfnc(m / sqrt(s))), _ADP_DtimesD(_ADP_d_ydotx_dx(1 - u, lncdfnc(m / sqrt(s) )), _ADP_DtimesD(_ADP_d_lncdfnc(m / sqrt(s)), _ADP_DtimesD(_ADP_d_xdivy_dx(m, sqrt(s)), _ADP_d_x_dx(m))))))); /* u = z[.,(1)]./= 0; */ _AD_d AD_t1 = _ADP_DtimesD(_AD_d_m, _ADP_DtimesD(_ADP_d_yx_dx( z[.,(2):(4)], _AD_t1), _ADP_d_x_dx(_AD_t1))); Local _AD_sr_x, _AD_sc_x; MaxLik 3-21

34 Maxlik 5.0 for GAUSS _AD_sr_x = _ADP_seqaMatrixRows(x); _AD_sc_x = _ADP_seqaMatrixCols(x); _AD_d_x = _ADP_DtimesD(_AD_d AD_t1, _ADP_d_x2Idx_dx(x, _AD_sr_x[(1):(3)], _AD_sc_x[0] )); Local _AD_s_x; _AD_s_x = _ADP_seqaMatrix(x); _AD_d_x = _ADP_DplusD(_ADP_DtimesD(_AD_d_s, _ADP_d_xIdx_dx(x, _AD_s_x[(4)] )), _AD_d_x); retp(_adp_external(_ad_d_x)); endp; If there s a syntax error in the input function procedure, the following is written to the screen java -jar d:\gauss6.0\src\gaussad.jar lpr.fct d_lpr.fct Command java -jar d:\gauss6.0\src\gaussad.jar lpr.fct d_lpr.fct exi the exit status 1 indicating that an error has occurred. The output file then contains the reason for the error: /* Version:1.0 - May 15, 2004 */ /* Generated from:lpr.fct */ /* Taking derivative with respect to argument 1 */ proc lpr(x,z); local s,m,u; s = x[4]; m = z[.,2:4]*x[1:3,.]; u = z[.,1]./= 0; retp(u.*lnpdfmvn(z[.,1]-m,s) + (1-u).*(lncdfnc(m/sqrt(s))); Error: lpr.fct:12:63: expecting ), found ; Finally, set the global, _max_gradproc equal to a pointer to this above procedure, for example, 3-22

35 Maximum Likelihood Estimation library maxlik,ad; #include ad.sdf x0 = { 1, 1, 1, 1 }; title = "tobit example"; _max_bounds = { , , ,.1 10 }; _max_gradproc = &d_1_lpr; Maxlik("tobit",0,&lpr,x0); MaxLik Speeding Up the Algorithmic Derivative A slightly faster derivative procedure can be generated by modifying the log-likelihood proc to return a scalar sum of the log-likelihoods in the input file in the call to AD. It is important to note that this derivative function based on a scalar return cannot be used for computing the QML covariance matrix of the parameters. Thus if you want both a derivative procedure based on a scalar return and QML standard errors you will need to provide both types of gradient procedures. To accomplish this first copy both versions of the log-likelihood procedure into separate files and run AD on both of them with different output files. Then copy both of these derivatives procedures to the command file. Note: the log-likelihood procedure that returns a vector of log-likelihoods should remain in the command file, i.e., don t use the version of the log-likelihood that returns a scalar in the command file. For example, enlarging on the example in the previous section, put the following into a separate file, 3-23

36 Maxlik 5.0 for GAUSS proc lpr2(x,z); local s,m,u,logl; s = x[4]; m = z[.,2:4]*x[1:3,.]; u = z[.,1]./= 0; logl = u.*lnpdfmvn(z[.,1]-m,s) + (1-u).*(lncdfnc(m/sqrt(s))); retp(sumc(logl)); endp; Then enter on the command line ad lpr2.src d_lpr2.src and copy the contents of d lpr2.src into the command file. Our comand file now contains two derivative procedures, one based on a scalar result and another on a vector result. The one in the previous section d_1_lpr is our vector result derivative, and the from run above, d_1_lpr2 is our scalar result derivative. We want to use d_1_lpr2 for the iterations because it will be faster (it is computing a 1 K vector gradient), and for the QML covariance matrix of the parameters we will use d_1_lpr which returns a N K matrix of derivatives as required for the QML covariance matrix. Our command file will be library maxlik,ad; #include ad.sdf x0 = { 1, 1, 1, 1 }; title = "tobit example"; _max_bounds = { , , 3-24

37 Maximum Likelihood Estimation ,.1 10 }; _max_qmlproc = &d_1_lpr; _max_gradproc = &d_1_lpr2; Maxlik("tobit",0,&lpr,x0); in addition to the two derivative procedures Analytical Hessian MaxLik You may provide a procedure for computing the Hessian, Σ(θ) = 2 L/ θ θ. This procedure has two arguments, the K 1 vector of parameters, an N i L submatrix of the input data set (where N i may be less than N), and returns a K K symmetric matrix of second derivatives of the objection function with respect to the parameters. The pointer to this procedure is stored in the global variable _max_hessproc. In practice, unfortunately, much of the time spent on writing the Hessian procedure is devoted to debugging. To help in this debugging process, Maxlik can be instructed to compute the numerical Hessian along with your prospective analytical Hessian for comparison purposes. To accomplish this _max_gradchecktol is set to a small nonzero value. library maxlik; #include maxlik.ext; proc lnlk(b,z); local dev,s2; dev = z[.,1] - b[1] * exp(-b[2]*z[.,2]); s2 = dev dev/rows(dev); 3-25

38 Maxlik 5.0 for GAUSS retp(-0.5*(dev.*dev/s2 + ln(2*pi*s2))); endp; proc grdlk(b,z); local d,s2,dev,r; d = exp(-b[2]*z[.,2]); dev = z[.,1] - b[1]*d; s2 = dev dev/rows(dev); r = dev.*d/s2; /* retp(r (-b[1]*z[.,2].*r)); correct gradient */ retp(r (z[.,2].*r)); /* incorrect gradient */ endp; proc hslk(b,z); local d,s2,dev,r, hss; d = exp(-b[2]*z[.,2]); dev = z[.,1] - b[1]*d; s2 = dev dev/rows(dev); if s2 <= 0; retp(error(0)); endif; r = z[.,2].*d.*(b[1].*d - dev)/s2; hss = -d.*d/s2 r -b[1].*z[.,2].*r; retp(xpnd(sumc(hss))); endp; maxset; _max_hessproc = &hslk; _max_gradproc = &grdlk; _max_gradchecktol = 1e-3; startv = { 2, 1 }; { x,f0,g,cov,retcode } = MAXLIK("nlls",0,&lnlk,startv); call MAXPrt(x,f0,g,cov,retcode); The gradient is incorrectly computed, and Maxlik responds with an error message. It is clear that the error is in the calculation of the gradient for the second parameter. analytical and numerical gradients differ 3-26

39 Maximum Likelihood Estimation numerical analytical ======================================================================== analytical Hessian and analytical gradient ======================================================================== MAXLIK Version /30/ :10 am ======================================================================== Data Set: nlls return code = 7 function cannot be evaluated at initial parameter values Mean log-likelihood Number of cases 150 MaxLik The covariance of the parameters failed to invert Parameters Estimates Gradient P P Number of iterations. Minutes to convergence User-Supplied Numerical Hessian You may substitute your own numerical Hessian procedure for the one used by Maxlik by default. This done by setting the Maxlik global, _max_userhess to a pointer to the procedure. This procedure has three input arguments, a pointer to the log-likelihood function, a K 1 vector of parameters, and an N i K matrix containing the data. It must return a K K matrix which is the estimated Hessian evaluated at the parameter vector. 3-27

40 Maxlik 5.0 for GAUSS Switching Algorithms Automatically The global variable _max_switch can be used to switch algorithms automatically during the iteratations. If _max_switch has one column, the algorithm is switched once during the iterations, and if it has two columns it is switched back and forth. The conditions for the switching is determined by the elements of _max_switch in the second through fourth rows. If these are rows are not supplied default values are entered. The first row contains the algorithm numbers to switch to, or if two columns to switch to and from. The algorithm switches if the log-likelihood function improves by less than the quantity in the second row, or if the number of iterations exceeds the quantity in the third row, or if the line search changes by less than the quantity in the fourth row. If only the first row is specified in the command file, that is, if only the algorithm numbers are entered, the second, third and fourth rows are set by default to.001, 10,.001 respectively. 3.5 FASTMAX Fast Execution MAXLIK Depending on the type of problem FASTMAX, the fast version of Maxlik, can be called with speed-ups from 10 percent to 500 percent over the regular version of Maxlik. This is achieved at the expense of losing some features, in particular, it won t print any iteration information to the screen, the globals cannot be modified on the fly, it can t print or store diagnostic information. Moreover, the dataset must be entirely storable in RAM. The gain in time depends on the type of problem. The greatest speedup occurs with problems that are function call intensive. The speedup will be less if gradients and/or Hessians are provided. The least speedup occurs for problems where convergence is quick, and the most where convergence is slow. Thus FASTMAX will least affect a bootstrap or profile likelihood estimation for models that converge quickly, and most affect those that don t. FASTMAX is most useful for problems that will be repeated in some way such as in a Monte 3-28

41 Maximum Likelihood Estimation Carlo study or a bootstrap. The initial runs would use Maxlik where monitoring the progress is most important, and subsequent runs would use FASTMAX. FASTMAX has the same arguments and returns as Maxlik and thus to call it you may change the name Maxlik in your command file to FASTMAX. FASTMAX does require that the dataset be storable in memory in its entirety, however, and if that isn t possible FASTMAX will fail. In a similar way, for the fast versions of MAXBOOT, MAXPROFILE, and MAXBAYES, change the calls to FASTBOOT, FASTPROFILE, and FASTBAYES, respectively. No changes in input or output arguments are necessary Undefined Function Evaluation MaxLik On occasion the log-likelihood function will evaluate to an undefined value, for example, the log-likelihood procedure may attempt to take the log of a negative quantity for one or more observations. If you have written your procedure to return a scalar missing value when this happens, Maxlik will succeed in recovering in most cases. That is, depending on circumstances it will find another set of parameter values or use a different line search method. If you are using FASTMAX, you can try a different strategy. Write your procedure to enter a missing value in the log-likelihood vector for that observation for which the calculation is undefined. FASTMAX will compute gradients and function values by list-wise deletion. In other words it will compute the function and gradient from the available observations. 3.6 Inference Maxlik includes four classes of methods for analyzing the distributions of the estimated parameters: 3-29

Constrained Maximum Likelihood Estimation

Constrained Maximum Likelihood Estimation for GAUSS TM Version 2.0 Aptech Systems, Inc. Information in this document is subject to change without notice and does not represent a commitment on the part