Notes on Regression Methods for Linear Models %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% A general problem in data analysis is fitting a linear model of the form y = a + b*x to a set of N [x,y] data points. The statistical problem is to first determine the parameters (intercept a, slope b) which provide the 'best' fit of the data to the model, and then estimate the confidence intervals for [a,b]. The 'best' fit is determined by least-squares, i.e., chosing [a,b] to minimize an appropriate merit function chi which involves squares of [x,y]. This leads to [a,b] having confidence intervals given by the student-t distribution. In the usual case (Case A), x is considered the independent variable with no measurement error and y the dependent variable with a random error given by N(0, stdy), where N is the normal distrubution function with zero mean and standard deviation stdy. In this case, the merit function chi is defined as the sum of the squared residuals chi = sum( r(n)^2 ) = sum( y(n)-a-b*x(n))^2 ) where r(n) is the residual or vertical distance between y(n) and the 'best' fit line for the nth point, and the sum is over [1:N]. Differentiation of chi with respect with a and b and setting the resulting equations to zero yield two normal equations which are linear in a and b, thus allowing an exact solution. LINREG(x,y) finds this solution and returns the 95% confidence intervals for [a,b]. In the case when both x and y are considered random variables (Case B), i.e., when x also contains random measurement error characterized by N(0, stdx), then the merit function chi is defined as the sum of the squared perpendicular distances between the data points and the 'best' fit line, so that chi = sum( (y(n)-a-b*x(n))^2/(stdy^2+b^2*stdx^2) ) where the denominator can be considered to convert the vertical distance into the perpendicular distance. In this case, stdx and stdy are assumed known apriori. Since b appears in the denominator of chi, differentiation of chi with respect to b leads to a nonlinear normal equation for b (the normal equation for a is unchanged from case A). For the special case considered here, namely that the uncertainities in x(y) are independent of the value of x(y), then the nonlinear normal equation for b reduces to a quadratic which can be solved 'exactly' using FZERO and an initial guess. LINREG1(x,y,stdx,stdy,b0) determines the best [a,b] and returns their 95% confidence intervals given the input [stdx,stdy]. In the limit as the uncertainity in (say) x gets very small (stdx<<stdy), the results of LINREG1 and LINREG converge. SUMMARY: Case A: y is random variable. Use LINREG(x,y). Case B: x and y are random variables. Use LINREG1(x,y,stdx,stdy,b0). LINREG1 uses QUADRAT. For general discussion of both cases, see sections 15.1-15.3, Press et al. (1992), Numerical Recipes: The Art of Scientific Computing, Edition 2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ver. 1: 12/1/96 (RB) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

0001 % Notes on Regression Methods for Linear Models 0002 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0003 % A general problem in data analysis is fitting a linear model of the 0004 % form 0005 % 0006 % y = a + b*x 0007 % 0008 % to a set of N [x,y] data points. The statistical problem is to first 0009 % determine the parameters (intercept a, slope b) which provide the 0010 % 'best' fit of the data to the model, and then estimate the confidence 0011 % intervals for [a,b]. The 'best' fit is determined by least-squares, 0012 % i.e., chosing [a,b] to minimize an appropriate merit function chi which 0013 % involves squares of [x,y]. This leads to [a,b] having confidence 0014 % intervals given by the student-t distribution. 0015 % 0016 % In the usual case (Case A), x is considered the independent variable 0017 % with no measurement error and y the dependent variable with a random 0018 % error given by N(0, stdy), where N is the normal distrubution function 0019 % with zero mean and standard deviation stdy. In this case, the merit 0020 % function chi is defined as the sum of the squared residuals 0021 % 0022 % chi = sum( r(n)^2 ) = sum( y(n)-a-b*x(n))^2 ) 0023 % 0024 % where r(n) is the residual or vertical distance between y(n) and the 0025 % 'best' fit line for the nth point, and the sum is over [1:N]. 0026 % Differentiation of chi with respect with a and b and setting the 0027 % resulting equations to zero yield two normal equations which are linear 0028 % in a and b, thus allowing an exact solution. LINREG(x,y) finds this 0029 % solution and returns the 95% confidence intervals for [a,b]. 0030 % 0031 % In the case when both x and y are considered random variables (Case B), 0032 % i.e., when x also contains random measurement error characterized by 0033 % N(0, stdx), then the merit function chi is defined as the sum of the 0034 % squared perpendicular distances between the data points and the 'best' 0035 % fit line, so that 0036 % 0037 % chi = sum( (y(n)-a-b*x(n))^2/(stdy^2+b^2*stdx^2) ) 0038 % 0039 % where the denominator can be considered to convert the vertical 0040 % distance into the perpendicular distance. In this case, stdx and stdy 0041 % are assumed known apriori. Since b appears in the denominator of chi, 0042 % differentiation of chi with respect to b leads to a nonlinear normal 0043 % equation for b (the normal equation for a is unchanged from case A). 0044 % For the special case considered here, namely that the uncertainities in 0045 % x(y) are independent of the value of x(y), then the nonlinear normal 0046 % equation for b reduces to a quadratic which can be solved 'exactly' 0047 % using FZERO and an initial guess. LINREG1(x,y,stdx,stdy,b0) determines 0048 % the best [a,b] and returns their 95% confidence intervals given the 0049 % input [stdx,stdy]. In the limit as the uncertainity in (say) x gets 0050 % very small (stdx<<stdy), the results of LINREG1 and LINREG converge. 0051 % 0052 % SUMMARY: 0053 % Case A: y is random variable. Use LINREG(x,y). 0054 % Case B: x and y are random variables. Use LINREG1(x,y,stdx,stdy,b0). 0055 % LINREG1 uses QUADRAT. 0056 % 0057 % For general discussion of both cases, see sections 15.1-15.3, 0058 % Press et al. (1992), Numerical Recipes: The Art of Scientific 0059 % Computing, Edition 2. 0060 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0061 % ver. 1: 12/1/96 (RB) 0062 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Generated on Thu 01-Dec-2005 07:46:03 by