


Notes on Regression Methods for Linear Models
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
A general problem in data analysis is fitting a linear model of the
form
y = a + b*x
to a set of N [x,y] data points. The statistical problem is to first
determine the parameters (intercept a, slope b) which provide the
'best' fit of the data to the model, and then estimate the confidence
intervals for [a,b]. The 'best' fit is determined by least-squares,
i.e., chosing [a,b] to minimize an appropriate merit function chi which
involves squares of [x,y]. This leads to [a,b] having confidence
intervals given by the student-t distribution.
In the usual case (Case A), x is considered the independent variable
with no measurement error and y the dependent variable with a random
error given by N(0, stdy), where N is the normal distrubution function
with zero mean and standard deviation stdy. In this case, the merit
function chi is defined as the sum of the squared residuals
chi = sum( r(n)^2 ) = sum( y(n)-a-b*x(n))^2 )
where r(n) is the residual or vertical distance between y(n) and the
'best' fit line for the nth point, and the sum is over [1:N].
Differentiation of chi with respect with a and b and setting the
resulting equations to zero yield two normal equations which are linear
in a and b, thus allowing an exact solution. LINREG(x,y) finds this
solution and returns the 95% confidence intervals for [a,b].
In the case when both x and y are considered random variables (Case B),
i.e., when x also contains random measurement error characterized by
N(0, stdx), then the merit function chi is defined as the sum of the
squared perpendicular distances between the data points and the 'best'
fit line, so that
chi = sum( (y(n)-a-b*x(n))^2/(stdy^2+b^2*stdx^2) )
where the denominator can be considered to convert the vertical
distance into the perpendicular distance. In this case, stdx and stdy
are assumed known apriori. Since b appears in the denominator of chi,
differentiation of chi with respect to b leads to a nonlinear normal
equation for b (the normal equation for a is unchanged from case A).
For the special case considered here, namely that the uncertainities in
x(y) are independent of the value of x(y), then the nonlinear normal
equation for b reduces to a quadratic which can be solved 'exactly'
using FZERO and an initial guess. LINREG1(x,y,stdx,stdy,b0) determines
the best [a,b] and returns their 95% confidence intervals given the
input [stdx,stdy]. In the limit as the uncertainity in (say) x gets
very small (stdx<<stdy), the results of LINREG1 and LINREG converge.
SUMMARY:
Case A: y is random variable. Use LINREG(x,y).
Case B: x and y are random variables. Use LINREG1(x,y,stdx,stdy,b0).
LINREG1 uses QUADRAT.
For general discussion of both cases, see sections 15.1-15.3,
Press et al. (1992), Numerical Recipes: The Art of Scientific
Computing, Edition 2.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
ver. 1: 12/1/96 (RB)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

0001 % Notes on Regression Methods for Linear Models 0002 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0003 % A general problem in data analysis is fitting a linear model of the 0004 % form 0005 % 0006 % y = a + b*x 0007 % 0008 % to a set of N [x,y] data points. The statistical problem is to first 0009 % determine the parameters (intercept a, slope b) which provide the 0010 % 'best' fit of the data to the model, and then estimate the confidence 0011 % intervals for [a,b]. The 'best' fit is determined by least-squares, 0012 % i.e., chosing [a,b] to minimize an appropriate merit function chi which 0013 % involves squares of [x,y]. This leads to [a,b] having confidence 0014 % intervals given by the student-t distribution. 0015 % 0016 % In the usual case (Case A), x is considered the independent variable 0017 % with no measurement error and y the dependent variable with a random 0018 % error given by N(0, stdy), where N is the normal distrubution function 0019 % with zero mean and standard deviation stdy. In this case, the merit 0020 % function chi is defined as the sum of the squared residuals 0021 % 0022 % chi = sum( r(n)^2 ) = sum( y(n)-a-b*x(n))^2 ) 0023 % 0024 % where r(n) is the residual or vertical distance between y(n) and the 0025 % 'best' fit line for the nth point, and the sum is over [1:N]. 0026 % Differentiation of chi with respect with a and b and setting the 0027 % resulting equations to zero yield two normal equations which are linear 0028 % in a and b, thus allowing an exact solution. LINREG(x,y) finds this 0029 % solution and returns the 95% confidence intervals for [a,b]. 0030 % 0031 % In the case when both x and y are considered random variables (Case B), 0032 % i.e., when x also contains random measurement error characterized by 0033 % N(0, stdx), then the merit function chi is defined as the sum of the 0034 % squared perpendicular distances between the data points and the 'best' 0035 % fit line, so that 0036 % 0037 % chi = sum( (y(n)-a-b*x(n))^2/(stdy^2+b^2*stdx^2) ) 0038 % 0039 % where the denominator can be considered to convert the vertical 0040 % distance into the perpendicular distance. In this case, stdx and stdy 0041 % are assumed known apriori. Since b appears in the denominator of chi, 0042 % differentiation of chi with respect to b leads to a nonlinear normal 0043 % equation for b (the normal equation for a is unchanged from case A). 0044 % For the special case considered here, namely that the uncertainities in 0045 % x(y) are independent of the value of x(y), then the nonlinear normal 0046 % equation for b reduces to a quadratic which can be solved 'exactly' 0047 % using FZERO and an initial guess. LINREG1(x,y,stdx,stdy,b0) determines 0048 % the best [a,b] and returns their 95% confidence intervals given the 0049 % input [stdx,stdy]. In the limit as the uncertainity in (say) x gets 0050 % very small (stdx<<stdy), the results of LINREG1 and LINREG converge. 0051 % 0052 % SUMMARY: 0053 % Case A: y is random variable. Use LINREG(x,y). 0054 % Case B: x and y are random variables. Use LINREG1(x,y,stdx,stdy,b0). 0055 % LINREG1 uses QUADRAT. 0056 % 0057 % For general discussion of both cases, see sections 15.1-15.3, 0058 % Press et al. (1992), Numerical Recipes: The Art of Scientific 0059 % Computing, Edition 2. 0060 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0061 % ver. 1: 12/1/96 (RB) 0062 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%