acgc.stats.bivariate_lines
Specialized methods of bivariate line fitting
- Standard major axis (SMA) also called reduced major axis (RMA)
- York regression, for data with errors in x and y
- Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module)
1# -*- coding: utf-8 -*- 2"""Specialized methods of bivariate line fitting 3 4* Standard major axis (SMA) also called reduced major axis (RMA) 5* York regression, for data with errors in x and y 6* Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module) 7""" 8# from collections import namedtuple 9import numpy as np 10import scipy.stats as stats 11from sklearn.covariance import MinCovDet 12import statsmodels.formula.api as smf 13import statsmodels.robust.norms as norms 14import warnings 15#from numba import jit 16 17__all__ = [ 18 "sma", 19 "smafit", 20 "sen", 21 "sen_slope", 22 "york" 23] 24# Aliases 25def sen_slope(*args,**kwargs): 26 '''Alias for `sen`''' 27 return sen(*args,**kwargs) 28def smafit(*args,**kwargs): 29 '''Alias for `sma`''' 30 return sma(*args,**kwargs) 31 32def sma(X,Y,W=None, 33 data=None, 34 cl=0.95, 35 intercept=True, 36 robust=False,robust_method='FastMCD'): 37 '''Standard Major-Axis (SMA) line fitting 38 39 Calculate standard major axis, aka reduced major axis, fit to 40 data X and Y. The main advantage of this over ordinary least squares is 41 that the best fit of Y to X will be the same as the best fit of X to Y. 42 43 The fit equations and confidence intervals are implemented following 44 Warton et al. (2006). Robust fits use the FastMCD covariance estimate 45 from Rousseeuw and Van Driessen (1999). While there are many alternative 46 robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 47 the FastMCD algorithm is default in Matlab. When the standard error or 48 uncertainty of each point is known, then weighted SMA may be preferrable to 49 robust SMA. The conventional choice of weights for each point i is 50 W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 51 (squared standard error). 52 53 References 54 Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 55 Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 56 doi:10.1017/S1464793106007007, 2006. 57 Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 58 Covariance Determinant Estimator, Technometrics, 41(3), 1999. 59 60 Parameters 61 ---------- 62 X, Y : array_like or str 63 Input values, Must have same length. 64 W : array_like or str, optional 65 array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 66 data : dict_like, optional 67 data structure containing variables. Used when X, Y, or W are str. 68 cl : float (default = 0.95) 69 Desired confidence level for output. 70 intercept : bool, default=True 71 Specify if the fitted model should include a non-zero intercept. 72 The model will be forced through the origin (0,0) if intercept=False. 73 robust : bool, default=False 74 Use statistical methods that are robust to the presence of outliers 75 robust_method: {'FastMCD' (default), 'Huber', 'Biweight'} 76 Method for calculating robust variance and covariance. Options: 77 - 'MCD' or 'FastMCD' for Fast MCD 78 - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers 79 - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers 80 81 82 Returns 83 ------- 84 fitresult : dict 85 Contains the following keys: 86 - slope (float) 87 Slope or Gradient of Y vs. X 88 - intercept (float) 89 Y intercept. 90 - slope_ste (float) 91 Standard error of slope estimate 92 - intercept_ste (float) 93 standard error of intercept estimate 94 - slope_interval ([float, float]) 95 confidence interval for gradient at confidence level cl 96 - intercept_interval ([float, float]) 97 confidence interval for intercept at confidence level cl 98 - df_model (float) 99 degrees of freedom for model 100 - df_resid (float) 101 degrees of freedom for residuals 102 - params ([float,float]) 103 array of fitted parameters 104 ''' 105 106 def str2var( v, data ): 107 '''Extract variable named v from Dataframe named data''' 108 try: 109 return data[v] 110 except Exception as exc: 111 raise ValueError( 'Argument data must be provided with a key named '+v ) from exc 112 113 # If variables are provided as strings, get values from the data structure 114 if isinstance( X, str ): 115 X = str2var( X, data ) 116 if isinstance( Y, str ): 117 Y = str2var( Y, data ) 118 if isinstance( W, str ): 119 W = str2var( W, data ) 120 121 # Make sure arrays have the same length 122 assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length' 123 if W is None: 124 W = np.zeros_like(X) + 1 125 else: 126 assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y' 127 128 # Make sure cl is within the range 0-1 129 assert (cl < 1), 'cl must be less than 1' 130 assert (cl > 0), 'cl must be greater than 0' 131 132 # Drop any NaN elements of X, Y, or W 133 # Infinite values are allowed but will make the result undefined 134 # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) ) 135 idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W) 136 137 X0 = X[idx] 138 Y0 = Y[idx] 139 W0 = W[idx] 140 141 # Number of observations 142 N = len(X0) 143 144 # Degrees of freedom for the model 145 if intercept: 146 dfmod = 2 147 else: 148 dfmod = 1 149 150 151 # Choose whether to use methods robust to outliers 152 if robust: 153 154 # Choose the robust method 155 if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ): 156 # FAST MCD 157 158 if not intercept: 159 # intercept=False could possibly be supported by calculating 160 # using mcd.support_ as weights in an explicit variance/covariance calculation 161 raise NotImplementedError('FastMCD method only supports SMA with intercept') 162 163 # Fit robust model of mean and covariance 164 mcd = MinCovDet().fit( np.array([X0,Y0]).T ) 165 166 # Robust mean 167 Xmean = mcd.location_[0] 168 Ymean = mcd.location_[1] 169 170 # Robust variance of X, Y 171 Vx = mcd.covariance_[0,0] 172 Vy = mcd.covariance_[1,1] 173 174 # Robust covariance 175 Vxy = mcd.covariance_[0,1] 176 177 # Number of observations used in mean and covariance estimate 178 # excludes observations marked as outliers 179 N = mcd.support_.sum() 180 181 elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ): 182 183 # Tukey's Biweight and Huber's T 184 if robust_method.lower()=='biweight': 185 norm = norms.TukeyBiweight() 186 else: 187 norm = norms.HuberT() 188 189 # Get weights for downweighting outliers 190 # Fitting a linear model the easiest way to get these 191 # Options include "TukeyBiweight" (totally removes large deviates) 192 # "HuberT" (linear, not squared weighting of large deviates) 193 rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights 194 195 # Sum of weight and weights squared, for convienience 196 rsum = np.sum( rweights ) 197 rsum2 = np.sum( rweights**2 ) 198 199 # Mean 200 Xmean = np.sum( X0 * rweights ) / rsum 201 Ymean = np.sum( Y0 * rweights ) / rsum 202 203 # Force intercept through zero, if requested 204 if not intercept: 205 Xmean = 0 206 Ymean = 0 207 208 # Variance & Covariance 209 Vx = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2 210 Vy = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2 211 Vxy = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2 212 213 # Effective number of observations 214 N = rsum 215 216 else: 217 218 raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\ 219 format(robust_method)) 220 else: 221 222 if intercept: 223 224 wsum = np.sum(W) 225 226 # Average values 227 Xmean = np.sum(X0 * W0) / wsum 228 Ymean = np.sum(Y0 * W0) / wsum 229 230 # Covariance matrix 231 cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 ) 232 233 # Variance 234 Vx = cov[0,0] 235 Vy = cov[1,1] 236 237 # Covariance 238 Vxy = cov[0,1] 239 240 else: 241 242 # Force the line to pass through origin by setting means to zero 243 Xmean = 0 244 Ymean = 0 245 246 wsum = np.sum(W0) 247 248 # Sum of squares in place of variance and covariance 249 Vx = np.sum( X0**2 * W0 ) / wsum 250 Vy = np.sum( Y0**2 * W0 ) / wsum 251 Vxy= np.sum( X0*Y0 * W0 ) / wsum 252 253 # Standard deviation 254 Sx = np.sqrt( Vx ) 255 Sy = np.sqrt( Vy ) 256 257 # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases) 258 R = Vxy / np.sqrt( Vx * Vy ) 259 260 ############# 261 # SLOPE 262 263 Slope = np.sign(R) * Sy / Sx 264 265 # Standard error of slope estimate 266 ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) ) 267 268 # Confidence interval for Slope 269 B = (1-R**2)/(N-dfmod) * stats.f.isf(1-cl, 1, N-dfmod) 270 ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) ) 271 272 ############# 273 # INTERCEPT 274 275 if intercept: 276 Intercept = Ymean - Slope * Xmean 277 278 # Standard deviation of residuals 279 # New Method: Formula from smatr R package (Warton) 280 # This formula avoids large residuals of outliers when using robust=True 281 Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 * Vx ) * (N-1) / (N-dfmod) ) 282 283 # OLD METHOD 284 # Standard deviation of residuals 285 #resid = Y0 - (Intercept + Slope * X0 ) 286 # Population standard deviation of the residuals 287 #Sr = np.std( resid, ddof=0 ) 288 289 # Standard error of the intercept estimate 290 ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2 ) 291 292 # Confidence interval for Intercept 293 tcrit = stats.t.isf((1-cl)/2,N-dfmod) 294 ci_int = Intercept + ste_int * np.array([-tcrit,tcrit]) 295 296 else: 297 298 # Set Intercept quantities to zero 299 Intercept = 0 300 ste_int = 0 301 ci_int = np.array([0,0]) 302 303 result = dict( slope = Slope, 304 intercept = Intercept, 305 slope_ste = ste_slope, 306 intercept_ste = ste_int, 307 slope_interval = ci_grad, 308 intercept_interval = ci_int, 309 df_model = dfmod, 310 df_resid = N-dfmod, 311 params = np.array([Slope,Intercept]), 312 nobs = N, 313 fittedvalues = Intercept + Slope * X0, 314 resid = Intercept + Slope * X0 - Y0 ) 315 316 # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int 317 return result 318 319def york( x, y, err_x=1, err_y=1, rerr_xy=0 ): 320 '''York regression accounting for error in x and y 321 Follows the notation and algorithm of York et al. (2004) Section III 322 323 Parameters 324 ---------- 325 x, y : ndarray 326 dependent (x) and independent (y) variables for fitting 327 err_x, err_y : ndarray (default=1) 328 standard deviation of errors/uncertainty in x and y 329 rerr_xy : float (default=0) 330 correlation coefficient for errors in x and y, 331 default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y 332 err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y 333 334 Returns 335 ------- 336 fitresult : dict 337 Contains the following keys: 338 - slope (float) 339 Slope or Gradient of Y vs. X 340 - intercept (float) 341 Y intercept. 342 - slope_ste (float) 343 Standard error of slope estimate 344 - intercept_ste (float) 345 standard error of intercept estimate 346 - slope_interval ([float, float]) 347 confidence interval for gradient at confidence level cl 348 - intercept_interval ([float, float]) 349 confidence interval for intercept at confidence level cl 350 - df_model (float) 351 degrees of freedom for model 352 - df_resid (float) 353 degrees of freedom for residuals 354 - params ([float,float]) 355 array of fitted parameters 356 ''' 357 358 # relative error tolerance required for convergence 359 rtol = 1e-15 360 361 # Initial guess for slope, from ordinary least squares 362 result = stats.linregress( x, y ) 363 b = result[0] 364 365 # Weights for x and y 366 wx = 1 / err_x**2 367 wy = 1 / err_y**2 368 369 # Combined weights 370 alpha = np.sqrt( wx * wy ) 371 372 # Iterate until solution converges, but not more 50 times 373 maxiter=50 374 for i in range(1,maxiter): 375 376 # Weight for point i 377 W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha ) 378 Wsum = np.sum( W ) 379 380 # Weighted means 381 Xbar = np.sum( W * x ) / Wsum 382 Ybar = np.sum( W * y ) / Wsum 383 384 # Deviation from weighted means 385 U = x - Xbar 386 V = y - Ybar 387 388 # parameter needed for slope 389 beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha ) 390 391 # Update slope estimate 392 bnew = np.sum( W * beta * V ) / np.sum( W * beta * U ) 393 394 # Break from loop if new value is very close to old value 395 if np.abs( (bnew-b)/b ) < rtol: 396 break 397 else: 398 b = bnew 399 400 if i==maxiter: 401 raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' ) 402 403 # Intercept 404 a = Ybar - b * Xbar 405 406 # least-squares adjusted points, expectation values of X and Y 407 xa = Xbar + beta 408 ya = Ybar + b*beta 409 410 # Mean of adjusted points 411 xabar = np.sum( W * xa ) / Wsum 412 yabar = np.sum( W * ya ) / Wsum 413 414 # Devaiation of adjusted points from their means 415 u = xa - xabar 416 v = ya - yabar 417 418 # Variance of slope and intercept estimates 419 varb = 1 / np.sum( W * u**2 ) 420 vara = 1 / Wsum + xabar**2 * varb 421 422 # Standard error of slope and intercept 423 siga = np.sqrt( vara ) 424 sigb = np.sqrt( varb ) 425 426 # Define a named tuple type that will contain the results 427 # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' ) 428 429 # Return results as a named tuple, User can access as a regular tuple too 430 # return result( b, a, sigb, siga, [b,a], [sigb, siga] ) 431 432 dfmod = 2 433 N = np.sum( ~np.isnan(x) * ~np.isnan(y) ) 434 435 result = dict( slope = b, 436 intercept = a, 437 slope_ste = sigb, 438 intercept_ste = siga, 439 slope_interval = [None,None], 440 intercept_interval = [None,None], 441 df_model = dfmod, 442 df_resid = N-dfmod, 443 params = np.array([b,a]), 444 nobs = N, 445 fittedvalues = a + b * x, 446 resid = a + b * x - y ) 447 448 return result 449 450#@jit(nopython=True) 451def sen( x, y ): 452 '''Estimate linear trend using the Thiel-Sen method 453 454 This non-parametric method finds the median slope among all 455 combinations of time points. 456 scipy.stats.theilslopes provides the same slope estimate, with 457 confidence intervals. However, this function is faster for 458 large datasets due to Numba 459 460 Parameters 461 ---------- 462 x : array_like (N,) 463 independent variable 464 y : array_like (N,) 465 dependent variable 466 467 Returns 468 ------- 469 sen : float 470 the median slope 471 slopes : array (N*N,) 472 all slope estimates from all combinations of x and y 473 ''' 474 475 with warnings.catch_warnings(): 476 warnings.simplefilter('always', DeprecationWarning) 477 warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.', 478 DeprecationWarning, stacklevel=2) 479 480 if len( x ) != len( y ): 481 print('Inputs x and y must have same dimension') 482 return np.nan 483 484 # Find number of time points 485 n = len( x ) 486 487 # Array to hold all slope estimates 488 slopes = np.zeros( np.ceil( n * ( n-1 ) / 2 ).astype('int') ) 489 slopes[:] = np.nan 490 491 count = 0 492 493 for i in range(n): 494 for j in range(i+1, n): 495 496 # Slope between elements i and j 497 slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] ) 498 499 slopes[count] = slopeij 500 501 count += 1 502 503 # Thiel-Sen estimate is the median slope, neglecting NaN 504 sen = np.nanmedian( slopes ) 505 506 return sen, slopes
33def sma(X,Y,W=None, 34 data=None, 35 cl=0.95, 36 intercept=True, 37 robust=False,robust_method='FastMCD'): 38 '''Standard Major-Axis (SMA) line fitting 39 40 Calculate standard major axis, aka reduced major axis, fit to 41 data X and Y. The main advantage of this over ordinary least squares is 42 that the best fit of Y to X will be the same as the best fit of X to Y. 43 44 The fit equations and confidence intervals are implemented following 45 Warton et al. (2006). Robust fits use the FastMCD covariance estimate 46 from Rousseeuw and Van Driessen (1999). While there are many alternative 47 robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 48 the FastMCD algorithm is default in Matlab. When the standard error or 49 uncertainty of each point is known, then weighted SMA may be preferrable to 50 robust SMA. The conventional choice of weights for each point i is 51 W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 52 (squared standard error). 53 54 References 55 Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 56 Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 57 doi:10.1017/S1464793106007007, 2006. 58 Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 59 Covariance Determinant Estimator, Technometrics, 41(3), 1999. 60 61 Parameters 62 ---------- 63 X, Y : array_like or str 64 Input values, Must have same length. 65 W : array_like or str, optional 66 array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 67 data : dict_like, optional 68 data structure containing variables. Used when X, Y, or W are str. 69 cl : float (default = 0.95) 70 Desired confidence level for output. 71 intercept : bool, default=True 72 Specify if the fitted model should include a non-zero intercept. 73 The model will be forced through the origin (0,0) if intercept=False. 74 robust : bool, default=False 75 Use statistical methods that are robust to the presence of outliers 76 robust_method: {'FastMCD' (default), 'Huber', 'Biweight'} 77 Method for calculating robust variance and covariance. Options: 78 - 'MCD' or 'FastMCD' for Fast MCD 79 - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers 80 - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers 81 82 83 Returns 84 ------- 85 fitresult : dict 86 Contains the following keys: 87 - slope (float) 88 Slope or Gradient of Y vs. X 89 - intercept (float) 90 Y intercept. 91 - slope_ste (float) 92 Standard error of slope estimate 93 - intercept_ste (float) 94 standard error of intercept estimate 95 - slope_interval ([float, float]) 96 confidence interval for gradient at confidence level cl 97 - intercept_interval ([float, float]) 98 confidence interval for intercept at confidence level cl 99 - df_model (float) 100 degrees of freedom for model 101 - df_resid (float) 102 degrees of freedom for residuals 103 - params ([float,float]) 104 array of fitted parameters 105 ''' 106 107 def str2var( v, data ): 108 '''Extract variable named v from Dataframe named data''' 109 try: 110 return data[v] 111 except Exception as exc: 112 raise ValueError( 'Argument data must be provided with a key named '+v ) from exc 113 114 # If variables are provided as strings, get values from the data structure 115 if isinstance( X, str ): 116 X = str2var( X, data ) 117 if isinstance( Y, str ): 118 Y = str2var( Y, data ) 119 if isinstance( W, str ): 120 W = str2var( W, data ) 121 122 # Make sure arrays have the same length 123 assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length' 124 if W is None: 125 W = np.zeros_like(X) + 1 126 else: 127 assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y' 128 129 # Make sure cl is within the range 0-1 130 assert (cl < 1), 'cl must be less than 1' 131 assert (cl > 0), 'cl must be greater than 0' 132 133 # Drop any NaN elements of X, Y, or W 134 # Infinite values are allowed but will make the result undefined 135 # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) ) 136 idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W) 137 138 X0 = X[idx] 139 Y0 = Y[idx] 140 W0 = W[idx] 141 142 # Number of observations 143 N = len(X0) 144 145 # Degrees of freedom for the model 146 if intercept: 147 dfmod = 2 148 else: 149 dfmod = 1 150 151 152 # Choose whether to use methods robust to outliers 153 if robust: 154 155 # Choose the robust method 156 if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ): 157 # FAST MCD 158 159 if not intercept: 160 # intercept=False could possibly be supported by calculating 161 # using mcd.support_ as weights in an explicit variance/covariance calculation 162 raise NotImplementedError('FastMCD method only supports SMA with intercept') 163 164 # Fit robust model of mean and covariance 165 mcd = MinCovDet().fit( np.array([X0,Y0]).T ) 166 167 # Robust mean 168 Xmean = mcd.location_[0] 169 Ymean = mcd.location_[1] 170 171 # Robust variance of X, Y 172 Vx = mcd.covariance_[0,0] 173 Vy = mcd.covariance_[1,1] 174 175 # Robust covariance 176 Vxy = mcd.covariance_[0,1] 177 178 # Number of observations used in mean and covariance estimate 179 # excludes observations marked as outliers 180 N = mcd.support_.sum() 181 182 elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ): 183 184 # Tukey's Biweight and Huber's T 185 if robust_method.lower()=='biweight': 186 norm = norms.TukeyBiweight() 187 else: 188 norm = norms.HuberT() 189 190 # Get weights for downweighting outliers 191 # Fitting a linear model the easiest way to get these 192 # Options include "TukeyBiweight" (totally removes large deviates) 193 # "HuberT" (linear, not squared weighting of large deviates) 194 rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights 195 196 # Sum of weight and weights squared, for convienience 197 rsum = np.sum( rweights ) 198 rsum2 = np.sum( rweights**2 ) 199 200 # Mean 201 Xmean = np.sum( X0 * rweights ) / rsum 202 Ymean = np.sum( Y0 * rweights ) / rsum 203 204 # Force intercept through zero, if requested 205 if not intercept: 206 Xmean = 0 207 Ymean = 0 208 209 # Variance & Covariance 210 Vx = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2 211 Vy = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2 212 Vxy = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2 213 214 # Effective number of observations 215 N = rsum 216 217 else: 218 219 raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\ 220 format(robust_method)) 221 else: 222 223 if intercept: 224 225 wsum = np.sum(W) 226 227 # Average values 228 Xmean = np.sum(X0 * W0) / wsum 229 Ymean = np.sum(Y0 * W0) / wsum 230 231 # Covariance matrix 232 cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 ) 233 234 # Variance 235 Vx = cov[0,0] 236 Vy = cov[1,1] 237 238 # Covariance 239 Vxy = cov[0,1] 240 241 else: 242 243 # Force the line to pass through origin by setting means to zero 244 Xmean = 0 245 Ymean = 0 246 247 wsum = np.sum(W0) 248 249 # Sum of squares in place of variance and covariance 250 Vx = np.sum( X0**2 * W0 ) / wsum 251 Vy = np.sum( Y0**2 * W0 ) / wsum 252 Vxy= np.sum( X0*Y0 * W0 ) / wsum 253 254 # Standard deviation 255 Sx = np.sqrt( Vx ) 256 Sy = np.sqrt( Vy ) 257 258 # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases) 259 R = Vxy / np.sqrt( Vx * Vy ) 260 261 ############# 262 # SLOPE 263 264 Slope = np.sign(R) * Sy / Sx 265 266 # Standard error of slope estimate 267 ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) ) 268 269 # Confidence interval for Slope 270 B = (1-R**2)/(N-dfmod) * stats.f.isf(1-cl, 1, N-dfmod) 271 ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) ) 272 273 ############# 274 # INTERCEPT 275 276 if intercept: 277 Intercept = Ymean - Slope * Xmean 278 279 # Standard deviation of residuals 280 # New Method: Formula from smatr R package (Warton) 281 # This formula avoids large residuals of outliers when using robust=True 282 Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 * Vx ) * (N-1) / (N-dfmod) ) 283 284 # OLD METHOD 285 # Standard deviation of residuals 286 #resid = Y0 - (Intercept + Slope * X0 ) 287 # Population standard deviation of the residuals 288 #Sr = np.std( resid, ddof=0 ) 289 290 # Standard error of the intercept estimate 291 ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2 ) 292 293 # Confidence interval for Intercept 294 tcrit = stats.t.isf((1-cl)/2,N-dfmod) 295 ci_int = Intercept + ste_int * np.array([-tcrit,tcrit]) 296 297 else: 298 299 # Set Intercept quantities to zero 300 Intercept = 0 301 ste_int = 0 302 ci_int = np.array([0,0]) 303 304 result = dict( slope = Slope, 305 intercept = Intercept, 306 slope_ste = ste_slope, 307 intercept_ste = ste_int, 308 slope_interval = ci_grad, 309 intercept_interval = ci_int, 310 df_model = dfmod, 311 df_resid = N-dfmod, 312 params = np.array([Slope,Intercept]), 313 nobs = N, 314 fittedvalues = Intercept + Slope * X0, 315 resid = Intercept + Slope * X0 - Y0 ) 316 317 # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int 318 return result
Standard Major-Axis (SMA) line fitting
Calculate standard major axis, aka reduced major axis, fit to data X and Y. The main advantage of this over ordinary least squares is that the best fit of Y to X will be the same as the best fit of X to Y.
The fit equations and confidence intervals are implemented following Warton et al. (2006). Robust fits use the FastMCD covariance estimate from Rousseeuw and Van Driessen (1999). While there are many alternative robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), the FastMCD algorithm is default in Matlab. When the standard error or uncertainty of each point is known, then weighted SMA may be preferrable to robust SMA. The conventional choice of weights for each point i is W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance (squared standard error).
References Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, doi:10.1017/S1464793106007007, 2006. Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41(3), 1999.
Parameters
- X, Y (array_like or str): Input values, Must have same length.
- W (array_like or str, optional): array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i))
- data (dict_like, optional): data structure containing variables. Used when X, Y, or W are str.
- cl (float (default = 0.95)): Desired confidence level for output.
- intercept (bool, default=True): Specify if the fitted model should include a non-zero intercept. The model will be forced through the origin (0,0) if intercept=False.
- robust (bool, default=False): Use statistical methods that are robust to the presence of outliers
- robust_method ({'FastMCD' (default), 'Huber', 'Biweight'}):
Method for calculating robust variance and covariance. Options:
- 'MCD' or 'FastMCD' for Fast MCD
- 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
- 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
Returns
- fitresult (dict):
Contains the following keys:
- slope (float) Slope or Gradient of Y vs. X
- intercept (float) Y intercept.
- slope_ste (float) Standard error of slope estimate
- intercept_ste (float) standard error of intercept estimate
- slope_interval ([float, float]) confidence interval for gradient at confidence level cl
- intercept_interval ([float, float]) confidence interval for intercept at confidence level cl
- df_model (float) degrees of freedom for model
- df_resid (float) degrees of freedom for residuals
- params ([float,float]) array of fitted parameters
Alias for sma
452def sen( x, y ): 453 '''Estimate linear trend using the Thiel-Sen method 454 455 This non-parametric method finds the median slope among all 456 combinations of time points. 457 scipy.stats.theilslopes provides the same slope estimate, with 458 confidence intervals. However, this function is faster for 459 large datasets due to Numba 460 461 Parameters 462 ---------- 463 x : array_like (N,) 464 independent variable 465 y : array_like (N,) 466 dependent variable 467 468 Returns 469 ------- 470 sen : float 471 the median slope 472 slopes : array (N*N,) 473 all slope estimates from all combinations of x and y 474 ''' 475 476 with warnings.catch_warnings(): 477 warnings.simplefilter('always', DeprecationWarning) 478 warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.', 479 DeprecationWarning, stacklevel=2) 480 481 if len( x ) != len( y ): 482 print('Inputs x and y must have same dimension') 483 return np.nan 484 485 # Find number of time points 486 n = len( x ) 487 488 # Array to hold all slope estimates 489 slopes = np.zeros( np.ceil( n * ( n-1 ) / 2 ).astype('int') ) 490 slopes[:] = np.nan 491 492 count = 0 493 494 for i in range(n): 495 for j in range(i+1, n): 496 497 # Slope between elements i and j 498 slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] ) 499 500 slopes[count] = slopeij 501 502 count += 1 503 504 # Thiel-Sen estimate is the median slope, neglecting NaN 505 sen = np.nanmedian( slopes ) 506 507 return sen, slopes
Estimate linear trend using the Thiel-Sen method
This non-parametric method finds the median slope among all
combinations of time points.
scipy.stats.theilslopes provides the same slope estimate, with
confidence intervals. However, this function is faster for
large datasets due to Numba
Parameters
- x (array_like (N,)): independent variable
- y (array_like (N,)): dependent variable
Returns
- sen (float): the median slope
- slopes (array (N*N,)): all slope estimates from all combinations of x and y
Alias for sen
320def york( x, y, err_x=1, err_y=1, rerr_xy=0 ): 321 '''York regression accounting for error in x and y 322 Follows the notation and algorithm of York et al. (2004) Section III 323 324 Parameters 325 ---------- 326 x, y : ndarray 327 dependent (x) and independent (y) variables for fitting 328 err_x, err_y : ndarray (default=1) 329 standard deviation of errors/uncertainty in x and y 330 rerr_xy : float (default=0) 331 correlation coefficient for errors in x and y, 332 default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y 333 err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y 334 335 Returns 336 ------- 337 fitresult : dict 338 Contains the following keys: 339 - slope (float) 340 Slope or Gradient of Y vs. X 341 - intercept (float) 342 Y intercept. 343 - slope_ste (float) 344 Standard error of slope estimate 345 - intercept_ste (float) 346 standard error of intercept estimate 347 - slope_interval ([float, float]) 348 confidence interval for gradient at confidence level cl 349 - intercept_interval ([float, float]) 350 confidence interval for intercept at confidence level cl 351 - df_model (float) 352 degrees of freedom for model 353 - df_resid (float) 354 degrees of freedom for residuals 355 - params ([float,float]) 356 array of fitted parameters 357 ''' 358 359 # relative error tolerance required for convergence 360 rtol = 1e-15 361 362 # Initial guess for slope, from ordinary least squares 363 result = stats.linregress( x, y ) 364 b = result[0] 365 366 # Weights for x and y 367 wx = 1 / err_x**2 368 wy = 1 / err_y**2 369 370 # Combined weights 371 alpha = np.sqrt( wx * wy ) 372 373 # Iterate until solution converges, but not more 50 times 374 maxiter=50 375 for i in range(1,maxiter): 376 377 # Weight for point i 378 W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha ) 379 Wsum = np.sum( W ) 380 381 # Weighted means 382 Xbar = np.sum( W * x ) / Wsum 383 Ybar = np.sum( W * y ) / Wsum 384 385 # Deviation from weighted means 386 U = x - Xbar 387 V = y - Ybar 388 389 # parameter needed for slope 390 beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha ) 391 392 # Update slope estimate 393 bnew = np.sum( W * beta * V ) / np.sum( W * beta * U ) 394 395 # Break from loop if new value is very close to old value 396 if np.abs( (bnew-b)/b ) < rtol: 397 break 398 else: 399 b = bnew 400 401 if i==maxiter: 402 raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' ) 403 404 # Intercept 405 a = Ybar - b * Xbar 406 407 # least-squares adjusted points, expectation values of X and Y 408 xa = Xbar + beta 409 ya = Ybar + b*beta 410 411 # Mean of adjusted points 412 xabar = np.sum( W * xa ) / Wsum 413 yabar = np.sum( W * ya ) / Wsum 414 415 # Devaiation of adjusted points from their means 416 u = xa - xabar 417 v = ya - yabar 418 419 # Variance of slope and intercept estimates 420 varb = 1 / np.sum( W * u**2 ) 421 vara = 1 / Wsum + xabar**2 * varb 422 423 # Standard error of slope and intercept 424 siga = np.sqrt( vara ) 425 sigb = np.sqrt( varb ) 426 427 # Define a named tuple type that will contain the results 428 # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' ) 429 430 # Return results as a named tuple, User can access as a regular tuple too 431 # return result( b, a, sigb, siga, [b,a], [sigb, siga] ) 432 433 dfmod = 2 434 N = np.sum( ~np.isnan(x) * ~np.isnan(y) ) 435 436 result = dict( slope = b, 437 intercept = a, 438 slope_ste = sigb, 439 intercept_ste = siga, 440 slope_interval = [None,None], 441 intercept_interval = [None,None], 442 df_model = dfmod, 443 df_resid = N-dfmod, 444 params = np.array([b,a]), 445 nobs = N, 446 fittedvalues = a + b * x, 447 resid = a + b * x - y ) 448 449 return result
York regression accounting for error in x and y Follows the notation and algorithm of York et al. (2004) Section III
Parameters
- x, y (ndarray): dependent (x) and independent (y) variables for fitting
- err_x, err_y (ndarray (default=1)): standard deviation of errors/uncertainty in x and y
- rerr_xy (float (default=0)): correlation coefficient for errors in x and y, default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
Returns
- fitresult (dict):
Contains the following keys:
- slope (float) Slope or Gradient of Y vs. X
- intercept (float) Y intercept.
- slope_ste (float) Standard error of slope estimate
- intercept_ste (float) standard error of intercept estimate
- slope_interval ([float, float]) confidence interval for gradient at confidence level cl
- intercept_interval ([float, float]) confidence interval for intercept at confidence level cl
- df_model (float) degrees of freedom for model
- df_resid (float) degrees of freedom for residuals
- params ([float,float]) array of fitted parameters