acgc.stats.bivariate_lines
Specialized methods of bivariate line fitting
- Standard major axis (SMA) also called reduced major axis (RMA)
- York regression, for data with errors in x and y
- Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module)
1# -*- coding: utf-8 -*- 2"""Specialized methods of bivariate line fitting 3 4* Standard major axis (SMA) also called reduced major axis (RMA) 5* York regression, for data with errors in x and y 6* Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module) 7""" 8# from collections import namedtuple 9import warnings 10import numpy as np 11import scipy.stats as stats 12from sklearn.covariance import MinCovDet 13import statsmodels.formula.api as smf 14import statsmodels.robust.norms as norms 15from scipy.stats import theilslopes 16#from numba import jit 17 18__all__ = [ 19 "bivariate_line_equation", 20 "sma", 21 "smafit", 22 "sen", 23 "sen_slope", 24 "sen_numba", 25 "york" 26] 27# Aliases 28def sen_slope(*args,**kwargs): 29 '''Alias for `sen`''' 30 return sen(*args,**kwargs) 31def smafit(*args,**kwargs): 32 '''Alias for `sma`''' 33 return sma(*args,**kwargs) 34 35def bivariate_line_equation(fitresult, 36 floatformat='{:.3f}', 37 ystring='include' ): 38 '''Write equation for the fitted line 39 40 Parameters 41 ---------- 42 fitresult : dict 43 results of the line fit 44 floatformat : str 45 format string for the numerical values (default='{:.3f}') 46 ystring : {'include' (default), 'separate', 'none'} 47 specifies whether "y =" should be included in result, a separate item in tuple, or none 48 49 Returns 50 ------- 51 fitline_string : str 52 equation for the the fitted line, in the form "y = a x + b" or "y = a x" 53 ''' 54 55 # Left-hand side 56 lhs = "y_"+fitresult['method'] 57 58 # Right-hand side 59 if fitresult['fitintercept']: 60 rhs = f'{floatformat:s} x + {floatformat:s}'.\ 61 format( fitresult['slope'], fitresult['intercept'] ) 62 else: 63 rhs = f'{floatformat:s} x'.\ 64 format( fitresult['slope'] ) 65 66 # Combine right and left-hand sides 67 if ystring=='include': 68 equation = f'{lhs:s} = {rhs:s}' 69 elif ystring=='separate': 70 equation = (lhs,rhs) 71 elif ystring=='none': 72 equation = rhs 73 else: 74 raise ValueError('Unrecognized value of ystring: '+ystring) 75 76 return equation 77 78def sma(X,Y,W=None, 79 data=None, 80 alpha=0.95, 81 intercept=True, 82 robust=False,robust_method='FastMCD'): 83 '''Standard Major-Axis (SMA) line fitting 84 85 Calculate standard major axis, aka reduced major axis, fit to 86 data X and Y. The main advantage of this over ordinary least squares is 87 that the best fit of Y to X will be the same as the best fit of X to Y. 88 89 The fit equations and confidence intervals are implemented following 90 Warton et al. (2006). Robust fits use the FastMCD covariance estimate 91 from Rousseeuw and Van Driessen (1999). While there are many alternative 92 robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 93 the FastMCD algorithm is default in Matlab. When the standard error or 94 uncertainty of each point is known, then weighted SMA may be preferrable to 95 robust SMA. The conventional choice of weights for each point i is 96 W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 97 (squared standard error). 98 99 References 100 Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 101 Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 102 doi:10.1017/S1464793106007007, 2006. 103 Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 104 Covariance Determinant Estimator, Technometrics, 41(3), 1999. 105 106 Parameters 107 ---------- 108 X, Y : array_like or str 109 Input values, Must have same length. 110 W : array_like or str, optional 111 array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 112 data : dict_like, optional 113 data structure containing variables. Used when X, Y, or W are str. 114 alpha : float (default = 0.95) 115 Desired confidence level [0,1] for output. 116 intercept : bool, default=True 117 Specify if the fitted model should include a non-zero intercept. 118 The model will be forced through the origin (0,0) if intercept=False. 119 robust : bool, default=False 120 Use statistical methods that are robust to the presence of outliers 121 robust_method: {'FastMCD' (default), 'Huber', 'Biweight'} 122 Method for calculating robust variance and covariance. Options: 123 - 'MCD' or 'FastMCD' for Fast MCD 124 - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers 125 - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers 126 127 128 Returns 129 ------- 130 fitresult : dict 131 Contains the following keys: 132 - slope (float) 133 Slope or Gradient of Y vs. X 134 - intercept (float) 135 Y intercept. 136 - slope_ste (float) 137 Standard error of slope estimate 138 - intercept_ste (float) 139 standard error of intercept estimate 140 - slope_interval ([float, float]) 141 confidence interval for gradient at confidence level alpha 142 - intercept_interval ([float, float]) 143 confidence interval for intercept at confidence level alpha 144 - alpha (float) 145 confidence level [0,1] for slope and intercept intervals 146 - df_model (float) 147 degrees of freedom for model 148 - df_resid (float) 149 degrees of freedom for residuals 150 - params ([float,float]) 151 array of fitted parameters 152 - fittedvalues (ndarray) 153 array of fitted values 154 - resid (ndarray) 155 array of residual values 156 ''' 157 158 def str2var( v, data ): 159 '''Extract variable named v from Dataframe named data''' 160 try: 161 return data[v] 162 except Exception as exc: 163 raise ValueError( 'Argument data must be provided with a key named '+v ) from exc 164 165 # If variables are provided as strings, get values from the data structure 166 if isinstance( X, str ): 167 X = str2var( X, data ) 168 if isinstance( Y, str ): 169 Y = str2var( Y, data ) 170 if isinstance( W, str ): 171 W = str2var( W, data ) 172 173 # Make sure arrays have the same length 174 assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length' 175 if W is None: 176 W = np.zeros_like(X) + 1 177 else: 178 assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y' 179 180 # Make sure alpha is within the range 0-1 181 assert (alpha < 1), 'alpha must be less than 1' 182 assert (alpha > 0), 'alpha must be greater than 0' 183 184 # Drop any NaN elements of X, Y, or W 185 # Infinite values are allowed but will make the result undefined 186 # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) ) 187 idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W) 188 189 X0 = X[idx] 190 Y0 = Y[idx] 191 W0 = W[idx] 192 193 # Number of observations 194 N = len(X0) 195 196 # Degrees of freedom for the model 197 if intercept: 198 dfmod = 2 199 else: 200 dfmod = 1 201 202 203 # Choose whether to use methods robust to outliers 204 if robust: 205 206 # Choose the robust method 207 if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ): 208 # FAST MCD 209 210 if not intercept: 211 # intercept=False could possibly be supported by calculating 212 # using mcd.support_ as weights in an explicit variance/covariance calculation 213 raise NotImplementedError('FastMCD method only supports SMA with intercept') 214 215 # Fit robust model of mean and covariance 216 mcd = MinCovDet().fit( np.array([X0,Y0]).T ) 217 218 # Robust mean 219 Xmean = mcd.location_[0] 220 Ymean = mcd.location_[1] 221 222 # Robust variance of X, Y 223 Vx = mcd.covariance_[0,0] 224 Vy = mcd.covariance_[1,1] 225 226 # Robust covariance 227 Vxy = mcd.covariance_[0,1] 228 229 # Number of observations used in mean and covariance estimate 230 # excludes observations marked as outliers 231 N = mcd.support_.sum() 232 233 elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ): 234 235 # Tukey's Biweight and Huber's T 236 if robust_method.lower()=='biweight': 237 norm = norms.TukeyBiweight() 238 else: 239 norm = norms.HuberT() 240 241 # Get weights for downweighting outliers 242 # Fitting a linear model the easiest way to get these 243 # Options include "TukeyBiweight" (totally removes large deviates) 244 # "HuberT" (linear, not squared weighting of large deviates) 245 rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights 246 247 # Sum of weight and weights squared, for convienience 248 rsum = np.sum( rweights ) 249 rsum2 = np.sum( rweights**2 ) 250 251 # Mean 252 Xmean = np.sum( X0 * rweights ) / rsum 253 Ymean = np.sum( Y0 * rweights ) / rsum 254 255 # Force intercept through zero, if requested 256 if not intercept: 257 Xmean = 0 258 Ymean = 0 259 260 # Variance & Covariance 261 Vx = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2 262 Vy = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2 263 Vxy = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2 264 265 # Effective number of observations 266 N = rsum 267 268 else: 269 270 raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\ 271 format(robust_method)) 272 else: 273 274 if intercept: 275 276 wsum = np.sum(W) 277 278 # Average values 279 Xmean = np.sum(X0 * W0) / wsum 280 Ymean = np.sum(Y0 * W0) / wsum 281 282 # Covariance matrix 283 cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 ) 284 285 # Variance 286 Vx = cov[0,0] 287 Vy = cov[1,1] 288 289 # Covariance 290 Vxy = cov[0,1] 291 292 else: 293 294 # Force the line to pass through origin by setting means to zero 295 Xmean = 0 296 Ymean = 0 297 298 wsum = np.sum(W0) 299 300 # Sum of squares in place of variance and covariance 301 Vx = np.sum( X0**2 * W0 ) / wsum 302 Vy = np.sum( Y0**2 * W0 ) / wsum 303 Vxy= np.sum( X0*Y0 * W0 ) / wsum 304 305 # Standard deviation 306 Sx = np.sqrt( Vx ) 307 Sy = np.sqrt( Vy ) 308 309 # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases) 310 R = Vxy / np.sqrt( Vx * Vy ) 311 312 ############# 313 # SLOPE 314 315 Slope = np.sign(R) * Sy / Sx 316 317 # Standard error of slope estimate 318 ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) ) 319 320 # Confidence interval for Slope 321 B = (1-R**2)/(N-dfmod) * stats.f.isf(1-alpha, 1, N-dfmod) 322 ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) ) 323 324 ############# 325 # INTERCEPT 326 327 if intercept: 328 Intercept = Ymean - Slope * Xmean 329 330 # Standard deviation of residuals 331 # New Method: Formula from smatr R package (Warton) 332 # This formula avoids large residuals of outliers when using robust=True 333 Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 * Vx ) * (N-1) / (N-dfmod) ) 334 335 # OLD METHOD 336 # Standard deviation of residuals 337 #resid = Y0 - (Intercept + Slope * X0 ) 338 # Population standard deviation of the residuals 339 #Sr = np.std( resid, ddof=0 ) 340 341 # Standard error of the intercept estimate 342 ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2 ) 343 344 # Confidence interval for Intercept 345 tcrit = stats.t.isf((1-alpha)/2,N-dfmod) 346 ci_int = Intercept + ste_int * np.array([-tcrit,tcrit]) 347 348 else: 349 350 # Set Intercept quantities to zero 351 Intercept = 0 352 ste_int = 0 353 ci_int = np.array([0,0]) 354 355 result = dict( slope = Slope, 356 intercept = Intercept, 357 slope_ste = ste_slope, 358 intercept_ste = ste_int, 359 slope_interval = ci_grad, 360 intercept_interval = ci_int, 361 alpha = alpha, 362 df_model = dfmod, 363 df_resid = N-dfmod, 364 params = np.array([Slope,Intercept]), 365 nobs = N, 366 fittedvalues = Intercept + Slope * X0, 367 resid = Intercept + Slope * X0 - Y0 ) 368 369 # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int 370 return result 371 372def york( x, y, err_x=1, err_y=1, rerr_xy=0 ): 373 '''York regression accounting for error in x and y 374 Follows the notation and algorithm of York et al. (2004) Section III 375 376 Parameters 377 ---------- 378 x, y : ndarray 379 dependent (x) and independent (y) variables for fitting 380 err_x, err_y : ndarray (default=1) 381 standard deviation of errors/uncertainty in x and y 382 rerr_xy : float (default=0) 383 correlation coefficient for errors in x and y, 384 default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y 385 err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y 386 387 Returns 388 ------- 389 fitresult : dict 390 Contains the following keys: 391 - slope (float) 392 Slope or Gradient of Y vs. X 393 - intercept (float) 394 Y intercept. 395 - slope_ste (float) 396 Standard error of slope estimate 397 - intercept_ste (float) 398 standard error of intercept estimate 399 - slope_interval ([float, float]) 400 confidence interval for gradient at confidence level alpha 401 - intercept_interval ([float, float]) 402 confidence interval for intercept at confidence level alpha 403 - alpha (float) 404 confidence level [0,1] for slope and intercept intervals 405 - df_model (float) 406 degrees of freedom for model 407 - df_resid (float) 408 degrees of freedom for residuals 409 - params ([float,float]) 410 array of fitted parameters 411 - fittedvalues (ndarray) 412 array of fitted values 413 - resid (ndarray) 414 array of residual values 415 ''' 416 417 # relative error tolerance required for convergence 418 rtol = 1e-15 419 420 # Initial guess for slope, from ordinary least squares 421 result = stats.linregress( x, y ) 422 b = result[0] 423 424 # Weights for x and y 425 wx = 1 / err_x**2 426 wy = 1 / err_y**2 427 428 # Combined weights 429 alpha = np.sqrt( wx * wy ) 430 431 # Iterate until solution converges, but not more 50 times 432 maxiter=50 433 for i in range(1,maxiter): 434 435 # Weight for point i 436 W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha ) 437 Wsum = np.sum( W ) 438 439 # Weighted means 440 Xbar = np.sum( W * x ) / Wsum 441 Ybar = np.sum( W * y ) / Wsum 442 443 # Deviation from weighted means 444 U = x - Xbar 445 V = y - Ybar 446 447 # parameter needed for slope 448 beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha ) 449 450 # Update slope estimate 451 bnew = np.sum( W * beta * V ) / np.sum( W * beta * U ) 452 453 # Break from loop if new value is very close to old value 454 if np.abs( (bnew-b)/b ) < rtol: 455 break 456 else: 457 b = bnew 458 459 if i==maxiter: 460 raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' ) 461 462 # Intercept 463 a = Ybar - b * Xbar 464 465 # least-squares adjusted points, expectation values of X and Y 466 xa = Xbar + beta 467 ya = Ybar + b*beta 468 469 # Mean of adjusted points 470 xabar = np.sum( W * xa ) / Wsum 471 yabar = np.sum( W * ya ) / Wsum 472 473 # Devaiation of adjusted points from their means 474 u = xa - xabar 475 v = ya - yabar 476 477 # Variance of slope and intercept estimates 478 varb = 1 / np.sum( W * u**2 ) 479 vara = 1 / Wsum + xabar**2 * varb 480 481 # Standard error of slope and intercept 482 siga = np.sqrt( vara ) 483 sigb = np.sqrt( varb ) 484 485 # Define a named tuple type that will contain the results 486 # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' ) 487 488 # Return results as a named tuple, User can access as a regular tuple too 489 # return result( b, a, sigb, siga, [b,a], [sigb, siga] ) 490 491 dfmod = 2 492 N = np.sum( ~np.isnan(x) * ~np.isnan(y) ) 493 494 result = dict( slope = b, 495 intercept = a, 496 slope_ste = sigb, 497 intercept_ste = siga, 498 slope_interval = [None,None], 499 intercept_interval = [None,None], 500 alpha = alpha, 501 df_model = dfmod, 502 df_resid = N-dfmod, 503 params = np.array([b,a]), 504 nobs = N, 505 fittedvalues = a + b * x, 506 resid = a + b * x - y ) 507 508 return result 509 510def sen( x, y, alpha=0.95, method='separate' ): 511 ''''Theil-Sen slope estimate 512 513 This function wraps `scipy.stats.theilslopes` and provides 514 results in the same dict format as the other line fitting methods 515 in this module 516 517 Parameters 518 ---------- 519 x, y : ndarray 520 dependent (x) and independent (y) variables for fitting 521 alpha : float (default = 0.95) 522 Desired confidence level [0,1] for output. 523 method : {'separate' (default), 'joint'} 524 Method for estimating intercept. 525 - 'separate' uses np.median(y) - slope * np.median(x) 526 - 'joint' uses np.median( y - slope * x ) 527 528 Returns 529 ------- 530 fitresult : dict 531 Contains the following keys: 532 - slope (float) 533 Slope or Gradient of Y vs. X 534 - intercept (float) 535 Y intercept. 536 - slope_ste (float) 537 Standard error of slope estimate 538 - intercept_ste (float) 539 standard error of intercept estimate 540 - slope_interval ([float, float]) 541 confidence interval for gradient at confidence level alpha 542 - intercept_interval ([float, float]) 543 confidence interval for intercept at confidence level alpha 544 - alpha (float) 545 confidence level [0,1] for slope and intercept intervals 546 - df_model (float) 547 degrees of freedom for model 548 - df_resid (float) 549 degrees of freedom for residuals 550 - params ([float,float]) 551 array of fitted parameters 552 - fittedvalues (ndarray) 553 array of fitted values 554 - resid (ndarray) 555 array of residual values 556 ''' 557 558 slope, intercept, low_slope, high_slope = theilslopes(y,x,alpha,method) 559 560 dfmod = 2 561 N = np.sum( ~np.isnan(x) * ~np.isnan(y) ) 562 563 result = dict( slope = slope, 564 intercept = intercept, 565 slope_ste = None, 566 intercept_ste = None, 567 slope_interval = [low_slope,high_slope], 568 intercept_interval = [None,None], 569 alpha = alpha, 570 df_model = dfmod, 571 df_resid = N-dfmod, 572 params = np.array([slope,intercept]), 573 nobs = N, 574 fittedvalues = intercept + slope * x, 575 resid = intercept + slope * x - y ) 576 577 return result 578 579#@jit(nopython=True) 580def sen_numba( x, y ): 581 '''Estimate linear trend using the Thiel-Sen method 582 583 This non-parametric method finds the median slope among all 584 combinations of time points. 585 scipy.stats.theilslopes provides the same slope estimate, with 586 confidence intervals. However, this function is faster for 587 large datasets due to Numba 588 589 Parameters 590 ---------- 591 x : array_like (N,) 592 independent variable 593 y : array_like (N,) 594 dependent variable 595 596 Returns 597 ------- 598 sen : float 599 the median slope 600 slopes : array (N*N,) 601 all slope estimates from all combinations of x and y 602 ''' 603 604 with warnings.catch_warnings(): 605 warnings.simplefilter('always', DeprecationWarning) 606 warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.', 607 DeprecationWarning, stacklevel=2) 608 609 if len( x ) != len( y ): 610 print('Inputs x and y must have same dimension') 611 return np.nan 612 613 # Find number of time points 614 n = len( x ) 615 616 # Array to hold all slope estimates 617 slopes = np.zeros( np.ceil( n * ( n-1 ) / 2 ).astype('int') ) 618 slopes[:] = np.nan 619 620 count = 0 621 622 for i in range(n): 623 for j in range(i+1, n): 624 625 # Slope between elements i and j 626 slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] ) 627 628 slopes[count] = slopeij 629 630 count += 1 631 632 # Thiel-Sen estimate is the median slope, neglecting NaN 633 sen = np.nanmedian( slopes ) 634 635 return sen, slopes
36def bivariate_line_equation(fitresult, 37 floatformat='{:.3f}', 38 ystring='include' ): 39 '''Write equation for the fitted line 40 41 Parameters 42 ---------- 43 fitresult : dict 44 results of the line fit 45 floatformat : str 46 format string for the numerical values (default='{:.3f}') 47 ystring : {'include' (default), 'separate', 'none'} 48 specifies whether "y =" should be included in result, a separate item in tuple, or none 49 50 Returns 51 ------- 52 fitline_string : str 53 equation for the the fitted line, in the form "y = a x + b" or "y = a x" 54 ''' 55 56 # Left-hand side 57 lhs = "y_"+fitresult['method'] 58 59 # Right-hand side 60 if fitresult['fitintercept']: 61 rhs = f'{floatformat:s} x + {floatformat:s}'.\ 62 format( fitresult['slope'], fitresult['intercept'] ) 63 else: 64 rhs = f'{floatformat:s} x'.\ 65 format( fitresult['slope'] ) 66 67 # Combine right and left-hand sides 68 if ystring=='include': 69 equation = f'{lhs:s} = {rhs:s}' 70 elif ystring=='separate': 71 equation = (lhs,rhs) 72 elif ystring=='none': 73 equation = rhs 74 else: 75 raise ValueError('Unrecognized value of ystring: '+ystring) 76 77 return equation
Write equation for the fitted line
Parameters
- fitresult (dict): results of the line fit
- floatformat (str): format string for the numerical values (default='{:.3f}')
- ystring ({'include' (default), 'separate', 'none'}): specifies whether "y =" should be included in result, a separate item in tuple, or none
Returns
- fitline_string (str): equation for the the fitted line, in the form "y = a x + b" or "y = a x"
79def sma(X,Y,W=None, 80 data=None, 81 alpha=0.95, 82 intercept=True, 83 robust=False,robust_method='FastMCD'): 84 '''Standard Major-Axis (SMA) line fitting 85 86 Calculate standard major axis, aka reduced major axis, fit to 87 data X and Y. The main advantage of this over ordinary least squares is 88 that the best fit of Y to X will be the same as the best fit of X to Y. 89 90 The fit equations and confidence intervals are implemented following 91 Warton et al. (2006). Robust fits use the FastMCD covariance estimate 92 from Rousseeuw and Van Driessen (1999). While there are many alternative 93 robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 94 the FastMCD algorithm is default in Matlab. When the standard error or 95 uncertainty of each point is known, then weighted SMA may be preferrable to 96 robust SMA. The conventional choice of weights for each point i is 97 W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 98 (squared standard error). 99 100 References 101 Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 102 Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 103 doi:10.1017/S1464793106007007, 2006. 104 Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 105 Covariance Determinant Estimator, Technometrics, 41(3), 1999. 106 107 Parameters 108 ---------- 109 X, Y : array_like or str 110 Input values, Must have same length. 111 W : array_like or str, optional 112 array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 113 data : dict_like, optional 114 data structure containing variables. Used when X, Y, or W are str. 115 alpha : float (default = 0.95) 116 Desired confidence level [0,1] for output. 117 intercept : bool, default=True 118 Specify if the fitted model should include a non-zero intercept. 119 The model will be forced through the origin (0,0) if intercept=False. 120 robust : bool, default=False 121 Use statistical methods that are robust to the presence of outliers 122 robust_method: {'FastMCD' (default), 'Huber', 'Biweight'} 123 Method for calculating robust variance and covariance. Options: 124 - 'MCD' or 'FastMCD' for Fast MCD 125 - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers 126 - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers 127 128 129 Returns 130 ------- 131 fitresult : dict 132 Contains the following keys: 133 - slope (float) 134 Slope or Gradient of Y vs. X 135 - intercept (float) 136 Y intercept. 137 - slope_ste (float) 138 Standard error of slope estimate 139 - intercept_ste (float) 140 standard error of intercept estimate 141 - slope_interval ([float, float]) 142 confidence interval for gradient at confidence level alpha 143 - intercept_interval ([float, float]) 144 confidence interval for intercept at confidence level alpha 145 - alpha (float) 146 confidence level [0,1] for slope and intercept intervals 147 - df_model (float) 148 degrees of freedom for model 149 - df_resid (float) 150 degrees of freedom for residuals 151 - params ([float,float]) 152 array of fitted parameters 153 - fittedvalues (ndarray) 154 array of fitted values 155 - resid (ndarray) 156 array of residual values 157 ''' 158 159 def str2var( v, data ): 160 '''Extract variable named v from Dataframe named data''' 161 try: 162 return data[v] 163 except Exception as exc: 164 raise ValueError( 'Argument data must be provided with a key named '+v ) from exc 165 166 # If variables are provided as strings, get values from the data structure 167 if isinstance( X, str ): 168 X = str2var( X, data ) 169 if isinstance( Y, str ): 170 Y = str2var( Y, data ) 171 if isinstance( W, str ): 172 W = str2var( W, data ) 173 174 # Make sure arrays have the same length 175 assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length' 176 if W is None: 177 W = np.zeros_like(X) + 1 178 else: 179 assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y' 180 181 # Make sure alpha is within the range 0-1 182 assert (alpha < 1), 'alpha must be less than 1' 183 assert (alpha > 0), 'alpha must be greater than 0' 184 185 # Drop any NaN elements of X, Y, or W 186 # Infinite values are allowed but will make the result undefined 187 # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) ) 188 idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W) 189 190 X0 = X[idx] 191 Y0 = Y[idx] 192 W0 = W[idx] 193 194 # Number of observations 195 N = len(X0) 196 197 # Degrees of freedom for the model 198 if intercept: 199 dfmod = 2 200 else: 201 dfmod = 1 202 203 204 # Choose whether to use methods robust to outliers 205 if robust: 206 207 # Choose the robust method 208 if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ): 209 # FAST MCD 210 211 if not intercept: 212 # intercept=False could possibly be supported by calculating 213 # using mcd.support_ as weights in an explicit variance/covariance calculation 214 raise NotImplementedError('FastMCD method only supports SMA with intercept') 215 216 # Fit robust model of mean and covariance 217 mcd = MinCovDet().fit( np.array([X0,Y0]).T ) 218 219 # Robust mean 220 Xmean = mcd.location_[0] 221 Ymean = mcd.location_[1] 222 223 # Robust variance of X, Y 224 Vx = mcd.covariance_[0,0] 225 Vy = mcd.covariance_[1,1] 226 227 # Robust covariance 228 Vxy = mcd.covariance_[0,1] 229 230 # Number of observations used in mean and covariance estimate 231 # excludes observations marked as outliers 232 N = mcd.support_.sum() 233 234 elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ): 235 236 # Tukey's Biweight and Huber's T 237 if robust_method.lower()=='biweight': 238 norm = norms.TukeyBiweight() 239 else: 240 norm = norms.HuberT() 241 242 # Get weights for downweighting outliers 243 # Fitting a linear model the easiest way to get these 244 # Options include "TukeyBiweight" (totally removes large deviates) 245 # "HuberT" (linear, not squared weighting of large deviates) 246 rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights 247 248 # Sum of weight and weights squared, for convienience 249 rsum = np.sum( rweights ) 250 rsum2 = np.sum( rweights**2 ) 251 252 # Mean 253 Xmean = np.sum( X0 * rweights ) / rsum 254 Ymean = np.sum( Y0 * rweights ) / rsum 255 256 # Force intercept through zero, if requested 257 if not intercept: 258 Xmean = 0 259 Ymean = 0 260 261 # Variance & Covariance 262 Vx = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2 263 Vy = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2 264 Vxy = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2 265 266 # Effective number of observations 267 N = rsum 268 269 else: 270 271 raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\ 272 format(robust_method)) 273 else: 274 275 if intercept: 276 277 wsum = np.sum(W) 278 279 # Average values 280 Xmean = np.sum(X0 * W0) / wsum 281 Ymean = np.sum(Y0 * W0) / wsum 282 283 # Covariance matrix 284 cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 ) 285 286 # Variance 287 Vx = cov[0,0] 288 Vy = cov[1,1] 289 290 # Covariance 291 Vxy = cov[0,1] 292 293 else: 294 295 # Force the line to pass through origin by setting means to zero 296 Xmean = 0 297 Ymean = 0 298 299 wsum = np.sum(W0) 300 301 # Sum of squares in place of variance and covariance 302 Vx = np.sum( X0**2 * W0 ) / wsum 303 Vy = np.sum( Y0**2 * W0 ) / wsum 304 Vxy= np.sum( X0*Y0 * W0 ) / wsum 305 306 # Standard deviation 307 Sx = np.sqrt( Vx ) 308 Sy = np.sqrt( Vy ) 309 310 # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases) 311 R = Vxy / np.sqrt( Vx * Vy ) 312 313 ############# 314 # SLOPE 315 316 Slope = np.sign(R) * Sy / Sx 317 318 # Standard error of slope estimate 319 ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) ) 320 321 # Confidence interval for Slope 322 B = (1-R**2)/(N-dfmod) * stats.f.isf(1-alpha, 1, N-dfmod) 323 ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) ) 324 325 ############# 326 # INTERCEPT 327 328 if intercept: 329 Intercept = Ymean - Slope * Xmean 330 331 # Standard deviation of residuals 332 # New Method: Formula from smatr R package (Warton) 333 # This formula avoids large residuals of outliers when using robust=True 334 Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 * Vx ) * (N-1) / (N-dfmod) ) 335 336 # OLD METHOD 337 # Standard deviation of residuals 338 #resid = Y0 - (Intercept + Slope * X0 ) 339 # Population standard deviation of the residuals 340 #Sr = np.std( resid, ddof=0 ) 341 342 # Standard error of the intercept estimate 343 ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2 ) 344 345 # Confidence interval for Intercept 346 tcrit = stats.t.isf((1-alpha)/2,N-dfmod) 347 ci_int = Intercept + ste_int * np.array([-tcrit,tcrit]) 348 349 else: 350 351 # Set Intercept quantities to zero 352 Intercept = 0 353 ste_int = 0 354 ci_int = np.array([0,0]) 355 356 result = dict( slope = Slope, 357 intercept = Intercept, 358 slope_ste = ste_slope, 359 intercept_ste = ste_int, 360 slope_interval = ci_grad, 361 intercept_interval = ci_int, 362 alpha = alpha, 363 df_model = dfmod, 364 df_resid = N-dfmod, 365 params = np.array([Slope,Intercept]), 366 nobs = N, 367 fittedvalues = Intercept + Slope * X0, 368 resid = Intercept + Slope * X0 - Y0 ) 369 370 # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int 371 return result
Standard Major-Axis (SMA) line fitting
Calculate standard major axis, aka reduced major axis, fit to data X and Y. The main advantage of this over ordinary least squares is that the best fit of Y to X will be the same as the best fit of X to Y.
The fit equations and confidence intervals are implemented following Warton et al. (2006). Robust fits use the FastMCD covariance estimate from Rousseeuw and Van Driessen (1999). While there are many alternative robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), the FastMCD algorithm is default in Matlab. When the standard error or uncertainty of each point is known, then weighted SMA may be preferrable to robust SMA. The conventional choice of weights for each point i is W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance (squared standard error).
References Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, doi:10.1017/S1464793106007007, 2006. Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41(3), 1999.
Parameters
- X, Y (array_like or str): Input values, Must have same length.
- W (array_like or str, optional): array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i))
- data (dict_like, optional): data structure containing variables. Used when X, Y, or W are str.
- alpha (float (default = 0.95)): Desired confidence level [0,1] for output.
- intercept (bool, default=True): Specify if the fitted model should include a non-zero intercept. The model will be forced through the origin (0,0) if intercept=False.
- robust (bool, default=False): Use statistical methods that are robust to the presence of outliers
- robust_method ({'FastMCD' (default), 'Huber', 'Biweight'}):
Method for calculating robust variance and covariance. Options:
- 'MCD' or 'FastMCD' for Fast MCD
- 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
- 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
Returns
- fitresult (dict):
Contains the following keys:
- slope (float) Slope or Gradient of Y vs. X
- intercept (float) Y intercept.
- slope_ste (float) Standard error of slope estimate
- intercept_ste (float) standard error of intercept estimate
- slope_interval ([float, float]) confidence interval for gradient at confidence level alpha
- intercept_interval ([float, float]) confidence interval for intercept at confidence level alpha
- alpha (float) confidence level [0,1] for slope and intercept intervals
- df_model (float) degrees of freedom for model
- df_resid (float) degrees of freedom for residuals
- params ([float,float]) array of fitted parameters
- fittedvalues (ndarray) array of fitted values
- resid (ndarray) array of residual values
Alias for sma
511def sen( x, y, alpha=0.95, method='separate' ): 512 ''''Theil-Sen slope estimate 513 514 This function wraps `scipy.stats.theilslopes` and provides 515 results in the same dict format as the other line fitting methods 516 in this module 517 518 Parameters 519 ---------- 520 x, y : ndarray 521 dependent (x) and independent (y) variables for fitting 522 alpha : float (default = 0.95) 523 Desired confidence level [0,1] for output. 524 method : {'separate' (default), 'joint'} 525 Method for estimating intercept. 526 - 'separate' uses np.median(y) - slope * np.median(x) 527 - 'joint' uses np.median( y - slope * x ) 528 529 Returns 530 ------- 531 fitresult : dict 532 Contains the following keys: 533 - slope (float) 534 Slope or Gradient of Y vs. X 535 - intercept (float) 536 Y intercept. 537 - slope_ste (float) 538 Standard error of slope estimate 539 - intercept_ste (float) 540 standard error of intercept estimate 541 - slope_interval ([float, float]) 542 confidence interval for gradient at confidence level alpha 543 - intercept_interval ([float, float]) 544 confidence interval for intercept at confidence level alpha 545 - alpha (float) 546 confidence level [0,1] for slope and intercept intervals 547 - df_model (float) 548 degrees of freedom for model 549 - df_resid (float) 550 degrees of freedom for residuals 551 - params ([float,float]) 552 array of fitted parameters 553 - fittedvalues (ndarray) 554 array of fitted values 555 - resid (ndarray) 556 array of residual values 557 ''' 558 559 slope, intercept, low_slope, high_slope = theilslopes(y,x,alpha,method) 560 561 dfmod = 2 562 N = np.sum( ~np.isnan(x) * ~np.isnan(y) ) 563 564 result = dict( slope = slope, 565 intercept = intercept, 566 slope_ste = None, 567 intercept_ste = None, 568 slope_interval = [low_slope,high_slope], 569 intercept_interval = [None,None], 570 alpha = alpha, 571 df_model = dfmod, 572 df_resid = N-dfmod, 573 params = np.array([slope,intercept]), 574 nobs = N, 575 fittedvalues = intercept + slope * x, 576 resid = intercept + slope * x - y ) 577 578 return result
'Theil-Sen slope estimate
This function wraps scipy.stats.theilslopes
and provides
results in the same dict format as the other line fitting methods
in this module
Parameters
- x, y (ndarray): dependent (x) and independent (y) variables for fitting
- alpha (float (default = 0.95)): Desired confidence level [0,1] for output.
- method ({'separate' (default), 'joint'}):
Method for estimating intercept.
- 'separate' uses np.median(y) - slope * np.median(x)
- 'joint' uses np.median( y - slope * x )
Returns
- fitresult (dict):
Contains the following keys:
- slope (float) Slope or Gradient of Y vs. X
- intercept (float) Y intercept.
- slope_ste (float) Standard error of slope estimate
- intercept_ste (float) standard error of intercept estimate
- slope_interval ([float, float]) confidence interval for gradient at confidence level alpha
- intercept_interval ([float, float]) confidence interval for intercept at confidence level alpha
- alpha (float) confidence level [0,1] for slope and intercept intervals
- df_model (float) degrees of freedom for model
- df_resid (float) degrees of freedom for residuals
- params ([float,float]) array of fitted parameters
- fittedvalues (ndarray) array of fitted values
- resid (ndarray) array of residual values
Alias for sen
581def sen_numba( x, y ): 582 '''Estimate linear trend using the Thiel-Sen method 583 584 This non-parametric method finds the median slope among all 585 combinations of time points. 586 scipy.stats.theilslopes provides the same slope estimate, with 587 confidence intervals. However, this function is faster for 588 large datasets due to Numba 589 590 Parameters 591 ---------- 592 x : array_like (N,) 593 independent variable 594 y : array_like (N,) 595 dependent variable 596 597 Returns 598 ------- 599 sen : float 600 the median slope 601 slopes : array (N*N,) 602 all slope estimates from all combinations of x and y 603 ''' 604 605 with warnings.catch_warnings(): 606 warnings.simplefilter('always', DeprecationWarning) 607 warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.', 608 DeprecationWarning, stacklevel=2) 609 610 if len( x ) != len( y ): 611 print('Inputs x and y must have same dimension') 612 return np.nan 613 614 # Find number of time points 615 n = len( x ) 616 617 # Array to hold all slope estimates 618 slopes = np.zeros( np.ceil( n * ( n-1 ) / 2 ).astype('int') ) 619 slopes[:] = np.nan 620 621 count = 0 622 623 for i in range(n): 624 for j in range(i+1, n): 625 626 # Slope between elements i and j 627 slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] ) 628 629 slopes[count] = slopeij 630 631 count += 1 632 633 # Thiel-Sen estimate is the median slope, neglecting NaN 634 sen = np.nanmedian( slopes ) 635 636 return sen, slopes
Estimate linear trend using the Thiel-Sen method
This non-parametric method finds the median slope among all
combinations of time points.
scipy.stats.theilslopes provides the same slope estimate, with
confidence intervals. However, this function is faster for
large datasets due to Numba
Parameters
- x (array_like (N,)): independent variable
- y (array_like (N,)): dependent variable
Returns
- sen (float): the median slope
- slopes (array (N*N,)): all slope estimates from all combinations of x and y
373def york( x, y, err_x=1, err_y=1, rerr_xy=0 ): 374 '''York regression accounting for error in x and y 375 Follows the notation and algorithm of York et al. (2004) Section III 376 377 Parameters 378 ---------- 379 x, y : ndarray 380 dependent (x) and independent (y) variables for fitting 381 err_x, err_y : ndarray (default=1) 382 standard deviation of errors/uncertainty in x and y 383 rerr_xy : float (default=0) 384 correlation coefficient for errors in x and y, 385 default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y 386 err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y 387 388 Returns 389 ------- 390 fitresult : dict 391 Contains the following keys: 392 - slope (float) 393 Slope or Gradient of Y vs. X 394 - intercept (float) 395 Y intercept. 396 - slope_ste (float) 397 Standard error of slope estimate 398 - intercept_ste (float) 399 standard error of intercept estimate 400 - slope_interval ([float, float]) 401 confidence interval for gradient at confidence level alpha 402 - intercept_interval ([float, float]) 403 confidence interval for intercept at confidence level alpha 404 - alpha (float) 405 confidence level [0,1] for slope and intercept intervals 406 - df_model (float) 407 degrees of freedom for model 408 - df_resid (float) 409 degrees of freedom for residuals 410 - params ([float,float]) 411 array of fitted parameters 412 - fittedvalues (ndarray) 413 array of fitted values 414 - resid (ndarray) 415 array of residual values 416 ''' 417 418 # relative error tolerance required for convergence 419 rtol = 1e-15 420 421 # Initial guess for slope, from ordinary least squares 422 result = stats.linregress( x, y ) 423 b = result[0] 424 425 # Weights for x and y 426 wx = 1 / err_x**2 427 wy = 1 / err_y**2 428 429 # Combined weights 430 alpha = np.sqrt( wx * wy ) 431 432 # Iterate until solution converges, but not more 50 times 433 maxiter=50 434 for i in range(1,maxiter): 435 436 # Weight for point i 437 W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha ) 438 Wsum = np.sum( W ) 439 440 # Weighted means 441 Xbar = np.sum( W * x ) / Wsum 442 Ybar = np.sum( W * y ) / Wsum 443 444 # Deviation from weighted means 445 U = x - Xbar 446 V = y - Ybar 447 448 # parameter needed for slope 449 beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha ) 450 451 # Update slope estimate 452 bnew = np.sum( W * beta * V ) / np.sum( W * beta * U ) 453 454 # Break from loop if new value is very close to old value 455 if np.abs( (bnew-b)/b ) < rtol: 456 break 457 else: 458 b = bnew 459 460 if i==maxiter: 461 raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' ) 462 463 # Intercept 464 a = Ybar - b * Xbar 465 466 # least-squares adjusted points, expectation values of X and Y 467 xa = Xbar + beta 468 ya = Ybar + b*beta 469 470 # Mean of adjusted points 471 xabar = np.sum( W * xa ) / Wsum 472 yabar = np.sum( W * ya ) / Wsum 473 474 # Devaiation of adjusted points from their means 475 u = xa - xabar 476 v = ya - yabar 477 478 # Variance of slope and intercept estimates 479 varb = 1 / np.sum( W * u**2 ) 480 vara = 1 / Wsum + xabar**2 * varb 481 482 # Standard error of slope and intercept 483 siga = np.sqrt( vara ) 484 sigb = np.sqrt( varb ) 485 486 # Define a named tuple type that will contain the results 487 # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' ) 488 489 # Return results as a named tuple, User can access as a regular tuple too 490 # return result( b, a, sigb, siga, [b,a], [sigb, siga] ) 491 492 dfmod = 2 493 N = np.sum( ~np.isnan(x) * ~np.isnan(y) ) 494 495 result = dict( slope = b, 496 intercept = a, 497 slope_ste = sigb, 498 intercept_ste = siga, 499 slope_interval = [None,None], 500 intercept_interval = [None,None], 501 alpha = alpha, 502 df_model = dfmod, 503 df_resid = N-dfmod, 504 params = np.array([b,a]), 505 nobs = N, 506 fittedvalues = a + b * x, 507 resid = a + b * x - y ) 508 509 return result
York regression accounting for error in x and y Follows the notation and algorithm of York et al. (2004) Section III
Parameters
- x, y (ndarray): dependent (x) and independent (y) variables for fitting
- err_x, err_y (ndarray (default=1)): standard deviation of errors/uncertainty in x and y
- rerr_xy (float (default=0)): correlation coefficient for errors in x and y, default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
Returns
- fitresult (dict):
Contains the following keys:
- slope (float) Slope or Gradient of Y vs. X
- intercept (float) Y intercept.
- slope_ste (float) Standard error of slope estimate
- intercept_ste (float) standard error of intercept estimate
- slope_interval ([float, float]) confidence interval for gradient at confidence level alpha
- intercept_interval ([float, float]) confidence interval for intercept at confidence level alpha
- alpha (float) confidence level [0,1] for slope and intercept intervals
- df_model (float) degrees of freedom for model
- df_resid (float) degrees of freedom for residuals
- params ([float,float]) array of fitted parameters
- fittedvalues (ndarray) array of fitted values
- resid (ndarray) array of residual values