acgc.stats.bivariate_lines

Specialized methods of bivariate line fitting

  • Standard major axis (SMA) also called reduced major axis (RMA)
  • York regression, for data with errors in x and y
  • Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module)
  1# -*- coding: utf-8 -*-
  2"""Specialized methods of bivariate line fitting
  3
  4* Standard major axis (SMA) also called reduced major axis (RMA)
  5* York regression, for data with errors in x and y
  6* Theil-Sen, non-parametric slope estimation (use numba to accelerate the function in this module)
  7"""
  8# from collections import namedtuple
  9import warnings
 10import numpy as np
 11import scipy.stats as stats
 12from sklearn.covariance import MinCovDet
 13import statsmodels.formula.api as smf
 14import statsmodels.robust.norms as norms
 15from scipy.stats import theilslopes
 16#from numba import jit
 17
 18__all__ = [
 19    "bivariate_line_equation",
 20    "sma",
 21    "smafit",
 22    "sen",
 23    "sen_slope",
 24    "sen_numba",
 25    "york"
 26]
 27# Aliases
 28def sen_slope(*args,**kwargs):
 29    '''Alias for `sen`'''
 30    return sen(*args,**kwargs)
 31def smafit(*args,**kwargs):
 32    '''Alias for `sma`'''
 33    return sma(*args,**kwargs)
 34
 35def bivariate_line_equation(fitresult,
 36                    floatformat='{:.3f}',
 37                    ystring='include' ):
 38    '''Write equation for the fitted line
 39    
 40    Parameters
 41    ----------
 42    fitresult : dict
 43        results of the line fit
 44    floatformat : str
 45        format string for the numerical values (default='{:.3f}')
 46    ystring : {'include' (default), 'separate', 'none'}
 47        specifies whether "y =" should be included in result, a separate item in tuple, or none
 48    
 49    Returns
 50    -------
 51    fitline_string : str
 52        equation for the the fitted line, in the form "y = a x + b" or "y = a x"
 53    '''
 54
 55    # Left-hand side
 56    lhs = "y_"+fitresult['method']
 57
 58    # Right-hand side
 59    if fitresult['fitintercept']:
 60        rhs = f'{floatformat:s} x + {floatformat:s}'.\
 61                format( fitresult['slope'], fitresult['intercept'] )
 62    else:
 63        rhs = f'{floatformat:s} x'.\
 64                format( fitresult['slope'] )
 65
 66    # Combine right and left-hand sides
 67    if ystring=='include':
 68        equation = f'{lhs:s} = {rhs:s}'
 69    elif ystring=='separate':
 70        equation = (lhs,rhs)
 71    elif ystring=='none':
 72        equation = rhs
 73    else:
 74        raise ValueError('Unrecognized value of ystring: '+ystring)
 75
 76    return equation
 77
 78def sma(X,Y,W=None,
 79           data=None,
 80           alpha=0.95,
 81           intercept=True,
 82           robust=False,robust_method='FastMCD'):
 83    '''Standard Major-Axis (SMA) line fitting
 84    
 85    Calculate standard major axis, aka reduced major axis, fit to 
 86    data X and Y. The main advantage of this over ordinary least squares is 
 87    that the best fit of Y to X will be the same as the best fit of X to Y.
 88    
 89    The fit equations and confidence intervals are implemented following 
 90    Warton et al. (2006). Robust fits use the FastMCD covariance estimate 
 91    from Rousseeuw and Van Driessen (1999). While there are many alternative 
 92    robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 
 93    the FastMCD algorithm is default in Matlab. When the standard error or 
 94    uncertainty of each point is known, then weighted SMA may be preferrable to 
 95    robust SMA. The conventional choice of weights for each point i is 
 96    W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 
 97    (squared standard error).
 98    
 99    References 
100    Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 
101        Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 
102        doi:10.1017/S1464793106007007, 2006.
103    Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 
104        Covariance Determinant Estimator, Technometrics, 41(3), 1999.
105
106    Parameters
107    ----------
108    X, Y : array_like or str
109        Input values, Must have same length.
110    W    : array_like or str, optional
111        array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 
112    data : dict_like, optional
113        data structure containing variables. Used when X, Y, or W are str.
114    alpha : float (default = 0.95)
115        Desired confidence level [0,1] for output. 
116    intercept : bool, default=True
117        Specify if the fitted model should include a non-zero intercept.
118        The model will be forced through the origin (0,0) if intercept=False.
119    robust : bool, default=False
120        Use statistical methods that are robust to the presence of outliers
121    robust_method: {'FastMCD' (default), 'Huber', 'Biweight'}
122        Method for calculating robust variance and covariance. Options:
123        - 'MCD' or 'FastMCD' for Fast MCD
124        - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
125        - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
126
127        
128    Returns
129    -------
130    fitresult : dict 
131        Contains the following keys:
132        - slope (float)
133            Slope or Gradient of Y vs. X
134        - intercept (float)
135            Y intercept.
136        - slope_ste (float)
137            Standard error of slope estimate
138        - intercept_ste (float)
139            standard error of intercept estimate
140        - slope_interval ([float, float])
141            confidence interval for gradient at confidence level alpha
142        - intercept_interval ([float, float])
143            confidence interval for intercept at confidence level alpha
144        - alpha (float)
145            confidence level [0,1] for slope and intercept intervals
146        - df_model (float)
147            degrees of freedom for model
148        - df_resid (float)
149            degrees of freedom for residuals
150        - params ([float,float])
151            array of fitted parameters
152        - fittedvalues (ndarray)
153            array of fitted values
154        - resid (ndarray)
155            array of residual values
156    '''
157
158    def str2var( v, data ):
159        '''Extract variable named v from Dataframe named data'''
160        try:
161            return data[v]
162        except Exception as exc:
163            raise ValueError( 'Argument data must be provided with a key named '+v ) from exc
164
165    # If variables are provided as strings, get values from the data structure
166    if isinstance( X, str ):
167        X = str2var( X, data )
168    if isinstance( Y, str ):
169        Y = str2var( Y, data )
170    if isinstance( W, str ):
171        W = str2var( W, data )
172
173    # Make sure arrays have the same length
174    assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length'
175    if W is None:
176        W = np.zeros_like(X) + 1
177    else:
178        assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y'
179
180    # Make sure alpha is within the range 0-1
181    assert (alpha < 1), 'alpha must be less than 1'
182    assert (alpha > 0), 'alpha must be greater than 0'
183
184    # Drop any NaN elements of X, Y, or W
185    # Infinite values are allowed but will make the result undefined
186    # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) )
187    idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W)
188
189    X0 = X[idx]
190    Y0 = Y[idx]
191    W0 = W[idx]
192
193    # Number of observations
194    N = len(X0)
195
196    # Degrees of freedom for the model
197    if intercept:
198        dfmod = 2
199    else:
200        dfmod = 1
201
202
203    # Choose whether to use methods robust to outliers
204    if robust:
205
206        # Choose the robust method
207        if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ):
208            # FAST MCD
209
210            if not intercept:
211                # intercept=False could possibly be supported by calculating
212                # using mcd.support_ as weights in an explicit variance/covariance calculation
213                raise NotImplementedError('FastMCD method only supports SMA with intercept')
214
215            # Fit robust model of mean and covariance
216            mcd = MinCovDet().fit( np.array([X0,Y0]).T )
217
218            # Robust mean
219            Xmean = mcd.location_[0]
220            Ymean = mcd.location_[1]
221
222            # Robust variance of X, Y
223            Vx    = mcd.covariance_[0,0]
224            Vy    = mcd.covariance_[1,1]
225
226            # Robust covariance
227            Vxy   = mcd.covariance_[0,1]
228
229            # Number of observations used in mean and covariance estimate
230            # excludes observations marked as outliers
231            N = mcd.support_.sum()
232
233        elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ):
234
235            # Tukey's Biweight and Huber's T
236            if robust_method.lower()=='biweight':
237                norm = norms.TukeyBiweight()
238            else:
239                norm = norms.HuberT()
240
241            # Get weights for downweighting outliers
242            # Fitting a linear model the easiest way to get these
243            # Options include "TukeyBiweight" (totally removes large deviates)
244            # "HuberT" (linear, not squared weighting of large deviates)
245            rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights
246
247            # Sum of weight and weights squared, for convienience
248            rsum  = np.sum( rweights )
249            rsum2 = np.sum( rweights**2 )
250
251            # Mean
252            Xmean = np.sum( X0 * rweights ) / rsum
253            Ymean = np.sum( Y0 * rweights ) / rsum
254
255            # Force intercept through zero, if requested
256            if not intercept:
257                Xmean = 0
258                Ymean = 0
259
260            # Variance & Covariance
261            Vx    = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2
262            Vy    = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2
263            Vxy   = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2
264
265            # Effective number of observations
266            N = rsum
267
268        else:
269
270            raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\
271                                      format(robust_method))
272    else:
273
274        if intercept:
275
276            wsum = np.sum(W)
277
278            # Average values
279            Xmean = np.sum(X0 * W0) / wsum
280            Ymean = np.sum(Y0 * W0) / wsum
281
282            # Covariance matrix
283            cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 )
284
285            # Variance
286            Vx = cov[0,0]
287            Vy = cov[1,1]
288
289            # Covariance
290            Vxy = cov[0,1]
291
292        else:
293
294            # Force the line to pass through origin by setting means to zero
295            Xmean = 0
296            Ymean = 0
297
298            wsum = np.sum(W0)
299
300            # Sum of squares in place of variance and covariance
301            Vx = np.sum( X0**2 * W0 ) / wsum
302            Vy = np.sum( Y0**2 * W0 ) / wsum
303            Vxy= np.sum( X0*Y0 * W0 ) / wsum
304
305    # Standard deviation
306    Sx = np.sqrt( Vx )
307    Sy = np.sqrt( Vy )
308
309    # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases)
310    R = Vxy / np.sqrt( Vx * Vy )
311
312    #############
313    # SLOPE
314
315    Slope  = np.sign(R) * Sy / Sx
316
317    # Standard error of slope estimate
318    ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) )
319
320    # Confidence interval for Slope
321    B = (1-R**2)/(N-dfmod) * stats.f.isf(1-alpha, 1, N-dfmod)
322    ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) )
323
324    #############
325    # INTERCEPT
326
327    if intercept:
328        Intercept = Ymean - Slope * Xmean
329
330        # Standard deviation of residuals
331        # New Method: Formula from smatr R package (Warton)
332        # This formula avoids large residuals of outliers when using robust=True
333        Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 *  Vx ) * (N-1) / (N-dfmod) )
334
335        # OLD METHOD
336        # Standard deviation of residuals
337        #resid = Y0 - (Intercept + Slope * X0 )    
338        # Population standard deviation of the residuals
339        #Sr = np.std( resid, ddof=0 )      
340
341        # Standard error of the intercept estimate
342        ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2  )
343
344        # Confidence interval for Intercept
345        tcrit = stats.t.isf((1-alpha)/2,N-dfmod)
346        ci_int = Intercept + ste_int * np.array([-tcrit,tcrit])
347
348    else:
349
350        # Set Intercept quantities to zero
351        Intercept = 0
352        ste_int   = 0
353        ci_int    = np.array([0,0])
354
355    result = dict( slope            = Slope,
356                   intercept        = Intercept,
357                   slope_ste        = ste_slope,
358                   intercept_ste    = ste_int,
359                   slope_interval   = ci_grad,
360                   intercept_interval = ci_int,
361                   alpha            = alpha,
362                   df_model         = dfmod,
363                   df_resid         = N-dfmod,
364                   params           = np.array([Slope,Intercept]),
365                   nobs             = N,
366                   fittedvalues     = Intercept + Slope * X0,
367                   resid            = Intercept + Slope * X0 - Y0 )
368
369    # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int
370    return result
371
372def york( x, y, err_x=1, err_y=1, rerr_xy=0 ):
373    '''York regression accounting for error in x and y
374    Follows the notation and algorithm of York et al. (2004) Section III
375    
376    Parameters
377    ----------
378    x, y : ndarray
379        dependent (x) and independent (y) variables for fitting
380    err_x, err_y : ndarray (default=1)
381        standard deviation of errors/uncertainty in x and y
382    rerr_xy : float (default=0)
383        correlation coefficient for errors in x and y, 
384        default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y
385        err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
386    
387    Returns
388    -------
389    fitresult : dict 
390        Contains the following keys:
391        - slope (float)
392            Slope or Gradient of Y vs. X
393        - intercept (float)
394            Y intercept.
395        - slope_ste (float)
396            Standard error of slope estimate
397        - intercept_ste (float)
398            standard error of intercept estimate
399        - slope_interval ([float, float])
400            confidence interval for gradient at confidence level alpha
401        - intercept_interval ([float, float])
402            confidence interval for intercept at confidence level alpha
403        - alpha (float)
404            confidence level [0,1] for slope and intercept intervals
405        - df_model (float)
406            degrees of freedom for model
407        - df_resid (float)
408            degrees of freedom for residuals
409        - params ([float,float])
410            array of fitted parameters
411        - fittedvalues (ndarray)
412            array of fitted values
413        - resid (ndarray)
414            array of residual values
415    '''
416
417    # relative error tolerance required for convergence
418    rtol = 1e-15
419
420    # Initial guess for slope, from ordinary least squares
421    result = stats.linregress( x, y )
422    b = result[0]
423
424    # Weights for x and y
425    wx = 1 / err_x**2
426    wy = 1 / err_y**2
427
428    # Combined weights
429    alpha = np.sqrt( wx * wy )
430
431    # Iterate until solution converges, but not more 50 times
432    maxiter=50
433    for i in range(1,maxiter):
434
435        # Weight for point i
436        W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha )
437        Wsum = np.sum( W )
438
439        # Weighted means
440        Xbar = np.sum( W * x ) / Wsum
441        Ybar = np.sum( W * y ) / Wsum
442
443        # Deviation from weighted means
444        U = x - Xbar
445        V = y - Ybar
446
447        # parameter needed for slope
448        beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha )
449
450        # Update slope estimate
451        bnew = np.sum( W * beta * V ) / np.sum( W * beta * U )
452
453        # Break from loop if new value is very close to old value
454        if np.abs( (bnew-b)/b ) < rtol:
455            break
456        else:
457            b = bnew
458
459    if i==maxiter:
460        raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' )
461
462    # Intercept
463    a = Ybar - b * Xbar
464
465    # least-squares adjusted points, expectation values of X and Y
466    xa = Xbar + beta
467    ya = Ybar + b*beta
468
469    # Mean of adjusted points
470    xabar = np.sum( W * xa ) / Wsum
471    yabar = np.sum( W * ya ) / Wsum
472
473    # Devaiation of adjusted points from their means
474    u = xa - xabar
475    v = ya - yabar
476
477    # Variance of slope and intercept estimates
478    varb = 1 / np.sum( W * u**2 )
479    vara = 1 / Wsum + xabar**2 * varb
480
481    # Standard error of slope and intercept
482    siga = np.sqrt( vara )
483    sigb = np.sqrt( varb )
484
485    # Define a named tuple type that will contain the results
486    # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' )
487
488    # Return results as a named tuple, User can access as a regular tuple too
489    # return result( b, a, sigb, siga, [b,a], [sigb, siga] )
490
491    dfmod = 2
492    N = np.sum( ~np.isnan(x) * ~np.isnan(y) )
493
494    result = dict( slope         = b,
495                intercept        = a,
496                slope_ste        = sigb,
497                intercept_ste    = siga,
498                slope_interval   = [None,None],
499                intercept_interval = [None,None],
500                alpha            = alpha,
501                df_model         = dfmod,
502                df_resid         = N-dfmod,
503                params           = np.array([b,a]),
504                nobs             = N,
505                fittedvalues     = a + b * x,
506                resid            = a + b * x - y )
507
508    return result
509
510def sen( x, y, alpha=0.95, method='separate' ):
511    ''''Theil-Sen slope estimate
512    
513    This function wraps `scipy.stats.theilslopes` and provides
514    results in the same dict format as the other line fitting methods 
515    in this module
516    
517    Parameters
518    ----------
519    x, y : ndarray
520        dependent (x) and independent (y) variables for fitting
521    alpha : float (default = 0.95)
522        Desired confidence level [0,1] for output. 
523    method : {'separate' (default), 'joint'}
524        Method for estimating intercept. 
525        - 'separate' uses np.median(y) - slope * np.median(x)
526        - 'joint' uses np.median( y - slope * x )
527            
528    Returns
529    -------
530    fitresult : dict 
531        Contains the following keys:
532        - slope (float)
533            Slope or Gradient of Y vs. X
534        - intercept (float)
535            Y intercept.
536        - slope_ste (float)
537            Standard error of slope estimate
538        - intercept_ste (float)
539            standard error of intercept estimate
540        - slope_interval ([float, float])
541            confidence interval for gradient at confidence level alpha
542        - intercept_interval ([float, float])
543            confidence interval for intercept at confidence level alpha
544        - alpha (float)
545            confidence level [0,1] for slope and intercept intervals
546        - df_model (float)
547            degrees of freedom for model
548        - df_resid (float)
549            degrees of freedom for residuals
550        - params ([float,float])
551            array of fitted parameters
552        - fittedvalues (ndarray)
553            array of fitted values
554        - resid (ndarray)
555            array of residual values
556    '''
557
558    slope, intercept, low_slope, high_slope = theilslopes(y,x,alpha,method)
559
560    dfmod = 2
561    N = np.sum( ~np.isnan(x) * ~np.isnan(y) )
562
563    result = dict( slope         = slope,
564                intercept        = intercept,
565                slope_ste        = None,
566                intercept_ste    = None,
567                slope_interval   = [low_slope,high_slope],
568                intercept_interval = [None,None],
569                alpha            = alpha,
570                df_model         = dfmod,
571                df_resid         = N-dfmod,
572                params           = np.array([slope,intercept]),
573                nobs             = N,
574                fittedvalues     = intercept + slope * x,
575                resid            = intercept + slope * x - y )
576
577    return result
578
579#@jit(nopython=True)
580def sen_numba( x, y ):
581    '''Estimate linear trend using the Thiel-Sen method
582    
583    This non-parametric method finds the median slope among all
584    combinations of time points. 
585    scipy.stats.theilslopes provides the same slope estimate, with  
586    confidence intervals. However, this function is faster for 
587    large datasets due to Numba 
588    
589    Parameters
590    ----------
591    x : array_like (N,)
592        independent variable
593    y : array_like (N,)
594        dependent variable
595    
596    Returns
597    -------
598    sen : float
599        the median slope
600    slopes : array (N*N,)
601        all slope estimates from all combinations of x and y
602    '''
603
604    with warnings.catch_warnings():
605        warnings.simplefilter('always', DeprecationWarning)
606        warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.',
607                    DeprecationWarning, stacklevel=2)
608        
609    if len( x ) != len( y ):
610        print('Inputs x and y must have same dimension')
611        return np.nan
612
613    # Find number of time points
614    n = len( x )
615
616    # Array to hold all slope estimates
617    slopes = np.zeros(  np.ceil( n * ( n-1 ) / 2 ).astype('int') )
618    slopes[:] = np.nan
619
620    count = 0
621
622    for i in range(n):
623        for j in range(i+1, n):
624
625            # Slope between elements i and j
626            slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] )
627
628            slopes[count] = slopeij
629
630            count += 1
631
632    # Thiel-Sen estimate is the median slope, neglecting NaN
633    sen = np.nanmedian( slopes )
634
635    return sen, slopes
def bivariate_line_equation(fitresult, floatformat='{:.3f}', ystring='include'):
36def bivariate_line_equation(fitresult,
37                    floatformat='{:.3f}',
38                    ystring='include' ):
39    '''Write equation for the fitted line
40    
41    Parameters
42    ----------
43    fitresult : dict
44        results of the line fit
45    floatformat : str
46        format string for the numerical values (default='{:.3f}')
47    ystring : {'include' (default), 'separate', 'none'}
48        specifies whether "y =" should be included in result, a separate item in tuple, or none
49    
50    Returns
51    -------
52    fitline_string : str
53        equation for the the fitted line, in the form "y = a x + b" or "y = a x"
54    '''
55
56    # Left-hand side
57    lhs = "y_"+fitresult['method']
58
59    # Right-hand side
60    if fitresult['fitintercept']:
61        rhs = f'{floatformat:s} x + {floatformat:s}'.\
62                format( fitresult['slope'], fitresult['intercept'] )
63    else:
64        rhs = f'{floatformat:s} x'.\
65                format( fitresult['slope'] )
66
67    # Combine right and left-hand sides
68    if ystring=='include':
69        equation = f'{lhs:s} = {rhs:s}'
70    elif ystring=='separate':
71        equation = (lhs,rhs)
72    elif ystring=='none':
73        equation = rhs
74    else:
75        raise ValueError('Unrecognized value of ystring: '+ystring)
76
77    return equation

Write equation for the fitted line

Parameters
  • fitresult (dict): results of the line fit
  • floatformat (str): format string for the numerical values (default='{:.3f}')
  • ystring ({'include' (default), 'separate', 'none'}): specifies whether "y =" should be included in result, a separate item in tuple, or none
Returns
  • fitline_string (str): equation for the the fitted line, in the form "y = a x + b" or "y = a x"
def sma( X, Y, W=None, data=None, alpha=0.95, intercept=True, robust=False, robust_method='FastMCD'):
 79def sma(X,Y,W=None,
 80           data=None,
 81           alpha=0.95,
 82           intercept=True,
 83           robust=False,robust_method='FastMCD'):
 84    '''Standard Major-Axis (SMA) line fitting
 85    
 86    Calculate standard major axis, aka reduced major axis, fit to 
 87    data X and Y. The main advantage of this over ordinary least squares is 
 88    that the best fit of Y to X will be the same as the best fit of X to Y.
 89    
 90    The fit equations and confidence intervals are implemented following 
 91    Warton et al. (2006). Robust fits use the FastMCD covariance estimate 
 92    from Rousseeuw and Van Driessen (1999). While there are many alternative 
 93    robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), 
 94    the FastMCD algorithm is default in Matlab. When the standard error or 
 95    uncertainty of each point is known, then weighted SMA may be preferrable to 
 96    robust SMA. The conventional choice of weights for each point i is 
 97    W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance 
 98    (squared standard error).
 99    
100    References 
101    Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: 
102        Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, 
103        doi:10.1017/S1464793106007007, 2006.
104    Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum 
105        Covariance Determinant Estimator, Technometrics, 41(3), 1999.
106
107    Parameters
108    ----------
109    X, Y : array_like or str
110        Input values, Must have same length.
111    W    : array_like or str, optional
112        array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i)) 
113    data : dict_like, optional
114        data structure containing variables. Used when X, Y, or W are str.
115    alpha : float (default = 0.95)
116        Desired confidence level [0,1] for output. 
117    intercept : bool, default=True
118        Specify if the fitted model should include a non-zero intercept.
119        The model will be forced through the origin (0,0) if intercept=False.
120    robust : bool, default=False
121        Use statistical methods that are robust to the presence of outliers
122    robust_method: {'FastMCD' (default), 'Huber', 'Biweight'}
123        Method for calculating robust variance and covariance. Options:
124        - 'MCD' or 'FastMCD' for Fast MCD
125        - 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
126        - 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
127
128        
129    Returns
130    -------
131    fitresult : dict 
132        Contains the following keys:
133        - slope (float)
134            Slope or Gradient of Y vs. X
135        - intercept (float)
136            Y intercept.
137        - slope_ste (float)
138            Standard error of slope estimate
139        - intercept_ste (float)
140            standard error of intercept estimate
141        - slope_interval ([float, float])
142            confidence interval for gradient at confidence level alpha
143        - intercept_interval ([float, float])
144            confidence interval for intercept at confidence level alpha
145        - alpha (float)
146            confidence level [0,1] for slope and intercept intervals
147        - df_model (float)
148            degrees of freedom for model
149        - df_resid (float)
150            degrees of freedom for residuals
151        - params ([float,float])
152            array of fitted parameters
153        - fittedvalues (ndarray)
154            array of fitted values
155        - resid (ndarray)
156            array of residual values
157    '''
158
159    def str2var( v, data ):
160        '''Extract variable named v from Dataframe named data'''
161        try:
162            return data[v]
163        except Exception as exc:
164            raise ValueError( 'Argument data must be provided with a key named '+v ) from exc
165
166    # If variables are provided as strings, get values from the data structure
167    if isinstance( X, str ):
168        X = str2var( X, data )
169    if isinstance( Y, str ):
170        Y = str2var( Y, data )
171    if isinstance( W, str ):
172        W = str2var( W, data )
173
174    # Make sure arrays have the same length
175    assert ( len(X) == len(Y) ), 'Arrays X and Y must have the same length'
176    if W is None:
177        W = np.zeros_like(X) + 1
178    else:
179        assert ( len(W) == len(X) ), 'Array W must have the same length as X and Y'
180
181    # Make sure alpha is within the range 0-1
182    assert (alpha < 1), 'alpha must be less than 1'
183    assert (alpha > 0), 'alpha must be greater than 0'
184
185    # Drop any NaN elements of X, Y, or W
186    # Infinite values are allowed but will make the result undefined
187    # idx = ~np.logical_or( np.isnan(X0), np.isnan(Y0) )
188    idx = ~np.isnan(X) * ~np.isnan(Y) * ~np.isnan(W)
189
190    X0 = X[idx]
191    Y0 = Y[idx]
192    W0 = W[idx]
193
194    # Number of observations
195    N = len(X0)
196
197    # Degrees of freedom for the model
198    if intercept:
199        dfmod = 2
200    else:
201        dfmod = 1
202
203
204    # Choose whether to use methods robust to outliers
205    if robust:
206
207        # Choose the robust method
208        if ((robust_method.lower() =='mcd') or (robust_method.lower() == 'fastmcd') ):
209            # FAST MCD
210
211            if not intercept:
212                # intercept=False could possibly be supported by calculating
213                # using mcd.support_ as weights in an explicit variance/covariance calculation
214                raise NotImplementedError('FastMCD method only supports SMA with intercept')
215
216            # Fit robust model of mean and covariance
217            mcd = MinCovDet().fit( np.array([X0,Y0]).T )
218
219            # Robust mean
220            Xmean = mcd.location_[0]
221            Ymean = mcd.location_[1]
222
223            # Robust variance of X, Y
224            Vx    = mcd.covariance_[0,0]
225            Vy    = mcd.covariance_[1,1]
226
227            # Robust covariance
228            Vxy   = mcd.covariance_[0,1]
229
230            # Number of observations used in mean and covariance estimate
231            # excludes observations marked as outliers
232            N = mcd.support_.sum()
233
234        elif ((robust_method.lower() =='biweight') or (robust_method.lower() == 'huber') ):
235
236            # Tukey's Biweight and Huber's T
237            if robust_method.lower()=='biweight':
238                norm = norms.TukeyBiweight()
239            else:
240                norm = norms.HuberT()
241
242            # Get weights for downweighting outliers
243            # Fitting a linear model the easiest way to get these
244            # Options include "TukeyBiweight" (totally removes large deviates)
245            # "HuberT" (linear, not squared weighting of large deviates)
246            rweights = smf.rlm('y~x+1',{'x':X0,'y':Y0},M=norm).fit().weights
247
248            # Sum of weight and weights squared, for convienience
249            rsum  = np.sum( rweights )
250            rsum2 = np.sum( rweights**2 )
251
252            # Mean
253            Xmean = np.sum( X0 * rweights ) / rsum
254            Ymean = np.sum( Y0 * rweights ) / rsum
255
256            # Force intercept through zero, if requested
257            if not intercept:
258                Xmean = 0
259                Ymean = 0
260
261            # Variance & Covariance
262            Vx    = np.sum( (X0-Xmean)**2 * rweights**2 ) / rsum2
263            Vy    = np.sum( (Y0-Ymean)**2 * rweights**2 ) / rsum2
264            Vxy   = np.sum( (X0-Xmean) * (Y0-Ymean) * rweights**2 ) / rsum2
265
266            # Effective number of observations
267            N = rsum
268
269        else:
270
271            raise NotImplementedError("sma hasn't implemented robust_method={:%s}".\
272                                      format(robust_method))
273    else:
274
275        if intercept:
276
277            wsum = np.sum(W)
278
279            # Average values
280            Xmean = np.sum(X0 * W0) / wsum
281            Ymean = np.sum(Y0 * W0) / wsum
282
283            # Covariance matrix
284            cov = np.cov( X0, Y0, ddof=1, aweights=W0**2 )
285
286            # Variance
287            Vx = cov[0,0]
288            Vy = cov[1,1]
289
290            # Covariance
291            Vxy = cov[0,1]
292
293        else:
294
295            # Force the line to pass through origin by setting means to zero
296            Xmean = 0
297            Ymean = 0
298
299            wsum = np.sum(W0)
300
301            # Sum of squares in place of variance and covariance
302            Vx = np.sum( X0**2 * W0 ) / wsum
303            Vy = np.sum( Y0**2 * W0 ) / wsum
304            Vxy= np.sum( X0*Y0 * W0 ) / wsum
305
306    # Standard deviation
307    Sx = np.sqrt( Vx )
308    Sy = np.sqrt( Vy )
309
310    # Correlation coefficient (equivalent to np.corrcoef()[1,0] for non-robust cases)
311    R = Vxy / np.sqrt( Vx * Vy )
312
313    #############
314    # SLOPE
315
316    Slope  = np.sign(R) * Sy / Sx
317
318    # Standard error of slope estimate
319    ste_slope = np.sqrt( 1/(N-dfmod) * Sy**2 / Sx**2 * (1-R**2) )
320
321    # Confidence interval for Slope
322    B = (1-R**2)/(N-dfmod) * stats.f.isf(1-alpha, 1, N-dfmod)
323    ci_grad = Slope * ( np.sqrt( B+1 ) + np.sqrt(B)*np.array([-1,+1]) )
324
325    #############
326    # INTERCEPT
327
328    if intercept:
329        Intercept = Ymean - Slope * Xmean
330
331        # Standard deviation of residuals
332        # New Method: Formula from smatr R package (Warton)
333        # This formula avoids large residuals of outliers when using robust=True
334        Sr = np.sqrt((Vy - 2 * Slope * Vxy + Slope**2 *  Vx ) * (N-1) / (N-dfmod) )
335
336        # OLD METHOD
337        # Standard deviation of residuals
338        #resid = Y0 - (Intercept + Slope * X0 )    
339        # Population standard deviation of the residuals
340        #Sr = np.std( resid, ddof=0 )      
341
342        # Standard error of the intercept estimate
343        ste_int = np.sqrt( Sr**2/N + Xmean**2 * ste_slope**2  )
344
345        # Confidence interval for Intercept
346        tcrit = stats.t.isf((1-alpha)/2,N-dfmod)
347        ci_int = Intercept + ste_int * np.array([-tcrit,tcrit])
348
349    else:
350
351        # Set Intercept quantities to zero
352        Intercept = 0
353        ste_int   = 0
354        ci_int    = np.array([0,0])
355
356    result = dict( slope            = Slope,
357                   intercept        = Intercept,
358                   slope_ste        = ste_slope,
359                   intercept_ste    = ste_int,
360                   slope_interval   = ci_grad,
361                   intercept_interval = ci_int,
362                   alpha            = alpha,
363                   df_model         = dfmod,
364                   df_resid         = N-dfmod,
365                   params           = np.array([Slope,Intercept]),
366                   nobs             = N,
367                   fittedvalues     = Intercept + Slope * X0,
368                   resid            = Intercept + Slope * X0 - Y0 )
369
370    # return Slope, Intercept, ste_slope, ste_int, ci_grad, ci_int
371    return result

Standard Major-Axis (SMA) line fitting

Calculate standard major axis, aka reduced major axis, fit to data X and Y. The main advantage of this over ordinary least squares is that the best fit of Y to X will be the same as the best fit of X to Y.

The fit equations and confidence intervals are implemented following Warton et al. (2006). Robust fits use the FastMCD covariance estimate from Rousseeuw and Van Driessen (1999). While there are many alternative robust covariance estimators (e.g. other papers by D.I. Warton using M-estimators), the FastMCD algorithm is default in Matlab. When the standard error or uncertainty of each point is known, then weighted SMA may be preferrable to robust SMA. The conventional choice of weights for each point i is W_i = 1 / ( var(X_i) + var(Y_i) ), where var() is the variance (squared standard error).

References Warton, D. I., Wright, I. J., Falster, D. S. and Westoby, M.: Bivariate line-fitting methods for allometry, Biol. Rev., 81(02), 259, doi:10.1017/S1464793106007007, 2006. Rousseeuw, P. J. and Van Driessen, K.: A Fast Algorithm for the Minimum Covariance Determinant Estimator, Technometrics, 41(3), 1999.

Parameters
  • X, Y (array_like or str): Input values, Must have same length.
  • W (array_like or str, optional): array of weights for each X-Y point, typically W_i = 1/(var(X_i)+var(Y_i))
  • data (dict_like, optional): data structure containing variables. Used when X, Y, or W are str.
  • alpha (float (default = 0.95)): Desired confidence level [0,1] for output.
  • intercept (bool, default=True): Specify if the fitted model should include a non-zero intercept. The model will be forced through the origin (0,0) if intercept=False.
  • robust (bool, default=False): Use statistical methods that are robust to the presence of outliers
  • robust_method ({'FastMCD' (default), 'Huber', 'Biweight'}): Method for calculating robust variance and covariance. Options:
    • 'MCD' or 'FastMCD' for Fast MCD
    • 'Huber' for Huber's T: reduce, not eliminate, influence of outliers
    • 'Biweight' for Tukey's Biweight: reduces then eliminates influence of outliers
Returns
  • fitresult (dict): Contains the following keys:
    • slope (float) Slope or Gradient of Y vs. X
    • intercept (float) Y intercept.
    • slope_ste (float) Standard error of slope estimate
    • intercept_ste (float) standard error of intercept estimate
    • slope_interval ([float, float]) confidence interval for gradient at confidence level alpha
    • intercept_interval ([float, float]) confidence interval for intercept at confidence level alpha
    • alpha (float) confidence level [0,1] for slope and intercept intervals
    • df_model (float) degrees of freedom for model
    • df_resid (float) degrees of freedom for residuals
    • params ([float,float]) array of fitted parameters
    • fittedvalues (ndarray) array of fitted values
    • resid (ndarray) array of residual values
def smafit(*args, **kwargs):
32def smafit(*args,**kwargs):
33    '''Alias for `sma`'''
34    return sma(*args,**kwargs)

Alias for sma

def sen(x, y, alpha=0.95, method='separate'):
511def sen( x, y, alpha=0.95, method='separate' ):
512    ''''Theil-Sen slope estimate
513    
514    This function wraps `scipy.stats.theilslopes` and provides
515    results in the same dict format as the other line fitting methods 
516    in this module
517    
518    Parameters
519    ----------
520    x, y : ndarray
521        dependent (x) and independent (y) variables for fitting
522    alpha : float (default = 0.95)
523        Desired confidence level [0,1] for output. 
524    method : {'separate' (default), 'joint'}
525        Method for estimating intercept. 
526        - 'separate' uses np.median(y) - slope * np.median(x)
527        - 'joint' uses np.median( y - slope * x )
528            
529    Returns
530    -------
531    fitresult : dict 
532        Contains the following keys:
533        - slope (float)
534            Slope or Gradient of Y vs. X
535        - intercept (float)
536            Y intercept.
537        - slope_ste (float)
538            Standard error of slope estimate
539        - intercept_ste (float)
540            standard error of intercept estimate
541        - slope_interval ([float, float])
542            confidence interval for gradient at confidence level alpha
543        - intercept_interval ([float, float])
544            confidence interval for intercept at confidence level alpha
545        - alpha (float)
546            confidence level [0,1] for slope and intercept intervals
547        - df_model (float)
548            degrees of freedom for model
549        - df_resid (float)
550            degrees of freedom for residuals
551        - params ([float,float])
552            array of fitted parameters
553        - fittedvalues (ndarray)
554            array of fitted values
555        - resid (ndarray)
556            array of residual values
557    '''
558
559    slope, intercept, low_slope, high_slope = theilslopes(y,x,alpha,method)
560
561    dfmod = 2
562    N = np.sum( ~np.isnan(x) * ~np.isnan(y) )
563
564    result = dict( slope         = slope,
565                intercept        = intercept,
566                slope_ste        = None,
567                intercept_ste    = None,
568                slope_interval   = [low_slope,high_slope],
569                intercept_interval = [None,None],
570                alpha            = alpha,
571                df_model         = dfmod,
572                df_resid         = N-dfmod,
573                params           = np.array([slope,intercept]),
574                nobs             = N,
575                fittedvalues     = intercept + slope * x,
576                resid            = intercept + slope * x - y )
577
578    return result

'Theil-Sen slope estimate

This function wraps scipy.stats.theilslopes and provides results in the same dict format as the other line fitting methods in this module

Parameters
  • x, y (ndarray): dependent (x) and independent (y) variables for fitting
  • alpha (float (default = 0.95)): Desired confidence level [0,1] for output.
  • method ({'separate' (default), 'joint'}): Method for estimating intercept.
    • 'separate' uses np.median(y) - slope * np.median(x)
    • 'joint' uses np.median( y - slope * x )
Returns
  • fitresult (dict): Contains the following keys:
    • slope (float) Slope or Gradient of Y vs. X
    • intercept (float) Y intercept.
    • slope_ste (float) Standard error of slope estimate
    • intercept_ste (float) standard error of intercept estimate
    • slope_interval ([float, float]) confidence interval for gradient at confidence level alpha
    • intercept_interval ([float, float]) confidence interval for intercept at confidence level alpha
    • alpha (float) confidence level [0,1] for slope and intercept intervals
    • df_model (float) degrees of freedom for model
    • df_resid (float) degrees of freedom for residuals
    • params ([float,float]) array of fitted parameters
    • fittedvalues (ndarray) array of fitted values
    • resid (ndarray) array of residual values
def sen_slope(*args, **kwargs):
29def sen_slope(*args,**kwargs):
30    '''Alias for `sen`'''
31    return sen(*args,**kwargs)

Alias for sen

def sen_numba(x, y):
581def sen_numba( x, y ):
582    '''Estimate linear trend using the Thiel-Sen method
583    
584    This non-parametric method finds the median slope among all
585    combinations of time points. 
586    scipy.stats.theilslopes provides the same slope estimate, with  
587    confidence intervals. However, this function is faster for 
588    large datasets due to Numba 
589    
590    Parameters
591    ----------
592    x : array_like (N,)
593        independent variable
594    y : array_like (N,)
595        dependent variable
596    
597    Returns
598    -------
599    sen : float
600        the median slope
601    slopes : array (N*N,)
602        all slope estimates from all combinations of x and y
603    '''
604
605    with warnings.catch_warnings():
606        warnings.simplefilter('always', DeprecationWarning)
607        warnings.warn(f'Sen function is slow unless numba.jit is used. Use scipy.stats.theilslopes instead.',
608                    DeprecationWarning, stacklevel=2)
609        
610    if len( x ) != len( y ):
611        print('Inputs x and y must have same dimension')
612        return np.nan
613
614    # Find number of time points
615    n = len( x )
616
617    # Array to hold all slope estimates
618    slopes = np.zeros(  np.ceil( n * ( n-1 ) / 2 ).astype('int') )
619    slopes[:] = np.nan
620
621    count = 0
622
623    for i in range(n):
624        for j in range(i+1, n):
625
626            # Slope between elements i and j
627            slopeij = ( y[j] - y[i] ) / ( x[j] - x[i] )
628
629            slopes[count] = slopeij
630
631            count += 1
632
633    # Thiel-Sen estimate is the median slope, neglecting NaN
634    sen = np.nanmedian( slopes )
635
636    return sen, slopes

Estimate linear trend using the Thiel-Sen method

This non-parametric method finds the median slope among all combinations of time points. scipy.stats.theilslopes provides the same slope estimate, with
confidence intervals. However, this function is faster for large datasets due to Numba

Parameters
  • x (array_like (N,)): independent variable
  • y (array_like (N,)): dependent variable
Returns
  • sen (float): the median slope
  • slopes (array (N*N,)): all slope estimates from all combinations of x and y
def york(x, y, err_x=1, err_y=1, rerr_xy=0):
373def york( x, y, err_x=1, err_y=1, rerr_xy=0 ):
374    '''York regression accounting for error in x and y
375    Follows the notation and algorithm of York et al. (2004) Section III
376    
377    Parameters
378    ----------
379    x, y : ndarray
380        dependent (x) and independent (y) variables for fitting
381    err_x, err_y : ndarray (default=1)
382        standard deviation of errors/uncertainty in x and y
383    rerr_xy : float (default=0)
384        correlation coefficient for errors in x and y, 
385        default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y
386        err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
387    
388    Returns
389    -------
390    fitresult : dict 
391        Contains the following keys:
392        - slope (float)
393            Slope or Gradient of Y vs. X
394        - intercept (float)
395            Y intercept.
396        - slope_ste (float)
397            Standard error of slope estimate
398        - intercept_ste (float)
399            standard error of intercept estimate
400        - slope_interval ([float, float])
401            confidence interval for gradient at confidence level alpha
402        - intercept_interval ([float, float])
403            confidence interval for intercept at confidence level alpha
404        - alpha (float)
405            confidence level [0,1] for slope and intercept intervals
406        - df_model (float)
407            degrees of freedom for model
408        - df_resid (float)
409            degrees of freedom for residuals
410        - params ([float,float])
411            array of fitted parameters
412        - fittedvalues (ndarray)
413            array of fitted values
414        - resid (ndarray)
415            array of residual values
416    '''
417
418    # relative error tolerance required for convergence
419    rtol = 1e-15
420
421    # Initial guess for slope, from ordinary least squares
422    result = stats.linregress( x, y )
423    b = result[0]
424
425    # Weights for x and y
426    wx = 1 / err_x**2
427    wy = 1 / err_y**2
428
429    # Combined weights
430    alpha = np.sqrt( wx * wy )
431
432    # Iterate until solution converges, but not more 50 times
433    maxiter=50
434    for i in range(1,maxiter):
435
436        # Weight for point i
437        W = wx * wy / ( wx + b**2 * wy - 2 * b * rerr_xy * alpha )
438        Wsum = np.sum( W )
439
440        # Weighted means
441        Xbar = np.sum( W * x ) / Wsum
442        Ybar = np.sum( W * y ) / Wsum
443
444        # Deviation from weighted means
445        U = x - Xbar
446        V = y - Ybar
447
448        # parameter needed for slope
449        beta = W * ( U / wy + b*V / wx - (b*U + V) * rerr_xy / alpha )
450
451        # Update slope estimate
452        bnew = np.sum( W * beta * V ) / np.sum( W * beta * U )
453
454        # Break from loop if new value is very close to old value
455        if np.abs( (bnew-b)/b ) < rtol:
456            break
457        else:
458            b = bnew
459
460    if i==maxiter:
461        raise ValueError( f'York regression failed to converge in {maxiter:d} iterations' )
462
463    # Intercept
464    a = Ybar - b * Xbar
465
466    # least-squares adjusted points, expectation values of X and Y
467    xa = Xbar + beta
468    ya = Ybar + b*beta
469
470    # Mean of adjusted points
471    xabar = np.sum( W * xa ) / Wsum
472    yabar = np.sum( W * ya ) / Wsum
473
474    # Devaiation of adjusted points from their means
475    u = xa - xabar
476    v = ya - yabar
477
478    # Variance of slope and intercept estimates
479    varb = 1 / np.sum( W * u**2 )
480    vara = 1 / Wsum + xabar**2 * varb
481
482    # Standard error of slope and intercept
483    siga = np.sqrt( vara )
484    sigb = np.sqrt( varb )
485
486    # Define a named tuple type that will contain the results
487    # result = namedtuple( 'result', 'slope intercept sigs sigi params sigma' )
488
489    # Return results as a named tuple, User can access as a regular tuple too
490    # return result( b, a, sigb, siga, [b,a], [sigb, siga] )
491
492    dfmod = 2
493    N = np.sum( ~np.isnan(x) * ~np.isnan(y) )
494
495    result = dict( slope         = b,
496                intercept        = a,
497                slope_ste        = sigb,
498                intercept_ste    = siga,
499                slope_interval   = [None,None],
500                intercept_interval = [None,None],
501                alpha            = alpha,
502                df_model         = dfmod,
503                df_resid         = N-dfmod,
504                params           = np.array([b,a]),
505                nobs             = N,
506                fittedvalues     = a + b * x,
507                resid            = a + b * x - y )
508
509    return result

York regression accounting for error in x and y Follows the notation and algorithm of York et al. (2004) Section III

Parameters
  • x, y (ndarray): dependent (x) and independent (y) variables for fitting
  • err_x, err_y (ndarray (default=1)): standard deviation of errors/uncertainty in x and y
  • rerr_xy (float (default=0)): correlation coefficient for errors in x and y, default to rerr_xy=0 meaning that the errors in x are unrelated to errors in y err_x, err_y, and rerr_xy can be constants or arrays of the same length as x and y
Returns
  • fitresult (dict): Contains the following keys:
    • slope (float) Slope or Gradient of Y vs. X
    • intercept (float) Y intercept.
    • slope_ste (float) Standard error of slope estimate
    • intercept_ste (float) standard error of intercept estimate
    • slope_interval ([float, float]) confidence interval for gradient at confidence level alpha
    • intercept_interval ([float, float]) confidence interval for intercept at confidence level alpha
    • alpha (float) confidence level [0,1] for slope and intercept intervals
    • df_model (float) degrees of freedom for model
    • df_resid (float) degrees of freedom for residuals
    • params ([float,float]) array of fitted parameters
    • fittedvalues (ndarray) array of fitted values
    • resid (ndarray) array of residual values