Coverage for /home/martinb/.local/share/virtualenvs/camcops/lib/python3.6/site-packages/statsmodels/multivariate/pca.py : 7%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
1"""Principal Component Analysis
3Author: josef-pktd
4Modified by Kevin Sheppard
5"""
7import numpy as np
8import pandas as pd
10from statsmodels.tools.sm_exceptions import (ValueWarning,
11 EstimationWarning)
14def _norm(x):
15 return np.sqrt(np.sum(x * x))
18class PCA(object):
19 """
20 Principal Component Analysis
22 Parameters
23 ----------
24 data : array_like
25 Variables in columns, observations in rows.
26 ncomp : int, optional
27 Number of components to return. If None, returns the as many as the
28 smaller of the number of rows or columns in data.
29 standardize : bool, optional
30 Flag indicating to use standardized data with mean 0 and unit
31 variance. standardized being True implies demean. Using standardized
32 data is equivalent to computing principal components from the
33 correlation matrix of data.
34 demean : bool, optional
35 Flag indicating whether to demean data before computing principal
36 components. demean is ignored if standardize is True. Demeaning data
37 but not standardizing is equivalent to computing principal components
38 from the covariance matrix of data.
39 normalize : bool , optional
40 Indicates whether to normalize the factors to have unit inner product.
41 If False, the loadings will have unit inner product.
42 gls : bool, optional
43 Flag indicating to implement a two-step GLS estimator where
44 in the first step principal components are used to estimate residuals,
45 and then the inverse residual variance is used as a set of weights to
46 estimate the final principal components. Setting gls to True requires
47 ncomp to be less then the min of the number of rows or columns.
48 weights : ndarray, optional
49 Series weights to use after transforming data according to standardize
50 or demean when computing the principal components.
51 method : str, optional
52 Sets the linear algebra routine used to compute eigenvectors
53 'svd' uses a singular value decomposition (default).
54 'eig' uses an eigenvalue decomposition of a quadratic form
55 'nipals' uses the NIPALS algorithm and can be faster than SVD when
56 ncomp is small and nvars is large. See notes about additional changes
57 when using NIPALS.
58 missing : str
59 Method for missing data. Choices are:
60 'drop-row' - drop rows with missing values.
61 'drop-col' - drop columns with missing values.
62 'drop-min' - drop either rows or columns, choosing by data retention.
63 'fill-em' - use EM algorithm to fill missing value. ncomp should be
64 set to the number of factors required.
65 tol : float, optional
66 Tolerance to use when checking for convergence when using NIPALS.
67 max_iter : int, optional
68 Maximum iterations when using NIPALS.
69 tol_em : float
70 Tolerance to use when checking for convergence of the EM algorithm.
71 max_em_iter : int
72 Maximum iterations for the EM algorithm.
74 Attributes
75 ----------
76 factors : array or DataFrame
77 nobs by ncomp array of of principal components (scores)
78 scores : array or DataFrame
79 nobs by ncomp array of of principal components - identical to factors
80 loadings : array or DataFrame
81 ncomp by nvar array of principal component loadings for constructing
82 the factors
83 coeff : array or DataFrame
84 nvar by ncomp array of principal component loadings for constructing
85 the projections
86 projection : array or DataFrame
87 nobs by var array containing the projection of the data onto the ncomp
88 estimated factors
89 rsquare : array or Series
90 ncomp array where the element in the ith position is the R-square
91 of including the fist i principal components. Note: values are
92 calculated on the transformed data, not the original data
93 ic : array or DataFrame
94 ncomp by 3 array containing the Bai and Ng (2003) Information
95 criteria. Each column is a different criteria, and each row
96 represents the number of included factors.
97 eigenvals : array or Series
98 nvar array of eigenvalues
99 eigenvecs : array or DataFrame
100 nvar by nvar array of eigenvectors
101 weights : ndarray
102 nvar array of weights used to compute the principal components,
103 normalized to unit length
104 transformed_data : ndarray
105 Standardized, demeaned and weighted data used to compute
106 principal components and related quantities
107 cols : ndarray
108 Array of indices indicating columns used in the PCA
109 rows : ndarray
110 Array of indices indicating rows used in the PCA
112 Notes
113 -----
114 The default options perform principal component analysis on the
115 demeaned, unit variance version of data. Setting standardize to False
116 will instead only demean, and setting both standardized and
117 demean to False will not alter the data.
119 Once the data have been transformed, the following relationships hold when
120 the number of components (ncomp) is the same as tne minimum of the number
121 of observation or the number of variables.
123 .. math:
125 X' X = V \\Lambda V'
127 .. math:
129 F = X V
131 .. math:
133 X = F V'
135 where X is the `data`, F is the array of principal components (`factors`
136 or `scores`), and V is the array of eigenvectors (`loadings`) and V' is
137 the array of factor coefficients (`coeff`).
139 When weights are provided, the principal components are computed from the
140 modified data
142 .. math:
144 \\Omega^{-\\frac{1}{2}} X
146 where :math:`\\Omega` is a diagonal matrix composed of the weights. For
147 example, when using the GLS version of PCA, the elements of :math:`\\Omega`
148 will be the inverse of the variances of the residuals from
150 .. math:
152 X - F V'
154 where the number of factors is less than the rank of X
156 References
157 ----------
158 .. [*] J. Bai and S. Ng, "Determining the number of factors in approximate
159 factor models," Econometrica, vol. 70, number 1, pp. 191-221, 2002
161 Examples
162 --------
163 Basic PCA using the correlation matrix of the data
165 >>> import numpy as np
166 >>> from statsmodels.multivariate.pca import PCA
167 >>> x = np.random.randn(100)[:, None]
168 >>> x = x + np.random.randn(100, 100)
169 >>> pc = PCA(x)
171 Note that the principal components are computed using a SVD and so the
172 correlation matrix is never constructed, unless method='eig'.
174 PCA using the covariance matrix of the data
176 >>> pc = PCA(x, standardize=False)
178 Limiting the number of factors returned to 1 computed using NIPALS
180 >>> pc = PCA(x, ncomp=1, method='nipals')
181 >>> pc.factors.shape
182 (100, 1)
183 """
185 def __init__(self, data, ncomp=None, standardize=True, demean=True,
186 normalize=True, gls=False, weights=None, method='svd',
187 missing=None, tol=5e-8, max_iter=1000, tol_em=5e-8,
188 max_em_iter=100):
189 self._index = None
190 self._columns = []
191 if isinstance(data, pd.DataFrame):
192 self._index = data.index
193 self._columns = data.columns
195 self.data = np.asarray(data)
196 # Store inputs
197 self._gls = gls
198 self._normalize = normalize
199 self._tol = tol
200 if not 0 < self._tol < 1:
201 raise ValueError('tol must be strictly between 0 and 1')
202 self._max_iter = max_iter
203 self._max_em_iter = max_em_iter
204 self._tol_em = tol_em
206 # Prepare data
207 self._standardize = standardize
208 self._demean = demean
210 self._nobs, self._nvar = self.data.shape
211 if weights is None:
212 weights = np.ones(self._nvar)
213 else:
214 weights = np.array(weights).flatten()
215 if weights.shape[0] != self._nvar:
216 raise ValueError('weights should have nvar elements')
217 weights = weights / np.sqrt((weights ** 2.0).mean())
218 self.weights = weights
220 # Check ncomp against maximum
221 min_dim = min(self._nobs, self._nvar)
222 self._ncomp = min_dim if ncomp is None else ncomp
223 if self._ncomp > min_dim:
224 import warnings
226 warn = 'The requested number of components is more than can be ' \
227 'computed from data. The maximum number of components is ' \
228 'the minimum of the number of observations or variables'
229 warnings.warn(warn, ValueWarning)
230 self._ncomp = min_dim
232 self._method = method
233 # Workaround to avoid instance methods in __dict__
234 if self._method not in ('eig', 'svd', 'nipals'):
235 raise ValueError('method {0} is not known.'.format(method))
237 self.rows = np.arange(self._nobs)
238 self.cols = np.arange(self._nvar)
239 # Handle missing
240 self._missing = missing
241 self._adjusted_data = self.data
242 if missing is not None:
243 self._adjust_missing()
244 # Update size
245 self._nobs, self._nvar = self._adjusted_data.shape
246 if self._ncomp == np.min(self.data.shape):
247 self._ncomp = np.min(self._adjusted_data.shape)
248 elif self._ncomp > np.min(self._adjusted_data.shape):
249 raise ValueError('When adjusting for missing values, user '
250 'provided ncomp must be no larger than the '
251 'smallest dimension of the '
252 'missing-value-adjusted data size.')
254 # Attributes and internal values
255 self._tss = 0.0
256 self._ess = None
257 self.transformed_data = None
258 self._mu = None
259 self._sigma = None
260 self._ess_indiv = None
261 self._tss_indiv = None
262 self.scores = self.factors = None
263 self.loadings = None
264 self.coeff = None
265 self.eigenvals = None
266 self.eigenvecs = None
267 self.projection = None
268 self.rsquare = None
269 self.ic = None
271 # Prepare data
272 self.transformed_data = self._prepare_data()
273 # Perform the PCA
274 self._pca()
275 if gls:
276 self._compute_gls_weights()
277 self.transformed_data = self._prepare_data()
278 self._pca()
280 # Final calculations
281 self._compute_rsquare_and_ic()
282 if self._index is not None:
283 self._to_pandas()
285 def _adjust_missing(self):
286 """
287 Implements alternatives for handling missing values
288 """
290 def keep_col(x):
291 index = np.logical_not(np.any(np.isnan(x), 0))
292 return x[:, index], index
294 def keep_row(x):
295 index = np.logical_not(np.any(np.isnan(x), 1))
296 return x[index, :], index
298 if self._missing == 'drop-col':
299 self._adjusted_data, index = keep_col(self.data)
300 self.cols = np.where(index)[0]
301 self.weights = self.weights[index]
302 elif self._missing == 'drop-row':
303 self._adjusted_data, index = keep_row(self.data)
304 self.rows = np.where(index)[0]
305 elif self._missing == 'drop-min':
306 drop_col, drop_col_index = keep_col(self.data)
307 drop_col_size = drop_col.size
309 drop_row, drop_row_index = keep_row(self.data)
310 drop_row_size = drop_row.size
312 if drop_row_size > drop_col_size:
313 self._adjusted_data = drop_row
314 self.rows = np.where(drop_row_index)[0]
315 else:
316 self._adjusted_data = drop_col
317 self.weights = self.weights[drop_col_index]
318 self.cols = np.where(drop_col_index)[0]
319 elif self._missing == 'fill-em':
320 self._adjusted_data = self._fill_missing_em()
321 else:
322 raise ValueError('missing method is not known.')
324 if self._index is not None:
325 self._columns = self._columns[self.cols]
326 self._index = self._index[self.rows]
328 # Check adjusted data size
329 if self._adjusted_data.size == 0:
330 raise ValueError('Removal of missing values has eliminated '
331 'all data.')
333 def _compute_gls_weights(self):
334 """
335 Computes GLS weights based on percentage of data fit
336 """
337 errors = self.transformed_data - np.asarray(self.projection)
338 if self._ncomp == self._nvar:
339 raise ValueError('gls can only be used when ncomp < nvar '
340 'so that residuals have non-zero variance')
341 var = (errors ** 2.0).mean(0)
342 weights = 1.0 / var
343 weights = weights / np.sqrt((weights ** 2.0).mean())
344 nvar = self._nvar
345 eff_series_perc = (1.0 / sum((weights / weights.sum()) ** 2.0)) / nvar
346 if eff_series_perc < 0.1:
347 eff_series = int(np.round(eff_series_perc * nvar))
348 import warnings
350 warn = 'Many series are being down weighted by GLS. Of the ' \
351 '{original} series, the GLS estimates are based on only ' \
352 '{effective} (effective) ' \
353 'series.'.format(original=nvar, effective=eff_series)
354 warnings.warn(warn, EstimationWarning)
356 self.weights = weights
358 def _pca(self):
359 """
360 Main PCA routine
361 """
362 self._compute_eig()
363 self._compute_pca_from_eig()
364 self.projection = self.project()
366 def __repr__(self):
367 string = self.__str__()
368 string = string[:-1]
369 string += ', id: ' + hex(id(self)) + ')'
370 return string
372 def __str__(self):
373 string = 'Principal Component Analysis('
374 string += 'nobs: ' + str(self._nobs) + ', '
375 string += 'nvar: ' + str(self._nvar) + ', '
376 if self._standardize:
377 kind = 'Standardize (Correlation)'
378 elif self._demean:
379 kind = 'Demean (Covariance)'
380 else:
381 kind = 'None'
382 string += 'transformation: ' + kind + ', '
383 if self._gls:
384 string += 'GLS, '
385 string += 'normalization: ' + str(self._normalize) + ', '
386 string += 'number of components: ' + str(self._ncomp) + ', '
387 string += 'method: ' + 'Eigenvalue' if self._method == 'eig' else 'SVD'
388 string += ')'
389 return string
391 def _prepare_data(self):
392 """
393 Standardize or demean data.
394 """
395 adj_data = self._adjusted_data
396 if np.all(np.isnan(adj_data)):
397 return np.empty(adj_data.shape[1]).fill(np.nan)
399 self._mu = np.nanmean(adj_data, axis=0)
400 self._sigma = np.sqrt(np.nanmean((adj_data - self._mu) ** 2.0, axis=0))
401 if self._standardize:
402 data = (adj_data - self._mu) / self._sigma
403 elif self._demean:
404 data = (adj_data - self._mu)
405 else:
406 data = adj_data
407 return data / np.sqrt(self.weights)
409 def _compute_eig(self):
410 """
411 Wrapper for actual eigenvalue method
413 This is a workaround to avoid instance methods in __dict__
414 """
415 if self._method == 'eig':
416 return self._compute_using_eig()
417 elif self._method == 'svd':
418 return self._compute_using_svd()
419 else: # self._method == 'nipals'
420 return self._compute_using_nipals()
422 def _compute_using_svd(self):
423 """SVD method to compute eigenvalues and eigenvecs"""
424 x = self.transformed_data
425 u, s, v = np.linalg.svd(x)
426 self.eigenvals = s ** 2.0
427 self.eigenvecs = v.T
429 def _compute_using_eig(self):
430 """
431 Eigenvalue decomposition method to compute eigenvalues and eigenvectors
432 """
433 x = self.transformed_data
434 self.eigenvals, self.eigenvecs = np.linalg.eigh(x.T.dot(x))
436 def _compute_using_nipals(self):
437 """
438 NIPALS implementation to compute small number of eigenvalues
439 and eigenvectors
440 """
441 x = self.transformed_data
442 if self._ncomp > 1:
443 x = x + 0.0 # Copy
445 tol, max_iter, ncomp = self._tol, self._max_iter, self._ncomp
446 vals = np.zeros(self._ncomp)
447 vecs = np.zeros((self._nvar, self._ncomp))
448 for i in range(ncomp):
449 max_var_ind = np.argmax(x.var(0))
450 factor = x[:, [max_var_ind]]
451 _iter = 0
452 diff = 1.0
453 while diff > tol and _iter < max_iter:
454 vec = x.T.dot(factor) / (factor.T.dot(factor))
455 vec = vec / np.sqrt(vec.T.dot(vec))
456 factor_last = factor
457 factor = x.dot(vec) / (vec.T.dot(vec))
458 diff = _norm(factor - factor_last) / _norm(factor)
459 _iter += 1
460 vals[i] = (factor ** 2).sum()
461 vecs[:, [i]] = vec
462 if ncomp > 1:
463 x -= factor.dot(vec.T)
465 self.eigenvals = vals
466 self.eigenvecs = vecs
468 def _fill_missing_em(self):
469 """
470 EM algorithm to fill missing values
471 """
472 non_missing = np.logical_not(np.isnan(self.data))
474 # If nothing missing, return without altering the data
475 if np.all(non_missing):
476 return self.data
478 # 1. Standardized data as needed
479 data = self.transformed_data = np.asarray(self._prepare_data())
481 ncomp = self._ncomp
483 # 2. Check for all nans
484 col_non_missing = np.sum(non_missing, 1)
485 row_non_missing = np.sum(non_missing, 0)
486 if np.any(col_non_missing < ncomp) or np.any(row_non_missing < ncomp):
487 raise ValueError('Implementation requires that all columns and '
488 'all rows have at least ncomp non-missing values')
489 # 3. Get mask
490 mask = np.isnan(data)
492 # 4. Compute mean
493 mu = np.nanmean(data, 0)
495 # 5. Replace missing with mean
496 projection = np.ones((self._nobs, 1)) * mu
497 projection_masked = projection[mask]
498 data[mask] = projection_masked
500 # 6. Compute eigenvalues and fit
501 diff = 1.0
502 _iter = 0
503 while diff > self._tol_em and _iter < self._max_em_iter:
504 last_projection_masked = projection_masked
505 # Set transformed data to compute eigenvalues
506 self.transformed_data = data
507 # Call correct eig function here
508 self._compute_eig()
509 # Call function to compute factors and projection
510 self._compute_pca_from_eig()
511 projection = np.asarray(self.project(transform=False,
512 unweight=False))
513 projection_masked = projection[mask]
514 data[mask] = projection_masked
515 delta = last_projection_masked - projection_masked
516 diff = _norm(delta) / _norm(projection_masked)
517 _iter += 1
518 # Must copy to avoid overwriting original data since replacing values
519 data = self._adjusted_data + 0.0
520 projection = np.asarray(self.project())
521 data[mask] = projection[mask]
523 return data
525 def _compute_pca_from_eig(self):
526 """
527 Compute relevant statistics after eigenvalues have been computed
528 """
529 # Ensure sorted largest to smallest
530 vals, vecs = self.eigenvals, self.eigenvecs
531 indices = np.argsort(vals)
532 indices = indices[::-1]
533 vals = vals[indices]
534 vecs = vecs[:, indices]
535 if (vals <= 0).any():
536 # Discard and warn
537 num_good = vals.shape[0] - (vals <= 0).sum()
538 if num_good < self._ncomp:
539 import warnings
541 warnings.warn('Only {num:d} eigenvalues are positive. '
542 'This is the maximum number of components '
543 'that can be extracted.'.format(num=num_good),
544 EstimationWarning)
546 self._ncomp = num_good
547 vals[num_good:] = np.finfo(np.float64).tiny
548 # Use ncomp for the remaining calculations
549 vals = vals[:self._ncomp]
550 vecs = vecs[:, :self._ncomp]
551 self.eigenvals, self.eigenvecs = vals, vecs
552 # Select correct number of components to return
553 self.scores = self.factors = self.transformed_data.dot(vecs)
554 self.loadings = vecs
555 self.coeff = vecs.T
556 if self._normalize:
557 self.coeff = (self.coeff.T * np.sqrt(vals)).T
558 self.factors /= np.sqrt(vals)
559 self.scores = self.factors
561 def _compute_rsquare_and_ic(self):
562 """
563 Final statistics to compute
564 """
565 # TSS and related calculations
566 # TODO: This needs careful testing, with and without weights,
567 # gls, standardized and demean
568 weights = self.weights
569 ss_data = self.transformed_data * np.sqrt(weights)
570 self._tss_indiv = np.sum(ss_data ** 2, 0)
571 self._tss = np.sum(self._tss_indiv)
572 self._ess = np.zeros(self._ncomp + 1)
573 self._ess_indiv = np.zeros((self._ncomp + 1, self._nvar))
574 for i in range(self._ncomp + 1):
575 # Projection in the same space as transformed_data
576 projection = self.project(ncomp=i, transform=False, unweight=False)
577 indiv_rss = (projection ** 2).sum(axis=0)
578 rss = indiv_rss.sum()
579 self._ess[i] = self._tss - rss
580 self._ess_indiv[i, :] = self._tss_indiv - indiv_rss
581 self.rsquare = 1.0 - self._ess / self._tss
582 # Information Criteria
583 ess = self._ess
584 invalid = ess <= 0 # Prevent log issues of 0
585 if invalid.any():
586 last_obs = (np.where(invalid)[0]).min()
587 ess = ess[:last_obs]
589 log_ess = np.log(ess)
590 r = np.arange(ess.shape[0])
592 nobs, nvar = self._nobs, self._nvar
593 sum_to_prod = (nobs + nvar) / (nobs * nvar)
594 min_dim = min(nobs, nvar)
595 penalties = np.array([sum_to_prod * np.log(1.0 / sum_to_prod),
596 sum_to_prod * np.log(min_dim),
597 np.log(min_dim) / min_dim])
598 penalties = penalties[:, None]
599 ic = log_ess + r * penalties
600 self.ic = ic.T
602 def project(self, ncomp=None, transform=True, unweight=True):
603 """
604 Project series onto a specific number of factors.
606 Parameters
607 ----------
608 ncomp : int, optional
609 Number of components to use. If omitted, all components
610 initially computed are used.
611 transform : bool, optional
612 Flag indicating whether to return the projection in the original
613 space of the data (True, default) or in the space of the
614 standardized/demeaned data.
615 unweight : bool, optional
616 Flag indicating whether to undo the effects of the estimation
617 weights.
619 Returns
620 -------
621 array_like
622 The nobs by nvar array of the projection onto ncomp factors.
624 Notes
625 -----
626 """
627 # Projection needs to be scaled/shifted based on inputs
628 ncomp = self._ncomp if ncomp is None else ncomp
629 if ncomp > self._ncomp:
630 raise ValueError('ncomp must be smaller than the number of '
631 'components computed.')
632 factors = np.asarray(self.factors)
633 coeff = np.asarray(self.coeff)
635 projection = factors[:, :ncomp].dot(coeff[:ncomp, :])
636 if transform or unweight:
637 projection *= np.sqrt(self.weights)
638 if transform:
639 # Remove the weights, which do not depend on transformation
640 if self._standardize:
641 projection *= self._sigma
642 if self._standardize or self._demean:
643 projection += self._mu
644 if self._index is not None:
645 projection = pd.DataFrame(projection,
646 columns=self._columns,
647 index=self._index)
648 return projection
650 def _to_pandas(self):
651 """
652 Returns pandas DataFrames for all values
653 """
654 index = self._index
655 # Principal Components
656 num_zeros = np.ceil(np.log10(self._ncomp))
657 comp_str = 'comp_{0:0' + str(int(num_zeros)) + 'd}'
658 cols = [comp_str.format(i) for i in range(self._ncomp)]
659 df = pd.DataFrame(self.factors, columns=cols, index=index)
660 self.scores = self.factors = df
661 # Projections
662 df = pd.DataFrame(self.projection,
663 columns=self._columns,
664 index=index)
665 self.projection = df
666 # Weights
667 df = pd.DataFrame(self.coeff, index=cols,
668 columns=self._columns)
669 self.coeff = df
670 # Loadings
671 df = pd.DataFrame(self.loadings,
672 index=self._columns, columns=cols)
673 self.loadings = df
674 # eigenvals
675 self.eigenvals = pd.Series(self.eigenvals)
676 self.eigenvals.name = 'eigenvals'
677 # eigenvecs
678 vec_str = comp_str.replace('comp', 'eigenvec')
679 cols = [vec_str.format(i) for i in range(self.eigenvecs.shape[1])]
680 self.eigenvecs = pd.DataFrame(self.eigenvecs, columns=cols)
681 # R2
682 self.rsquare = pd.Series(self.rsquare)
683 self.rsquare.index.name = 'ncomp'
684 self.rsquare.name = 'rsquare'
685 # IC
686 self.ic = pd.DataFrame(self.ic, columns=['IC_p1', 'IC_p2', 'IC_p3'])
687 self.ic.index.name = 'ncomp'
689 def plot_scree(self, ncomp=None, log_scale=True,
690 cumulative=False, ax=None):
691 """
692 Plot of the ordered eigenvalues
694 Parameters
695 ----------
696 ncomp : int, optional
697 Number of components ot include in the plot. If None, will
698 included the same as the number of components computed
699 log_scale : boot, optional
700 Flag indicating whether ot use a log scale for the y-axis
701 cumulative : bool, optional
702 Flag indicating whether to plot the eigenvalues or cumulative
703 eigenvalues
704 ax : AxesSubplot, optional
705 An axes on which to draw the graph. If omitted, new a figure
706 is created
708 Returns
709 -------
710 matplotlib.figure.Figure
711 The handle to the figure.
712 """
713 import statsmodels.graphics.utils as gutils
715 fig, ax = gutils.create_mpl_ax(ax)
717 ncomp = self._ncomp if ncomp is None else ncomp
718 vals = np.asarray(self.eigenvals)
719 vals = vals[:self._ncomp]
720 if cumulative:
721 vals = np.cumsum(vals)
723 if log_scale:
724 ax.set_yscale('log')
725 ax.plot(np.arange(ncomp), vals[: ncomp], 'bo')
726 ax.autoscale(tight=True)
727 xlim = np.array(ax.get_xlim())
728 sp = xlim[1] - xlim[0]
729 xlim += 0.02 * np.array([-sp, sp])
730 ax.set_xlim(xlim)
732 ylim = np.array(ax.get_ylim())
733 scale = 0.02
734 if log_scale:
735 sp = np.log(ylim[1] / ylim[0])
736 ylim = np.exp(np.array([np.log(ylim[0]) - scale * sp,
737 np.log(ylim[1]) + scale * sp]))
738 else:
739 sp = ylim[1] - ylim[0]
740 ylim += scale * np.array([-sp, sp])
741 ax.set_ylim(ylim)
742 ax.set_title('Scree Plot')
743 ax.set_ylabel('Eigenvalue')
744 ax.set_xlabel('Component Number')
745 fig.tight_layout()
747 return fig
749 def plot_rsquare(self, ncomp=None, ax=None):
750 """
751 Box plots of the individual series R-square against the number of PCs.
753 Parameters
754 ----------
755 ncomp : int, optional
756 Number of components ot include in the plot. If None, will
757 plot the minimum of 10 or the number of computed components.
758 ax : AxesSubplot, optional
759 An axes on which to draw the graph. If omitted, new a figure
760 is created.
762 Returns
763 -------
764 matplotlib.figure.Figure
765 The handle to the figure.
766 """
767 import statsmodels.graphics.utils as gutils
769 fig, ax = gutils.create_mpl_ax(ax)
771 ncomp = 10 if ncomp is None else ncomp
772 ncomp = min(ncomp, self._ncomp)
773 # R2s in rows, series in columns
774 r2s = 1.0 - self._ess_indiv / self._tss_indiv
775 r2s = r2s[1:]
776 r2s = r2s[:ncomp]
777 ax.boxplot(r2s.T)
778 ax.set_title('Individual Input $R^2$')
779 ax.set_ylabel('$R^2$')
780 ax.set_xlabel('Number of Included Principal Components')
782 return fig
785def pca(data, ncomp=None, standardize=True, demean=True, normalize=True,
786 gls=False, weights=None, method='svd'):
787 """
788 Perform Principal Component Analysis (PCA).
790 Parameters
791 ----------
792 data : ndarray
793 Variables in columns, observations in rows.
794 ncomp : int, optional
795 Number of components to return. If None, returns the as many as the
796 smaller to the number of rows or columns of data.
797 standardize : bool, optional
798 Flag indicating to use standardized data with mean 0 and unit
799 variance. standardized being True implies demean.
800 demean : bool, optional
801 Flag indicating whether to demean data before computing principal
802 components. demean is ignored if standardize is True.
803 normalize : bool , optional
804 Indicates whether th normalize the factors to have unit inner
805 product. If False, the loadings will have unit inner product.
806 gls : bool, optional
807 Flag indicating to implement a two-step GLS estimator where
808 in the first step principal components are used to estimate residuals,
809 and then the inverse residual variance is used as a set of weights to
810 estimate the final principal components
811 weights : ndarray, optional
812 Series weights to use after transforming data according to standardize
813 or demean when computing the principal components.
814 method : str, optional
815 Determines the linear algebra routine uses. 'eig', the default,
816 uses an eigenvalue decomposition. 'svd' uses a singular value
817 decomposition.
819 Returns
820 -------
821 factors : {ndarray, DataFrame}
822 Array (nobs, ncomp)of of principal components (also known as scores).
823 loadings : {ndarray, DataFrame}
824 Array (ncomp, nvar) of principal component loadings for constructing
825 the factors.
826 projection : {ndarray, DataFrame}
827 Array (nobs, nvar) containing the projection of the data onto the ncomp
828 estimated factors.
829 rsquare : {ndarray, Series}
830 Array (ncomp,) where the element in the ith position is the R-square
831 of including the fist i principal components. The values are
832 calculated on the transformed data, not the original data.
833 ic : {ndarray, DataFrame}
834 Array (ncomp, 3) containing the Bai and Ng (2003) Information
835 criteria. Each column is a different criteria, and each row
836 represents the number of included factors.
837 eigenvals : {ndarray, Series}
838 Array of eigenvalues (nvar,).
839 eigenvecs : {ndarray, DataFrame}
840 Array of eigenvectors. (nvar, nvar).
842 Notes
843 -----
844 This is a simple function wrapper around the PCA class. See PCA for
845 more information and additional methods.
846 """
847 pc = PCA(data, ncomp=ncomp, standardize=standardize, demean=demean,
848 normalize=normalize, gls=gls, weights=weights, method=method)
850 return (pc.factors, pc.loadings, pc.projection, pc.rsquare, pc.ic,
851 pc.eigenvals, pc.eigenvecs)