Correspondence anaysis is an explorative technique applyed to analysis of contingency tables. The module provides implements correspondence analysis for two-way frequency crosstabulation tables.
Module contains one class CA
which wraps all the
mathematical functions and a function input
for loading
contingency table from a file. The class can be constructed by
providing a contingency table as a parameter to the
constructor. Contingency table is encoded as a Python's nested lists,
"list-of-lists" or using numpy types matrix
and
array
. The class also includes a method
input(filename)
that reads the contingency table from a
file, where each row of contingency table is represented with a line
of comma-separated numbers. Different means of passing the contingency
table to a correspondence analysis method are illustrated in the
following snippet:
The attributes provide access to the contingency table and various matrices created in the analysis process.
dim
are
returned. dim
is optional and if omitted, first two
dimensions are returned. dim
are returned. If dim
is omitted, first two
dimensions are returned. axis
. If axis
equals to 0,
inertia is decomposed across
rows. If axis equals to 1, inertia is decomposed across columns. This
parameter is optional, and defaults to 0. array
whose elements are inertias of axes. If
percentage = 1
percentages of inertias of each axis are
returned. array
whose elements are
contributions of points to the inertia of axis. Argument
rowColumn
defines wheter the calculation will be
performed for row (default action) or column points. The values can be
represented in percentages if percentage = 1
. axis
. Data table given below represents smoking habits of different employees in a company.
|
Smoking category |
|
|||
Staff Group |
(1) None |
(2) Light |
(3) Medium |
(4) Heavy |
Row Totals |
(1) Senior managers |
4 |
2 |
3 |
2 |
11 |
(2) Junior Managers |
4 |
3 |
7 |
4 |
18 |
(3) Senior Employees |
25 |
10 |
12 |
2 |
51 |
(4) Junior Employees |
18 |
24 |
33 |
13 |
88 |
(5) Secretaries |
10 |
6 |
7 |
2 |
25 |
Column Totals |
61 |
45 |
62 |
25 |
193 |
The 4 column values in each row of the table can be viewed as coordinates in a 4-dimensional space, and the (Euclidean) distances could be computed between the 5 row points in the 4-dimensional space. The distances between the points in the 4-dimensional space summarize all information about the similarities between the rows in the table above. Correspondence analysis module can be used to find a lower-dimensional space, in which the row points are positioned in a manner that retains all, or almost all, of the information about the differences between the rows. All information about the similarities between the rows (types of employees in this case) can be presented in a simple 2-dimensional graph. While this may not appear to be particularly useful for small tables like the one shown above, the presentation and interpretation of very large tables (e.g., differential preference for 10 consumer items among 100 groups of respondents in a consumer survey) could greatly benefit from the simplification that can be achieved via correspondence analysis (e.g., represent the 10 consumer items in a 2-dimensional space). This analysis can be similarly performed on columns of the table.
Following lines load modules and data needed for the analysis. Analysis is started in the last line.
After analysis finishes, results can be inspected:
The points in the two-dimensional display that are close to each other are similar with regard to the pattern of relative frequencies across the columns, i.e. they have similar row profiles. After producing the plot it can be noticed that along the most important first axis in the plot, the Senior employees and Secretaries are relatively close together. This can be also seen by examining row profile, these two groups of employees show very similar patterns of relative frequencies across the categories of smoking intensity.
Lines 17- 19 print out singular values , eigenvalues, percentages of inertia explained. These are important values to decide how many axes are needed to represent the data. The dimensions are "extracted" to maximize the distances between the row or column points, and successive dimensions will "explain" less and less of the overall inertia.
Lines 21-22 print out principal row co-ordinates with respect to first two axes. And lines 24-25 show decomposition of inertia.
Following two last statements plot a scree diagram and a biplot. Scree diagram is a plot of the amount of inertia accounted for by successive dimensions, i.e. it is a plot of the percentage of inertia against the components, plotted in order of magnitude from largest to smallest. This plot is usually used to identify components with the highest contribution of inertia, which are selected, and then look for a change in slope in the diagram, where the remaining factors seem simply to be debris at the bottom of the slope and they are discarded. Biplot is a plot or row and column point in two-dimensional space.