Module orngLR is a set of wrappers around classes LogisticLearner and
LogisticClassifier, that are implemented in core Orange. This module expanses
use of logistic regression to discrete attributes, it helps handling various
anomalies in attributes, such as constant variables and singularities, that
makes fitting logistic regression almost impossible. It also implements a
function for constructing a stepwise logistic regression, which is a good
technique to prevent overfitting, and is a good feature subset selection
technique as well.
Functions
- LogRegLearner([examples=None, weightID=0, removeSingular=0, fitter = None, stepwiseLR = 0, addCrit=0.2, deleteCrit=0.3, numAttr=-1])
-
Returns a LogisticClassifier if examples are given. If
examples
are
not specified, an instance of object LogisticLearner with its parameters
appropriately initialized is returned. Parameter weightID
defines the ID of the weight meta attribute. Set parameter removeSingular
to 1,if you want automatic removal of disturbing attributes, such as constants and
singularities. Examples can contain discrete and continuous attributes.
Parameter fitter
is used to alternate fitting algorithm. Currently a Newton-Raphson fitting algorithm is used, however you can change it to something else. You can find bayesianFitter
in orngLR to test it out. The last three parameters addCrit, deleteCrit, numAttr
are used to set parameters for stepwise attribute selection (see next method). If you wish to use stepwise within LogRegLearner, stpewiseLR
must be set as 1.
-
- stepWiseFSS([examples=None, addCrit=0.2, deleteCrit=0.3, numAttr=-1])
-
If
examples
are specified, stepwise logistic regression implemented in stepWiseFSS_class
is performed and a list of chosen attributes is returned. If examples
are not specified an instance of stepWiseFSS_class
with all parameters set is returned. Parameters addCrit, deleteCrit
and numAttr
are explained in the description of stepWiseFSS_class
.
- bestNAtt([examples, N, addCrit=0.2, deleteCrit=0.3])
-
Returns "best"
N
attributes selected with stepwise logistic regression. Parameter examples
is required.
- printOUT([classifier])
-
Formatted print to console of all major attributes in logistic regression classifier. Parameter
classifier
is a logistic regression classifier.
Classes
- stepWiseFSS_class([examples=None, addCrit=0.2, deleteCrit=0.3, numAttr=-1])
- Performs stepwise logistic regression and returns a list of "most" informative attributes. Each step of this algorithm is composed of two parts. First is backward elimination, where each already chosen attribute is tested for significant contribution to overall model.
If worst among all tested attributes has higher significance that is specified in
deleteCrit
, this attribute is removed from the model. The second step is forward selection, which is similar to backward elimination. It loops through all attributes that are not in model and tests whether they contribute to common model with significance lower that addCrit
.
Algorithm stops when no attribute in model is to be removed and no attribute out of the model is to be added. By setting numAttr
larger than -1, algorithm will stop its execution when number of attributes in model will exceed that number.
Significances are assesed via the likelihood ration chi-square test. Normal F test is not appropriate, because errors are assumed to follow a binomial distribution.
Examples
First example shows a very simple induction of a logistic regression classifier.
import orange, orngLR
data = orange.ExampleTable("titanic")
lr = orngLR.LogRegLearner(data)
# compute classification accuracy
correct = 0.0
for ex in data:
if lr(ex) == ex.getclass():
correct += 1
print "Classification accuracy:", correct/len(data)
orngLR.printOUT(lr)
Result:
Classification accuracy: 0.778282598819
class attribute = survived
class values =
Attribute beta st. error wald Z P
Intercept 0.38 0.14 2.73 0.01
status=second 1.02 0.20 5.13 0.00
status=third 1.78 0.17 10.24 0.00
status=crew 0.86 0.16 5.39 0.00
age=child -1.06 0.25 -4.30 0.00
sex=female -2.42 0.14 -17.04 0.00
Next examples shows how to handle singularities in data sets.
import orange, orngLR
data = orange.ExampleTable("adult_sample.tab")
lr = orngLR.LogRegLearner(data, removeSingular = 1)
for ex in data[:5]:
print ex.getclass(), lr(ex)
orngLR.printOUT(lr)
Result:
removing education=Preschool
removing education-num
removing workclass=Never-worked
removing native-country=Honduras
removing native-country=Thailand
removing native-country=Ecuador
removing native-country=Portugal
removing native-country=France
removing native-country=Yugoslavia
removing native-country=Trinadad&Tobago
removing native-country=Hong
removing native-country=Hungary
removing native-country=Holand-Netherlands
<=50K <=50K
<=50K <=50K
<=50K <=50K
>50K >50K
<=50K >50K
class attribute = y
class values = <>50K, <=50K>
Attribute beta st. error wald Z P
Intercept 1.39 0.82 1.71 0.09
age -0.04 0.01 -3.60 0.00
workclass=Self-emp-not-inc -0.10 0.37 -0.27 0.79
workclass=Self-emp-inc 0.49 0.50 0.97 0.33
workclass=Federal-gov -1.38 0.52 -2.69 0.01
workclass=Local-gov -0.12 0.41 -0.29 0.77
workclass=State-gov 0.30 0.47 0.63 0.53
workclass=Without-pay 2.50 2.55 0.98 0.33
fnlwgt -0.00 0.00 -0.66 0.51
education=Some-college 1.12 0.32 3.44 0.00
education=11th 1.51 0.75 2.02 0.04
education=HS-grad 1.16 0.31 3.76 0.00
education=Prof-school 0.03 1.18 0.03 0.98
education=Assoc-acdm 0.42 0.48 0.88 0.38
education=Assoc-voc 2.05 0.58 3.55 0.00
education=9th 2.67 0.99 2.71 0.01
education=7th-8th 2.04 0.77 2.64 0.01
education=12th 3.82 1.11 3.44 0.00
education=Masters -0.14 0.41 -0.34 0.73
education=1st-4th 1.05 1.52 0.69 0.49
education=10th 3.00 0.91 3.29 0.00
education=Doctorate 0.03 0.80 0.04 0.97
education=5th-6th 0.33 1.28 0.26 0.80
marital-status=Divorced 3.71 1.09 3.39 0.00
marital-status=Never-married 3.36 1.03 3.28 0.00
marital-status=Separated 3.21 1.22 2.64 0.01
marital-status=Widowed 3.66 1.20 3.04 0.00
marital-status=Married-spouse-absent 5.28 1.72 3.07 0.00
marital-status=Married-AF-spouse 4.93 5.06 0.97 0.33
occupation=Craft-repair 0.67 0.51 1.30 0.19
occupation=Other-service 1.97 0.64 3.06 0.00
occupation=Sales 0.43 0.53 0.81 0.42
occupation=Exec-managerial 0.45 0.51 0.89 0.37
occupation=Prof-specialty 0.54 0.52 1.03 0.30
occupation=Handlers-cleaners 1.71 0.74 2.33 0.02
occupation=Machine-op-inspct 1.15 0.62 1.84 0.07
occupation=Adm-clerical 0.67 0.53 1.27 0.20
occupation=Farming-fishing 1.40 0.78 1.80 0.07
occupation=Transport-moving 0.92 0.57 1.62 0.10
occupation=Priv-house-serv 2.38 1.81 1.32 0.19
occupation=Protective-serv 0.47 0.76 0.61 0.54
occupation=Armed-Forces 1.89 6.36 0.30 0.77
relationship=Own-child -0.18 1.05 -0.17 0.87
relationship=Husband 1.21 0.51 2.37 0.02
relationship=Not-in-family -0.31 1.12 -0.28 0.78
relationship=Other-relative -1.00 1.23 -0.81 0.42
relationship=Unmarried -0.47 1.17 -0.40 0.69
race=Asian-Pac-Islander -0.66 0.90 -0.74 0.46
race=Amer-Indian-Eskimo 1.65 1.91 0.86 0.39
race=Other 2.67 1.53 1.75 0.08
race=Black 0.48 0.38 1.26 0.21
sex=Male -0.18 0.37 -0.49 0.62
capital-gain -0.00 0.00 -6.74 0.00
capital-loss -0.00 0.00 -2.96 0.00
hours-per-week -0.04 0.01 -4.37 0.00
native-country=Cuba -1.04 5.24 -0.20 0.84
native-country=Jamaica -4.48 2.25 -1.99 0.05
native-country=India 1.03 1.42 0.73 0.47
native-country=Mexico 0.77 0.95 0.81 0.42
native-country=South 1.36 5.84 0.23 0.82
native-country=Puerto-Rico 0.52 5.00 0.10 0.92
native-country=England 1.50 2.40 0.63 0.53
native-country=Canada -0.68 1.41 -0.48 0.63
native-country=Germany -0.61 0.91 -0.67 0.50
native-country=Iran 3.31 3.46 0.96 0.34
native-country=Philippines 1.56 1.98 0.79 0.43
native-country=Italy -2.10 1.40 -1.50 0.13
native-country=Poland 0.84 2.60 0.32 0.75
native-country=Columbia -1.93 1.78 -1.08 0.28
native-country=Cambodia 1.19 5.52 0.21 0.83
native-country=Laos 3.40 3.34 1.02 0.31
native-country=Taiwan -0.14 1.81 -0.08 0.94
native-country=Haiti -3.22 2.07 -1.55 0.12
native-country=Dominican-Republic -3.90 7.85 -0.50 0.62
native-country=El-Salvador 1.23 2.87 0.43 0.67
native-country=Guatemala -1.17 5.41 -0.22 0.83
native-country=China 0.95 1.69 0.56 0.58
native-country=Japan -2.31 3.31 -0.70 0.48
native-country=Peru -0.40 4.39 -0.09 0.93
native-country=Outlying-US(Guam-USVI-etc) 0.95 4.78 0.20 0.84
native-country=Scotland 1.88 3.15 0.60 0.55
native-country=Greece -0.12 5.00 -0.02 0.98
native-country=Nicaragua 1.26 2.80 0.45 0.65
native-country=Vietnam 2.33 2.53 0.92 0.36
native-country=Ireland -1.00 1.45 -0.69 0.49
In case we set removeSingular
to 0, inducing logistic regression classifier would return an error:
Traceback (most recent call last):
File "C:\Python23\lib\site-packages\Pythonwin\pywin\framework\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
File "C:\Python23\Lib\site-packages\logreg1.py", line 4, in ?
lr = Ndomain.LogisticLearner(data, removeSingular = 1)
File "C:\Python23\Lib\site-packages\Ndomain.py", line 49, in LogisticLearner
return lr(examples)
File "C:\Python23\Lib\site-packages\Ndomain.py", line 68, in __call__
lr = orange.LogisticLearner(nexamples, showSingularity = self.showSingularity)
KernelException: 'orange.LogisticFitterMinimization': singularity in education=Preschool
We can see that attribute education=Preschool
is causeing singularity. The issue of this is that we should remove Preschool
manually or leave it to function LogRegLearner
to remove it automatically. In the last example it is shown, how the use of stepwise logistic regression can help us in achieving better classification.
import orange, orngFSS, orngTest, orngStat, orngLR
def StepWiseFSS_Filter(examples = None, **kwds):
"""
check function StepWiseFSS()
"""
filter = apply(StepWiseFSS_Filter_class, (), kwds)
if examples:
return filter(examples)
else:
return filter
class StepWiseFSS_Filter_class:
def __init__(self, addCrit=0.2, deleteCrit=0.3, numAttr = -1):
self.addCrit = addCrit
self.deleteCrit = deleteCrit
self.numAttr = numAttr
def __call__(self, examples):
attr = orngLR.StepWiseFSS(examples, addCrit=self.addCrit, deleteCrit = self.deleteCrit, numAttr = self.numAttr)
return examples.select(orange.Domain(attr, examples.domain.classVar))
data = orange.ExampleTable("d:\\data\\ionosphere.tab")
lr = orngLR.LogRegLearner(removeSingular=1)
learners = (orngLR.LogRegLearner(name='logistic', removeSingular=1),
orngFSS.FilteredLearner(lr, filter=StepWiseFSS_Filter(addCrit=0.05, deleteCrit=0.9), name='filtered'))
results = orngTest.crossValidation(learners, data, storeClassifiers=1)
# output the results
print "Learner CA"
for i in range(len(learners)):
print "%-12s %5.3f" % (learners[i].name, orngStat.CA(results)[i])
# find out which attributes were retained by filtering
print "\nNumber of times attributes were used in cross-validation:"
attsUsed = {}
for i in range(10):
for a in results.classifiers[i][1].atts():
if a.name in attsUsed.keys(): attsUsed[a.name] += 1
else: attsUsed[a.name] = 1
for k in attsUsed.keys():
print "%2d x %s" % (attsUsed[k], k)
Result:
Learner CA
logistic 0.835
filtered 0.846
Number of times attributes were used in cross-validation:
1 x a20
1 x a21
10 x a22
7 x a23
5 x a24
2 x a25
10 x a26
10 x a27
3 x a29
3 x a17
1 x a16
4 x a12
2 x a32
7 x a15
10 x a14
10 x a31
8 x a30
10 x a11
1 x a10
1 x a13
10 x a34
1 x a18
10 x a3
10 x a5
5 x a4
3 x a7
7 x a6
7 x a9
10 x a8
References
David W. Hosmer, Stanley Lemeshow. Applied Logistic Regression - 2nd ed. Wiley, New York, 2000