C# .NET Algorithm for Variable Selection Based on the
Mallow’s Cp Criterion
Jessie Chen, MEng.
Massachusetts
Institute of Technology, Cambridge, MA
jic@mit.edu
Abstract: Variable selection techniques are important
in statistical modeling because they seek to simultaneously reduce the chances
of data overfitting and to minimize the effects of omission bias. The Linear
or Ordinary Least Squared regression model is particularly useful in variable
selection because of its association with certain optimality criterions. One
of these is the Mallow’s Cp Criterion which evaluates the fit of a regression
model by the squared distance between its predictions and the true values. The
first part of this project seeks to implement an algorithm in C# .NET for
variable selection using the Mallow’s Cp Criterion and also to test the
viability of using a greedy version of such an algorithm in reducing
computational costs. The second half aims to verify the results of the
algorithm through logistic regression. The results affirmed the use of a
greedy algorithm, and the logistic regression models also confirmed the
Mallow’s Cp results. However, further studies on the details of the Mallow’s
Cp algorithm, a calibrated logistic regression modeling process, and perhaps
incorporation of techniques such as cross-validation may also be useful before
drawing final conclusions concerning the reliability of the algorithm
implemented. Keywords: variable selection; overfitting; omission bias; linear
least squared regression; Mallow’s Cp; logistic regression; C-Index
Paper (pdf)
Appendices:
A: C# Code Implementation
B:
Pima Indian Dataset
C: CpTable.txt
D: CpTableAll.txt
E: Logistic Regression Results