Description
Multivariate logistic regression finds the combination of model coefficients which maximises the log-likelihood of the data. However, while mathematically convenient, the log-likelihood may not be the most meaningful objective function from an end user’s perspective. One reason for this is that the cost of a ‘misclassification’ under the log-likelihood model becomes extreme as the fitted value approaches zero (for positive cases) or one (for negative cases), effectively imposing a severe penalty on ‘outliers’.
Because of this, we are investigating binary classifier methodologies that are more robust to outliers than the standard logistic regression workhorse. One option is the mean squared error of the difference between the response and the fitted value, which imposes a far lower cost on outliers than the log-likelihood. Another appealing avenue of research is to consider objective functions which closely reflect common measures of model quality such as the Area under the Receiver Operating Characteristic (ROC) curve, the ‘AUC’.
However, challenges exist. Direct use of the AUC as an objective function for a binary classifier comes with barriers to implementation. The AUC is a rank-based function and thus contains discontinuities, making gradient descent based methods for its optimisation challenging, especially with high dimensional data. Further, the calculation of the AUC is of complexity O(nlog(n)), which limits its applicability to large (and high-dimensional) datasets.
This exciting and highly novel project will initially focus on the development of rank-based objective functions which emulate the AUC for binary classifiers, overcoming the above barriers. Comparison will be made with logistic regression, as well as its robust extensions including Firth regression, and LASSO regression. Extensions will also be made to the partial AUC, and emulation of the concordance index for time to event regression.
This project is ideally suited to a PhD candidate with a strong interest in the development of new statistical methodologies, a passion for statistical innovation, and a solid background in statistical programming, most especially using the R programming environment for statistical programming.
Essential criteria:
Minimum entry requirements can be found here: https://www.monash.edu/admissions/entry-requirements/minimum
Keywords
Statistics, biostatistics, binary classification
School
School of Public Health and Preventive Medicine
Available options
PhD/Doctorate
Time commitment
Full-time
Part-time
Top-up scholarship funding available
No
Physical location
Alfred Centre, School of Public Health & Preventive Medicine
Research webpage
Co-supervisors
Dr
Alan Herschtal