COMBSS: Scalable Best Subset Selection for Generalised Linear Models

An interactive demonstration — StatFest 2026

Author

Sarat Moka (UNSW Sydney)

Published

May 22, 2026

Abstract

Best subset selection — identifying the optimal \(k\) predictors from \(p\) candidates — is fundamental for building interpretable and parsimonious statistical models, but the underlying combinatorial problem is NP-hard.

COMBSS (Continuous Optimisation for Best Subset Selection) overcomes this barrier by reformulating the discrete problem as a continuous optimisation, making it scalable to high-dimensional settings where \(p\) far exceeds \(n\).

This presentation walks through the framework for linear, logistic, and multinomial regression, illustrates its performance on simulated data and two real biomedical applications — cancer gene-expression classification (\(p = 2{,}308\) genes) and a rice GWAS (\(p \approx 158{,}000\) SNPs) — and demonstrates the open-source combss (R and Python) packages with practical examples.

Research collaborators

Name Affiliation
Sarat Moka (presenter) UNSW Sydney
Zdravko Botev UNSW Sydney
Benoit Liquet Université de Pau et des Pays de l’Adour & Macquarie University
Anant Mathur UNSW Sydney
Samuel Muller Macquarie University
Houying Zhu Macquarie University

How this demonstration is structured

Each section below has its own page; navigate via the top menu or click through in order.

Section Pages
Motivation — what sparse-constrained GLM is, and why MIO / lasso are not the last word Sparse GLMMIOLassoCOMBSS (our method)
Methodology — Boolean relaxation and the homotopy Frank-Wolfe algorithm RelaxationHomotopy Frank-Wolfe
Demos — five runnable demos in R and Python Linear sim · HD logistic · Khan SRBCT · Rice GWAS · Comparisons
Use it yourself Install · References

Packages