| Age | Weight | Height | Neck | Chest | Abdomen | Hip | Thigh | Knee | Ankle | Biceps | Forearm | Wrist | Bodyfat |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23 | 70.1 | 172.1 | 36.2 | 93.1 | 85.2 | 94.5 | 59.0 | 37.3 | 21.9 | 32.0 | 27.4 | 17.1 | 12.3 |
| 22 | 78.8 | 183.5 | 38.5 | 93.6 | 83.0 | 98.7 | 58.7 | 37.3 | 23.4 | 30.5 | 28.9 | 18.2 | 6.1 |
| 22 | 70.0 | 168.3 | 34.0 | 95.8 | 87.9 | 99.2 | 59.6 | 38.9 | 24.0 | 28.8 | 25.2 | 16.6 | 25.3 |
| 26 | 83.5 | 183.5 | 37.4 | 101.8 | 86.4 | 101.2 | 60.1 | 37.3 | 22.8 | 32.4 | 29.4 | 18.2 | 10.4 |
| 24 | 80.4 | 180.3 | 39.0 | 97.3 | 100.0 | 101.9 | 63.2 | 42.2 | 24.0 | 32.2 | 27.7 | 17.7 | 28.7 |
Sparse-constrained GLM
The problem we want to solve
A familiar starting point: bodyfat
Suppose we want to predict body-fat percentage from physical measurements. The classical dataset (Penrose et al. 1985) records 14 measurements on 252 men:
Which handful of these measurements is enough to predict body fat well? That is best subset selection.
Which \(k\) columns of \(\boldsymbol{X}\) best predict \(\boldsymbol{y}\)?
Generalising to GLMs
The same question arises whenever the response is well modelled by a generalised linear model (GLM):
| Response type | Family | Examples |
|---|---|---|
| Continuous | Gaussian | bodyfat, gene-expression intensity |
| Binary | Bernoulli (logistic) | disease yes/no, GWAS case/control |
| Multinomial | softmax (multinomial logistic) | cancer subtypes |
| Count | Poisson | event counts |
In every case, fitting the model amounts to maximising a log-likelihood:
\[ \ell(\beta_0, \boldsymbol{\beta}) \;=\; \sum_{i=1}^n \log p(y_i \mid x_i; \beta_0, \boldsymbol{\beta}). \]
For Gaussian, this reduces to least squares; for Bernoulli, to logistic regression; and so on. The unifying framework lets one method handle them all — provided it can deal with the sparsity constraint we are about to add.
The optimisation we are trying to solve
For a prescribed model size \(k\), find the \(k\) predictors that best fit the data:
\[ \boxed{\; \begin{aligned} &\min_{\beta_0 \in \mathbb{R},\; \boldsymbol{\beta} \in \mathbb{R}^p}\;\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \\ &\text{subject to} \quad \|\boldsymbol{\beta}\|_0 := \sum_{j = 1}^p I(\beta_j \neq 0) = k. \end{aligned} \;} \]
Two things to notice:
- The objective is the negative GLM log-likelihood.
- The constraint is the \(\ell_0\)-norm cardinality: count the non-zero entries, keep at most \(k\).
The user supplies \(k\). Output: which \(k\) features, and their fitted coefficients.
This is the problem COMBSS focuses on.
Why it is hard
For a fixed model size \(k\), the number of subsets to enumerate is \(\binom{p}{k}\). Across the three settings featured later in this talk:
| Setting | p | k | Subsets of size k |
|---|---|---|---|
| Bodyfat | 13 | 5 | 1,287 |
| Khan SRBCT | 2,308 | 12 | ~5 × 1031 |
| Rice GWAS | 158,210 | 10 | ~3 × 1045 |
Enumeration is fine at \(p = 13\), but even with the right \(k\) given to us, the number of \(k\)-subsets explodes to roughly \(5 \times 10^{31}\) at \(p = 2308\), and to \(3 \times 10^{45}\) at \(p \approx 1.6 \times 10^5\). The combinatorial blowup makes exhaustive search hopeless within the lifetime of the universe.
In fact the problem is NP-hard (Natarajan 1995). No polynomial-time algorithm is expected for the worst case unless P = NP.
Where this matters
The setting \(p \gg n\) with a sparse truth is the bread and butter of modern applied statistics:
- Genomics — GWAS with \(10^5\) SNPs, gene-expression panels with \(10^3\) to \(10^4\) probes.
- Survey data — patient registries with hundreds of recorded variables and a few hundred outcomes.
- Sensor data — many channels, few labelled trials.
- Financial markets — predicting asset returns or default risk from hundreds of candidate factors (macroeconomic indicators, fundamentals, sentiment, technical signals); a small interpretable factor model is preferred over a dense black-box.
Every one of these calls for a sparsity-constrained GLM, not just a regularised one.
The plan from here
How do existing methods cope with the NP-hard constraint? Two main strategies:
- MIO (next page) — Mixed-Integer Optimisation: solve the discrete problem directly. Exact, but does not scale much beyond a few hundred predictors.
- Lasso (page after) — relax \(\|\boldsymbol{\beta}\|_0\) to the convex \(\|\boldsymbol{\beta}\|_1\). Fast and popular, but biased and indexed by \(\lambda\), not \(k\).
COMBSS sits between the two: continuous like the lasso, but on the support indicator rather than on \(\boldsymbol{\beta}\) itself — and explicitly \(k\)-indexed.