Sparse-constrained GLM

The problem we want to solve

A familiar starting point: bodyfat

Suppose we want to predict body-fat percentage from physical measurements. The classical dataset (Penrose et al. 1985) records 14 measurements on 252 men:

Bodyfat data — first 5 rows. Response Bodyfat (%) in last column.
Age Weight Height Neck Chest Abdomen Hip Thigh Knee Ankle Biceps Forearm Wrist Bodyfat
23 70.1 172.1 36.2 93.1 85.2 94.5 59.0 37.3 21.9 32.0 27.4 17.1 12.3
22 78.8 183.5 38.5 93.6 83.0 98.7 58.7 37.3 23.4 30.5 28.9 18.2 6.1
22 70.0 168.3 34.0 95.8 87.9 99.2 59.6 38.9 24.0 28.8 25.2 16.6 25.3
26 83.5 183.5 37.4 101.8 86.4 101.2 60.1 37.3 22.8 32.4 29.4 18.2 10.4
24 80.4 180.3 39.0 97.3 100.0 101.9 63.2 42.2 24.0 32.2 27.7 17.7 28.7

Which handful of these measurements is enough to predict body fat well? That is best subset selection.

y response X p candidate predictors (here, p = 13); k = 3 selected (teal)

Which \(k\) columns of \(\boldsymbol{X}\) best predict \(\boldsymbol{y}\)?

Generalising to GLMs

The same question arises whenever the response is well modelled by a generalised linear model (GLM):

Response type Family Examples
Continuous Gaussian bodyfat, gene-expression intensity
Binary Bernoulli (logistic) disease yes/no, GWAS case/control
Multinomial softmax (multinomial logistic) cancer subtypes
Count Poisson event counts

In every case, fitting the model amounts to maximising a log-likelihood:

\[ \ell(\beta_0, \boldsymbol{\beta}) \;=\; \sum_{i=1}^n \log p(y_i \mid x_i; \beta_0, \boldsymbol{\beta}). \]

For Gaussian, this reduces to least squares; for Bernoulli, to logistic regression; and so on. The unifying framework lets one method handle them all — provided it can deal with the sparsity constraint we are about to add.

The optimisation we are trying to solve

For a prescribed model size \(k\), find the \(k\) predictors that best fit the data:

\[ \boxed{\; \begin{aligned} &\min_{\beta_0 \in \mathbb{R},\; \boldsymbol{\beta} \in \mathbb{R}^p}\;\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \\ &\text{subject to} \quad \|\boldsymbol{\beta}\|_0 := \sum_{j = 1}^p I(\beta_j \neq 0) = k. \end{aligned} \;} \]

Two things to notice:

  • The objective is the negative GLM log-likelihood.
  • The constraint is the \(\ell_0\)-norm cardinality: count the non-zero entries, keep at most \(k\).

The user supplies \(k\). Output: which \(k\) features, and their fitted coefficients.

This is the problem COMBSS focuses on.

Why it is hard

For a fixed model size \(k\), the number of subsets to enumerate is \(\binom{p}{k}\). Across the three settings featured later in this talk:

Setting p k Subsets of size k
Bodyfat 13 5 1,287
Khan SRBCT 2,308 12 ~5 × 1031
Rice GWAS 158,210 10 ~3 × 1045

Enumeration is fine at \(p = 13\), but even with the right \(k\) given to us, the number of \(k\)-subsets explodes to roughly \(5 \times 10^{31}\) at \(p = 2308\), and to \(3 \times 10^{45}\) at \(p \approx 1.6 \times 10^5\). The combinatorial blowup makes exhaustive search hopeless within the lifetime of the universe.

In fact the problem is NP-hard (Natarajan 1995). No polynomial-time algorithm is expected for the worst case unless P = NP.

Where this matters

The setting \(p \gg n\) with a sparse truth is the bread and butter of modern applied statistics:

  • Genomics — GWAS with \(10^5\) SNPs, gene-expression panels with \(10^3\) to \(10^4\) probes.
  • Survey data — patient registries with hundreds of recorded variables and a few hundred outcomes.
  • Sensor data — many channels, few labelled trials.
  • Financial markets — predicting asset returns or default risk from hundreds of candidate factors (macroeconomic indicators, fundamentals, sentiment, technical signals); a small interpretable factor model is preferred over a dense black-box.

Every one of these calls for a sparsity-constrained GLM, not just a regularised one.

The plan from here

How do existing methods cope with the NP-hard constraint? Two main strategies:

  • MIO (next page) — Mixed-Integer Optimisation: solve the discrete problem directly. Exact, but does not scale much beyond a few hundred predictors.
  • Lasso (page after) — relax \(\|\boldsymbol{\beta}\|_0\) to the convex \(\|\boldsymbol{\beta}\|_1\). Fast and popular, but biased and indexed by \(\lambda\), not \(k\).

COMBSS sits between the two: continuous like the lasso, but on the support indicator rather than on \(\boldsymbol{\beta}\) itself — and explicitly \(k\)-indexed.