Sparse-constrained GLM

The problem we want to solve

A familiar starting point: bodyfat

Suppose we want to predict body-fat percentage from physical measurements. The classical dataset (Penrose et al. 1985) records 14 measurements on 252 men:

Bodyfat data — first 5 rows. Response **Bodyfat** (%) in last column.
Age	Weight	Height	Neck	Chest	Abdomen	Hip	Thigh	Knee	Ankle	Biceps	Forearm	Wrist	Bodyfat
23	70.1	172.1	36.2	93.1	85.2	94.5	59.0	37.3	21.9	32.0	27.4	17.1	12.3
22	78.8	183.5	38.5	93.6	83.0	98.7	58.7	37.3	23.4	30.5	28.9	18.2	6.1
22	70.0	168.3	34.0	95.8	87.9	99.2	59.6	38.9	24.0	28.8	25.2	16.6	25.3
26	83.5	183.5	37.4	101.8	86.4	101.2	60.1	37.3	22.8	32.4	29.4	18.2	10.4
24	80.4	180.3	39.0	97.3	100.0	101.9	63.2	42.2	24.0	32.2	27.7	17.7	28.7

Which handful of these measurements is enough to predict body fat well? That is best subset selection.

Which $k$ columns of $\boldsymbol{X}$ best predict $\boldsymbol{y}$?

Generalising to GLMs

The same question arises whenever the response is well modelled by a generalised linear model (GLM):

Response type	Family	Examples
Continuous	Gaussian	bodyfat, gene-expression intensity
Binary	Bernoulli (logistic)	disease yes/no, GWAS case/control
Multinomial	softmax (multinomial logistic)	cancer subtypes
Count	Poisson	event counts

In every case, fitting the model amounts to maximising a log-likelihood:

\[ \ell(\beta_0, \boldsymbol{\beta}) \;=\; \sum_{i=1}^n \log p(y_i \mid x_i; \beta_0, \boldsymbol{\beta}). \]

For Gaussian, this reduces to least squares; for Bernoulli, to logistic regression; and so on. The unifying framework lets one method handle them all — provided it can deal with the sparsity constraint we are about to add.

The optimisation we are trying to solve

For a prescribed model size $k$, find the $k$ predictors that best fit the data:

\[ \boxed{\; \begin{aligned} &\min_{\beta_0 \in \mathbb{R},\; \boldsymbol{\beta} \in \mathbb{R}^p}\;\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \\ &\text{subject to} \quad \|\boldsymbol{\beta}\|_0 := \sum_{j = 1}^p I(\beta_j \neq 0) = k. \end{aligned} \;} \]

Two things to notice:

The objective is the negative GLM log-likelihood.
The constraint is the $\ell_0$-norm cardinality: count the non-zero entries, keep at most $k$.

The user supplies $k$. Output: which $k$ features, and their fitted coefficients.

This is the problem COMBSS focuses on.

Why it is hard

For a fixed model size $k$, the number of subsets to enumerate is $\binom{p}{k}$. Across the three settings featured later in this talk:

Setting	p	k	Subsets of size k
Bodyfat	13	5	1,287
Khan SRBCT	2,308	12	~5 × 10³¹
Rice GWAS	158,210	10	~3 × 10⁴⁵

Enumeration is fine at $p = 13$, but even with the right $k$ given to us, the number of $k$-subsets explodes to roughly $5 \times 10^{31}$ at $p = 2308$, and to $3 \times 10^{45}$ at $p \approx 1.6 \times 10^5$. The combinatorial blowup makes exhaustive search hopeless within the lifetime of the universe.

In fact the problem is NP-hard (Natarajan 1995). No polynomial-time algorithm is expected for the worst case unless P = NP.

Where this matters

The setting $p \gg n$ with a sparse truth is the bread and butter of modern applied statistics:

Genomics — GWAS with $10^5$ SNPs, gene-expression panels with $10^3$ to $10^4$ probes.
Survey data — patient registries with hundreds of recorded variables and a few hundred outcomes.
Sensor data — many channels, few labelled trials.
Financial markets — predicting asset returns or default risk from hundreds of candidate factors (macroeconomic indicators, fundamentals, sentiment, technical signals); a small interpretable factor model is preferred over a dense black-box.

Every one of these calls for a sparsity-constrained GLM, not just a regularised one.

The plan from here

How do existing methods cope with the NP-hard constraint? Two main strategies:

MIO (next page) — Mixed-Integer Optimisation: solve the discrete problem directly. Exact, but does not scale much beyond a few hundred predictors.
Lasso (page after) — relax $\|\boldsymbol{\beta}\|_0$ to the convex $\|\boldsymbol{\beta}\|_1$. Fast and popular, but biased and indexed by $\lambda$, not $k$.

COMBSS sits between the two: continuous like the lasso, but on the support indicator rather than on $\boldsymbol{\beta}$ itself — and explicitly $k$-indexed.

← Previous: Home

Next: MIO →

--- title: "Sparse-constrained GLM" subtitle: "The problem we want to solve" --- ## A familiar starting point: bodyfat Suppose we want to predict body-fat percentage from physical measurements. The classical dataset (Penrose et al. 1985) records 14 measurements on 252 men: ```{r} #| label: bodyfat-table #| echo: false df <- data.frame( Age = c(23, 22, 22, 26, 24), Weight = c(70.1, 78.8, 70.0, 83.5, 80.4), Height = c(172.1, 183.5, 168.3, 183.5, 180.3), Neck = c(36.2, 38.5, 34.0, 37.4, 39.0), Chest = c(93.1, 93.6, 95.8, 101.8, 97.3), Abdomen = c(85.2, 83.0, 87.9, 86.4, 100.0), Hip = c(94.5, 98.7, 99.2, 101.2, 101.9), Thigh = c(59.0, 58.7, 59.6, 60.1, 63.2), Knee = c(37.3, 37.3, 38.9, 37.3, 42.2), Ankle = c(21.9, 23.4, 24.0, 22.8, 24.0), Biceps = c(32.0, 30.5, 28.8, 32.4, 32.2), Forearm = c(27.4, 28.9, 25.2, 29.4, 27.7), Wrist = c(17.1, 18.2, 16.6, 18.2, 17.7), Bodyfat = c(12.3, 6.1, 25.3, 10.4, 28.7) ) knitr::kable(df, caption = "Bodyfat data — first 5 rows. Response **Bodyfat** (%) in last column.") ``` Which **handful** of these measurements is enough to predict body fat well? That is best subset selection. ```{=html} <div class="yx-figure" style="text-align:center; margin: 1.6em 0 0.4em 0;"> <svg viewBox="0 0 520 240" xmlns="http://www.w3.org/2000/svg" style="max-width: 560px; width: 100%; height: auto;">  <text x="40" y="22" text-anchor="middle" font-size="18" font-weight="bold" fill="#0b3a35" font-style="italic">y</text>  <rect x="25" y="32" width="30" height="180" fill="#e6f4f1" stroke="#0b3a35" stroke-width="1.5"/> <text x="40" y="232" text-anchor="middle" font-size="14" fill="#666">response</text>  <text x="285" y="22" text-anchor="middle" font-size="18" font-weight="bold" fill="#0b3a35" font-style="italic">X</text>  <rect x="95" y="32" width="380" height="180" fill="#f6f6f6" stroke="#0b3a35" stroke-width="1.5"/>  <rect x="103" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="131" y="40" width="20" height="164" fill="#2a9d8f" stroke="#0b3a35"/> <rect x="159" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="187" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="215" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="243" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="271" y="40" width="20" height="164" fill="#2a9d8f" stroke="#0b3a35"/> <rect x="299" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="327" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="355" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="383" y="40" width="20" height="164" fill="#2a9d8f" stroke="#0b3a35"/> <rect x="411" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <rect x="439" y="40" width="20" height="164" fill="#e0e0e0" stroke="#999"/> <text x="285" y="232" text-anchor="middle" font-size="14" fill="#666">p candidate predictors (here, p = 13); k = 3 selected (teal)</text> </svg> </div> ``` ::: {style="text-align:center; font-size:1.05em; margin: 0 0 1.2em 0;"} **Which $k$ columns of $\boldsymbol{X}$ best predict $\boldsymbol{y}$?** ::: ## Generalising to GLMs The same question arises whenever the response is well modelled by a generalised linear model (GLM): | Response type | Family | Examples | |---|---|---| | Continuous | Gaussian | bodyfat, gene-expression intensity | | Binary | Bernoulli (logistic) | disease yes/no, GWAS case/control | | Multinomial | softmax (multinomial logistic) | cancer subtypes | | Count | Poisson | event counts | In every case, fitting the model amounts to maximising a log-likelihood: $$ \ell(\beta_0, \boldsymbol{\beta}) \;=\; \sum_{i=1}^n \log p(y_i \mid x_i; \beta_0, \boldsymbol{\beta}). $$ For Gaussian, this reduces to least squares; for Bernoulli, to logistic regression; and so on. The unifying framework lets one method handle them all — provided it can deal with the **sparsity constraint** we are about to add. ## The optimisation we are trying to solve For a prescribed model size $k$, find the $k$ predictors that best fit the data: $$ \boxed{\; \begin{aligned} &\min_{\beta_0 \in \mathbb{R},\; \boldsymbol{\beta} \in \mathbb{R}^p}\;\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \\ &\text{subject to} \quad \|\boldsymbol{\beta}\|_0 := \sum_{j = 1}^p I(\beta_j \neq 0) = k. \end{aligned} \;} $$ Two things to notice: - The objective is the **negative GLM log-likelihood**. - The constraint is the **$\ell_0$-norm cardinality**: count the non-zero entries, keep at most $k$. The user supplies $k$. Output: which $k$ features, and their fitted coefficients. This is the problem COMBSS focuses on. ## Why it is hard For a fixed model size $k$, the number of subsets to enumerate is $\binom{p}{k}$. Across the three settings featured later in this talk: ```{=html} <table class="combo-table"> <thead> <tr><th>Setting</th><th>p</th><th>k</th><th>Subsets of size k</th></tr> </thead> <tbody> <tr><td>Bodyfat</td> <td>13</td> <td>5</td> <td>1,287</td></tr> <tr><td>Khan SRBCT</td> <td>2,308</td> <td>12</td> <td>~5 × 10<sup>31</sup></td></tr> <tr><td>Rice GWAS</td> <td>158,210</td> <td>10</td> <td>~3 × 10<sup>45</sup></td></tr> </tbody> </table> ``` Enumeration is fine at $p = 13$, but even with the right $k$ given to us, the number of $k$-subsets explodes to roughly **$5 \times 10^{31}$** at $p = 2308$, and to **$3 \times 10^{45}$** at $p \approx 1.6 \times 10^5$. The combinatorial blowup makes exhaustive search hopeless within the lifetime of the universe. In fact the problem is **NP-hard** (Natarajan 1995). No polynomial-time algorithm is expected for the worst case unless **P = NP**. ## Where this matters The setting $p \gg n$ with a sparse truth is the bread and butter of modern applied statistics: - **Genomics** — GWAS with $10^5$ SNPs, gene-expression panels with $10^3$ to $10^4$ probes. - **Survey data** — patient registries with hundreds of recorded variables and a few hundred outcomes. - **Sensor data** — many channels, few labelled trials. - **Financial markets** — predicting asset returns or default risk from hundreds of candidate factors (macroeconomic indicators, fundamentals, sentiment, technical signals); a small interpretable factor model is preferred over a dense black-box. Every one of these calls for a sparsity-constrained GLM, not just a regularised one. ## The plan from here How do existing methods cope with the NP-hard constraint? Two main strategies: - **MIO** ([next page](02-mio.qmd)) — Mixed-Integer Optimisation: solve the discrete problem directly. Exact, but does not scale much beyond a few hundred predictors. - **Lasso** ([page after](03-lasso.qmd)) — relax $\|\boldsymbol{\beta}\|_0$ to the convex $\|\boldsymbol{\beta}\|_1$. Fast and popular, but biased and indexed by $\lambda$, not $k$. COMBSS sits between the two: continuous like the lasso, but on the *support indicator* rather than on $\boldsymbol{\beta}$ itself — and explicitly $k$-indexed. ::: {.page-nav} [← Previous: Home](../index.qmd) [Next: MIO →](02-mio.qmd) :::