COMBSS (our method)

Where it sits between lasso and MIO

Where COMBSS fits

Three properties we want a method to have:

Continuous / smooth optimisation — tractable, gradient-based, scalable.
Targets the $\ell_0$-constrained problem — faithful to the original sparse selection task.
Returns exactly $k$ features — gives the user direct control over model size.

Lasso sits only in the smooth optimisation circle — convex and scalable, but doesn’t target $\ell_0$ and doesn’t directly control $k$.
MIO sits in targets $\ell_0$ and returns $k$ — exact but not a smooth optimisation, and not scalable.
COMBSS sits in all three.

The bridge: side by side

	Lasso	MIO	COMBSS
Variable type	$\boldsymbol{\beta} \in \mathbb{R}^p$	$\boldsymbol{s} \in \{0,1\}^p$	$\boldsymbol{t} \in [0,1]^p$
Sparsity by	$\ell_1$ penalty	hard $\ell_0$ + Big-$M$	continuous relaxation on the support indicator
Selection size	indexed by $\lambda$	$k$ via constraint	$k$ via polytope
Scales to $p \gg n$	yes	barely	yes
Biased coefficients	yes	no	no

In one sentence

COMBSS relaxes the discrete support indicator $\boldsymbol{s} \in \{0,1\}^p$ to a continuous $\boldsymbol{t} \in [0,1]^p$ on the cardinality polytope, fits a ridge GLM in the relaxed space, and runs a Frank-Wolfe homotopy that drives $\boldsymbol{t}$ to a vertex of the polytope — recovering an integer subset of size exactly $k$.

Two head-to-head questions

The simplest way to see what COMBSS buys you is to put it next to lasso on the same dataset and ask two questions.

Setup: $n = 300$ ($200$ train + $100$ validation), $p = 30$, $\sigma = 0.5$, true coefficients $\boldsymbol{\beta} = (3,\,2,\,1.5,\,1,\,0.5,\,0,\,\ldots,\,0)$, predictors (i.e., entries of the dataset) i.i.d. standard normal.

1. At a prescribed $k = 5$, how biased are the coefficients?

Coefficients at $k=5$: True (grey), Lasso at the path-$\lambda$ that gives 5 features and minimises training MSE (coral), and COMBSS (teal, OLS-refit on its $k=5$ subset). Squared $\ell_2$ error to the truth across all 30 dimensions: **lasso 0.075, COMBSS 0.009** — ~8x smaller.

COMBSS’s recovered coefficients match the truth to within sampling noise. Lasso’s are all visibly shrunken, even though both methods are looking at exactly five features.

2. For the same validation loss, how many features does each method need?

**Left:** validation MSE as a function of the number of selected features (full range). **Right:** zoom into the region $k \ge 4$ where the two methods plateau. The dashed line marks lasso’s *best-ever* validation MSE. COMBSS at $k = 5$ already sits **below** that line — lasso cannot match COMBSS’s $k=5$ performance at any model size.

At $k = 5$, COMBSS reaches a validation MSE that lasso cannot match at any model size — even with 22 features, lasso’s MSE stays above COMBSS’s value at $k = 5$.

Preview of the rest of the talk

Methodology — how the relaxation works, the Frank-Wolfe step, the curvature homotopy.
Demos — five runnable demos showing the method in action.

← Previous: Lasso

Next: Boolean relaxation →

--- title: "COMBSS (our method)" subtitle: "Where it sits between lasso and MIO" --- ## Where COMBSS fits Three properties we want a method to have: 1. **Continuous / smooth optimisation** — tractable, gradient-based, scalable. 2. **Targets the $\ell_0$-constrained problem** — faithful to the original sparse selection task. 3. **Returns exactly $k$ features** — gives the user direct control over model size. ```{=html} <div class="venn-wrap"> <svg viewBox="-40 -30 620 470" xmlns="http://www.w3.org/2000/svg" role="img" aria-label="Venn diagram showing COMBSS at the centre of three properties."> <circle cx="270" cy="120" r="125" fill="#2a9d8f" fill-opacity="0.22" stroke="#1f7a70" stroke-width="1.6"/> <circle cx="190" cy="230" r="125" fill="#e76f51" fill-opacity="0.22" stroke="#b04a31" stroke-width="1.6"/> <circle cx="350" cy="230" r="125" fill="#e9c46a" fill-opacity="0.28" stroke="#9c7d22" stroke-width="1.6"/> <text x="270" y="-12" font-size="18" font-weight="700" fill="#1f7a70" text-anchor="middle">Returns exactly k features</text> <text x="140" y="378" font-size="18" font-weight="700" fill="#b04a31" text-anchor="middle">Continuous /</text> <text x="140" y="400" font-size="18" font-weight="700" fill="#b04a31" text-anchor="middle">smooth optimisation</text> <text x="400" y="378" font-size="18" font-weight="700" fill="#9c7d22" text-anchor="middle">Targets ℓ₀-</text> <text x="400" y="400" font-size="18" font-weight="700" fill="#9c7d22" text-anchor="middle">constrained problem</text> <text x="130" y="290" font-size="19" font-style="italic" fill="#0b3a35" text-anchor="middle">Lasso</text> <text x="340" y="148" font-size="19" font-style="italic" fill="#0b3a35" text-anchor="middle">MIO</text> <text x="270" y="227" font-size="24" font-weight="700" fill="#0b3a35" text-anchor="middle">COMBSS</text> </svg> </div> ``` - **Lasso** sits only in the *smooth optimisation* circle — convex and scalable, but doesn't target $\ell_0$ and doesn't directly control $k$. - **MIO** sits in *targets $\ell_0$* and *returns $k$* — exact but not a smooth optimisation, and not scalable. - **COMBSS** sits in all three. ## The bridge: side by side | | **Lasso** | **MIO** | **COMBSS** | |---|---|---|---| | Variable type | $\boldsymbol{\beta} \in \mathbb{R}^p$ | $\boldsymbol{s} \in \{0,1\}^p$ | $\boldsymbol{t} \in [0,1]^p$ | | Sparsity by | $\ell_1$ penalty | hard $\ell_0$ + Big-$M$ | continuous relaxation on the support indicator | | Selection size | indexed by $\lambda$ | $k$ via constraint | $k$ via polytope | | Scales to $p \gg n$ | yes | barely | yes | | Biased coefficients | yes | no | no | ## In one sentence COMBSS relaxes the discrete support indicator $\boldsymbol{s} \in \{0,1\}^p$ to a continuous $\boldsymbol{t} \in [0,1]^p$ on the cardinality polytope, fits a ridge GLM in the relaxed space, and runs a Frank-Wolfe homotopy that drives $\boldsymbol{t}$ to a vertex of the polytope — recovering an integer subset of size exactly $k$. ## Two head-to-head questions The simplest way to see what COMBSS buys you is to put it next to lasso on the same dataset and ask two questions. ::: {.callout-note appearance="minimal"} Setup: $n = 300$ ($200$ train + $100$ validation), $p = 30$, $\sigma = 0.5$, true coefficients $\boldsymbol{\beta} = (3,\,2,\,1.5,\,1,\,0.5,\,0,\,\ldots,\,0)$, predictors (i.e., entries of the dataset) i.i.d. standard normal.  ::: ```{r} #| label: combss-comparison-setup #| echo: false #| message: false library(glmnet) library(combss) set.seed(2026) n <- 300; p <- 30 beta_true <- c(3, 2, 1.5, 1, 0.5, rep(0, p - 5)) X <- matrix(rnorm(n * p), n, p) y <- as.numeric(X %*% beta_true + rnorm(n) * 0.5) itr <- 1:200; iva <- 201:300 # Lasso path on training data fit_l <- glmnet(X[itr, ], y[itr], alpha = 1) nz <- colSums(coef(fit_l)[-1, , drop = FALSE] != 0) # COMBSS fit on training, validated on iva fit_c <- combss(X[itr, ], y[itr], x_val = X[iva, ], y_val = y[iva], family = "gaussian", q = 10) ``` ### 1. At a prescribed $k = 5$, how biased are the coefficients? ```{r} #| label: combss-bias #| echo: false #| fig-width: 7 #| fig-height: 3.5 #| fig-cap: "Coefficients at $k=5$: True (grey), Lasso at the path-$\\lambda$ that gives 5 features and minimises training MSE (coral), and COMBSS (teal, OLS-refit on its $k=5$ subset). Squared $\\ell_2$ error to the truth across all 30 dimensions: **lasso 0.075, COMBSS 0.009** — ~8x smaller." # Lasso at the lambda giving k=5 with smallest training MSE idx5 <- which(nz == 5) trn_mse_at5 <- sapply(idx5, function(i) { pred <- predict(fit_l, newx = X[itr, ], s = fit_l$lambda[i]) mean((y[itr] - pred)^2) }) lam5 <- fit_l$lambda[idx5[which.min(trn_mse_at5)]] b_lasso5 <- as.numeric(coef(fit_l, s = lam5))[-1] # COMBSS at k=5 with OLS refit sel5 <- fit_c$subset_list[[5]] ols <- lm(y[itr] ~ X[itr, sel5]) b_combss5 <- rep(0, p) b_combss5[sel5] <- coef(ols)[-1] ix <- 1:5 mat <- t(cbind(true = beta_true[ix], lasso = b_lasso5[ix], combss = b_combss5[ix])) colnames(mat) <- paste0("x", ix) par(mar = c(4, 4, 1, 1)) barplot(mat, beside = TRUE, col = c("grey70", "#e76f51", "#2a9d8f"), ylab = "coefficient", ylim = c(0, 3.4), border = NA) abline(h = 0, col = "grey40") legend("topright", legend = c("true", "lasso (k=5)", "COMBSS (k=5)"), fill = c("grey70", "#e76f51", "#2a9d8f"), border = NA, bty = "n", cex = 0.95) ``` COMBSS's recovered coefficients match the truth to within sampling noise. Lasso's are all visibly shrunken, even though both methods are looking at exactly five features. ### 2. For the same validation loss, how many features does each method need? ```{r} #| label: combss-mse-vs-k #| echo: false #| fig-width: 11 #| fig-height: 4 #| fig-cap: "**Left:** validation MSE as a function of the number of selected features (full range). **Right:** zoom into the region $k \\ge 4$ where the two methods plateau. The dashed line marks lasso's *best-ever* validation MSE. COMBSS at $k = 5$ already sits **below** that line — lasso cannot match COMBSS's $k=5$ performance at any model size." # Validation MSE per unique k along the lasso path ks_lasso <- sort(unique(nz)) val_mse_lasso <- sapply(ks_lasso, function(k) { i_k <- which(nz == k) trn <- sapply(i_k, function(i) { pred <- predict(fit_l, newx = X[itr, ], s = fit_l$lambda[i]) mean((y[itr] - pred)^2) }) pred_val <- predict(fit_l, newx = X[iva, ], s = fit_l$lambda[i_k[which.min(trn)]]) mean((y[iva] - pred_val)^2) }) # COMBSS val MSE per k from fit_c$mse_path ks_combss <- seq_along(fit_c$mse_path) val_mse_combss <- fit_c$mse_path # Two-panel plot: full view + zoomed view par(mfrow = c(1, 2), mar = c(4, 4, 1, 1)) # Panel 1: full view plot(ks_lasso, val_mse_lasso, type = "b", pch = 19, col = "#e76f51", xlab = "number of selected features (k)", ylab = "validation MSE", ylim = range(c(val_mse_lasso, val_mse_combss)), xlim = c(1, max(c(ks_lasso, ks_combss)))) lines(ks_combss, val_mse_combss, type = "b", pch = 17, col = "#2a9d8f") abline(h = min(val_mse_lasso), lty = 2, col = "grey40") legend("topright", legend = c("lasso", "COMBSS", "lasso's best MSE"), col = c("#e76f51", "#2a9d8f", "grey40"), pch = c(19, 17, NA), lty = c(1, 1, 2), bty = "n", cex = 0.95) mtext("(a) full range", side = 3, line = -1.2, cex = 0.95, adj = 0.5) # Panel 2: zoomed in to k >= 4 where the plateau lives keep_l <- ks_lasso >= 4 keep_c <- ks_combss >= 4 y_min <- min(val_mse_lasso[keep_l], val_mse_combss[keep_c]) - 0.01 y_max <- 0.45 plot(ks_lasso[keep_l], val_mse_lasso[keep_l], type = "b", pch = 19, col = "#e76f51", xlab = "number of selected features (k)", ylab = "validation MSE", ylim = c(y_min, y_max), xlim = c(4, max(ks_lasso))) lines(ks_combss[keep_c], val_mse_combss[keep_c], type = "b", pch = 17, col = "#2a9d8f") abline(h = min(val_mse_lasso), lty = 2, col = "grey40") mtext("(b) zoom: k >= 4", side = 3, line = -1.2, cex = 0.95, adj = 0.5) par(mfrow = c(1, 1)) ``` At $k = 5$, COMBSS reaches a validation MSE that lasso cannot match at **any** model size — even with 22 features, lasso's MSE stays above COMBSS's value at $k = 5$. ## Preview of the rest of the talk - **Methodology** — how the relaxation works, the Frank-Wolfe step, the curvature homotopy. - **Demos** — five runnable demos showing the method in action. ::: {.page-nav} [← Previous: Lasso](03-lasso.qmd) [Next: Boolean relaxation →](../methodology/01-relaxation.qmd) :::

Where COMBSS fits

The bridge: side by side

In one sentence

Two head-to-head questions

1. At a prescribed \(k = 5\), how biased are the coefficients?

2. For the same validation loss, how many features does each method need?

Preview of the rest of the talk