Lasso: convex relaxation
The standard \(\ell_1\) surrogate
From hard \(\ell_0\) to soft \(\ell_1\)
The sparse-constrained GLM problem from the previous page is
\[ \begin{aligned} &\min_{\beta_0,\,\boldsymbol{\beta}}\;\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \\ &\text{subject to} \quad \|\boldsymbol{\beta}\|_0 = k. \end{aligned} \]
The combinatorial pain comes entirely from the \(\ell_0\)-norm in the constraint. Lasso replaces it with the \(\ell_1\)-norm:
\[ \boxed{\; \begin{aligned} &\min_{\beta_0,\,\boldsymbol{\beta}}\;\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \\ &\text{subject to} \quad \|\boldsymbol{\beta}\|_1 \le \tau. \end{aligned} \;} \]
Four things changed:
- The constraint set is now convex (the \(\ell_1\)-ball), and the objective is convex.
- The budget \(\tau \ge 0\) is real-valued, not an integer count. The smaller \(\tau\), the more shrunken (and therefore sparser) the solution.
- The constraint also flips from equality (\(\|\boldsymbol{\beta}\|_0 = k\)) to inequality (\(\|\boldsymbol{\beta}\|_1 \le \tau\)). Lasso does not try to hit a specific number of features — it bounds the total coefficient magnitude.
- The whole problem is a convex optimisation — every local optimum is global, and there are mature solvers (i.e., glmnet).
The penalised form
Because the constrained lasso is convex, Lagrangian duality lets us rewrite it as an unconstrained penalised problem:
\[ \min_{\beta_0,\,\boldsymbol{\beta}}\; -\tfrac{1}{n}\,\ell(\beta_0, \boldsymbol{\beta}) \;+\; \lambda \|\boldsymbol{\beta}\|_1. \]
The link is exact: for every budget \(\tau\) in the constrained problem, there exists a \(\lambda^\star(\tau) \ge 0\) such that the penalised problem with \(\lambda = \lambda^\star(\tau)\) has the same minimiser as the constrained form. The two are different parameterisations of the same solution path.
In practice, almost everyone solves the penalised form — it is unconstrained, smooth in \(\lambda\), and admits efficient algorithms (coordinate descent, ADMM, ISTA/FISTA). glmnet returns the whole regularisation path in one call.
What lasso gives up
The relaxation is fast and well understood, but it pays two prices that no choice of \(\lambda\) can dodge simultaneously:
- You can’t control \(k\) directly. The user picks \(\lambda\) (or equivalently the \(\ell_1\) budget \(\tau\)); neither maps directly to the number of active features \(k\). Cross-validation picks a \(\lambda\) that optimises prediction, not selection size.
- Either overselection or shrinkage.
- A small \(\lambda\) (e.g.
lambda.min) gives near-OLS coefficients but typically grabs many spurious features. - A larger \(\lambda\) (e.g.
lambda.1se) cuts the false positives but shrinks every coefficient, including the truly active ones, toward zero.
- A small \(\lambda\) (e.g.
Concrete numbers on a simple simulation appear on the next page — they are the case for COMBSS rather than the case against lasso, so we save them for there.