Khan SRBCT cancer classification

\(p = 2308\) genes, 4 cancer types, \(n = 83\) — and the answer is twelve genes

R code is executed; outputs below are real. Click the Python tab for the equivalent script.

The data

Khan et al. (2001) profiled gene expression for small round blue-cell tumours, a family of four childhood cancers that look nearly identical under a microscope:

  • BL — Burkitt lymphoma
  • EWS — Ewing’s sarcoma
  • NB — neuroblastoma
  • RMS — rhabdomyosarcoma

Each tumour is described by 2308 cDNA microarray probes. The training set has 63 samples; the test set has 20. The classical version of the question is: can you train a classifier that distinguishes the four tumour types using only a handful of genes?

The CSVs in data/ are formatted with the class label in the first column (y) and the 2308 probes in columns X1X2308.

Load the data

train <- read.csv("data/Khan_train.csv")
test  <- read.csv("data/Khan_test.csv")

x_train <- as.matrix(train[, -1]); y_train <- train$y
x_test  <- as.matrix(test[, -1]);  y_test  <- test$y

dim(x_train); table(y_train)
[1]   63 2308
y_train
 1  2  3  4 
 8 23 12 20 
import pandas as pd

train = pd.read_csv("data/Khan_train.csv")
test  = pd.read_csv("data/Khan_test.csv")

X_train = train.drop(columns="y").to_numpy()
y_train = train["y"].to_numpy()
X_test  = test.drop(columns="y").to_numpy()
y_test  = test["y"].to_numpy()

print(X_train.shape)
print(pd.Series(y_train).value_counts())

Fit COMBSS

Run COMBSS for \(k = 1, \ldots, 20\) with multinomial logistic loss and the test set as validation:

library(combss)
fit_khan <- combss(x_train, y_train,
                   x_val = x_test, y_val = y_test,
                   family = "multinomial", q = 20)
fit_khan
COMBSS fit
  family:    multinomial
  n, p:      63, 2308
  q:         20
  lam_ridge: 0
  best k:    12
  accuracy:  1
  subset:    187, 246, 509, 545, 910, 1074, 1319, 1389, 1645, 1954, 1955, 2050
from combss import multinomial

fit_khan = multinomial.model()
fit_khan.fit(X_train, y_train, X_val=X_test, y_val=y_test, q=20, verbose=False)
print(fit_khan.subset)

How many genes does COMBSS pick?

fit_khan$best_k
[1] 12
fit_khan$subset_list[[ fit_khan$best_k ]]
 [1]  187  246  509  545  910 1074 1319 1389 1645 1954 1955 2050
print(len(fit_khan.subset), "genes selected at the best k")
print(fit_khan.subset)

Test-set accuracy at each \(k\)

plot(seq_along(fit_khan$accuracy_path), fit_khan$accuracy_path, type = "b", pch = 19,
     xlab = "k (number of selected genes)", ylab = "Test accuracy")
abline(v = fit_khan$best_k, lty = 2)

Accuracy as a function of the number of selected genes.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

acc_path = []
for s in fit_khan.subset_list:
    clf = LogisticRegression(max_iter=2000).fit(X_train[:, s], y_train)
    acc_path.append((clf.predict(X_test[:, s]) == y_test).mean())
best_k = int(np.argmax(acc_path) + 1)

plt.plot(range(1, len(acc_path) + 1), acc_path, "o-")
plt.axvline(best_k, ls="--")
plt.xlabel("k (number of selected genes)")
plt.ylabel("Test accuracy")
plt.show()

The headline

Twelve genes are enough to separate the four tumour types — perfect classification on the held-out 20 samples, drawn from 2308 candidates. The R and Python implementations agree on the same twelve genes.

A lasso baseline on the same training data selects an order of magnitude more genes for comparable accuracy (see Comparisons).

Reference

Khan, J. et al. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine 7, 673–679.