What is

Classification?

Presented by Mike Smith
mike@michaeltsmith.org.uk

What is classification?

You have some training data. Each item consists of some values, and is of a known class (i.e. is labelled).

You want to classify new data that you don't know the class (label) of.

Because the data is labelled, this is an example of supervised learning.

Acute Malnutrition Screening

Need to assess who needs treatment.

In-patient care provided to those with Severe Acute Malnutrition (SAM).

How to decide who has SAM?


Acute Malnutrition Screening

Most common methods:

  • Mid-upper arm circumference
  • Weight-for-height z-score


Acute Malnutrition Screening

Acute Malnutrition Screening

Which is best?

Researchers looked at outcomes of children presenting with different MUACs and z-scores.

Aim: Classify, based on MUAC, whether a child will recover without intervention.

e.g. Laillou, Arnaud, et al. "Optimal screening of children with acute malnutrition requires a change in current WHO guidelines as MUAC and WHZ identify different patient groups." PloS one 9.7 (2014): e101159.


MUAC (simulated) Dataset


SIMULATED data from 29 children who recovered without treatment and 19 who didn't

MUAC (simulated) Dataset


Where should we put the threshold to decide who to treat?

MUAC (simulated) Dataset


Is this a better threshold?

MUAC (simulated) Dataset


What about in the middle?

MUAC (simulated) Dataset

We can keep plotting points on this graph, to generate a curve. This is called the "Receiver operating characteristic" (ROC) curve.

We sometimes use this to compare classifiers - area under the curve.

Cost Matrix: What is the cost of a False Positive compared to the cost of a False Negative?

More data?

Use z-score instead?

We might be able to do better with our screening if we use the z-score data?

more data?

We might be able to do better with our screening if we combine the MUAC data with the z-score data

Which class do people think the ? child belongs in?

Nearest Neighbour

We could assign the class of the child's "nearest neighbour".

k-Nearest Neighbour

We could look at the class of the three nearest neighbours and pick the most common class.

Linear Boundary

We could draw a straight line to try to classify the children?

Other classifiers

Many other types of classifier... (some are good for when there's more dimensions, or when inputs are correlated, or if there's more or less noise...)

Leave-one-out cross-validation

If we imagine our data is more random (like this). Could we still classify the children?

Leave-one-out cross-validation

We could draw a complicated decision boundary (like this).

All training points are classified correctly, but...

...what about a new child?

Leave-one-out cross-validation

To test our classifier we need to divide the data up into TRAINING data and TEST data.

We normally do this by leaving out one item at a time.

Leave-one-out cross-validation

Here we've left out one of the children from the training data. We can now TEST to see if the classifier got it correct. In this case it didn't.

Leave-one-out cross-validation

Leave-one-out cross-validation allows us to say how well our classifier will generalise to unknown data.

Warning note: When leaving 'one' out, you need to be careful to think about other correlated data. For example if each child had a twin, you might want to leave both twins out for the cross-validation.

Over-fitting

This decision boundary is an example of over-fitting.

Our 'model' is overly complicated and describes the noise, rather than the underlying structure.

Validation

There might be many parameters we can adjust in our model (something you'll see shortly). For example the number of neighbours, or the curviness of our decision boundary.

We could keep fiddling with these parameters (this could even be done automatically) until we get a good result on our training set...

...does that seem 'honest'?

Validation

Once you've finished adjusting the model (using the training and test sets), use a validation dataset, that you'd not used yet for anything else.

Logistic Regression

A quick mention of how classification can often be considered a regression problem...

Logistic Regression

Earlier we tried to find a decision boundary to separate out these two classes.

Logistic Regression

But if we treat the y-axis as a number, could we instead fit a line?

Logistic Regression

Logistic Regression does exactly that, and fits this function to the data:

$Logistic(z) = \frac{1}{1 + e^{-z}}$

Logistic Regression

For classification we just assign a threshold to that function.

We replace $z$ with $w^\top x$. This might look familiar from Neil's lecture on regression.

$Logistic(z) = \frac{1}{1 + e^{-\mathbf{w}^\top\mathbf{x}}}$

We basically have a couple of parameters in $w$ which we adjust to make the curve fit our data as well as possible.

Questions

Terminology

True Positive = Sensitivity

False Positive = Type I error

True Negative = Specificity

False Negative = Type II error