Presented by Mike Smith

mike@michaeltsmith.org.uk

You have some training data. Each item consists of some values, and is of a **known class** (i.e. is labelled).

You want to classify new data that you don't know the class (label) of.

Because the data is labelled, this is an example of **supervised learning**.

Need to assess who needs treatment.

In-patient care provided to those with Severe Acute Malnutrition (SAM).

How to decide who has SAM?

Most common methods:

- Mid-upper arm circumference
- Weight-for-height z-score

Which is best?

Researchers looked at outcomes of children presenting with different MUACs and z-scores.

Aim: Classify, based on MUAC, whether a child will recover without intervention.

e.g. Laillou, Arnaud, et al. "Optimal screening of children with acute malnutrition requires a change in current WHO guidelines as MUAC and WHZ identify different patient groups." PloS one 9.7 (2014): e101159.

**SIMULATED** data from 29 children who recovered without treatment and 19 who didn't

Where should we put the threshold to decide who to treat?

Is this a better threshold?

What about in the middle?

We can keep plotting points on this graph, to generate a curve. This is called the "Receiver operating characteristic" (ROC) curve.

We sometimes use this to compare classifiers - area under the curve.

**Cost Matrix**: What is the cost of a **False Positive** compared to the cost of a **False Negative?**

We might be able to do better with our screening if we use the z-score data?

We might be able to do better with our screening if we combine the MUAC data with the z-score data

Which class do people think the ? child belongs in?

We could assign the class of the child's "nearest neighbour".

We could look at the class of the three nearest neighbours and pick the most common class.

We could draw a straight line to try to classify the children?

Many other types of classifier... (some are good for when there's more dimensions, or when inputs are correlated, or if there's more or less noise...)

If we imagine our data is more random (like this). Could we still classify the children?

We could draw a complicated decision boundary (like this).

All **training points** are classified correctly, but...

...what about a new child?

To test our classifier we need to divide the data up into **TRAINING** data and **TEST** data.

We normally do this by leaving out one item at a time.

Here we've left out one of the children from the training data. We can now TEST to see if the classifier got it correct. In this case it didn't.

Leave-one-out cross-validation allows us to say how well our classifier will **generalise** to unknown data.

Warning note: When leaving 'one' out, you need to be careful to think about other correlated data. For example if each child had a twin, you might want to leave both twins out for the cross-validation.

This decision boundary is an example of over-fitting.

Our 'model' is overly complicated and describes the noise, rather than the underlying structure.

There might be many parameters we can adjust in our model (something you'll see shortly). For example the number of neighbours, or the curviness of our decision boundary.

We could keep fiddling with these parameters (this could even be done automatically) until we get a good result on our training set...

...does that seem 'honest'?

Once you've finished adjusting the model (using the training and test sets), use a **validation** dataset, that you'd not used yet for anything else.

A quick mention of how classification can often be considered a regression problem...

Earlier we tried to find a decision boundary to separate out these two classes.

But if we treat the y-axis as a number, could we instead fit a line?

Logistic Regression does exactly that, and fits this function to the data:

$Logistic(z) = \frac{1}{1 + e^{-z}}$

For classification we just assign a threshold to that function.

We replace $z$ with $w^\top x$. This might look familiar from Neil's lecture on regression.

$Logistic(z) = \frac{1}{1 + e^{-\mathbf{w}^\top\mathbf{x}}}$

We basically have a couple of parameters in $w$ which we adjust to make the curve fit our data as well as possible.

True Positive = Sensitivity

False Positive = Type I error

True Negative = Specificity

False Negative = Type II error