Presented by Mike Smith
RA at the University of Sheffield
m.t.smith@sheffield.ac.uk
You have some training data. Each item consists of some values, and is of a known class (i.e. is labelled).
You want to classify new data that you don't know the class (label) of.
Because the data is labelled, this is an example of supervised learning.
Need to assess who needs treatment.
In-patient care provided to those with Severe Acute Malnutrition (SAM).
How to decide who has SAM?
Most common methods:
Which is best?
Researchers looked at outcomes of children presenting with different MUACs and z-scores.
Aim: Classify, based on MUAC, whether a child will recover without intervention.
e.g. Laillou, Arnaud, et al. "Optimal screening of children with acute malnutrition requires a change in current WHO guidelines as MUAC and WHZ identify different patient groups." PloS one 9.7 (2014): e101159.
SIMULATED data from 29 children who recovered without treatment and 19 who didn't
Where should we put the threshold to decide who to treat?
Is this a better threshold?
What about in the middle?
We can keep plotting points on this graph, to generate a curve. This is called the "Receiver operating characteristic" (ROC) curve.
We sometimes use this to compare classifiers - area under the curve.
Cost Matrix: What is the cost of a False Positive compared to the cost of a False Negative?
We might be able to do better with our screening if we use the z-score data?
We might be able to do better with our screening if we combine the MUAC data with the z-score data
Which class do people think the ? child belongs in?
We could assign the class of the child's "nearest neighbour".
We could look at the class of the three nearest neighbours and pick the most common class.
We could draw a straight line to try to classify the children?
Many other types of classifier... (some are good for when there's more dimensions, or when inputs are correlated, or if there's more or less noise...)
If we imagine our data is more random (like this). Could we still classify the children?
We could draw a complicated decision boundary (like this).
All training points are classified correctly, but...
...what about a new child?
To test our classifier we need to divide the data up into TRAINING data and TEST data.
We normally do this by leaving out one item at a time.
Here we've left out one of the children from the training data. We can now TEST to see if the classifier got it correct. In this case it didn't.
Leave-one-out cross-validation allows us to say how well our classifier will generalise to unknown data.
Warning note: When leaving 'one' out, you need to be careful to think about other correlated data. For example if each child had a twin, you might want to leave both twins out for the cross-validation.
This decision boundary is an example of over-fitting.
Our 'model' is overly complicated and describes the noise, rather than the underlying structure.
There might be many parameters we can adjust in our model (something you'll see shortly). For example the number of neighbours, or the curviness of our decision boundary.
We could keep fiddling with these parameters (this could even be done automatically) until we get a good result on our training set...
...does that seem 'honest'?
Once you've finished adjusting the model (using the training and test sets), use a validation dataset, that you'd not used yet for anything else.
Typically, as a ML-expert, you'll be given a set of data and will need to decide which parts of the data will be useful to answer a problem...
Usually requires, looking at the data and discussion with the domain-experts.
Often depends on the quality and biases in the collected data
Example problem: Count the number of sickle cells
What features of the cells would be useful?
i.e. which features are "invariant".
Circumference (pixels) | Average pixel value | Area (pixels${}^2$) | Sickle? (manually labelled) | |
---|---|---|---|---|
45 | 72 | 201 | Yes | |
62 | 51 | 340 | No | |
30 | 36 | 85 | Yes | |
55 | 65 | 126 | Yes | |
50 | 62 | 125 | No |
These could be fed into one of the classifiers discussed.
The example was of classifying images, something the CNN community have worked on considerably. Does the CNN learn the features itself? [is this data efficient?]
Many (most?) problems benefit still from feature engineering. Maybe one could localise the cells first and then run the DNN on each one, rather than the whole slide?
Can include expert knowledge (e.g. we might want to hide features that might differ in future tests - data shift)
"I have a binary classification problem and one class is present with 60:1 ratio in my training set. I used the logistic regression and the result seems to just ignores one class."
https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
Don't just report 'accuracy'! Also Area under the curve ROC isn't ideal if imbalanced. Use F-score? Or even better a confusion matrix.
Some classifiers need balanced training data. Could resample to even up class sizes (or use additional synthetic data). Some classifiers let you penalise one type of mistake more.
The F-score gives a description of a classifier's accuracy combining both the precision and recall.
Example: 100 cells were tested.
25 were really sickle. The classifier found 20 of these.
It also identified 30 of the non-sickle as sickle.
What is the Precision (detected true positives/all detected positives)?
What is the Recall (detected true positives/all real positives)?
Image from wikipedia
What is the Precision (detected true positives/all detected positives)?
20/50 = 0.4
What is the Recall (detected true positives/all real positives)?
20/25 = 0.8
$\text{F-score} = (\frac{\text{recall}^{-1} + \text{precision}^{-1}}{2})^{-1}$
$[(2.5 + 1.25)/2]^{-1} = 0.53$
Issue: weights precision and recall equally.
We take an image and add some carefully crafted noise.
Dog becomes ostrich
Note: Noise image scaled by 10x
Szegedy, et al. 2014 (original paper describing AEs)
Note: These are examples of 'test-time' attacks (we're altering the test image not the training data)
Fischer, 2017
based on http://karpathy.github.io/2015/03/30/breaking-convnets/
based on http://karpathy.github.io/2015/03/30/breaking-convnets/
The noise must be crafted, e.g. Fawzi, 2015 show the difference between random noise and an adversarial example.
Maximally perturb the feature (pixel) with greatest gradient (wrt class) [Papernot, 2016].
They find they modify about 4% of pixels in MNIST to produce AEs.
Attack specific L-Norm. Carlini, 2017: Good overview - basically the last two are just different norms.
Fast Gradient Sign Method. Changes all pixels simultaneously, but by either $\pm\epsilon$.
Black Box. Papernot 2017, train a DNN on the classifications of a black-box classifier. AEs produced on their new emulation worked against the target (e.g. Amazon and Google's image recognition systems).
Blackbox+Genetic Algorithm. Su et al., 2017: single pixels. I'm not clear why we can't just check each pixel?
Adversarial Sample created on row ML-technique and then they were tested on the column technique.
Papernot 2016
Moosavi-Dezfooli, 2016
Alexey, 2016
Evtimov, 2017 (published?)
Originally people suggested 'overfitting', but it may be more to do with the linear nature of the classifier.
Goodfellow, 2014
Various ideas: