This blog post is based on a conference talk I gave at the PyTexas and North Bay Python conferences. The blog post is a little more detailed, but if you prefer watching video to reading text you can watch the talk on YouTube.
Machine learning seems to be everywhere these days, but a lot of the information about what it is and how it works can be somewhat opaque. On one end of the spectrum there’s the “just run this code” approach, which is great if you’re learning a new library for a familiar task, but can seem a bit like magic when it’s demonstrating something you’ve not done before. On the other end of the spectrum is the mathematical explanation. Mathematical notation is a useful tool if you know it, but if not it can hide some simple ideas behind unfamiliar language.
This post will try to walk the line between those extremes. We’ll explore the ideas behind machine learning without much math or much code, so we can pick out some of the landmarks that will help us navigate this brave new world.
In order to understand how this thing works, first we need to know what it’s aiming for.
Think about how we typically write software. As much as we like to use fancy abstractions like objects and functions to organise our code and make it easier to maintain, we’re ultimately writing a specific, linear sequence of steps that the computer should perform to turn some input into some output. If our program gets some input that we didn’t explicitly write instructions for, it’ll break. If we’ve done our job well, it’ll break in a predictable and consistent way—like a Web application responding with a 404 error when it gets a request for an unknown URL—but it’ll still break.
The killer feature of machine learning is generalisation: the ability to adapt to new kinds of input that we didn’t explicitly consider when we were building the system.
I recently wrote about using machine learning to understand ingredients in recipes. There’s no way I could consider every possible format for recipe ingredients—the rules of the English language are too complex and too vague—so I used machine learning to build a general system, instead of hand-coding a specific system.
The same applies to other machine learning applications. Think about recognising faces: it might be possible to build a specific system that could recognise my face, but we need a system that can generalise if it’s going to be useful for any face.
We’re aiming for generalisation, but most of the software we write consists of very specific instructions. This seems like a hard problem to solve, but there’s a good chance you’ve solved it before in another context.
Remember your high school science teacher telling you that you should pay attention because this will all come in useful one day? Today is that day.
Most high school science experiments aim to build a generalised system to make predictions about the world, and they build those systems by following specific instructions. Let’s walk through a high school science experiment, and draw parallels to machine learning.
The first stage of any experiment is to define our goal. What are we trying to understand about the world?
We’re going to look at a simple physics experiment to determine the relationship between the height a tennis ball is dropped from, and the height of the ball’s first bounce. Once we understand the relationship, we should be able to predict how high a tennis ball will bounce before it’s dropped.
A machine learning project starts the same way: we believe there’s a relationship between some input value or values, and some output value or values, but we don’t understand what it is. We want to build software that can make reasonable predictions about the output values based on input values.
Now that we understand what we’re looking for, we need to take some empirical observations of what happens in the real world when we bounce a tennis ball. For some experiments we might get lucky, and find a dataset some other scientist has collected that we can work from. Other times, we’re going to have to go out and collect our own data—we’re going to have to bounce a tennis ball a whole bunch of times.
Again, machine learning is similar: we’ll collect some data that’s relevant to our problem. Sometimes it’ll be there in the database of our Web application, or a public dataset on the Internet. Other times, we’ll have to collect it ourselves.
Once we’ve collected our data, we can take a look at it. Here’s a scatter plot of the drop heights and bounce heights we’ve observed for our tennis ball.
It looks like the data falls in a straight line, so we can draw a trend line on our chart to describe the pattern that we see.
While this trend line doesn’t look like much, it’s actually a powerful mathematical model. We took measurements at 1 metre increments, but our line is continuous—it fills in the gaps between our observations. In other words, our line can generalise to drop heights that we didn’t explicitly include when we were building the model.
There are an infinite number of straight lines we could draw, each of which is defined by two parameters:
A fixed point, which we typically define as the point where it crosses the vertical axis. This is often referred to as the intercept, because it’s the point where the line intercepts the axis.
The angle of the line, which we typically define using the gradient. The gradient is how far up the line goes each time it goes across by 1 unit; in our example, that’s how much the bounce height increases each time the drop height is increased by 1 metre.
Here’s our chart again, with controls to vary the intercept and gradient:
Once we decide on values of these two parameters, we have everything we need to make predictions using our line.
I’ve chosen to implement this using a
StraightLine class that can represent
any straight line, and an instance assigned to
represents our specific straight line. This highlights an important difference
between a flexible model that could work in a large number of situations, and
the parameters we give that model to make it work in a specific situation.
Our ideal trend line will fall as close as possible to the observations that we made—we want our mathematical model to agree with what we’ve observed in the real world. We can dismiss some lines just by looking at them, but once they start to get close to the observations it’s hard to pick out exactly the right one.
We can find out how close our trend line is to our observations by measuring the distances between the predictions made by the line and the observations we made. To make it easy to compare different lines to each other, we can then make a single error score from these measurements by squaring them (to make sure they’re all positive numbers), and then taking the average.
The differences between the observations and model predictions are shown on this version of the chart. Notice how the error changes as you change the parameters of the line.
If we find the gradient and intercept that give us the minimum possible error, we’ll know our line is a good fit for our data.
The learning part of machine learning often refers to finding the best parameters we can for a flexible mathematical model, so that it fits a set of observations we’ve made of what’s happened in the past, known as our training data.
For our simple straight line model, there are only two parameters to find, and it’s easy to visualise the results. We don’t really need a computer’s help to find reasonable parameters for the model, but it provides a clear example of how a computer might be used to fit a model to some data.
A modern neural network model might have many thousands of parameters, and is capable of modelling much more complex relationships than this one. Even with a more complex model, the basic process remains the same.
A typical process looks like this:
- Start with random parameter values.
- Calculate the difference between the training data and the model’s predictions using those parameter values.
- Make a small change to the parameter values, so that the error goes down.
- Repeat many times, until the error has stopped decreasing, has reached some target value, or we’ve performed a fixed number of repetitions.
There are various standard algorithms that can efficiently find parameters that give the minimum error. One technique is gradient descent, which uses calculus to work out how quickly the error is changing at any given point, and uses that to decide if each parameter should increase or decrease, any by how much. Machine learning libraries will provide implementations of various optimisation algorithms, so while it’s important to understand the concept of optimisation when you’re building machine learning systems, don’t be put off by mentions of calculus if that isn’t your strength.
The goal of our experiment was to understand the relationship between the drop height and the bounce height well enough to make predictions about drop heights we’ve never observed. So far, we know we have a mathematical model that fits well with the examples we used to build the model. The true test is whether or not the model will generalise to other examples.
Fortunately, we’ve already figured out how to determine if the model agrees with a set of observations. To decide if it generalises well, we can compare the model’s predictions to a different set of observations that we didn’t use to train the model.
If the model generalises well, its predictions should agree reasonably closely with these new observations.
If the model doesn’t generalise, we’re usually facing one of two problems.
Over fitting means that our model has agreed so closely with the examples we used to build it that it’s not capable of generalising to new examples. When this happens, it’s often useful to re-train the model using more examples.
Under fitting means that our model doesn’t agree closely enough with the examples we used to build it, and so it’s not able to generalise either. For example, if we tried to train a straight line model to fit points on a curved line, it wouldn’t be able to get very close. When this happens, using a more complex model can help.
For this simple example, where we have one input value and one output value, we can look at the data on a scatter plot and make an educated guess at how we should model it. We can see that the points lie in a straight line, so we use a straight line as our model.
Most real world situations are too complex to visualise in a simple diagram, so it might not be clear which model to use.
You might have to try out several models, and see which performs best.
Once we have a model that generalises well enough for our purposes, we can use it to make predictions.
I like that we use the word prediction for the output of our model. It reminds me that the output is an educated guess, and not a 100% accurate decision. How accurate the model needs to be depends on the problem we’re using it to solve—a model predicting the right move to make in a game probably has more room for error than a model predicting if a self-driving car should apply the brakes.
Now that we’ve seen the structure of a basic machine learning system, how do we get from here to implementing our own machine learning systems?
If you want to develop your own machine learning systems, there are three resources I’d recommend to get started:
The fast.ai online course for a hands-on introduction to building practical machine learning systems in Python.
The book Fundamentals of Machine Learning for Predictive Analytics for a wider overview of other types of machine learning, and more mathematical background.
Andrew Ng’s coursera course for a thorough (but math heavy) introduction to the low-level details.
This post has only covered supervised learning, which refers to algorithms that learn from examples where we have both the input and the desired output. This is often referred to as labelled data, because the input values are labelled with the expected output. While this is a popular and powerful technique, there are others that work differently.
Other techniques you may want to explore include:
Unsupervised learning, which refers to algorithms that learn from examples where we know the inputs but not the outputs. This kind of algorithm is useful for finding structure in data; for example, we can use unsupervised learning to find clusters of values.
Reinforcement learning, which refers to algorithms that learn from trial and error to maximise some reward. The latest versions of AlphaGo use reinforcement learning to learn how to play games without needing any examples of how humans play.