Machine learning has been a hot topic lately due to the substantial advances in classification, image segmentation and many other tasks where the best-performing methods are now often based on deep neural networks. Machine learning can be applied even when a mathematical description of the model is not at hand or too complicated. This talk will give a brief introduction to machine learning with a focus on artificial neural networks.
What is machine learning?
Machine learning describes the automated process of learning a predictive model from data. Work on machine learning already started in the 1950s though research slowed down in the 70s. Due to increasing computing power, it has seen a revival lately with many breakthroughs reported in recent years. Machine learning methods are very flexible and can be applied to many different kinds of data. This generality is often achieved by using nonlinear models with many free parameters. The trade-off for this is a high computational cost for training the model and the amount of training data needed. In contrast, physical models consist of a clear and well thought through mathematical model with the lowest number of parameters that can be determined by few observations. Thus, machine learning methods are advantageous when a clear mathematical description of the problem is not at hand (or too complex) but a huge amount of training data is available, e.g. for ill-posed inverse problems, image classification, spam filters, etc.
One differentiates between supervised and unsupervised learning. For supervised learning tasks training data is available together with labels. During training the model learns the mapping from the data to the labels. The trained model is then supposed to predict the correct labels for previously unknown input data. In other words, it should generalize to unknown data in a sensible way. The labels to predict can be continuous, in which case the problem is known as regression, or binary, which is called classification. A machine learning task is said to be unsupervised if only unlabeled data is given. Here, the task is to learn some kind of structure in the data that can be used to make predictions. Important examples of unsupervised learning tasks are dimensionality reduction and clustering. A real-life example of unsupervised learning would be to use customer data of an online shop and suggest items the customer would also be interested in.
Linear Regression: Linear regression learns a linear relationship between an input vector X and output variables Y given by the equation $$Y_j = \sum_{i} A_{ij} X_i + B_j$$. Here Y is the output data, X is the input data vector and A and B are the parameters to be estimated by the learning process. Linear regression is a widely used statistical tool to analyse data and predict output variables for new data. The model is usually trained by the least squares algorithm, which gives a direct formula for the estimator of the parameters.
Artificial Neural Networks: Artificial neural networks have been shown to solve many machine learning problems quite well. They consist of multiple layers of so called neurons. These neurons multiply the input Xi with a weight vector Wij just as in linear models. Each neuron can also have a so-called bias Bj that will be added to the result. To make the model nonlinear an activation function σ is then applied to the output of the artificial neuron. The output of a layer of an artificial neural network can thus be described by the following equation.
$$ Y_j = \sigma(\sum_i W_{ij} X_i + B_j)$$.
When multiple of those so-called dense layers are stacked on top of each other, the layers that do not handle the output directly are called hidden layers. Figure 1 shows a graphical representation of a three layer neural network with one hidden layer. Modern network architectures utilize many of such hidden layers to extract non-linear features from the data (deep neural networks). Neural networks are being trained to minimize some loss function, which usually is defined to be the least-squares error between the correct output and the calculated output. For training some variation of stochastic gradient descent is used for which the gradient of the loss function E is needed. To calculate the gradient one first calculates the output of all neurons and the prediction error of the network. When the input to all neurons and the error is known, one can calculate the derivative of the loss function with respect to the weights for each layer using the backpropagation algorithm. Finally, the parameters are updated using some variant of a gradient descent scheme:
$$ w_j \leftarrow w_j - \alpha \frac{\partial E}{\partial w_j}$$
A disadvantage of deep neural networks is that the weights often do not have any meaningful interpretation and that it is therefor often not clear how exactly the network generalizes to unknown data.
Convolutional Neural Networks
Standard dense layers are very costly to train for images as input data, due to the amount of preprocessing needed to achieve translational invariance - which is often desired - and the huge amount of parameters in every layer. Convolutional neural networks solve this problem by learning the kernels of image convolutions. This means that every convolutional layer produces a map of learned features which is usually downsampled and given as input to the next layer. Convolutional neural networks are translational invariant since the learned features are being applied to every position in the input image. Since convolutional layers fit in nicely with the backpropagation scheme the training process works just like for standard neural networks. Machine learning approaches based on convolutional neural networks have shown to beat most other approaches for image classification and segmentation tasks.