The hype around machine learning shows no sign of slowing down. But while companies set out to hire more data scientists and machine learning engineers, many of us are still left wondering what it is exactly.
In this two part blog series machine learning engineer Maartens Lourens sets out to introduce machine learning concepts and solutions from the point of view of an engineer.
In Part One we used classical machine learning techniques with Scikit Learn to perform Log Classification. In the current post, we will implement a solution for the problem using neural networks. More specifically, to give us a broader perspective, we will implement the solution in two different popular neural network frameworks, namely Keras and Pytorch.
What are neural networks, and how are they different from classical machine learning techniques?
Artificial neural networks were inspired by biological neurons. The first neural network, called a perceptron, was invented by Frank Rosenblatt in the late 1950s. It was a simple linear binary classifier with a vector for input resulting in a binary yes or no as output.
The perceptron was followed by a multilayer architecture, the multilayer perceptron. MLPs were capable of more complex processing, and have grown in complexity and sophistication to become the deep learning neural architectures we have today. The evolution of neural networks were not straight-forward, with progress often set back during so-called AI winters.
We can think of a multilayer neural network as a function. Each layer of nodes stores a vector of values. The value of each node in the next layer is computed as the derivative of values from the previous layer multiplied by a matrix of adjustable weights, with additional bias.
These calculations require matrix operations, and since GPUs are optimised for matrix calculations, GPUs have become very popular in neural network machine learning.
For our purpose in this post we will focus on a simple artificial neural network (ANN) architecture with its activation functions. For a deep dive I can recommend 3Blue1Brown’s excellent series of videos or Michael Nielsen’s online book “Neural Networks and Deep Learning”.
Keras and Pytorch
In Part One we saw how Scikit Learn made it easy for us to process and prepare our data. We then trained a variety of classical algorithms on our data and compared the accuracy of the models.
Neural network frameworks require a bit more legwork. Each framework exposes parts of the neural network architecture that gives a lot of flexibility. This flexibility is an advantage, but also places more of a burden on the developer.
Note that we don’t have to use dedicated neural network frameworks to build a neural network. We could use a numeric library like Numpy to build one from scratch. Nevertheless, we’d be missing out on many of the conveniences provided by dedicated libraries, such as tensors, dedicated classes and functions for specific architectures, and optimisations for GPU processing.
Keras and Pytorch are two such dedicated machine learning libraries. Keras, developed by Francois Chollet, is the more high level of the two. Chollet conceived of Keras as an interface to more lowlevel libraries like Theano or Tensorflow. It provides an intuitive set of abstractions as building blocks. This has made it quite popular.
Pytorch, which is primarily developed by researchers at Facebook AI Research, is more directly comparable to one of Keras’ backends (eg. Tensorflow). Pytorch provides automatic differentiation via a module called autograd, as well as via a more dedicated neural network module. Its flexibility makes it attractive for machine learning research and prototyping. In terms of coding style it is considered to be quite Pythonic, more so than for example Tensorflow.
We will implement our solution in both Keras and Pytorch to provide two distinct flavours of neural networks.
Before we continue, let’s remind ourselves of the problem we are trying to solve, which in Part One we called Log Classification.
When we look at data logs such as the system logs and wifi logs below, it’s easy to spot differences between them. For example, the date formats are different, and the wifi logs have angle brackets around the process names.
Jul 3 03:42:41 Maartenss-MacBook-Pro parsecd: BUG in libdispatch client: dispatch_mig_server: mach_msg() failed (ipc/send) msg too small - 0x10000008
Jul 3 03:42:41 Maartenss-MacBook-Pro systemstats: assertion failed: 17E199: systemstats + 689866 [D9E75C38-62FE-3D77-9BE3-5F6D38EF0767]: 0x5
Jul 3 03:42:41 Maartenss-MacBook-Pro systemstats: assertion failed: 17E199: systemstats + 914800 [D9E75C38-62FE-3D77-9BE3-5F6D38EF0767]: 0x40
Jul 3 03:42:41 Maartenss-MacBook-Pro com.apple.xpc.launchd (com.apple.WebKit.Networking.1E7707D2-AE49-4AE8-C73C-1A9DE74D21DE): Service exited with abnormal code: 1
Tue Jul 3 03:42:41.149 <kernel> Creating all peerManager reporters
Tue Jul 3 03:42:41.184 <airportd> _initLocaleManager: Started locale manager
Tue Jul 3 03:42:41.194 <airportd> airportdProcessDLILEvent: en0 attached (down)
Tue Jul 3 03:42:41.219 <kernel> wl0: setAWDL_PEER_TRAFFIC_REGISTRATION: active 0, roam_off: 0, err 0 roam_start_set 0 forced_roam_set 0
Tue Jul 3 03:42:41.269 <kernel> AirPort_Brcm43xx::syncPowerState: WWEN[disabled]
Ignoring the data format, suppose we received the following log line, would we still be able to say with certainty what type of log line it is?
--- last message repeated 1 time ---
We could do a search of course, but what if something more subtle changes, for example:
--- last message repeated 19 times ---
It becomes difficult to cater to all the exceptions, especially if there are potentially hundreds or thousands of exceptions and variations. This is a task that’s cumbersome for a human to code into logic, but potentially easy for a computer to learn using a guided form of machine learning called supervised machine learning.
Our task, therefore, is to let the computer learn this on its own. We’ve been calling this log classification, which is a subset of a more generic approach called document classification.
Let’s take a look at the solution from a neural network point of view, and how this differs from our approach in Part One. We can assume that our artificial neural network will have an input layer, one (or more) hidden layer(s), and an output layer.
Figure 1: Neural Network Architecture
We also need to consider the nodes in each layer. How do we know how many we are going to need?
To help us we should go back and look at our data. Recall that data preprocessing was an important part of our training process.
Figure 2: Training a Model
Data preparation included a chained set of operations starting with a count vectoriser, followed by a tf-idf transformer. The count vectoriser tokenises the data (our logs) and identifies a unique vocabulary, and then counts the number of tokens in each log line. If we think of individual items in the vocabulary as a set of data points per row, where the data point is either zero if that token is absent from the vocabulary, or a count (one or more) when it is present, we can see that only a small fraction of items will be non-zero. The resulting matrix is therefore called a sparse matrix, a property that can be leveraged to optimise further operations.
The tf-idf algorithm (short for term frequency, inverse document frequency) then calculates the importance of each token relative to the rest of the document.
The resulting matrix will have dimensions such that each row will be the length of the vocabulary. So we can conclude that this is the right number of nodes for our input layer.
At the other end is a fixed number of categories that we can predict, namely the log file types. We will therefore use the number of log file types as the number of nodes for the output layer.
That only leaves the hidden layer and its nodes. We can start out with a single hidden layer. The number of nodes in this layer isn’t fixed in the same way as in the input and output layers, as we can adjust it with reruns to see what works best. We will select 512 nodes to start with.
We will also need to tune a set of parameters called the hyperparameters. These are simply the parameters that need to be set before training starts, as opposed to parameters that emerge as properties of the model during the training process.
We’ve already looked at some of them: the size of the input layer, hidden layer, and output layer. However we also need to choose the functions we will be using, and some additional parameters like batch size, number of epochs, dropout and learning rate.
The neural network processes rely on a number of functions. The loss function represents the cost (or penalty) for inaccurate predictions. Related to it is the optimiser, the algorithm that minimises the loss function. The idea is that by minimising the loss function we are improving the accuracy of the model, and the optimiser helps us to find that minimum. Finally, for each hidden and output layer we need an activation function. The activation function defines the value of the output node given its input.
There are naturally many choices available for each of these. The following represents a reasonable set of choices given our use case:
Batch size, Epochs, Dropout and Learning Rate
We will need to tell our algorithm how much data to look at during each training iteration. The size of that chunk of data is called the batch size. Once training has iterated through all the data, batch by batch, it has reached its first full epoch. If the batch size is the same as the full data size, then there will be just one iteration per epoch. However, the use of smaller chunks or batches is recommended and is usually called mini batching. A batch size of 32 is a good default.
Dropout is a regularisation technique that helps to reduce overfitting and thereby improving the model’s chances of generalising. Dropout amounts to ignoring a number of nodes or units during each backwards or forwards pass.
Learning rate describes the rate at which training moves towards the minimum cost function, using a specific optimiser. The caveat is that if the learning rate is too high it will overshoot, so a lower learning rate is more likely to converge, although it will take longer.
It is worth noting that the best choice of hyperparameter is not always obvious. Only after training and observing the results with different combinations of values is it possible to make an informed decision. If you have the time and resources available an automated approach to hyperparameter tuning can help.
We will go with the following values, which provide a good balance between performance and convergence.
Batch size: 32
Learning rate: 0.0005
Some of the functionality was already covered in Part One, for example, our
read_data functions don’t change.
As before we will combine the count vectoriser and tf-idf transformer. Keras has a tokeniser that comes with its own built-in tf-idf transformer. For Pytorch we use Scikit Learn’s
TfidfVectorizer, which is an all-in-one count vectoriser and tf-idf transformer.
We also use a
LabelBinarizer to create our set of target labels (the file types) to supervise the neural network.
from keras.preprocessing.text import Tokenizer from sklearn.preprocessing import LabelBinarizer def prepare_data(text, labels): tokenizer = Tokenizer() tokenizer.fit_on_texts(text) X = tokenizer.texts_to_matrix(text, mode='tfidf') encoder = MultiLabelBinarizer() encoder.fit(labels) y = encoder.transform(labels) return X, y
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.preprocessing import LabelBinarizer def prepare_data(text, labels): tfidf_transformer = TfidfVectorizer() X = tfidf_transformer.fit_transform(text).toarray() encoder = LabelBinarizer() encoder.fit(labels) y = encoder.transform(labels) return X, y
As we did previously, we’ll use the
train_test_split utility to divide the data into training and testing data.
The Neural Network
The neural network is at the heart of our process. We’ve already discussed its design, now we need to build it. As we see below Keras makes it really easy for us to compose the neural network, whereas in Pytorch we need to subclass the base class
The result is that the neural network object we instantiate has different properties depending on whether we’re working in Keras or Pytorch. The Pytorch version is a bit more work, but also more flexible. We’ll see this more clearly in the training section.
def build_nn(input_size, hidden_size, num_classes, dropout): nn = Sequential() nn.add(Dense(hidden_size, input_shape=(input_size,))) nn.add(Activation('relu')) nn.add(Dropout(dropout)) nn.add(Dense(num_classes)) nn.add(Activation('softmax')) nn.summary() return nn input_size = X_train.shape # this is the vocab size hidden_size = 512 num_classes = y_train.shape dropout = 0.3 network = build_nn(input_size, hidden_size, num_classes, dropout)
class NN(nn.Module): def __init__(self, input_size, hidden_size, num_classes, dropout): super(NN, self).__init__() self.main = nn.Sequential( nn.Linear(input_size, hidden_size), nn.ReLU(), nn.Dropout(dropout), nn.Linear(hidden_size, num_classes), ) def forward(self, input): return self.main(input) input_size = X_train.shape # this is the vocab size hidden_size = 512 num_classes = y_train.shape dropout = 0.3 network = NN(input_size, hidden_size, num_classes, dropout)
Note that we didn’t specify the softmax activation in Pytorch’s output layer. The reason is that it is already included in Pytorch’s Cross Entropy loss function, which is added as a layer during training.
Differences are more pronounced in the training phase. Keras has built-in methods for putting it all together (
compile) and performing the training (
fit). With Pytorch we need to do some of the legwork ourselves.
For example, we need to consider batching, which in Keras is automatically part of the fit method. For Pytorch we create our own helper function. Luckily that’s pretty straightforward.
def train(X_train, y_train, criterion, optimiser, batch_size, num_epochs): network.compile(loss=criterion, optimizer=optimiser, metrics=['accuracy']) history = network.fit(X_train, y_train, batch_size=batch_size, epochs=num_epochs, verbose=1, validation_split=0.1) return network num_epochs = 5 batch_size = 50 learning_rate = 0.0005 criterion = 'categorical_crossentropy' optimiser = Adam(lr=learning_rate) model = train(X_train, y_train, criterion, optimiser, batch_size, num_epochs)
def train(X_train, y_train, num_epochs, batch_size): for epoch in tqdm_notebook(range(num_epochs)): total_batches = int(len(X_train) / batch_size) # Loop over all batches for i in tqdm_notebook(range(total_batches)): X_batch, y_batch = get_batch(X_train, y_train, i, batch_size) data = Variable(torch.FloatTensor(X_batch)) labels = Variable(torch.LongTensor(y_batch)) labels = torch.max(labels, 1) optimiser.zero_grad() outputs = network(data) loss = criterion(outputs, labels) loss.backward() optimiser.step() print ('Epoch [%d/%d], Loss: %.4f' %(epoch+1, num_epochs, loss.data)) def get_batch(X_train, y_train, i, batch_size): data = X_train[(i*batch_size):((i*batch_size)+batch_size)] labels = y_train[(i*batch_size):((i*batch_size)+batch_size)] return np.array(data), np.array(labels) num_epochs = 5 batch_size = 50 learning_rate = 0.0005 criterion = nn.CrossEntropyLoss() optimiser = torch.optim.Adam(network.parameters(), lr=learning_rate) train(X_train, y_train, num_epochs, batch_size)
As with our Scikit Learn version in Part One we will use the test data to make predictions that we can use to measure the performance of our models.
Figure 2: Making predictions
Pytorch requires a bit more work as we cast our test data to a tensor, but otherwise the two processes are equivalent.
file_types = np.unique(log_collection['type']) predictions = model.predict(np.array(X_test)) predicted_labels = [ file_types[np.argmax(p)] for p in predictions] actual_labels = [ file_types[np.argmax(y)] for y in y_test]
test_inputs = Variable(torch.from_numpy(X_test).float()) predicted = network.forward(test_inputs) predicted_classes = [ np.argmax(p) for p in predicted.detach().numpy() ] file_types = np.unique(log_collection['type']) predicted_labels = [ file_types[p] for p in predicted_classes] actual_labels = [ file_types[np.argmax(y)] for y in y_test]
As with Scikit Learn we are interested in observing and comparing the accuracy, precision, recall, and F1-Score. We will use the same utility functions we used in Part One.
It can take a while to train a model on a large dataset. If it takes too long it’s worth looking into GPUs. They can provide significant speed-ups (sometimes as much as 20-30x). While out of scope for this blog post, Keras should pick up on available GPUs automatically (assuming Tensorflow was installed with CUDA). Pytorch requires a few additional steps.
The validation loss between Keras and Pytorch showed a clear difference during training. Since the actual performance metrics turned out very similar this is likely down to differences in the way each performs validation.
The metrics suggest that both models perform more or less equally well.
Performance Report [[5854 0 0 0 0 0 0] [ 0 302 0 0 0 0 0] [ 0 0 46 0 0 0 0] [ 0 0 0 3588 19 0 0] [ 0 0 0 207 2700 0 0] [ 0 0 0 0 0 926 1] [ 0 0 0 0 0 1 3947]] precision recall f1-score support corecaptured.log 1.00 1.00 1.00 5854 fsck_apfs.log 1.00 1.00 1.00 302 fsck_hfs.log 1.00 1.00 1.00 46 install.log 0.95 0.99 0.97 3607 system.log 0.99 0.93 0.96 2907 wifi-11-07-2018__13:38:02.923.log 1.00 1.00 1.00 927 wifi.log 1.00 1.00 1.00 3948 micro avg 0.99 0.99 0.99 17591 macro avg 0.99 0.99 0.99 17591 weighted avg 0.99 0.99 0.99 17591 Accuracy: 0.99
Performance Report [[5872 0 0 0 0 0 0] [ 0 327 0 0 0 0 0] [ 0 0 46 0 0 0 0] [ 0 0 0 3493 29 0 0] [ 0 0 0 203 2712 0 0] [ 0 0 0 0 0 926 1] [ 0 0 0 0 0 4 3981]] precision recall f1-score support corecaptured.log 1.00 1.00 1.00 5872 fsck_apfs.log 1.00 1.00 1.00 327 fsck_hfs.log 1.00 1.00 1.00 46 install.log 0.95 0.99 0.97 3522 system.log 0.99 0.93 0.96 2915 wifi-11-07-2018__13:38:02.923.log 1.00 1.00 1.00 927 wifi.log 1.00 1.00 1.00 3985 micro avg 0.99 0.99 0.99 17594 macro avg 0.99 0.99 0.99 17594 weighted avg 0.99 0.99 0.99 17594 Accuracy: 0.99
Neural networks are quite different from the classical machine learning methods we looked at in Part One. Not only do we have to consider the neural network architecture, there are also functions and other hyperparameters to configure.
We saw that it is possible to implement a successful solution to a supervised machine learning problem like log classification using two different frameworks, Keras and Pytorch.
The code for both the Keras and Pytorch version are available as Jupyter notebooks in the github repo.
We hope this two-part series of blog posts has whetted your appetite for machine learning and look forward to sharing more in future.