The hype around machine learning shows no sign of slowing down. But while companies set out to hire more data scientists and machine learning engineers, many of us are still left wondering what it is exactly.
In this two part blog series machine learning engineer Maartens Lourens sets out to introduce machine learning concepts and solutions from the point of view of an engineer.
In some of my earlier writing, I set out to demystify machine learning, how it works, and how you can implement a solution to a classification problem. I used an example that I hoped would be easy for most engineers – especially DevOps Engineers – to relate to by, on the one hand, using logs as the data context, and, on the other hand, sticking to basic programming functions and data structures where possible. I also used a highly composable machine learning framework in the form of Scikit Learn.
In the present series of blog posts, I want to go a step further. In Part 1 I would like to present a more idiomatic implementation of the log classification problem. This will be achieved in large part by introducing the rich and flexible capabilities of Pandas and Numpy for data management and manipulation. Then in Part 2 I will compare the updated Scikit Learn implementation with solutions in two Deep Learning frameworks, in particular Keras (with a Tensorflow backend) and Pytorch.
By doing so I hope to provide some of the basic building blocks used in modern Deep Learning and Machine Learning development. This should place us in a good position to explore more sophisticated problems in the future.
Suppose that we are presented with a line from a log file. What we want to do is to predict what type of log it belongs to. In this case, the type of log is simply the name of the log – eg. system.log, httpd.log etc.
We accomplish this by training on a set of existing logs to produce a model. For example, on my laptop there is a set of logs in the
/var/log directory. By using a machine learning algorithm I can fit the data and so build the model. This flow can be visualised as follows.
Figure 1: Training a Model
Once we have a model we can achieve our goal by presenting new logs to the model and making predictions.
Figure 2: Making Predictions
Improving the Original
An earlier version used a couple of coding practices that might seem familiar to most DevOps Engineers, but would seem strange to a Python data scientist:
- Bash scripting to collect the logs
- Python dicts and lists to manage the data
It also omits a couple of things that would make a data scientist raise an eyebrow:
- No notebook
- Limited metrics
To be more idiomatic in our data science development we will let Python collect the logs, use Pandas and Numpy to manage the data, and leverage the power of Jupyter Notebook.
Let’s take a quick look at each of these.
Collect the Logs
This is straightforward to do in Python with the help of the
shutil libraries. Let’s define a function.
def copy_data(src_file_path, dst_file_path): if not os.path.exists(dst_file_path): os.mkdir(dst_file_path) for logfile in glob.glob(src_file_path + "/*.log"): if os.stat(logfile) > 10000: logfile_name = logfile.split('/')[-1] shutil.copyfile(logfile, dst_file_path + "/" + logfile_name)
We can now call our function quite easily.
source_data_dir = "/var/log" data_dir = "data" copy_data(source_data_dir, data_dir)
Our function creates a local data directory, and then copies logs from
/var/log when they are larger than 10k (to allow us enough data of a particular type to learn from).
Pandas and Numpy
We mention Pandas and Numpy practically in the same breath because it is common nowadays to use Pandas, and it relies on Numpy under the hood. However, Numpy is also used in its own right and both are foundational to most Python data science toolkits.
Numpy offers efficient data structures (more efficient than native Python) with many convenient numerical and data manipulation operator methods. Pandas treats the data as tabular data via structured data frames, which adds further conveniences and rich semantics. These semantics come in handy when, for example, we want to deal with NaN (not a number) values or to reshape the data dimensions. In our example we will note a couple of cases like this.
Let’s define a function that reads our logs into Pandas data frames.
def read_data(logfile_path): log_collection = pd.DataFrame() logs = pd.DataFrame() logfiles = glob.glob(logfile_path + "/*.log") # Get list of log files for logfile in logfiles: logs = pd.read_csv(logfile, sep="\n", header=None, names=[‘data’]) logs['type'] = logfile.split('/')[-1] # Add log file data and type to log collection log_collection = log_collection.append(logs) # Remove empty lines log_collection = log_collection.dropna() return log_collection
As there are separate data sources (the individual log files) we use one data frame (
logs) to read each log file in sequence, and another to keep the whole collection (
log_collection). This separation also makes it easy to add the log type to logs as data in a new column.
We are reading in the data using Pandas’
read_csv method. However, we want to configure some of its arguments. Firstly, Pandas’ default mode expects the first line to contain column header data, and since our logs don’t provide any headers we need to make this explicit:
header=None. Secondly, since there is no header information we add a descriptive column name as a convenience:
names=[‘data’]. As we have only one data column for the whole line, in other words we don’t have any requirement to parse the log structure into different data columns (eg. date, process, message, etc.), we use the
sep="\n" configuration to let Pandas stick the whole line into that one column.
Pandas adds another convenience. The log files have a number of empty lines, and since they appear as missing values in the data frame, we can use the
dropna() method to get rid of them.
The Jupyter project is an interactive computing platform that aims to “support interactive data science and scientific computing across all programming languages”. It grew out of iPython’s interactive shells and the browser based notebooks have since become a mainstay of data science workers.
Among other features a notebook offers interactive, out-of-sequence operations as well as visual output (such as graphs), which fit well with data scientists’ needs. This flexibility is not without its drawbacks, and Joel Grus in particular has taken notebooks to task for encouraging bad habits when it comes to engineering practices. On the whole, however, notebooks are very useful during the exploratory phase of model development.
Jupyter notebook isn’t the only interactive Python IDE, for example Spyder is another popular one. However Jupyter’s notebook approach has become increasingly popular with users, and its look and feel has been replicated in the machine learning offerings of Cloud providers like Google, AWS, and Azure, and is therefore the one provided in the repo.
In the original implementation our only measurement was accuracy. Accuracy is a useful indicator, but it can be misleading. To understand why we need to take a look at how accuracy is calculated, and to help us we’ll take a quick look at four important concepts: true positives, false positives, true negatives, and false negatives.
True and False Positives, True and False Negatives
True Positive: Reality: A wolf threatens; Shepherd: “Wolf!”
True Negative: Reality: No wolf threatens; Shepherd: quiet.
False Positive: Reality: No wolf threatens; Shepherd: “Wolf!”
False Negative: Reality: A wolf threatens; Shepherd: quiet.
In the story, the boy who cried ‘wolf’ all the time initially committed a false positive because there was no actual wolf that threatened. When finally the wolf appeared and threatened the sheep, he cried “Wolf!” (but nobody came to help). This time it was a true positive.
These four logical categories have different consequences that affect our assessment of the performance of the model.
Accuracy, Recall, Precision, F1-Score
Once measurements for the above four categories are known (often presented in a confusion matrix), we can use them to calculate more useful metrics with the following formulae:
Accuracy = (TP + TN) / (TP + TN + FP + FN) Recall = TP / (TP + FN) Precision = TP / (TP + FP) F1 Score = 2 * ((Precision * Recall) / (Precision + Recall))
The problem with accuracy is that it can hide false negatives and false positives. A low number of false positives or false negatives compared to either true positives or true negatives will make the accuracy look really good. However if either of the former are high compared to true positives, the results can have a big impact on decision-making.
This is where precision and recall come in. Low precision means the predictions tend to flag up positive cases incorrectly, whereas low recall means the predictions do not flag up many of the actual positive cases. F1 Score provides a kind of balance between precision and recall. Together they help us to gauge the true performance of our model.
Scikit Learn provides a classification report via its metrics module. It calculates the precision, recall, and F1 score for each classification category (the file types) automatically. Combined with the confusion matrix, which displays the true positives, false positives, and false negatives, we gain a powerful set of metrics.
For example in the following metrics for the Ridge Classifier algorithm we can see that, although the overall scores are high, the precision and recall for one of the wifi logs is actually really low. This is probably not surprising, given that all the wifi logs are likely really similar and mainly distinguished by date. However it shows us how we can analyse our results and make sense of the model’s performance.
[[2536 0 0 0 0 0 0 0 0 0] [ 0 131 0 0 0 0 0 0 0 0] [ 0 0 1739 1 0 0 0 0 0 1] [ 0 0 126 1190 0 0 0 3 0 0] [ 0 0 0 0 1091 0 0 0 0 0] [ 0 0 0 0 1 9 105 0 0 1] [ 0 0 0 0 1 24 923 0 0 1] [ 0 0 0 0 1 0 0 918 0 1] [ 0 0 0 0 0 0 0 0 761 1] [ 0 0 0 0 2 0 0 0 1 1719]] precision recall f1-score support corecaptured.log 1.00 1.00 1.00 2536 fsck_apfs.log 1.00 1.00 1.00 131 install.log 0.93 1.00 0.96 1741 system.log 1.00 0.90 0.95 1319 wifi-08-23-2018__12:54:38.121.log 1.00 1.00 1.00 1091 wifi-08-24-2018__09:09:14.458.log 0.27 0.08 0.12 116 wifi-08-24-2018__12:47:32.191.log 0.90 0.97 0.93 949 wifi-08-28-2018__14:27:47.184.log 1.00 1.00 1.00 920 wifi-09-03-2018__12:45:21.309.log 1.00 1.00 1.00 762 wifi.log 1.00 1.00 1.00 1722 avg / total 0.97 0.98 0.97 11287 Accuracy: 0.98
In particular, there are only 8 true positives for
wifi-08-24-2018__09:09:14.458.log, whereas there are 21 false positives (vertical) and 104 false negatives (horizontal). This affects the precision (0.28) and recall (0.07) in a big way. To put that in perspective, with over 10000 true negatives, the accuracy will be over 0.98! So clearly, accuracy by itself can be very misleading. This is called the accuracy paradox.
A Word on Data Preprocessing
We saw in the previous sections that Pandas makes data management really easy. It allowed us to perform an important data cleaning task in dropping empty lines, and also let us feature engineer the labels. As additional preprocessing we perform vectorisation and add tf-idf weighting factors on the fly during the training pipeline stage.
model = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', algorithm)]) model.fit(X_train, y_train)
Preprocessing is key to nearly all machine learning problems, and the quality of this task often distinguishes a good model from a lesser one. Because it is so important we should now add it as one of the components in our training flow.
Figure 3: Training a Model
With all these changes in place we can see that the code is lean, yet flexible. The adopted code practices are also more in keeping with the way data scientists actually work.
In Part Two we will take a look at a couple of neural network implementations of the same problem, which will set the stage for when we need to tackle more sophisticated problems later on.