Big Data, Data Analytics, IOT, Data Lake - Submit Your Guest Post
machine learning pipeline

The Philosophy of Building a Machine Learning Pipeline

Introduction

With this post, we open a series of articles on methodologies for building a pipeline in machine learning. You won’t find diving into specific details here. Our goal is to understand at the top level what is good and what is bad. 

Our story is not about machine learning, but about the infrastructure that you must surround your model with in order to achieve a specific goal, to satisfy a business need. It’s not about delicate and sophisticated architecture, beautiful in themselves. We will talk about a hammer that can be used to drive nails.

What makes the solution successful?

You can see that often the use of a simple baseline model, with the mind, of course, gives ~ 95% accuracy of a super-sharpened solution for a specific area. Below are the approaches that provide consistent improvement. So to the point, what makes a particular solution effective:

Knowledge of multiple state-of-the-art algorithms?

Tooling instead of problem solving can create a problem: we can start hammering in nails with a microscope. And is it necessary – everything that is new has a share of uncertainty under it, carries with it danger. 

Whether it’s a new model architecture, a new framework, or something else. Another thing is the time-tested tools (hammer)! It’s still dangerous here, as machine learning is a relatively young industry, but it’s better than nothing.

Knowing how to apply 3-4 standard algorithms?

Again, no. Here is the opposite situation. If we do not look above our heads, we will never see a possible solution to where we need to go. ML is always atypical solutions that go beyond the ordinary.

The right metric?

Undoubtedly yes! We concretize goals, formalize. This technique can be called fundamental, since target designation will set the vector of all our further development in any direction and it is good, if at least somehow co-directed with our “real” goal.
For example, our task is “World Peace”.


How do we know that we have achieved this? What’s the metric here? Let’s measure by the average level of endorphin in the blood of each person?

And then we need to implement our plan. Now that it’s getting started, let’s look at the state-of-the-art model: “The way is to consume chili. Hold it on your tongue for a while, and you will not only feel better, but also get a rush of endorphin. “
The model and metric are clear. Therefore, all people should put chili peppers under their tongues and then “World Peace” will come. Alas, I think it won’t work.

Amount of data?

Is it necessary? There is an opinion that after a certain threshold, the possible level of quality with the current architecture approaches a plateau, that is, for the invested efforts in data collection, we get less and less result.

Validating the correctness and purity of the original data?

Undoubtedly yes! In computer science, there is even such an expression GIGO (garbage in – garbage out). But without a solution that already has a solution, it can be difficult for a person to assess the quality of a dataset.


For example – there are millions of photos and every thousandth of poor quality. If you filter it out with your hands, it is titanic and pointless work. The simplest solution is to build at least some kind of model and look at what examples very gross mistakes are made. When considering such cases, we come to the conclusion:
The model works poorly on some data.
Some data is just bad.
Both the first and second knowledge are very valuable.

Understanding the features of the data and understanding the specifics of the complex data model?

Perhaps yes. You need to understand that machine learning often replaces traditional approaches (often very simple), which were previously and worked well. An example of such a situation is pricing automation. Before machine learning, this was done either by specially trained people who acted according to intuition and regulations, or heuristics. In fact, this means that there were some simple models that “realized” the specifics of the data and acted to solve a specific problem, and, most importantly, to solve it.

Tuning model hyperparameters?

This is the last thing to do (remember the GIGO principle).

Validation of results?

This item is undoubtedly a plus! There is no understanding of the quality of your model – as a rule, there is no result, or rather it is, just not what you expect.

Writing unit tests tailored for machine learning?

This point is no longer about machine learning, but rather about building a production working solution. Maybe this point will not give you a better solution, but it can give you more confidence that the annoying moment when your service suddenly stops working will not come.

How quickly can you get an understanding of the quality of the built pipeline? 

This is also not about machine learning, but about iterable development. You will not be able to conduct normal development if your model takes weeks to train and you are not
sure that it will do any good. You definitely need a debug mode.

I do not pretend to be true, but taking into account the existing experience, my answer to the question of how to build an ML pipleine most effectively is to create an end-to-end system as soon as possible and quickly search for and eliminate bottlenecks in the resulting scheme .

These bottlenecks, just can be the points mentioned above.

Output

So, summing up everything described above, to build a successful machine learning pipeline you need everything at once and something else!

Dataflickr Team

Add comment

Your Header Sidebar area is currently empty. Hurry up and add some widgets.