*Editor’s note: This is a series of blog posts on the topic of “Demystifying the creation of intelligent machines: How does one create AI?” You are now reading part 3. For the list of all, see here: 1, 2, 3, 4, 5, 6, 7.*

As I discussed in my last posts, I have been working with colleagues at DXC to build an artificially intelligent fan, one that can monitor its operations, report issues and even, sometimes, fix problems itself.

As I mentioned in my first post, the process of creating AI requires knowledge about various machine learning methods and a deep understanding of how they work and what their advantages and disadvantages are. In this post, I will go into detail about three fundamental theorems that guided our work – and should direct yours as well.

**No free lunch here**

Neural nets do not do well with just any kind of data. Usually, one cannot just feed raw signals into a generic neural net and expect good results. The data needs to be adjusted for efficient processing. This means that, when preparing the data, you need to understand how neural networks work, what they like and dislike. Then, you can reorganize the data in a way that the network is “happy” with them.

You may be wondering – why can’t we have a machine learning algorithm that works well with any kind of data? Well, for good reason: It can’t be done.

It’s not that we haven’t discovered such an algorithm yet. Rather, there is a fundamental theorem in machine learning that tells us this silver bullet cannot exist. No matter how much you try, you will not find it.

The mathematical theorem proving this is the so-called “**no-free-lunch theorem**” It tells us that if a learning algorithm works well with one kind of data, it will work poorly with other types of data. This means that, unless we are extremely lucky, we must either adjust data to the algorithm or adjust the algorithm to the data to achieve high performance.

There is one more implication of the no-free-lunch theorem: a tradeoff between specialization and generalization of an algorithm. That means, you can have very efficient learning, but it will be specialized for just one type of a problem. Or you can have an algorithm that works with a range of problems, but it will be necessarily less efficient.

An example of this is a genetic algorithm that can find a solution to almost anything – but the calculation takes forever. On the other hand, a specialized algorithm could be a general linear model that works well only with linear relationships between unimodally distributed and independently sampled variables – but an optimal fit can be calculated in the blink of an eye.

Generally speaking, when building a high-performing AI, you are unlikely to find an algorithm sitting on a shelf, ready to be applied. Rather, you will have to create something like an algorithm for your specific data. And this is where data science comes into play.

The beauty of neural nets, and the reason they are so popular, is that they offer the possibility to engineer an architecture specific to your problem. Neural nets are general when seen as a tool for creating architectures, but they become reasonably specialized when you settle on a specific architecture.

**Inductive bias and overfitting**

To understand why and how general network principles can be turned into a specialized learning system – and hence an effective learner – it is necessary to consider another theorem: inductive bias.

Every machine learning algorithm has a bias toward what it “sees” in the data. In a way, a machine learning algorithm projects its own knowledge onto data. Consider, for example, a simple algorithm that fits a sinusoidal function to a time series. The algorithm is likely to impose a sinusoidal-like form on even random data.

This inductive bias determines the type of data an algorithm will work with effectively and the type it will struggle with. Neural nets are not free of inductive biases; they have their own biases determined by their own mathematics and architecture. By changing the architecture of a neural network, we change its inductive biases, to a degree. This why neural nets are such a versatile tool.

Combined with the no-free-lunch-theorem, we realize this is the only way to create an effective learner: change its inductive biases so it can become effective for something else, namely, our data.

Inductive biases also have to do with the problem of overfitting. In machine learning, overfitting occurs when your model performs well on training data, but the performance becomes horrible when switched to test data.

Overfitting happens when you apply incorrect inductive biases in a model. If the equations of the model truly reflect the data (for example, a linear model applied to data generated by a linear process), then any fit will be a correct fit for test data. In a way, the model – in its very architecture – contains knowledge about the data. Such a model can learn very fast; with only a few data points, it can begin generating accurate predictions.

But in reality, it does not work that way. We train neural networks on data generated by real life, not other neural networks. What options do we have then? How do we create sufficiently effective models of real-world phenomena?

One option is to create the most specialized models possible. This is when we directly describe the world with the most optimal equations. These models require extensive human work — and may take decades to reach satisfactory performance. Examples are equations that describe natural laws, such as E = mc^{2}.

The other extreme is to start with very broad models and let the computer do the fitting job. Various kinds of genetic algorithms offer such tools. In many cases, they take ages to evolve successfully because of the immense amount of computation they require.

What we usually end up doing is creating a balanced solution, something that requires a reasonable amount of work on the human side and not too long a computation. A degree of effort needs to be invested into the human understanding of the problem first (this may involve days or weeks but not years or decades), then computation completes the process.

**The good model theorem**

Before I return to explaining how we built AI for our fan, let me discuss on more theorem – the oldest in this story.

Created during the golden years of cybernetics by Ross Ashby and his student Roger Conant, the theorem is popularly called the “good regulator theorem.” It tells us that, to have a well performing agent that regulates or controls its surrounding world to its own goals (i.e., a well performing AI), the agent has to be a good model of that world.

That means, internally, the agent has to have some sort of accurate representation and “what-if simulation” of its environment. Being a good model is the only possible way to successfully interact.

Good regulator theorem is closely related to everything discussed above. For example, overfitting can be understood as the model’s math and/or architecture being off, constituting a poor model of the data. Good regulator theorem also tells us why it’s not possible to have a silver bullet machine learning algorithm. Any learning algorithm must also be a good model of the data; if it learns one type of data effectively, it will necessarily be a poor model — and a poor student – of some other types of data.

Good regulator theorem also tells us that determining if inductive bias will be beneficial or detrimental for modeling certain data depends on whether the equations defining the bias constitute a good or poor model of the data.

To sum it up, here is the relationship between the three theorems:

Now that we’ve covered these important theorems, stay tuned to see how we used this knowledge to reach good performance with our AI.

RELATED LINKS

So, regarding “no free lunch” and a model good on one type of data is not good on another, the only real issue is dimensionality. If you separate you data along the natural axiis of their dimensionality (visual 2d, acoustic along time dimension), then you CAN handle about any data.