Capacity of a model
If we consider a supervised model as a set of parameterized functions, we can define representational capacity as the intrinsic ability of a certain generic function to map a relatively large number of data distributions. To understand this concept, let's consider a function f(x) that admits infinite derivatives, and rewrite it as a Taylor expansion:
We can decide to take only the first n terms, so to have an n-degree polynomial function. Consider a simple bi-dimensional scenario with six functions (starting from a linear one); we can observe the different behavior with a small set of data points:
Different behavior produced by six polynomial separating curves
The ability to rapidly change the curvature is proportional to the degree. If we choose a linear classifier, we can only modify its slope (the example is always in a bi-dimensional space) and the intercept. Instead, if we pick a higher-degree function, we have more possibilities to bend the curvature when it's necessary. If we consider n=1 and n=2 in the plot (on the top-right, they are the first and the second functions), with n=1, we can include the dot corresponding to x=11, but this choice has a negative impact on the dot at x=5.
Only a parameterized non-linear function can solve this problem efficiently, because this simple problem requires a representational capacity higher than the one provided by linear classifiers. Another classical example is the XOR function. For a long time, several researchers opposed perceptrons (linear neural networks), because they weren't able to classify a dataset generated by the XOR function. Fortunately, the introduction of multilayer perceptrons, with non-linear functions, allowed us to overcome this problem, and many whose complexity is beyond the possibilities of any classic machine learning model.