Kolmogorov-Arnold Networks, also known as KAN, are the latest advancement in neural networks. Based on the Kolgomorov-Arnold representation theorem, they have the potential to be a viable alternative to Multilayer Perceptrons (MLP). Unlike MLPs with fixed activation functions at each node, KANs use learnable activation functions on edges, replacing linear weights with univariate functions as parameterized splines.
A research team from the Massachusetts Institute of Technology, California Institute of Technology, Northeastern University, and The NSF Institute for Artificial Intelligence and Fundamental Interactions presented Kolmogorov-Arnold Networks (KANs) as a promising replacement for MLPs in a recent paper titled “KAN: Kolmogorov-Arnold Networks.”
This article was published as a part of the Data Science Blogathon.
According to the Kolmogorov-Arnold representation theorem, any multivariate continuous function can be defined as:
Here:
ϕqp : [0, 1] → R and Φq : R → R
Any multivariate function can be expressed as a sum of univariate functions and additions. This might make you think machine learning can become easier by learning high-dimensional functions through simple one-dimensional ones. However, since univariate functions can be non-smooth, this theorem was considered theoretical and impossible in practice. However, the researchers of KAN realized the potential of this theorem by expanding the function to greater than 2n+1 layers and for real-world, smooth functions.
These are the simplest forms of ANNs, where information flows in one direction, from input to output. The network architecture does not have cycles or loops. Multilayer perceptrons (MLP) are a type of feedforward neural network.
Multilayer Perceptrons are a type of feedforward neural network. Feedforward Neural Networks are simple artificial neural networks in which information moves forward, in one direction, from input to output via a hidden layer.
Here:
MLPs are based on the universal approximation theorem, which states that a feedforward neural network with a single hidden layer with a finite number of neurons can approximate any continuous function on a compact subset as long as the function is not a polynomial. This allows neural networks, especially those with hidden layers, to represent a wide range of complex functions. Thus, MLPs are designed based on this (with multiple hidden layers) to capture the intricate patterns in data. MLPs have fixed activation functions on each node.
However, MLPs have a few drawbacks. MLPs in transformers utilize the model’s parameters, even those that are not related to the embedding layers. They are also less interpretable. This is how KANs come into the picture.
A Kolmogorov-Arnold Network is a neural network with learnable activation functions. At each node, the network learns the activation function. Unlike MLPs with fixed node activation functions, KANs have learnable activation functions on edges. They replace the linear weights with parametrized splines.
Here are the advantages of KANs:
Also read: Multi-Layer Perceptrons: Notations and Trainable Parameters
Let’s implement KANs with the help of a simple example. We are going to create a custom dataset of the function: f(x, y) = exp(cos(pi*x) + y^2). This function takes two inputs, calculates the cosine of pi*x, adds the square of y to it, and then calculates the exponential of the result.
Requirements of Python library version:
!pip install git+https://github.com/KindXiaoming/pykan.git
import torch
import numpy as np
##create a dataset
def create_dataset(f, n_var=2, n_samples=1000, split_ratio=0.8):
# Generate random input data
X = torch.rand(n_samples, n_var)
# Compute the target values
y = f(X)
# Split into training and test sets
split_idx = int(n_samples * split_ratio)
train_input, test_input = X[:split_idx], X[split_idx:]
train_label, test_label = y[:split_idx], y[split_idx:]
return {
'train_input': train_input,
'train_label': train_label,
'test_input': test_input,
'test_label': test_label
}
# Define the new function f(x, y) = exp(cos(pi*x) + y^2)
f = lambda x: torch.exp(torch.cos(torch.pi*x[:, [0]]) + x[:, [1]]**2)
dataset = create_dataset(f, n_var=2)
print(dataset['train_input'].shape, dataset['train_label'].shape)
##output: torch.Size([800, 2]) torch.Size([800, 1])
from kan import *
# create a KAN: 2D inputs, 1D output, and 5 hidden neurons.
# cubic spline (k=3), 5 grid intervals (grid=5).
model = KAN(width=[2,5,1], grid=5, k=3, seed=0)
# plot KAN at initialization
model(dataset['train_input']);
model.plot(beta=100)
## train the model
model.train(dataset, opt="LBFGS", steps=20, lamb=0.01, lamb_entropy=10.)
## output: train loss: 7.23e-02 | test loss: 8.59e-02
## output: | reg: 3.16e+01 : 100%|██| 20/20 [00:11<00:00, 1.69it/s]
model.plot()
model.prune()
model.plot(mask=True)
model = model.prune()
model(dataset['train_input'])
model.plot()
model.train(dataset, opt="LBFGS", steps=100)
model.plot()
MLP | KAN |
Fixed node activation functions | Learnable activation functions |
Linear weights | Parametrized splines |
Less interpretable | More interpretable |
Less flexible and adaptable as compared to KANs | Highly flexible and adaptable |
Faster training time | Slower training time |
Based on Universal Approximation Theorem | Based on Kolmogorov-Arnold Representation Theorem |
The invention of KANs indicates a step towards advancing deep learning techniques. By providing better interpretability and accuracy than MLPs, they can be a better choice when interpretability and accuracy of the results are the main objective. However, MLPs can be a more practical solution for tasks where speed is essential. Research is continuously happening to improve these networks, yet for now, KANs represent an exciting alternative to MLPs.
The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.
A. Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaci, Thomas Y. Hou, Max Tegmark are the researchers involved in the dQevelopment of KANs.
A. Fixed activation functions are mathematical functions applied to the outputs of neurons in neural networks. These functions remain constant throughout training and are not updated or adjusted based on the network’s learning. Ex: Sigmoid, tanh, ReLU.
Learnable activation functions are adaptive and modified during the training process. Instead of being predefined, they are updated through backpropagation, allowing the network to learn the most suitable activation functions.
A. One limitation of KANs is their slower training time due to their complex architecture. They require more computations during the training process since they replace the linear weights with spline-based functions that require additional computations to learn and optimize.
A. If your task requires more accuracy and interpretability and training time isn’t limited, you can proceed with KANs. If training time is critical, MLPs are a practical option.
A. The LBFGS optimizer stands for “Limited-memory Broyden–Fletcher–Goldfarb–Shanno” optimizer. It is a popular algorithm for parameter estimation in machine learning and numerical optimization.