For first attempt in the creation and examination of artificial neural networks is considered the work of McCulloch and Pitts in 1943, in which they formulated the basic principle for construction of artificial neurons and neural networks. A big breakthrough in this field is made by Rosenblatt in 1962, who created a model of one-layered neural network, called percepton. It is used in for a variety of tasks like weather forecasting, electrocardiogram analysis and artificial vision. Later on Minsky and Papert proved in a strictly mathematical way, that this kind of neural network is not able to solve some very simple tasks like ‘XOR’. The unimpeachability of the proofs becomes the reason for the standstill in the development of the neural networks. Only a few persistent scientists like Kohonen, Anderson, etc continued the research in this direction. As it becomes clear later on, Minski was too pessimistic in his conclusions for the future of the neural networks, because the tasks he described as unsolvable, are solved today by standard models of neural networks.
At the moment the neural networks are used in various fields – from analysis of temporal rows to control of robots and examination of uncommon diseases. Their universality is one of their most precious features. It is also proven that the neural networks are performing perfectly where the traditional statistical models fail to give a result.
In the literature there is not yet a common adopted definition for ‘neural networks’ (NN). The terminological precision requires the use of the whole name – ‘artificial neural networks’ (ANN), but as the biological neural networks are not going to be examined, only the short name will be used. The complexity when defining the concept neural networks, origins from the fact that this subject has an interdisciplinary character. Besides, there is a big variety of algorithms, united under the name ‘neural networks’. The ways for realization of this algorithms are also conceptionally different – software and hardware. And last, but not least the various fields of implementation of the neural networks also lead to the fact that the experts from different fields put a different sense to this concept. After all, without pretending for absolute precision, the following working definition for the neural networks can be given.
The common name of several groups of algorithms that have the ability fot self-training through examples extracting the relationships of data.
In order to understand the nature of the neural networks, it is proper to begin with the description of the biological analogue, which served as a base for their creation. The human nerve system is composed of elements, called neurons. Their number is about 1011. They have many unique features, but the most important is to accept, process and transmit electrochemical signals through the neural paths, realizing in this way the communication system of the brain. The number of the connections between the neurons is also huge- 1015.
Biological Neuron

On the figure the structure of two biological neurons is shown. From the body of the cell two dendrites come out, which connect the cell to the other neurons. The points of junction are called synapses. The accepted input signals go to the body of the neuron. There they are stimulated, while some of the signals strive for exciting the neuron and others- to prevent its excitement. If the total excitement of the neuron is larger than a given minimum, it sends through the axon a signal to the other neurons. However, this scheme has exceptions, and, of course, in more complex situations the majority of artificial neurons model exactly these simple properties of the biological neurons.
The artificial neural networks are composed of simple elements – neurons connected with each other through a huge number of connections. The artificial neuron can imitate the properties of the real one. A variety of input signals come at its entrance, each of which is an output signal from another neuron. Each entry signal is multiplied by a corresponding weight (analogical to the strength of the synapse), the products are added together and in this way the so called activation level of the neuron is determined, which also determined its consequent behavior.
The simplest model of an artificial neuron is proposed by McCulloch and Pitts. The scheme of such a neuron is represented in the next figure.
Artificial Neuron
The input signals are multiplied by the corresponding weights and are summed in the core. After that they are compared with some minimal value . The output signal is determined in the following way.

Here represents:

The function is called activation function. In the model of McCulloch and Pitts it is discrete:

The positive value of the weights acts excitingly on the synapse, and the negative – suspensory. If the weight is zero, than this testifies the lack of connectio n between the neurons.
As this type discrete function has its disadvantages, in their quality of antirational, most commonly the following functions are used, ‘shrinking’ the exiting signal in some boundaries:
Logistic function, varying in the range between 0 and 1:

Hyperbolic tangent, varying in the range between -1 and 1:

The combination of the variety of artificial neurons, arranged in layers, together with the connections between them, form the architecture of the artificial neural network. The layers can be input, hidden (one or more) and output.
The big variety of algorithms, united under the name ‘neural networks’ also determines the variety of classifications of their types. The most significant classification is by the way, in which they learn. The learning is a process, under which the combination of data is passed to the network consecutively on steps, called epochs. There are two types of learning – ‘Supervised Learning’ and ‘Unsupervised Learning’.
In the case of ‘Supervised Learning’ a combination of input variables and an output variable is passed to the network.
In the case of ‘Unsupervised Learning’ the network does not need an output variable. The process follows the properties of the input data and groups the similar objects in clusters.
At the moment a great variety of different architectures of neural networks exists – multilayered perception (MLP), probabilitical neural networks (PNN), radial basic functions (RBF), associative network of Hopfield, Elman model, self-organizing Kohonen maps (SOM) or , self organizing feature maps (SOFM) etc.
Multilayer percepton
On the figure, an exemplary scheme of a neural network with a hidden layer is represented. This architecture is probably the most widely used. The number of the hidden layers can also be more than one, but on practice networks with more than two to three hidden layers are not applied. The reason is that ‘theoretically for modeling of a random task, a multilayered perception with two hidden layers is enough (in its exact formulation this result is known as the Kolmogorov theorem)’.
Multilayer Perceptron
with one hidden layer

This type of network is called Multilayer Perceptron (MLP). It is proposed by Rumelhart and McClelland and it is being discussed in detail in all works, connected to the neural networks. Each element from the input layer is connected through a defined weight with an element from the next layer (in this case the hidden layer). In a similar way each element from the hidden layer is connected with each element from the output layer.
In order for the network to start working, it should be trained. The aim is such values of the weights and thresholds to be found, that would minimize the aggregate error of the network. This happens through letting the whole combination of real data through the network and comparing them with the forecasted ones. All such differences are added together and the received value is the error. Most often this is an average quadratic error, at which the target function is minimized:

where:
is the value of the output layer , calculated by the network;
– the real value of this neuron on the basis of the input data.
The combination of data is passed to the network at a defined number of epochs. At each following epoch the network minimizes its error, until it reaches some preliminary determined criteria, for example error size, rate of decreasing the error size, number of epochs, etc.
Unlike the linear methods, in which the minimum of the function is found with the help of analytical methods, in the case of neural networks, this is impossible. Searching the minimum is done through iterative algorithms, ‘travelling over’ the so called surface of the error, which is a multidimensional hyper plane with a complex lay. During this iterative process, however, there is a threat that the found by algorithm minimum is only local, and not lump, which means that the best solution has not been found. This is also the price that is paid for the nonlinear possibilities for modeling neural networks. In order to decrease the probability of founding a local minimum, the network is trained many times, and the received results are compared.
Different algorithms are used for training the networks. The most popular of them is the Back Propagation. Of course, there are also numerous other algorithms like the gradient method, the Levenberg-Marquard algorithm, etc.
One of the biggest difficulties when training the neural networks is choosing the parameters in such a way, that it is able to summarize the information later on. That is, the network should be able to pass new, unknown for it, data and to deduce a true result. If, however, during the training process, this necessity is not taken into account, the network is trained to absolutely exactly settle itself down to the data, but loses the ability to summarize. This problem is known as
over-learning.
An example for such a problem can be also given when applying the regression analysis. It is known that when the function, through which the data is approximated, is a polynomial one, the higher its degree, the higher the determination coefficient. This, however, does not necessarily mean that the polynomial with the higher degree should be preferred.
In order for the retraining of the network to be avoided, the so called cross-check is done. A part of the input data is used for a type of an ‘independent control’ of the results. This dataset is called ‘test dataset’, and the dataset used during the training-‘training dataset’. In the beginning of the training the error for the training and the test dataset is the same. In the process of training, if the error of the test dataset decreases together with the error of the training one, this shows that the network needs more training. If the error of the test dataset stops decreasing and even starts increasing, then the training should be terminated, because the network has started to retrain itself.
In practice the process of model searching and the most suitable adjustments takes a lot of experimentation. This leads to unpleasant consequences – the test dataset starts playing a role when making the choice. It becomes a part of the training process and can no more be used for ‘independent control’. That is why it is necessary that another combination to be selected from the data- the so called ‘test’ dataset. It is used only once in the end of the training in order to check the results.
Self Organizing Kohonen Maps
The idea for Kohonen networks (Self Organizing Maps, SOM or Self Organizing Feature Maps, SOFM) has also originated analogically to some features of the human brain. The cerebral cortex is a big flat sheet (with area of about 0.5 m2, which, in order to fit in the human skull, is strongly folded) with topological features. For example, the section, responsible for the wrist is situated near the section, responsible for the movement of the whole hand. In this way the image of the human body is constantly being mapped on the two-dimensional surface of the cerebral cortex.
Kohonen maps are among the most popular kinds of neural networks. They are intended to identify the clusters of similar data, and to determine their proximity as well. They work on the principle ‘unsupervised learning’, realizing a process of clustering. Only input data is sent to the network, and it does not have any preliminary given output information.
The algorithm involved in Kohonen maps is a variation of multi-dimensional vectors clustering. With the help of this algorithm a mapping from a higher dimensional input space (determined by the number of indicators) to a lower dimensional (it is usually two-dimensional, but it is also possible to be one-dimensional) with preserving the topological resemblance, is achieved. This means that all vectors, which are adjacent to the topological map, are also adjacent in the input space. It should be noted that the opposite is not always true.
The Kohonen network is taught through the method of the successive approximations – Kaski. Each neuron on the topological map is – dimensional vector , where is the size of the input space (the number of indicators). The quantity of neurons on the topological map determines the degree of detailisation of the results from the work of the algorithm. Their initial position is chosen randomly.
Kohonen Map

Beginning with these randomly situated centers of clusters, the algorithm gradually improves their position in such a way as to catch the input data clustering (the objects in the input space are represented as dots). In result of the iterative procedure of learning the map is self-organizing in such a way that the elements, corresponding to the centers and situated near one another in the input space, are also situated close to the topological map (the output layer).
The algorithm is known as ‘the winner takes all’ and consists of the following:
1. The neuron-winner is chosen (the one, which is situated most closely to the input example-object). In practice the learning of the Kohonen map is a correction of the positions of the vectors-neurons on the topological map. At every step of the learning (the term, used with neural networks, is epoch) from the input supply of data, one of the vectors is chosen randomly and then the nearest to it vector from the neurons on the topological map is looked for. In this way the neuron-winner, which mostly resembles the input vector, is chosen. Under resemblance here is understood the distance between the vectors (usually Evklid space). The formula is:

where:
is input vector;
-vector of weights (of the ouput layer)
2. The neuron-winner is corrected is such a way that it resembles more the input example (for this purpose the weighted sum of the previous center of the neuron and the input example is calculated). By doing this, the vector, describing the neuron winner and the vectors, describing its neighbors on the topological map, move in the direction of the input vector.
In this process of correction of weights, the formula is used:

Where is the number of the epochs (discrete time). The vector is randomly chosen from the input combination of vectors of the epoch . The function is called the neuron neighborhood function. This is a non increasing function of time and distance between the neuron-winner and its neighbors.
When learning the Kohonen network, the concept ‘neighborhood’ is used. The neighborhood is the set of neurons, surrounding the neuron-winner. Its size decreases with time, as at the end it becomes equal to zero, that is, it is composed only of the neuron-winner itself. As a result of this procedure, bigger and bigger sections of the networks are attracted to the input examples – input objects. In this way the observations, which are similar to one another, activate groups of neurons, situated closely on the topological map. The process is repeated over and over again according to the chosen number of epochs.
After the Kohonen network is trained,
the so called ‘Unified Distance Matrix’ (U-Matrix) is used for the recognition of clusters. In this way the distance (usually Euclid) from each neuron to its neighbors on the topological map, is calculated. This distance determines in what color the neuron is represented on the map. The small distances speak for resemblance of the neuron-neighbors, and the big ones- for differences. The coloring is done analogically to altitude maps - the small values are colored in green, and the high ones- in brown. In this way the cluster on the map should form areas in green colors, and around them beige-brown-red areas should be situated- the boundaries of the clusters. Another option of coloring is to be black and white. In this option the white color corresponds to the small distances, and the black color- to the large ones. In this way the clusters are colored in white, and the boundaries – in black.
Unified Distance Matrix

It is also possible that maps of variables are produced, used for describing the input vectors. In this way it can be identified in which region of the map the corresponding variable has low values, and in which region -high ones. This makes it possible, ‘portraits’ of the clusters to be made, that is, their description to be made up. The received combination of cards represents an original ‘atlas’, describing the situation of the variables and clusters in the combination of data
Author: Research Associate Ist degree Alexander
Tzvetkov, PhD |