What is a probabilistic neural network.
Probabilistic Neural Network.
The probabilistic neural network was developed by Donald Specht. His network architecture was first presented in two papers, Probabilistic Neural Networks for Classification, Mapping or Associative Memory and Probabilistic
Neural Networks, released in 1988 and 1990, respectively. This network provides a general solution to pattern classification problems by following an approach developed in statistics, called Bayesian classifiers.
Bayes theory, developed in the 1950's, takes into account the relative likelihood of events and uses a priori information to improve prediction. The network paradigm also uses Parzen Estimators which were developed to construct the probability density functions required by Bayes theory.
The probabilistic neural network uses a supervised training set to develop distribution functions within a pattern layer. These functions, in the recall mode, are used to estimate the likelihood of an input feature vector being part of a learned category, or class.
The learned patterns can also be combined, or weighted, with the a priori probability, also called the relative frequency, of each category to determine the most likely class for a given input vector. If the relative frequency of the categories is unknown, then all categories can be assumed to be equally
likely and the determination of category is solely based on the closeness of the input feature vector to the distribution function of a class. An example of a probabilistic neural network is shown in Figure 5.2.3. This network has three layers.
The network contains an input layer which has as many elements as there are separable parameters needed to describe the objects to be classified. It has a pattern layer, which organizes the training set such that each input vector is represented by an individual processing element.
And finally, the network contains an output layer, called the summation layer, which has as many processing elements as there are classes to be recognized. Each element in this layer combines via processing elements within the pattern layer which relate to the same class and prepares
that category for output. Sometimes a fourth layer is added to normalize the input vector, if the inputs are not already normalized before they enter the network. As with the counter-propagation network, the input vector must be normalized to provided proper object separation in the pattern layer.
1- Figure 5.2.3. A Probabilistic Neural Network Example.
As mentioned earlier, the pattern layer represents a neural implementation of a version of a Bayes classifier, where the class dependent probability density functions are approximated using a Parzen estimator. This approach provides an optimum pattern classifier in terms of minimizing the expected risk of wrongly classifying an object.
With the estimator, the approach gets closer to the true underlying class density functions as the number of training samples increases, so long as the training set is an adequate representation of the class distinctions. In the pattern layer, there is a processing element for each input vector in the training set. Normally, there are equal amounts of processing elements for each output class.
Otherwise, one or more classes may be skewed incorrectly and the network will generate poor results. Each processing element in the pattern layer is trained once. An element is trained to generate a high output value when an input vector matches the training vector.
The training function may include a global smoothing factor to better generalize classification results. In any case, the training vectors do not have to be in any special order in the training set, since the category of a particular vector is specified by the desired output of the input. The learning function
2- simply selects the first untrained processing
element in the correct output class and modifies its weights to match the training vector. The pattern layer operates competitively, where only the highest match to an input vector wins and generates an output. In this way, only one classification category is generated for any given input vector.
If the input does not relate well to any patterns programmed into the pattern layer, no output is generated. The Parzen estimation can be added to the pattern layer to fine tune the classification of objects, This is done by adding the frequency of occurrence for each training pattern built into a processing element.
Basically, the probability distribution of occurrence for each example in a class is multiplied into its respective training node. In this way, a more accurate expectation of an object is added to the features which make it recognizable as a class member. Training of the probabilistic neural network is much simpler than with back-propagation.
However, the pattern layer can be quite huge if the distinction between categories is varied and at the same time quite similar is special areas. There are many proponents for this type of network, since the groundwork for optimization is founded in well known, classical mathematics.
3- Networks for Data Association
The previous class of networks, classification, is related to networks for data association. In data association, classification is still done. For example, a character reader will classify each of its scanned inputs. However, an additional element exists for most applications. That element is the fact that some data is simply erroneous.
Credit card applications might have been rendered unreadable by water stains. The scanner might have lost its light source. The card itself might have been filled out by a five year old. Networks for data association recognize these occurrances as simply bad data and they recognize that this bad data can span all classifications.
4- Hopfield Network.
John Hopfield first presented his cross-bar associative network in 1982 at the National Academy of Sciences. In honor of Hopfield's success and his championing of neural networks in general, this network paradigm is usually referred to as a Hopfield Network.
The network can be conceptualized i n terms of its energy and the physics of dynamic systems. A processing element in the Hopfield layer, will change state only if the overall "energy" of the state space is reduced.
In other words, the state of a processing element will vary depending whether the change will reduce the overall "frustration level" of the network. Primary applications for this sort of network have included
associative, or content-addressable, memories and a whole set of optimization problems, such as the combinatoric best route for a traveling salesman.
The Figure 5.3.1 outlines a basic Hopfield network. The original network had each processing element operate in a binary format. This is where the elements compute the weighted sum of the inputs and quantize the output to a zero or one.
These restrictions were later relaxed, in that the paradigm can use a sigmoid based transfer function for finer class distinction. Hopfield himself showed that the resulting network is equivalent to the original network designed in 1982.
5- A Hopfield Network Example.
The Hopfield network uses three layers; an input buffer, a Hopfield layer, and an output layer. Each layer has the same number of processing elements. The inputs of the Hopfield layer are connected to the outputs of the corresponding processing elements in the input buffer layer through variable connection weights.
The outputs of the Hopfield layer are connected back to the inputs of every other processing element except itself. They are also connected to the corresponding elements in the output layer. In normal recall operation, the network applies the data from the input layer through the learned connection weights to the Hopfield layer.
The Hopfield layer oscillates until some fixed number of cycles have been completed, and the current state of that layer is passed on to the output layer. This state matches a pattern already programmed into the network.
The learning of a Hopfield network requires that a training pattern be impressed on both the input and output layers simultaneously. The recursive nature of the Hopfield layer provides a means of adjusting all of the
connection weights. The learning rule is the Hopfield Law, where connections are increased when both the input and output of an Hopfield element are the same and the connection weights are decreased if the output does not match the input.
Obviously, any non-binary implementation of the network must have a threshold mechanism in the transfer function, or matching input-output pairs could be too rare to train the network properly. The Hopfield network has two major limitations when used as a content addressable memory. First, the number of patterns that can be stored and accurately recalled is severely limited.
If too many patterns are stored, the network may converge to a novel spurious pattern different from all programmed patterns. Or, it may not converge at all. The storage capacity limit for the network is approximately fifteen percent of the number of processing elements in the Hopfield layer.
The second limitation of the paradigm is that the Hopfield layer may become unstable if the common patterns it shares are too similar. Here an example pattern is considered unstable if it is applied at time zero and the network converges to some other pattern from the training set. This problem can be minimized by modifying the pattern set to be more orthogonal with each other.
6- Boltzmann Machine.
The Boltzmann machine is similar in function and operation to the Hopfield network with the addition of using a simulated annealing technique when determining the original pattern. The Boltzmann machine incorporates the concept of simulated annealing to search the pattern layer's state space for a global minimum.
Because of this, the machine will gravitate to an improved set of values over time as data iterates through the system. Ackley, Hinton, and Sejnowski developed the Boltzmann learning rule in 1985. Like the Hopfield network, the Boltzmann machine has an associated state space energy based upon the connection weights in the pattern layer.
The processes of learning a training set full of patterns involves the minimization of this state space energy. Because of this, the machine will gravitate to an improved set of values for the connection weights while data iterates through the system.
The Boltzmann machine requires a simulated annealing schedule, which is added to the learning process of the network. Just as in physical annealing, temperatures start at higher values and decreases over time. The increased temperature adds an increased noise factor into each processing element in the pattern layer.
Typically, the final temperature is zero. If the network fails to settle properly, adding more iterations at lower temperatures may help to get to a optimum solution. A Boltzmann machine learning at high temperature behaves much like a random model and at low temperatures it behaves like a deterministic
model. Because of the random component in annealed learning, a processing element can sometimes assume a new state value that increases rather than decreases the overall energy of the system. This mimics physical annealing and is helpful in escaping local minima and moving toward a global minimum.
As with the Hopfield network, once a set of patterns are learned, a partial pattern can be presented to the network and it will complete the missing information. The limitation on the number of classes, being less than fifteen percent of the total processing elements in the pattern layer, still applies.
7- Hamming Network.
The Hamming network is an extension of the Hopfield network in that it adds a maximum likelihood classifier to the frond end. This network was developed by Richard Lippman in the mid 1980's. The Hamming network implements a classifier based upon least error for binary input vectors, where the error is defined by the Hamming distance.
The Hamming distance is defined as the number of bits which differ between two corresponding, fixedlength input vectors. One input vector is the noiseless example pattern, the other is a pattern corrupted by real-world events. In this network architecture, the output categories are defined by a noiseless, pattern-filled training set.
In the recall mode any incoming input vectors are then assigned to the category for which the distance between the example input vectors and the current input vector is minimum. The Hamming network has three layers. There is an example network shown in Figure 5.3.2.
The network uses an input layer with as many nodes as there are separate binary features. It has a category layer, which is the Hopfield layer, with as many nodes as there are categories, or classes. This differs significantly from the formal Hopfield architecture, which has as many nodes in the middle layer as there are input nodes.
And finally, there is an output layer which matches the number of nodes in the category layer. The network is a simple feedforward architecture with the input layer fully connected to the category layer. Each processing element in the category layer is connected back to every other element in that same layer, as well as to a direct connection to the output processing element. The output from the category layer to the output layer is done through competition.