A high-level neural network or functional link network

A high-level neural network or functional link network

A high-level neural network or functional link network


 introduction

To facilitate keeping the weights within the compact region where the algorithm works best, an upper bound is required on the weight's magnitude. Yet, by setting the weight's bounds reasonably high, the network is still allowed to seek what is not exactly known - the true global optimum. 


The second key parameter to this learning rule involves the initial variance of the random distribution of the weights. In most of the commercial packages there is a vendor recommended number for this initial variance parameter. 


Yet, the setting of this number is not all that important as the self-adjusting feature of the directed random search has proven to be robust over a wide range of initial variances.

probable errors enables the system to take intelligent steps in adjusting the weights. However, this process is complicated in that empirical evidence suggests that each weight may have quite different effects on the overall error. 


Jacobs then suggested the common sense notion that back-propagation learning rules should account for these variations in the effect on the overall error. In other words, every connection weight of a network should have its own learning rate. 


The claim is that the step size appropriate for one connection weight may not be appropriate for all weights in that layer. Further, these learning rates should be allowed to vary over time. 


By assigning a learning rate to each connection and permitting this learning rate to change continuously over time, more degrees of freedom are introduced to reduce the time to convergence. Rules which directly apply to this algorithm are straight forward and easy to implement. 


Each connection weight has its own learning rate. These learning rates are varied based on the current error information found with standard back-propagation. When the connection weight changes, if the local error has the same sign for several consecutive time steps, the learning rate for that connection is linearly increased. Incrementing linearly prevents the learning rates from becoming too large too fast. 


When the local error changes signs frequently, the learning rate is decreased geometrically. Decrementing geometrically ensures that the connection learning rates are always positive. Further, they can be decreased more rapidly in regions where the change in error is large. 


By permitting different learning rates for each connection weight in a network, a steepest descent search (in the direction of the negative gradient) is no longer being preformed. Instead, the connection weights are updated on the basis of the partial derivatives of the error with respect to the weight itself. 


It is also based on an estimate of the "curvature of the error surface" in the vicinity of the current point weight value. Additionally, the weight changes satisfy the locality constraint, that is, they require information only from the processing elements to which they are connected.


1- Extended Delta Bar Delta. 

Ali Minai and Ron Williams developed the extended delta bar delta algorithm as a natural outgrowth from Jacob's work. Here, they enhance the delta bar delta by applying an exponential decay to the learning rate increase, add the momentum component back in, and put a cap on the learning rate and momentum coefficient. 


As discussed in the section on back-propagation, momentum is a factor used to smooth the learning rate. It is a term added to the standard weight change which is proportional to the previous weight change. In this way, good general trends are reinforced, and oscillations are dampened.


2- The learning rate and the momentum 

rate for each weight have separate constants controlling their increase and decrease. Once again, the sign of the current error is used to indicate whether an increase or decrease is appropriate. The adjustment for decrease is identical in form to that of Delta Bar Delta. 


However, the learning rate and momentum rate increases are modified to be exponentially decreasing functions of the magnitude of the weighted gradient components. Thus, greater increases will be applied i n areas of small slope or curvature than in areas of high curvature. 


This is a partial solution to the jump problem of delta bar delta. To take a step further to prevent wild jumps and oscillations in the weights, ceilings are placed on the individual connection learning rates and momentum rates. 


And finally, a memory with a recovery feature is built into the algorithm. When in use, after each epoch presentation of the training data, the accumulated error is evaluated. If the error is less than the previous minimum error, the weights are saved in memory as the current best. 


A tolerance parameter controls the recovery phase. Specifically, if the current error exceeds the minimum previous error, modified by the tolerance parameter, than all connection weight values revert stochastically to the stored best set of weights in memory. Furthermore, the learning and momentum rates are decreased to begin the recovery process.


3- Directed Random Search. 

The previous architectures were all based on learning rules, or paradigms, which are based on calculus. Those paradigms use a gradient descent technique to adjust each of the weights. The architecture of the directed random search, 


However, uses a standard feedforward recall structure which is not based on back-propagation. Instead, the directed random search adjusts the weights randomly. To provide some order to this process a direction component is added to the random step which insures that the weights tend toward a previously successful search direction. All processing elements are influenced individually. 


This random search paradigm has several important features. Basically, it is fast and easy to use if the problem is well understood and relatively small. The reason that the problem has to be well understood is that the best results occur when the initial weights, the first guesses, are within close proximity to the best weights. 


It is fast because the algorithm cycles through its training much more quickly than calculus-bases techniques (i.e., the delta rule and its variations), since no error terms are computed for the intermediate processing elements. Only the output error is calculated. 


This learning rule is easy to use because there are only two key parameters associated with it. But the problem needs to result in a small network because if the number of connections becomes high, then the training process becomes long and cumbersome.


There are four key components to a random search network. They are the random step, the reversal step, a directed component, and a self-adjusting variance.


4- Random Step: 

A random value is added to each weight. Then, the entire training set is run through the network, producing a "prediction error." If this new total training set error is less than the previous best prediction error, the current weight values (which include the random step) becomes the new set of "best" weights. The current prediction error is then saved as the new, best prediction error.


5- Reversal Step: 

If the random step's results are worse than the previous best, then the same random value is subtracted from the original weight value. This produces a set of weights that is in the opposite direction to the previous random step. 


If the total "prediction error" is less than the previous best error, the current weight values of the reversal step are stored as the best weights. The current prediction error is also saved as the new, best prediction error. If both the forward and reverse steps fail, a completely new set of random values are added to the best weights and the process is then begun again.


6- Directed Component: 

To add in convergence a set of directed components is created, based on the outcomes of the forward and reversal steps. These directed components reflect the history of success or failure for the previous random steps. 


The directed components, which are initialized to zero, are added to the random components at each step in the procedure. Directed components provide a "common sense, let's go this way" element to the search. It has been found that the addition of these directed


7- Self-adjusting Variance: 

An initial variance parameter is specified to control the initial size (or length) of the random steps which are added to the weights. An adaptive mechanism changes the variance parameter based on the current relative success rate or failure rate. 


The learning rule assumes that the current size of the steps for the weights is in the right direction if it records several consecutive successes, and it then expands to try even larger steps. Conversely, if it detects several consecutive failures it contracts the variance to reduce the step size.


Final- 

For small to moderately sized networks, a directed random search produces good solutions in a reasonable amount of time. The training is automatic, requiring little, if any, user interaction. 


The number of connection weights imposes a practical limit on the size of a problem that this learning algorithm can effectively solve. If a network has more than 200 connection weights, a directed random search can require a relatively long training time and still end up yielding an acceptable solution.

Post a Comment

Previous Post Next Post