home *** CD-ROM | disk | FTP | other *** search
-
- Comments for the neural net expert on
- the atree release 2.0 adaptive logic network simulator.
-
-
-
- You will probably have a lot of experience with backpropagation-type
- networks, so the thrust of the following remarks is to encourage you
- to see adaptive logic networks (ALNs) as a modification of that
- approach that could offer improved performance in some of your
- problems. Certainly ALNs can help in problems that involve demanding
- real-time requirements or in situations where the available computing
- power is not adequate for a backpropagation network.
-
- First of all, you may not think of an adaptive logic network as a
- neural network at all, but it is! In fact, an ALN is a special case
- of the familiar multilayer perceptron (MLP) feedforward network.
- There are just a few changes, all of them simplifications. Briefly,
- the nodes (or elements) have (during training, at least) two input
- leads, the input signals x1, x2 are boolean ( 0 or 1), each connection
- weight is determined by a single bit of information, and the
- "squashing function" is a threshold. Specifically, if b1 and b2 are
- boolean, then the element outputs a boolean value which is 1 if and
- only if
-
- (b1 + 1) * x1 + (b2 + 1) * x2 >= 2.
-
- The four combinations of b1 and b2 ( 00, 11, 10, 01) generate the four
- boolean functions of two variables: AND, OR, LEFT, and RIGHT, where
- LEFT(x1, x2) = x1 and RIGHT(x1, x2) = x2. For example 1 * x1 + 1 * x2
- >= 2 if and only if both x1 AND x2 are 1.
-
- A tree of such elements is connected to some boolean input variables
- and their complements. Inputs enter the tree at the leaves and the
- output is at the root node. The input variables are components of a
- vector representing a pattern, such as a handwritten character.
- Training on a set of input vectors, furnished with the desired boolean
- outputs, results in an assignment of functions to nodes that allows
- the tree to approximate the training data. It then has built-in
- capacity to generalize, i. e. to give appropriate responses to other
- possible input vectors not presented during training, as will be
- explained below.
-
- You might ask: "Why bother with a special case when we can use the
- whole power of the MLP and the backpropagation training algorithm?"
- Here, I think, are some convincing answers:
-
- 1) Speed of the hardware implementation of ALNs
-
- There are many ways to achieve pattern recognition capability, ranging
- from classical approaches like the K-nearest-neighbors method, to
- newer methods using neural networks. These all have their place, and
- certainly no one wants to try to argue with proofs showing near
- optimality of classical methods in some cases, but ALNs can often
- compute results far faster than all other approaches based on digital
- logic; and sometimes speed is the critical factor, for example in
- robotics or control of electronic circuits.
-
- Training an ALN results in a combinational logic circuit. In fact,
- after adaptation, one groups the subtrees of two-input ANDs together
- to get ANDs of higher fanin, and similarly for the ORs. The result is
- logically equivalent to a tree of NANDs, which is directly
- translatable into easily built hardware. Pattern recognition
- functions can then be computed in a few tens of nanoseconds in
- general, which is several orders of magnitude faster than other
- approaches allow. Dedicated ALNs can be realized in field
- programmable gate arrays quite cheaply. (This step is not yet done in
- the atree release 2.0 software. There are various types of target
- chips one could use. Discussions are underway for one kind of FPGA.)
-
- The adaptation algorithms are also designed to be fast: what
- corresponds to backprop, a credit assignment computation, is also
- combinational, so it is possible to build systems which will adapt a
- tree at a rate of, say, 25 million patterns per second. Adaptation to
- a pattern would take the same time that one special processor for
- backprop neural networks requires for just one multiply-add step of
- the many it has to do! (Unfortunately, at that speed, you wouldn't
- have time to go for coffee while the system converged; you probably
- wouldn't even make it to your office door! So you would have to try a
- lot harder problems.)
-
- If there is any hedging above, it is in insisting upon digital logic.
- Analog circuits could possibly be faster. Whether they would give
- reproducible results is another issue. As long as we are in the realm
- of digital logic, we have the theorem that any boolean function can be
- realized in two levels of logic (if the complements of the inputs are
- available) -- conjunctive normal form does that. For the problems we
- have tried, ALNs produce trees which are about five or ten times
- deeper than that, and hence which are are five or ten times slower
- than the absolute limit of speed for ANY digital system!
-
- Of course, any ALN circuit could be turned into a two layer one, but
- usually only at astronomical cost. Hence it is plausible that ALNs
- are not only a good choice for very high speed pattern recognition --
- they are the only way to come close to the ultimate in speed at
- reasonable cost and with reproducible results.
-
- 2) Speed of software simulation
-
- If a simulation computes a 0 input to an AND gate, then no other
- inputs need to be evaluated to know the output of the AND. If the
- other input is not computed, we have a "lazy" evaluation of the AND.
- Laziness, also called parsimony, speeds up the simulation very
- significantly -- 8.9 times for a 10 layer binary tree and 157 times
- for a 20 layer tree.
-
- On the other hand, backprop can do little to avoid computing
- everything, since the sigmoid function, which is increasing in its
- entire domain from minus infinity to plus infinity, never cuts off
- small values, at least in theory.
-
- The speedup due to parsimony makes the atree software very reasonable
- to run even on personal computers. Since atree does not need floating
- point operations, no math co-processor is required.
-
- When special hardware is available, but is not large enough to
- evaluate a large logical expression in one go, parsimony is still
- important. Say there are parts of a computation which can be done
- completely in parallel in the hardware, but these parts have to be
- computed iteratively. Lazy evaluation will in general allow one to
- omit those iterations whose outputs are not needed, based on those
- already computed.
-
- Laziness or parsimony is an advantage of ALNs which assures them an
- increasingly important role in neurocomputing -- even if the majority
- of neural net research today is based mainly on other paradigms which
- are non-parsimonious.
-
- 3) Representational power:
-
- A sufficiently large ALN tree with appropriate node functions and
- connections can realize any boolean function. This means that one
- can, in principle, generate any mapping from any set representable in
- a digital computer to any other such set. The complements of the
- input variables have to be included as inputs to the binary tree
- described above, since otherwise the ALN could only produce increasing
- functions. (A boolean function is increasing if, for any input
- vector, changing a 0 to a 1 in it will never change the function value
- from 1 to 0.)
-
- There are theorems that deal with approximation of arbitrary
- continuous functions using the (infinitely often differentiable)
- functions produced, in theory, by neural networks using sigmoids --
- but what if the function you need has a discontinuity? Can the
- discontinuity be generated without impairing extrapolation? One may
- be better off if one can generate discontinuities, and discontinuities
- in derivatives too.
-
- 4) Generalization or extrapolation
-
- This is the real strong point of ALNs. Pattern classification
- problems generally require an output which is the same for patterns
- which are in some way similar. Similarity could be based on all sorts
- of criteria, but generally it can be defined by saying two individuals
- have many features in common. Of course, two neigboring rows of the
- checkerboard have a lot in common even if they have no corresponding
- squares of the same color! But if we can agree on the idea of having
- features in common is a good criterion for being in the same class in
- pattern recognition, then for good generalization one doesn't want the
- output to change when one or a few features, among many, are altered.
-
- If a feature is represented by a boolean variable, then the above
- says, similarity can be measured by Hamming distance between boolean
- vectors. In order to use ALNs, we generally have to code other
- representations of individuals into boolean vectors in such a way that
- similarity for pattern classification purposes is reflected in Hamming
- distance. For example, the atree package incorporates a way of
- encoding real values into boolean vectors that, at least locally,
- somewhat preserves the metric on an interval of the real line. Unary
- numbers preserve the metric on positive integers perfectly, but this
- is quite an inefficient use of bits, so we have implemented a random
- walk technique.
-
- Binary trees with node functions AND, OR, LEFT, and RIGHT have been
- shown to have the property that the functions they realize tend to
- have the same output value when the inputs are perturbed. This
- depends on averaging over all functions on the tree, weighted by the
- number of ways each can be produced by setting the two bits at each
- node which determine the node's function. It is also necessary to
- average over all vectors and all perturbations of a given number of
- bits.
-
- The plausible reasoning behind this is that when an input signal to a
- gate changes, it may or may not change the output. A change to an
- input to an AND will not change its output if the other input to the
- AND is 0. Over all, the probability of a change is 0.5, if one input
- changes. So for a 10-layer tree, the probability of a single bit
- change at a leaf causing a change of the tree output is 1/1024.
-
- In a tree with multiple connections to the bits of an input vector and
- their complements, the arguments are a bit more elaborate. Consider a
- balanced binary tree. Suppose some bits are perturbed at random in the
- input vector, with p being the probability of an individual bit being
- changed. This could correspond to "salt and pepper" noise in a
- scanned character. We ask: what is the probability of change in the
- next layer? It turns out to be p - p^2 /4. The probability of
- perturbation decreases as the number of layers increases. More
- importantly, the limit as the number of layers goes to infinity is
- zero. For 14 layers, the probability of change of the output when the
- input is perturbed is about 0.2 no matter how much the input is
- perturbed! The sequence of p values starting at 1.0 is: 1.0, 0.75,
- .61, .52, .45, .40, .36, .33, .30, .28, .26, .24, .23, .21, .20. (If
- you think the probability should never go below 0.5, you haven't
- allowed for the case of a function which is almost a constant. As you
- train an ALN, it may start as almost a constant function. You force
- it to become sensitive in certain ways, but it retains its general
- tendency towards insensitivity.)
-
- The above implies that binary tree functions based on the given node
- functions, AND, OR, LEFT, and RIGHT, are very insensitive to changes
- in their inputs. If we find one, even by random search through the
- space of all possibilities that satisfies the training data, then it
- will tend to do a good job of extrapolation. So it is not the
- training procedure per se that is important here. Insensitivity
- replaces the concept of continuity in backpropagation networks. It is
- what we really need in a system intended to solve pattern recognition
- problems.
-
- I hope the above arguments will be enough to convince you to try the
- atree release 2.0 software. The lf language should make it easy. I
- suggest you take files for your training and test sets in some
- backpropagation problem and substitute them for the sets in the
- multiply.lf program to get yourprogram.lf. Pick a tree of a few
- thousand nodes, adjust the maximum and minimum value allowed for each
- variable, the numbers of quantization levels for the different
- variables, the number of bits used to represent each quantity, and the
- stepsize between representations of successive quantization levels in
- the random walk. Then type "lf yourprogram.lf". I hope you will be
- pleasantly surprised, and that you will then make ALNs a permanent
- part of your neural net toolkit!
-
- Good luck.
-
- Bill Armstrong
-