|
click to see animated view |
My B.A.Sc. thesis research, conducted under the supervision of Professor Raymond Kwong at the University of Toronto during my 4th year (1996-1997), focused on the use of reinforcement learning in guiding a computer to "learn" a "swing-up" control law for a rotational inverted pendulum. The computer did not actually interface to a real rotational inverted pendulum, but rather a simulation of the dynamics of an existing system control testbed at the University of Toronto. The source code for this software (which compiles under Visual C++ 6.0) has now been downloaded by over 300 researchers (or merely curious people) worldwide. Recent examples of follow-on work I'm aware of are listed below.
Reinforcement learning involves the update of an action policy (usually represented as a parameterized function over the state-space of the problem) using a scalar reward signal. This signal gives a hint as to how desirable the current state of the problem is. So for instance, when playing board games like chess, backgammon, checkers, monopoly etc. all states except those classified as the winning could be assigned a reward of zero while the winning positions could be assigned a reward value of one. In their simplest manifestation reinforcement learning algorithms basically try to determine the discounted expected reward for taking a given action from a given state by exploring the state-space using an "informed random walk". More traditional methods such as dynamic programming can be used within this framework and in principle could provide optimal solutions, however, as with many other problems in computer science, exact solutions to these problems often require impractical amounts of computational effort. Reinforcement learning was proposed by Sutton and Barto as a heuristic method motivated by behavior psychology. From these early beginnings, the discipline has grown to be one of the most fruitful areas of research in artificial intelligence.
There are numerous ways of improving the basic approach used in my simulator: the SARSA algorithm (introduced and explained in Richard S. Sutton and Andrew G. Barto's excellent book, Reinforcement Learning: An Introduction). For instance in my implementation I used a simple "learning deadband" to stop parameter updates when performance is within a "good enough" region (the deadband). I was motivated to try this after auditing an introductory course on adaptive control theory, ECE 1649. However, perhaps the most relevant to improving the real, or apparent learning rate is the incorporation of a learned "internal model" of the exterior world.
Experimental results: Unfortunately, during the learning phase (which takes on the order of several hours in real time, but only tens of minutes in simulation) the controller I developed explores regions of the state space that would likely damage the real testbed before it learns the desired control law. This made the technique somewhat unpractical for direct application to the actual system under control in this case. None-the-less it is an interesting display of the power of the general reinforcement learning approach. Here is a link to the simulator executable and some configuration files and below are some screen captures of my stunning OpenGL graphics to whet your appetite. Also available: source code... I eventually hope to put up some documentation for the source, plus some graphs showing the experimental results obtained using this software as is. The softcopy of my thesis was unfortunately lost several years ago, but a hardcopy should still exist with the Division of Engineering Science (assuming they don't throw them out!)
This is the "virtual pendulum" in the "down" position.
This
is the "virtual pendulum" in the "up" position.
This is the "virtual pendulum" somewhere between "up" and
"down".
The objective here is for the controller to learn how to transfer the state from one equilibrium manifold to another--i.e. from the down position to the up position--and to keep it there. That is, to learn a globally stabilizing (about the "up" position) control law. This is a nonlinear control task that you won't find solutions for it in the typical set of undergraduate system control courses... at least none I've heard of. Strictly speaking--in control theoretic terms, I can not say for certain that the controller generated by my software actually achieves this goal either--but it certainly does something that looks like it! If you haven't already, take a look at the animated demo.
The executable runs under Windows NT 4.0 or Windows 98. As you
can see above, it has an OpenGL simulation displaying the current state
of the rotational inverted pendulum (click on the button between the "?"
and "FB" buttons to active it). To start, use File->Open...
to open one of the *.tblsys files included in the distribution. These
are configuration files that tell the controller how to view the state
space and also load up control parameters already learned. The files
are in order of increasing "experience" so that Global1.tblsys is "quite
novice", whereas global9.tblsys is "very learned".
Now if you drag the main window
divider to the left you will see that there are two other sub-panels to
it. The upper one is a plot of the time it take the controller to
swing from "down" to "up" (or more accurately, some small region surrounding
"up") and if the thing is truly "learning" this quantity should get smaller
with each trial (in a statistical sense, naturally). The lower one can be used to display various signals within
the system, somewhat like an oscilloscope (the "Signals" menu controls
this feature). To start the simulation use the buttons that look
like tape deck controls for play, fast-forward and stop. They operate
pretty much as anyone who's used a VHS would expect ;-)
Citations
András Lörincz, Imre Pólik, István Szita Event-Learning and Robust Policy Heuristics,
Technical Report NIPG-ELU-14-05-2001, Department of Information Systems
Eötvös Loránd University, Budapest, Hungary, 2001.