Action Schema Neural Networks: Generalized Policies for Stochastic Planning Problems in the Wargaming Domain

Stochastic shortest path problems have been of interest to the automated planning community for many years. Traditionally, policy solutions to these problems have been found by a set of admissible heuristics, such as LM-Cut, which are able to approximate the best actions to take in a current state to provide the highest probability of reaching a goal state in a delete relaxation of these problems. Though successful, these heuristics face scalability problems as the state spaces of these stochastic problems increase. \cite{toyer2018action} provided a solution to this problem by utilizing deep neural networks to learn a successful policy that could scale to Nth order problems with only linear time constraints. The neural networks are coined Action Schema Networks (ASNets), since given a current state they provide an appropriate action to take. We present a case study on this technique by applying it to the fighter jet wargaminig domain. We have designed a PPDDL domain and grounding files for a wide set of scenarios in which red and blue, 4th and 5th, generations fighters engage in battle and the ASNets must decide on which attack method to use given the current scenario state to increase the probability of reaching a goal state. We present the results of 5 trial experiments and discuss the degree of success we have had in training the ASNets, intuition about the results, and suggestions for future work.

NOTE: Months after this work, I came across an error in my understanding of the framework that I was using at the time. The LM-Cut heuristic results were encoded as a set of feature vectors that were concatenated to the standard input feature vectors in order to provide some directional focus to the search space. However, the heuristic that actually was used to computer truth tables was the optimal solution ( which was exhaustively calculated.) All results still hold; however, the comparison between LM-Cut heuristic results and Trained NN results is a little malformed. In reality, the times when the NN did better than the LM-Cut revealed that NN learned a policy that was closer to the true optimal policy than the LM-Cut (because of it’s ‘delete-relaxed’ nature) heuristic’s policy.

The wargame PPDDL domain, PPDDL grounding files, grounding generator scripts, trial experiment files, the experiment results, and python scripts to collate the results of the trials are available on GitHub here under an MIT License