Action Schema Neural Networks: Generalized Policies for Stochastic Planning Problems in the Wargaming Domain

Stochastic shortest-path problems have been of interest to the automated planning community for many years. Traditionally, policy solutions to these problems have been found by a set of admissible heuristics, such as LM-Cut, which can approximate the best actions to take in a current state to provide the highest probability of reaching a goal state in a delete relaxation of these problems. Though successful, these heuristics face scalability problems as the state spaces of these stochastic problems increase. \cite{toyer2018action} provided a solution to this problem by utilizing deep neural networks to learn a successful policy that could scale to Nth-order problems with only linear time constraints. The neural networks are coined Action Schema Networks (ASNets), since given a current state they provide an appropriate action to take. We present a case study on this technique by applying it to the fighter jet wargaming domain. We have designed a PPDDL domain and grounding files for a wide set of scenarios in which red and blue, 4th and 5th, generations fighters engage in battle and the ASNets must decide on which attack method to use given the current scenario state to increase the probability of reaching a goal state. We present the results of 5 trial experiments and discuss the degree of success we have had in training the ASNets, intuition about the results, and suggestions for future work.

NOTE: Months after this work, I came across an error in my understanding of the framework that I was using at the time. The LM-Cut heuristic results were encoded as a set of feature vectors that were concatenated to the standard input feature vectors to provide some directional focus to the search space. However, the heuristic that was used for computer truth tables was the optimal solution ( which was exhaustively calculated.) All results still hold; however, the comparison between LM-Cut heuristic results and Trained NN results is a little malformed. In reality, the times when the NN did better than the LM-Cut revealed that the NN learned a policy that was closer to the true optimal policy than the LM-Cut (because of its ‘delete-relaxed’ nature) heuristic policy.

Previous
Previous

Formal Methods 101: Binary Relations

Next
Next

Improvements to Feed-Forward Neural Networks Used to Classify Forest Type Covers Based on Cartographic Features