Towards a method for computing effective intrusion prevention policies using reinforcement learning

December 20, 2021

Introduction

In my research, the goal is to develop methods for computing effective intrusion prevention policies using reinforcement learning. Consider the following use case. The operator of an IT-infrastructure (see Fig. 1), which I call the defender, takes measures to protect it against a possible attacker while, at the same time, providing a service to a client population. The infrastructure includes a public gateway through which the clients access the service and which also is open to a possible attacker. The attacker seeks to intrude on the infrastructure and compromise a specific set of components. Conversely, the defender aims at preventing intrusions from the attacker and maintaining service to its clients. What are effective policies for the defender to achieve this end? How to obtain such policies? Traditionally, these questions have been answered by domain experts. The goal of my research, however, is to develop methods to automatically compute effective intrusion prevention policies. Thus reducing the burden on domain experts. There are several possible approaches for automating intrusion prevention policies. In my research, I focus on reinforcement learning. This constitutes a research problem in the engineering sciences and lies in the intersection of computer networking, reinforcement learning, decision theory, game theory, optimal control, and network security.

The IT infrastructure and the actors in the intrusion prevention use case.

The remainder of this essay is structured as follows. I first describe the research problem in more detail (Section 1.1 and Section 1.2) and cover the necessary background on reinforcement learning (Section 1.3). Then, I describe a methodological problem that arises when deciding on a research method to solve the problem (Section 1.4). Subsequently, I outline different approaches to the methodological problem (Section 2) and discuss which approach I consider to be most suitable (Section 3). Finally, I provide my conclusions (Section 4).

Intrusion Prevention Policies and Target Infrastructure

An intrusion prevention policy is a set of rules that prescribes, for each possible scenario, a recommended set of actions for the defender to prevent network intrusions. Since the actions that the defender can take depends on the IT infrastructure, an intrusion prevention policy is specific to a given IT infrastructure. The IT-infrastructure of a given policy is referred to as the target infrastructure for that policy.

Consider the target infrastructure in Fig. 1. An intrusion prevention policy for this infrastructure can, for example, prescribe that the defender should update the configuration of the gateway’s firewall in response to a suspicious event. Indeed there is an abundance of possible actions that the defender can take, where some actions are more effective than others. For example, a trivial intrusion prevention policy is to block all incoming network traffic to the IT infrastructure. This policy prevents any intrusion but incurs a high cost since it also denies the infrastructure’s service to legitimate clients. An effective policy should prevent intrusions at the minimal cost.

Reinforcement Learning

Reinforcement learning is a method for learning control policies through interaction with an environment (see Fig. \ref{fig:sutton5}). The environment used for learning is referred to as the “training environment” and the environment used for evaluation of the learned policies is referred to as the “evaluation environment”. (Often, the environment for training and evaluation are the same.) To learn policies for a given training environment, different reinforcement learning algorithms can be used. A reinforcement learning algorithm trains a policy by running a sequence of training episodes. In each training episode, the reinforcement learning algorithm takes actions in the training environment according to the current policy. At the end of the episode, the actions are evaluated. Then, based on the evaluation, the policy is updated using the reinforcement learning algorithm. Finally, after training the policy, it is evaluated in the evaluation environment by running a sequence of evaluation episodes.

This process of training and evaluating policies constitute a quantitative research method to experiment with different reinforcement learning methods. The quality of the experiments and the learned policies depend on three main factors: (1) the mathematical formulation of the problem; (2) the design of the training environment and the evaluation environment (the two environments are dependent); and (3) the choice of reinforcement learning algorithm.

$A reinforcement learning algorithm learns a policy by repeating the following cycle: (1) measuring the state $s_t$ of the environment; (2) taking action $a_t$ in the environment according to policy $\pi$; (3) evaluating the outcome of the action (the reward $r_t$ and next state $s_{t+1}$); and (4), based on the outcome of the action, updating the policy $\pi$ to obtain a new improved policy $\pi^{\prime}$ ($\pi \rightarrow \pi^{\prime}$); a fixed-length sequence of such cycles is referred to as a \textit{training episode}.$

The Methodological Problem - Designing Environments to use for Training and Evaluation of Intrusion Prevention Policies using Reinforcement Learning

To achieve the scientific goal described in the previous section, a suitable research method needs to be chosen. In the context of my scientific goal, this corresponds to (1) designing the experiments to conduct; (2) selecting the reinforcement learning approaches to investigate; and (3), deciding on training and evaluation environments for conducting the experiments. In this essay, I focus on the latter problem. Namely, I compare different methodological approaches for selecting a training environment and an evaluation environment which can be used to experimentally test different reinforcement learning approaches to achieve the scientific goal. Specifically, I discuss the following methodological question:

To find effective intrusion prevention policies for an IT infrastructure using reinforcement learning, which environments are most suitable for training and evaluation of policies? That is, how should empirical evidence be collected in order to evaluate different ways of finding intrusion prevention policies through reinforcement learning?

The problem of selecting a training environment and an evaluation environment corresponds to the methodological problem of designing environments that can be used to obtain empirical evidence in support of different reinforcement learning approaches. That is, the purpose of the environments is to evaluate different research hypotheses regarding the scientific goal. Ideally, the choice of environments should support inductive inferences with a high degree of both internal and external validity [5]. Specifically, it is desirable that the training and evaluation environments allow to demonstrate causal relationships between the choice of reinforcement learning approach and the evaluation performance of the learned policies. Further, it is desirable that the evaluation results are general and allow to make inductive statements about the targeted IT infrastructure in real-world conditions.

There are several possible answers to the methodological question described above and the best approach depends on the scientific goal. For example, one approach is to train and evaluate policies directly in the targeted IT infrastructure and another approach is to use a simulation environment for policy-training and then use the target infrastructure only to evaluate the trained policies. When comparing different approaches I focus on the following five properties.

Scalability and flexibility of the training and evaluation environment. It should be possible to run many episodes in the training environment in a short amount of time. This is a necessary requirement of most reinforcement learning algorithms. For example, training an effective policy often requires several million training episodes. Further, it is desirable that the training environment is configurable. That is, it should be possible to train intrusion prevention policies for different types of IT-infrastructures.
Similarity between the training environment and the evaluation environment. In order for the reinforcement learning algorithm to achieve good results in the evaluation it is necessary that the evaluation environment resembles the training environment. For example, if a simulation of the target infrastructure is used to train policies that are later to be evaluated in the target infrastructure, the simulation should capture the key elements of the target infrastructure.
Similarity between the evaluation environment and the target infrastructure. For the evaluation results to be meaningful with respect to the scientific goal, the evaluation environment should resemble the targeted IT infrastructure as much as possible. If the evaluation environment does not resemble the target infrastructure, the results provide no scientific knowledge with respect to the scientific goal.
Low cost of training and evaluation. It is desirable that the training environment and the evaluation environment allow training and evaluation of policies at low costs. That is, training and evaluation environments with low costs are preferred over more costly environments if the gain in scientific knowledge of the latter is negligible compared to the increased costs.
Ethical and legal feasibility. The choice of training and evaluation environment should be ethically and legally justified. Certain choices of training and evaluation environments may be either ethically questionable or forbidden by law. For example, security research that is conducted on a company’s IT infrastructure may be confidential and thus not be of gain to the research community. Moreover, as the evaluation of security policies involves real cyber attacks, an ethical dilemma arises when evaluating security policies directly on a company’s IT infrastructure. On the one hand the evaluation can help to improve the company’s security, but on the other hand it could cause both financial losses and data breaches of the company.

In brief, the first two desirable properties in the above list ensure a high degree of internal validity of the choice of training and evaluation environment. The third desirable property ensures a high degree of external validity and the two final properties relate to the practicality of the approach.

The remainder of this essay is structured as follows. I first describe different approaches to the methodological problem (designing training and evaluation environments) and their respective trade-offs (Section 2). Then, I discuss which approach I consider to be most justified given the scientific goal (Section 3). Lastly, I provide my conclusions (Section 4).

Methodological Approaches and Trade-offs

To this date, two main approaches to the methodological problem described in the previous section can be found in the literature- the simulation approach [6,2,12,14,10,15,13,4,9] and the emulation approach [7,8,3]-where the simulation approach is the most popular. I also take the freedom to add another approach to this list which is yet to be explored- the target approach. The three approaches are described below.

The simulation approach. In this approach, policies are trained and evaluated in a simulation of the targeted IT infrastructure. This approach involves building a simulation system that simulates high-level functionality of the target infrastructure. The simulation approach has the benefit that it is scalable and general. Further, it allows a wide range of scenarios and configurations to be simulated at high speed and low cost, which provides an ideal environment for training policies through reinforcement learning (high internal validity). Also, the engineering effort to build the simulation system is relatively small. Moreover, using simulations to evaluate the trained policies does not bring about any immediate legal or ethical concerns. The limitation of the simulation approach is that the simulation is abstract, which limits the conclusions that can be made from the evaluation results obtained through the simulations. That is, the external validity of the simulation approach is low. Specifically, the results obtained from evaluating policies in the simulation does not necessarily indicate the policies’ performance when executed in the targeted infrastructure. Thus, the scientific knowledge (with respect to the research goal) that is gained from evaluations based solely on simulations is limited.
The emulation approach. In this approach, policies are trained and evaluated in a high-fidelity emulation of the target infrastructure. Specifically, the emulation approach involves building an emulation system where key functional components of the target infrastructure are replicated. The emulation system may for example run the same software as the target infrastructure but use a virtual infrastructure to emulate the hardware of the target infrastructure. In contrast to the simulation approach, the results obtained from evaluating policies using the emulation approach can be directly related to the scientific goal since the emulation is guaranteed to be similar to the target infrastructure, i.e. it has a higher degree of external validity than the simulation approach. Also, the emulation approach does not bring about any immediate legal or ethical concerns. A drawback of the emulation approach, however, is that the emulation system requires a large engineering effort to build. Moreover, since the emulation is similar to the target infrastructure, it can be time-consuming to run training episodes in the emulation. In addition, while the training environment and evaluation environment used in the emulation approach are configurable, in comparison to a simulation environment, it is more costly to configure the emulation environment. This makes the training environment used by the emulation approach less practical than the training environment used by the simulation approach.
The target approach. In this approach, policies are trained and evaluated in the targeted IT infrastructure. This approach has the benefit that it requires no engineering effort to build (assuming that the target infrastructure already exists). Further, the evaluation environment used in this approach has the highest possible degree of external validity. Drawbacks of the target approach are as follows. The time that it takes to run a training episode in the target infrastructure is substantial, making several reinforcement learning algorithms impractical to use when following the target approach. Moreover, the training environment used in the target approach is not configurable and is limited to a single intrusion prevention scenario. That is, the internal validity of the target approach is low. Further, using the target infrastructure as evaluation environment is costly as it degrades the performance of the target infrastructure. Finally, using the target infrastructure to evaluate security policies can be ethically questionable and may require special legal permits. For example, if the target infrastructure is a production infrastructure of an organization, an ethical dilemma arises when evaluating security policies directly on the target infrastructure. On the one hand the evaluation can improve the organization’s security but on the other hand it can also cause loss of data or financial value.

In summary, while the simulation approach has the highest degree of internal validity and provides the most flexible and scalable training environment, it provides an evaluation environment that is the farthest away from the target infrastructure, i.e it has the lowest degree of external validity. On the opposite side of the spectrum is the target approach, which has the highest degree of external validity and provides an evaluation environment that is identical with the target infrastructure but a training environment that is not scalable and is limited to a single intrusion prevention scenario. It also suffers from time-consuming and costly training of policies as well as ethical concerns. That is, the target approach has the lowest degree of internal validity and is the least practical approach. Between these two extremes lies the emulation approach, which uses a training environment that is more configurable and scalable than that of the target approach but less scalable compared to the simulation approach. Furthermore, the emulation approach uses an evaluation environment that is closer to the target infrastructure than that of the simulation approach but not as close as the target approach. Hence, depending on which properties are prioritized, each approach can be justified (see Fig. 3). In the following section I discuss which method I consider to be most justified given my scientific goal.

Trade-offs among the simulation approach, the emulation approach, and the target approach; a higher value is preferred on each scale except the "Cost" scale.

Discussion

Neither approach discussed in the previous section achieves all of the properties that are desired (Fig. 3), namely a scalable and configurable training environment (high internal validity) and a realistic evaluation environment (high external validity) that has low costs and is ethically justifiable (high practicality).

If the scientific goal only was to study which reinforcement learning algorithm is most effective, the evaluation environment would be less important and thus the simulation approach would be the most justified as it has the highest degree of internal validity. On the other hand, putting any ethical concerns aside, and assuming that the scientific goal is only to evaluate policies and not to train policies, then the target approach would be most justified as it has the most realistic evaluation environment, i.e. it has the highest degree of external validity. However, for my scientific goal- developing methods that use reinforcement learning to find effective intrusion prevention policies for a target infrastructure\textemdash I argue that the most justified approach is an approach that combines the simulation approach and the target approach (assuming that the target approach is ethically justifiable and legally possible). Specifically, the environment that is most suitable for training policies is a simulation environment and the environment that is most suitable for evaluating policies is the target infrastructure. The combination of these two approaches provide an approach that has a high degree of both internal and external validity. As opposed to the three approaches discussed in the previous section, this approach uses a training environment that is different from the evaluation environment. To ensure that the training environment resembles the evaluation environment (which is required by property (2) in Section \ref{sec:desirable_properties}), the training environment should be instantiated with data collected from the evaluation environment.

Although the approach described above satisfies all the desirable properties, it is impractical for the following reason. In practice, it is typically not the case that the target infrastructure is available for evaluation and even if it is available, it is likely too costly to use it for running evaluations and it may also be ethically questionable. For example, in my research, the target infrastructure corresponds to the production infrastructure of an organization. To use such an environment for evaluation would require the organization to accept a degradation in performance of their production infrastructure for the purpose of the research, which is generally not feasible.

Therefore, the approach that I consider to be the most justified given the scientific goal, while still being practical, is an approach that combines the emulation approach and the simulation approach (see Fig. 4). Specifically, the approach includes building an emulation system where key functional components of the target infrastructure are replicated. In this system, evaluations of intrusion prevention policies are run. These runs produce system metrics and logs that are used to estimate empirical distributions of infrastructure metrics, which are needed to create a simulation that captures the target infrastructure’s dynamics. Furthermore, the approach includes developing a simulation system where simulations are executed and policies are incrementally learned using reinforcement learning. Finally, the policies are extracted and evaluated in the emulation system. If the evaluation yields good results, the learned policies can be implemented in the target infrastructure, without any performance degradation and without any immediate ethical issues. In short, the emulation system is used to provide the statistics needed to simulate the target infrastructure and to evaluate policies, whereas the simulation system is used to learn policies. Hence, the proposed approach involves building two artifacts-a simulation system and an emulation system.

The approach for training and evaluating intrusion prevention policies that I argue is the most justified and practical given the scientific goal.

The different parts of the proposed approach serve the following purposes. The emulation system allows the evaluation of policies without degrading functionality of the target infrastructure. Further, as the emulated infrastructure is a high-fidelity replication of the target infrastructure, it provides an evaluation environment that resembles the target infrastructure and ensures a high degree of external validity. Next, the purpose of the simulation system is to simulate training episodes and learn policies through reinforcement learning, which would not scale in the emulation system due to the time constraints of running commands in the emulation. Hence, the simulation system ensures the practicality of the approach and improves the internal validity. Moreover, to make the simulations similar to the emulation and the target infrastructure (see property (2) in Section 2), the simulation is instantiated based on measurements and statistics obtained from the emulation system.

I argue that this approach is ethically justifiable and strikes a reasonable trade-off between internal validity and external validity. Specifically, the approach achieves a suitable trade-off between scalability of policy-training and a realistic environment for policy-evaluation. The scalability is provided by the simulation system, which allows to simulate a large number of training episodes in a short amount of time, and thus enables to learn policies through reinforcement learning, whereas the realistic evaluation environment is provided by the emulation system. Furthermore, to be able to find policies that perform well in the emulation it is important that there is a close connection between the simulation and the emulation. This connection is achieved by instantiating the simulation using measurements and statistics obtained from the emulation system.

There are two drawbacks of the proposed approach. First, it requires building an emulation system and a simulation system, which entails a significant engineering effort. Second, although the emulation system is a close replica of the target infrastructure, it is not exact, and thus the evaluation results obtained from the emulation may differ slightly from the results that would have been obtained in the target infrastructure. This means that although the evaluation results provide strong empirical evidence on the quality of the learned policies, the results may not reflect the performance of the learned policies in the target infrastructure to a 100%, i.e. the external validity is not optimal. However, considering the drawbacks of the other approaches that have been discussed in this essay, and the fact that it is impractical to run evaluations directly in the target infrastructure, I consider this approach to be the most practical and suitable given the scientific goal.

Conclusion

In this essay, I have discussed a methodological problem related to the scientific goal of developing methods for automatic computation of effective intrusion prevention policies for a target infrastructure using reinforcement learning. Specifically, I have outlined different approaches to achieve the scientific goal and discussed their respective pros and cons. I have focused on the methodological problem of choosing a training environment that can be used to learn intrusion prevention policies through reinforcement learning and an evaluation environment that can be used to evaluate the learned policies. In comparing different designs of these environments, several properties which relate to internal and external validity, as well as ethics and practicality, has been considered. When comparing different approaches with respect to the research goal, I arrived at the conclusion that the most suitable environment for training policies is a simulation environment that is instantiated based on the data from the evaluation environment and that the most suitable environment for evaluating policies is the targeted IT infrastructure. However, for practical and ethical reasons it is not feasible to use the targeted infrastructure for evaluation purposes. Therefore, I argue that the most justified approach is to use a simulation environment to train policies and an emulation environment to evaluate policies and to collect data that is used to instantiate the simulation.

References

[1] William Blum. Gamifying machine learning for stronger security and ai models, 2019. https://www.microsoft.com/security/blog/2021/04/08/ gamifying-machine-learning-for-stronger-security-and-ai-models/.
[2] Richard Elderman, Leon J. J. Pater, Albert S. Thie, Madalina M. Drugan, and Marco Wiering. Adversarial reinforcement learning in a cyber security simulation. In ICAART, 2017.
[3] Jon Gabirondo-López, Jon Egaña, Jose Miguel-Alonso, and Raul Orduna Urrutia. Towards autonomous defense of sdn networks using muzero based intelligent agents. IEEE Access, 9:107184–107199, 2021.
[4] Rohit Gangupantulu, Tyler Cody, Paul Park, Abdul Rahman, Logan Eisenbeiser, Dan Radke, and Ryan Clark. Using cyber terrain in reinforcement learning for penetration testing. 2021. https://arxiv.org/abs/2108.07124.
[5] Till Grüne-Yanoff. Experiments models and methodology, 2021.
[6] Kim Hammar and Rolf Stadler. Finding effective security strategies through reinforcement learning and Self-Play. In International Conference on Network and Service Management (CNSM 2020), Izmir, Turkey, 2020.
[7] Kim Hammar and Rolf Stadler. Intrusion prevention through optimal stopping, 2021.
[8] Kim Hammar and Rolf Stadler. Learning intrusion prevention policies through optimal stop- ping. In International Conference on Network and Service Management (CNSM 2021), Izmir, Turkey, 2021. https://arxiv.org/pdf/2106.07160.pdf.
[9] Zhisheng Hu, Minghui Zhu, and Peng Liu. Adaptive cyber defense against multi-stage attacks using learning-based pomdp. ACM Trans. Priv. Secur., 24(1), November 2020.
[10] Mehmet Necip Kurt, Oyetunji Ogundijo, Chong Li, and Xiaodong Wang. Online cyber-attack detection in smart grid: A reinforcement learning approach. IEEE Transactions on Smart Grid, 10(5):5174–5185, 2019.
[11] Ahmad Ridley. Machine learning for autonomous cyber defense, 2018. The Next Wave, Vol 22, No.1 2018.
[12] Jonathon Schwartz, Hanna Kurniawati, and Edwin El-Mahassni. Pomdp + information-decay: Incorporating defender’s behaviour in autonomous penetration testing. Proceedings of the International Conference on Automated Planning and Scheduling, 30(1):235–243, Jun. 2020.
[13] Khuong Tran, Ashlesha Akella, Maxwell Standen, Junae Kim, David Bowman, Toby Richer, and Chin-Teng Lin. Deep hierarchical reinforcement agents for automated penetration testing, 2021. https://arxiv.org/abs/2109.06449.
[14] Fabio Massimo Zennaro and Laszlo Erdodi. Modeling penetration testing with reinforcement learning using capture-the-flag challenges and tabular q-learning. CoRR, 2020. https:// arxiv.org/abs/2005.12632.
[15] Minghui Zhu, Zhisheng Hu, and Peng Liu. Reinforcement learning algorithms for adaptive cyber defense against heartbleed. In Proceedings of the First ACM Workshop on Moving Target Defense, MTD ’14, page 51–58, New York, NY, USA, 2014. Association for Computing Machinery.