Evaluating the Impact of Data Views on Anomaly Detection Performance in Software Logs

Seeing the Same Data from Multiple Perspectives

More Info
expand_more

Abstract

As our world has become increasingly digital and the number of tasks performed by software has grown, so too has the volume of software logs and the importance of cybersecurity. Anomaly detection on software logs is crucial for securing systems and identifying
the causes of past attacks. Extensive research has focused on developing effective log
parsers and anomaly detection methods. However, the process between obtaining parsed
logs and feeding them to the anomaly detection methods has remained unchanged for years.
Typically, logs are ordered by their generation time and split into fixed-length timeframes,
which are then analyzed for anomalies. This approach can group unrelated logs together
while splitting related logs across different timeframes, potentially missing critical context
for anomaly detection.
This research explores the concept of views, which are different perspectives used on input
data. For example, logs can be grouped by the computer they are generated on, allowing the comparison of computers instead of arbitrary timeframes. The study demonstrates
that the performance of anomaly detection methods can vary significantly depending on the
views used.
To validate this, two different anomaly detection methods were applied to two different
datasets, showing that the effect is not specific to any particular dataset or detection method.
The research discusses three methods for creating these views and proposes a method to
suggest the most useful view by using views that contain a lot of different log keys and log
values. Additionally, partial views, such as grouping logs by user, are examined. These
views, although missing some logs, prove valuable for detecting anomalous behaviour as
grouping all logs of a user together and comparing that to other users, helped find the
anomalous user within a dataset that contain more than 1400 users in total.
The study also investigates the cause of anomalous behavior in Windows event logs by analyzing the scores of individual logs to identify words present in anomalous logs but absent
in normal ones. While this approach has some false positives, it effectively highlighted the
timing and consequences of an attack.
Finally, the research explores various ways to combine scores from different views and their
effects. Combining all views of EVTX data provides a reliable middle ground, useful when
the most effective view is unknown. It was found that taking the highest score from all views
results in many false positives and should be avoided. However, using the highest x scores
from all views, as long as x is greater than one, does not significantly differ from taking the
average score.

Files

Master_Thesis_Otte_van_Dam.pdf
(pdf | 2.15 Mb)
- Embargo expired in 27-07-2024
Unknown license