Machine Learning for Cybersecurity

The amount of data collected by computer systems on a daily basis is staggering.  Making sense of all this data is often a daunting task.  Machine Learning algorithms are often used to identify patterns within these data sets.  Providing product recommendations on various online shopping websites is just one use case.  Machine Learning algorithms can also classify data into categories.  An example of a classifier includes the detection of fraudulent transactions.  Despite many successful implementations of machine learning algorithms and systems to a broad spectrum of application fields, the cybersecurity domain has resisted similar progress.  The reason for this machine learning – cybersecurity “application gap” anomaly lies with several significant challenges within the cybersecurity domain. The most difficult challenges include: (1) difficulty in obtaining labeled examples of abnormal/normal or malicious/benign events; and (2) difficulty in obtaining domain experts to (a) generate (at least) a few labeled examples; and (b) evaluate-validate outputs from machine learning systems.

Development of a Machine Learning model requires data samples containing both abnormal/normal events, known as a labeled data set. Difficulties in the collection of labeled cybersecurity data can be overcome by specialized techniques, such as semi-supervised learning where a few labeled data samples are used to label a larger collection of similar unlabeled samples).   Another approach is the “Presumed Negative” approach.  The presumed negative approach can be applied if the events of interest are very infrequent (<1% of the population). In this approach all randomly collected samples in a large data set are treated as “Presumed Negatives”.

The significant benefits of a successful machine learning cybersecurity system implementation should convince cybersecurity customers to invest in the development of a machine learning system, to gain superior malware coverage compared to traditional signature-based systems. Machine learning systems can complement traditional signature-based methods, by offering more malware coverage, particularly the potential for discovery of new malware, but with the tradeoff of more uncertainty and increased likelihood of false positives. In addition, machine learning cybersecurity systems provide better predictability and potential for discovery of new malware. Signature-based methods have difficulty detecting new mutations of malware, such as new packing algorithms applied to old malicious code.  But a machine learning cybersecuirty system can use features such as entropy, and locality-preserving hashing functions, to generate malware detection models that can still detect new mutations of old malware.

By constructing more general and compact models of malware, machine learning algorithms are easier to maintain than the complex data structures required for specific signature-based systems.  Compared to large collections of specific signatures constructed reactively in response to yesterday’s malware, machine learning systems are inherently more predictive and adaptable to tomorrow’s cyber threats.

The potential benefit of increased malware coverage, especially discovery of new malware, offered by a machine learning system is complementary to the capabilities of traditional signature-based systems, and worth the risk of increased uncertainty (false positives) in the detection of new malware.


Authored by: Ed Purcell and Rick Havrilla