Black Hat Europe: How Machine Learning Offers a New Approach to Uncover IOCs

Indicators of compromise (IOCs) are an incredibly important forensic artifacts which, as the name suggests, are used in incident response and threat research to discover if a system has been compromised. They come in various forms, for example, unusual outbound network traffic, an MD5 file in a temporary directory, or even log-in irregularities. One class of IOCs so far resistant to detection by traditional methods relates to the use of external content in web-based attacks.

At Black Hat Europe earlier today, Trend Micro senior security researcher Marco Balduzzi, explained how a new machine learning approach can reap fantastic results for early detection of such threats.

Overcoming limitations
When attackers compromise a web application to launch an attack, they often rely on external content. This could be in the form of popular JavaScript libraries like jQuery; beautifiers to improve the look and feel of the page; or even scripts that implement reusable functions. Although they’re not inherently malicious, the fact that they’re so innocuous makes it very difficult for web scanners and other traditional tools to detect when they’re IOCs.

But the fact that their presence can be used to precisely pinpoint a compromised webpage makes them potentially great IOCs. So, Balduzzi and his team set about building a high-interaction honeypot featuring five vulnerable web apps and 100 domains. However, they didn’t have much success as candidate IOCs appeared benign judging by the content alone – especially as attackers often try to prevent page inspection. So, it was decided to extend the analysis to include the context.Marco2

Using machine learning technology based on the Weka Framework, the team analysed several validators including:

  • Page similarity: attackers often reuse the same template
  • Anomalous origin: common scripts are often reused but then hosted elsewhere, eg compromised sites in Russia
  • Maliciousness: reputation of the parent web pages
  • Component popularity: very popular resources like Facebook SDKs tend to be benign
  • Security forums: discussions in these can help inform research

Going live
After feeding in months’ worth of training data, and using unsupervised learning and clustering for exploratory analysis, the team ran a four-month live experiment. It generated 303 unique IOC candidates and automatically validated 96 as genuine. An astonishing 90% were previously unknown or misclassified and only 6% of the compromised parent pages were detected by Virus Total, Balduzzi explained.

Some 10% were even hosted on Google Drive/Code with one allowed to stay online for over a year. They’ve been linked to dozens of web defacements, drive-by attacks, phishing attacks, adware campaigns and more.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.