Indicators of compromise (IOCs) are an incredibly important forensic artifacts which, as the name suggests, are used in incident response and threat research to discover if a system has been compromised. They come in various forms, for example, unusual outbound network traffic, an MD5 file in a temporary directory, or even log-in irregularities. One class of IOCs so far resistant to detection by traditional methods relates to the use of external content in web-based attacks.
At Black Hat Europe earlier today, Trend Micro senior security researcher Marco Balduzzi, explained how a new machine learning approach can reap fantastic results for early detection of such threats.
But the fact that their presence can be used to precisely pinpoint a compromised webpage makes them potentially great IOCs. So, Balduzzi and his team set about building a high-interaction honeypot featuring five vulnerable web apps and 100 domains. However, they didn’t have much success as candidate IOCs appeared benign judging by the content alone – especially as attackers often try to prevent page inspection. So, it was decided to extend the analysis to include the context.
Using machine learning technology based on the Weka Framework, the team analysed several validators including:
- Page similarity: attackers often reuse the same template
- Anomalous origin: common scripts are often reused but then hosted elsewhere, eg compromised sites in Russia
- Maliciousness: reputation of the parent web pages
- Component popularity: very popular resources like Facebook SDKs tend to be benign
- Security forums: discussions in these can help inform research
After feeding in months’ worth of training data, and using unsupervised learning and clustering for exploratory analysis, the team ran a four-month live experiment. It generated 303 unique IOC candidates and automatically validated 96 as genuine. An astonishing 90% were previously unknown or misclassified and only 6% of the compromised parent pages were detected by Virus Total, Balduzzi explained.
Some 10% were even hosted on Google Drive/Code with one allowed to stay online for over a year. They’ve been linked to dozens of web defacements, drive-by attacks, phishing attacks, adware campaigns and more.