Malware detection has been widely studied by analysing either file dropping relationships or characteristics of the file distribution network. This paper, for the first time, studies a global heterogeneous malware delivery graph fusing file dropping relationship and the topology of the file distribution network. The integration offers a unique ability of structuring the end-to-end distribution relationship. However, it brings large heterogeneous graphs to analysis. In our study, an average daily generated graph has more than 4 million edges and 2.7 million nodes that differ in type, such as IPs, URLs, and files. We propose a novel Bayesian label propagation model to unify the multi-source information, including content-agnostic features of different node types and topological information of the heterogeneous network. Our approach does not need to examine the source codes nor inspect the dynamic behaviours of a binary. Instead, it estimates the maliciousness of a given file through a semi-supervised label propagation procedure, which has a linear time complexity w.r.t. the number of nodes and edges. The evaluation on 567 million real-world download events validates that our proposed approach efficiently detects malware with a high accuracy. © 2016 Copyright held by the owner/author(s).
|Original language||English (US)|
|Title of host publication||Proceedings of the 25th ACM International on Conference on Information and Knowledge Management - CIKM '16|
|Publisher||Association for Computing Machinery (ACM)|
|Number of pages||6|
|State||Published - Oct 26 2016|