Data Science for Cyber-Security
25-27th September 2017, Imperial College London, U.K.
DSCS 25-27th Sep 2017

Invited speakers

  • Ian Levy, National Centre for Cyber Security
  • Mike Fisk, Los Alamos National Laboratory
    Title: Data-Driven Decision Making for Cybersecurity [slides]
    Abstract: Businesses and government face a daunting option space of cybersecurity improvements that could be made. However, resources are limited and improvements have direct costs, opportunity costs, and can impede other business objectives. Thus, deciding how much to invest and in which improvements is a key challenge for decision makers. The paucity of data-driven decision tools today results in decisions are often based solely on regulatory requirements, product marketing claims, and peer benchmarking. Given the role of adversary decision making and adaptation in cybersecurity, we pose a new way to evaluate options based on the ratio of the costs incurred by the defender and by the attacker. By ignoring this metric, a defender can bankrupt themselves without actually thwarting an attacker. Second, we demonstrate the use of data-driven metrics to compare courses of action and to measure improvement. Examples include the use of network traffic to identify high-value assets, measurements of security operations performance, and measures of a distributed system's intrusion tolerance.
  • Mark Briers, Alan Turing Institute
    Title: Towards a unified Bayesian model for Cyber security [slides]
    Abstract: With the realisation that Cyber attack presents a significant risk to an organisation's reputation, efficiency, and profitability, there has been an increase in the instrumentation of networks; from collecting netflow data at routers, to host-based agents collecting detailed process information. To spot the potential threats within a Cyber environment, a large community of researchers have produced many exciting innovations, aligned with such data. Much of this research has been focused around "data driven" techniques, and does not often fuse data from multiple sources. Moreover, incorporation of threat actors' behaviours and motivations (as specified by Cyber security experts) is often non-existent. In this talk, I will present an initial unified Bayesian model for Cyber security, which allows explicit incorporation of expert knowledge, and provides a natural probabilistic framework for the fusion of multiple data sources.
  • David Marchette, Naval Surface Warfare Center, Dahlgren Division
    Title: Computational Statistics and Mathematics for Cyber Security [slides]
    Abstract: Computer and network security relies on many different tools, such as secure programming practices, firewalls, virus scanners, and various algorithms to detect attacks and malicious software. The latter require the analysis of complex and varied data such as packet streams, emails, potential malicious binary executables and user activity on a computer and on the network. Modern data analytics has a number of tools to analyze these data streams and to design detection algorithms. This paper discusses several such tools that come from the computational statistics literature and from pure mathematics: nonparametric probability density estimation; graph-based manifold learning; topological data analysis. These ideas are illustrated on a problem of malware classification.
  • George Cybenko, Dartmouth College
    Title: Large-scale Analog Measurements and Analysis for Cyber Security [slides]
    Abstract: All digital logic and data in a systems is subject to compromise by an attacker. There is growing interest in using analog side-channel measurements and their analyses to monitor correct program execution. We describe an effort involving measuring electromagnetic emanations from a commodity processor and using those measurements to monitor execution in order to alert on deviations from the logically constrained paths determined by the program structure. These measurements are made at GHz rates and require large-scale modeling and real-time analysis to be effective.

  • Harsha Kalutarage, Queen's University Belfast
    Title: Feature Trade-off Analysis for Reconnaissance Detection [slides]
    Abstract: Early warning systems aim at alerting attack attempts at their nascent stages. Although there can be many overlaps between a typical intrusion detection system (IDS) and such a system, a particular emphasis for early warnings is to establish hypotheses and predictions as well as to generate alerts on not yet understood (unclassified) situations based on preliminary indications. In contrary, a typical IDS attempts to detect attacks using known indications of attack patterns. Design and implementation of such early warning systems involve numerous challenges such as generic set of indicators, intelligence gathering, uncertainty reasoning and information fusion. In this talk, we employ machine learning techniques to produce early warnings of attack attempts on computer networks. Modern complex network attacks have multiple stages such as reconnaissance, elevation of privileges, internal horizontal spread, exfiltration of data and access persistence over the time. Actions in each of these stages create evidence in systems. Due to number of factors such as scale, architecture and non-productive traffic however it makes difficult to detect them using typical IDSs. However, through curve fitting and regression, machine learning can create structures with unstructured data and labelling them so one can compare normal versus abnormal patterns in order to produce early warnings on ongoing malicious activities. The talk begins with an understanding of the behaviours of intruders and then related literature is followed by the proposed methodology. It also includes a carefully deployed empirical analysis with number of attack scenarios on computer networks. Finally, the talk concludes with a discussion on results, research challenges and necessary suggestions to move forward in this research line.
  • Niall Adams, Imperial College London
    Title: On Constructing Cyber-Analytics [slides]
    Abstract: Enterprise network defense is providing great opportunities for the development and deployment of statistical and machine learning methods. Such methods are intended to complement existing defenses, such as firewalls, virus scanners, and intrusion detection systems – which are predominantly signature-based. The role of data analysis methods is to provide enhanced situation awareness, by providing monitoring and alerting mechanisms to detect departures from “normal” behavior. In developing analytics in this context, a variety of challenging problems need to be addressed, including the volume and velocity of the data, high levels of heterogeneity, temporal variation, and more. We review aspects of the problem and characteristics of the various data sources. At present, the vision of jointly modelling various data sources at different levels of network abstraction, appears out of reach due to data volume and timeliness concerns. Instead, we describe a set of novel, and often simple, analytics that operate within different levels of the abstraction hierarchy.
  • Christoforos Anagnostopoulos, Mentat
    Title: Poorly-Supervised Learning: how to engineer labels for machine learning in cybersecurity [slides]
    Abstract: Abstract: Lack of labelled data is a well known challenge in cybersecurity research. Although confidentiality is often used to explain this scarcity, our experience shows that even within the closed walls of large enterprises with significant in-house expertise it is not obvious how to turn threat intelligence and incident information into a labelled dataset suitable for machine learning. This is due to a perfect storm of heavily imbalanced yet massive datasets, scarcity of time for performing manual labelling, and domain-specific idiosynchracies as to label semantics that drive a wedge between expert know-how and statistical modelling. We hereby try to unpack these incompatibilities, and despite having little to show by way of solutions, we make some progress towards identifying a core set of challenges that are unlikely to be addressed by innovations in processes or tooling, but rather seem to require novel statistical methodology.
  • Melissa Turcotte, Los Alamos National Laboratory
    Title: A Unified Host and Network Data Set [slides]
    Abstract: The lack of data sets derived from operational enterprise networks continues to be a critical deficiency in the cyber security research community. Unfortunatley, releasing viable data sets to the larger community is challenging for a number of reasons, primarily the difficulty of balancing security and privacy concerns against the fidelity and utility of the data. This chapter discusses the importance of cyber security research data sets and introduces a large data set derived from the operational network environment at Los Alamos National Laboratory. The hope is that this data set and associated discussion will act as a catalyst for both new research in cyber security as well as motivation for other organizations to release similar data sets to the community.
  • Blake Anderson, Cisco
    Title: Towards Generalizable Network Threat Detection [slides]
    Abstract: Network traffic analysis presents a number of key challenges including concept drift, noisy ground truth, and noisy features introduced by endpoint and network-level artifacts. We first provide an overview of these problems and how they manifest themselves in our datasets. Specifically, we demonstrate the changing popularity of different protocols and services, pitfalls in assuming ground truth for sandbox data, and how different operating systems and collection points affect commonly used data features. Given this enhanced understanding of the network domain, we show how to develop machine learning solutions that can address each of these challenges to provide generalizable threat detection. While many application layer protocols will be discussed, our focus will be on identifying malware in TLS encrypted traffic.