Researchers are building automated tools that can help analysts identify suspicious behavior within a network.
illustration of a circle with code inside with the words "data breach" and "cyber attack" in red. Red lines extend out of the circle to 3 computer symbols.
To detect and classify different types of cyber network attacks, researchers are leveraging easy-to-collect network packets and use machine learning to characterize anomalous behavior found in these network packets. Photo: James Caruso and Henry Palumbo

Early techniques for cyberattacks, such as guessing passwords manually, have evolved and multiplied throughout the years. In 2019, 350,000 new malware programs were registered every day in the United States. Detecting and reacting to these attacks are difficult tasks for analysts, because each type of attack interacts with a network in a different way.

"Cyber operators across the country are drowning in a sea of false positives," says Vijay Gadepally, a senior staff member in the Lincoln Laboratory Supercomputing Center (LLSC). "Alerts are going off, but they're not necessarily helping operators determine what is going on and how to take appropriate action." For example, existing tools may indicate that something suspicious is going on in the network but fail to characterize what is happening. 

To manage this tricky cyberattack landscape, analysts need automated tools that can accurately detect and classify threats. Gadepally is working with Emily Do, a former graduate student in the MIT Department of Electrical Engineering and Computer Science, to apply the power of supercomputing to this very problem. The team is using machine learning to characterize anomalous behavior within a cyber network.

A network packet is a unit of data that is sent from a source to a destination. A series of packets that is sent between a single source and destination is called a flow. "Our hypothesis is that if there is a network anomaly caused by a cyberattack, there will be a change in the way flows occur between sources and destinations," Gadepally says.

To test this hypothesis, the team first gathered data from the MAWI Working Group Traffic Archive — an open-source collection of continually updated raw internet traffic data — and then converted these data into network flows. This aspect of the project necessitated a huge amount of computing power, which is why the LLSC is involved.

"These datasets are massive — a 15-minute window of packet data corresponds to nearly 20 gigabytes of data," Gadepally says. "While we end up converting everything to the flow format, which is much smaller, this conversion is a computationally heavy task."

After converting the network packets into flow data, the team bucketed the data into 10-second time windows. For each of these windows, they then looked at the entropy (the amount of change) within features such as IP addresses. The idea is that a change in entropy would be a good indicator of anomalous network activity.

As an example, consider a distributed denial-of-service attack. In one form of this attack, an attacker disrupts a victim network by bombarding it with traffic from multiple compromised systems. An analyst observing this attack would expect to see a significant increase in the entropy of source IP flow — the increase is due to a big change in the number of source IP addresses within the time window in which the attack occurs. 

The last step of the process was to feed the entropy results into a neural network. This step allows the algorithm to accurately classify a network attack. In the end, the research team found that their system could detect and identify incoming attacks that affected as little as 5% of the total traffic flow within a network. 

"This work quantifies the sensitivity of our method of detection," Do says. "The lower the number, the more sensitive the detection. [This] means that if your network has a usual total traffic of 10,000 packets per second, the method can detect and classify an attack at 500 packets per second or more with high accuracy."

So far, the team has been using synthetic data to simulate network attacks for training and testing their algorithm. Now that they have successfully demonstrated the utility of their method, their next step is to use real data to test the system further. The end goal is to build an effective, working system that they can make publicly available for all cyber analysts who wish to use it. 

This work is supported by the MIT-Air Force AI Innovation Accelerator.