VPN/Non-VPN Network Application Traffic Dataset (VNAT)

This dataset is a collection of labeled PCAP files, both encrypted and unencrypted, across 10 applications. It was created to assist the development of machine learning tools that would allow operators to see the traffic categories of both encrypted and unencrypted traffic flows. In particular, features of the network packet traffic timing and size information (both inside of and outside of the VPN) can be leveraged to predict the application category that generated the traffic.

The dataset consists of 36.1 GB, 33,711 connections and approximately 272 hours of packet capture from five traffic categories, as shown in the table below:

Traffic Category	Applications	Filename Keywords
Streaming	Vimeo, Netflix, Youtube	vimeo, netflix, youtube
VoIP	Zoiper	Voip
Chat	Skype	skype-chat
Command & Control (C2)	SSH, RDP	ssh, rdp
File Transfer	SFTP, RSYNC, SCP	sftp, rsync, scp

To produce the dataset, virtual subnetworks for each traffic category were created. Each subnetwork contains a client, a client DNS server, a VPN client, and a VPN server. The Skype subnetwork contains an additional client to allow for bidirectional chat. The video streaming and web browsing subnetworks were connected to the Internet to enable access to Firefox, Chrome, YouTube, Netflix, and Vimeo. VPN traffic was captured between the VPN client and the VPN server. Separately, non-VPN traffic is captured between the VPN client and the application layer.

Netflix, YouTube, Zoiper, and Vimeo network traffic were generated manually. However, the File Transfer network traffic was generated with the assistance of randomized scripts. The Chat category was created by playing back chat messages available on https://github.com/freeCodeCamp/gitter-history. For the C2 category, the RDP traffic was manually generated, whereas the ssh traffic was created with randomized scripts that executed shell commands. All traffic was captured using tcpdump and outputted in the libpcap compatible PCAP format.

The figure above depicts the configuration and setup of our Skype chat, video streaming (e.g, YouTube), and other application traffic collection points.

Download Instructions

The PCAP dataset is provided as a single .zip archive which contains all the raw PCAP files. We also provide two .h5 files that can be directly loaded via the Python Pandas package. “VNAT_Dataframe_release_<#>.h5” contains the connection, packet timestamps, packet sizes, and packet directions from the PCAPs already extracted into a Pandas DataFrame. “VNAT_Feature_Dataframe_release_<#>.h5” contains machine learning feature data extracted using Wavelet-based and TLS-based methods described in the paper linked at the bottom of this page. Users can infer labels from the file names provided (VPN, non-VPN, apps, categories, etc.). Simply download the .zip file and extract it to begin using the PCAP data or download one of the .h5 files and extract the data using the following python code (ensure you have the pip packages “pytables” and “pandas” installed):

import pandas as pd
df = pd.read_hdf(“VNAT_Dataframe_release_<#>.h5”)

This was tested on python 3.8 using pandas v1.4.3 and likely works for python >= 3.8 and pandas >= 1.4.

The data was captured using TCP dump on an isolated subnetwork where only network traffic from the desired application was present. Since all applications captured encrypt the packet payloads, no obfuscation of the payload is required. Since the packets were captured on an isolated subnet created for only this purpose, no obfuscation of packet header data was required. After the PCAP data was captured, files were labeled according to the application run during the capture using the following format:

Capture Type	File Naming Format
VPN	vpn_<filename keyword>_capture<#>.pcap
NON-VPN	nonvpn_<filename keyword>_capture<#>.pcap

More information about the dataset can be found in our related publication, available at https://ieeexplore.ieee.org/abstract/document/10044382.