CYBERSECURITY DATASETS AND OPEN SOURCE PROJECTS

1. DATASETS:

1. 12. Security of DNS over HTTPS (DoH) (CIRA-CIC-DoHBrw-2020)
2020
This research work proposes a systematic approach to generate a typical dataset to analyze, test, and evaluate DoH traffic in covert channels and tunnels. The main objective of this project is to deploy DoH within an application and capture benign as well as malicious DoH traffic as a two-layered approach to detect and characterize DoH traffic using time-series classifier. The final dataset includes implementing DoH protocol within an application using five different browsers and tools and four servers to capture Benign-DoH, Malicious-DoH and non-DoH traffic. Layer 1 of the proposed two-layered approach is used to classify DoH traffic from non-DoH traffic and layer 2 is used to characterize Benign-Doh from Malicious-DoH traffic. The browsers and tools used to capture traffic include Google Chrome, Mozilla Firefox, dns2tcp, DNSCat2, and Iodine while the servers used to respond to DoH requests are AdGuard, Cloudflare, Google DNS, and Quad9. We developed a traffic analyzer namely DoHLyzer to extrac features from the captured traffic.

The full research paper outlining the details of the dataset and its underlying principles:
- Mohammadreza MontazeriShatoori, Logan Davidson, Gurdip Kaur and Arash Habibi Lashkari, "Detection of DoH Tunnels using Time-series Classification of Encrypted Traffic", The 5th Cyber Science and Technology Congress (2020) (CyberSciTech 2020), Vancouver, Canada, August 2020

For more information and download this dataset, visit this page.

1. 11. Darknet Traffic (CIC-Darknet2020)
2020
This research work proposes a novel technique to detect and characterize VPN and Tor applications together as the real representative of darknet traffic by amalgamating out two public datasets, namely, ISCXTor2016 and ISCXVPN2016, to create a complete darknet dataset covering Tor and VPN traffic respectively..

The full research paper outlining the details of the dataset and its underlying principles:
- Arash Habibi Lashkari, Gurdip Kaur, and Abir Rahali, “DIDarknet: A Contemporary Approach to Detect and Characterize the Darknet Traffic using Deep Image Learning”, 10th International Conference on Communication and Network Security, Tokyo, Japan, November 2020

For more information and download this dataset, visit this page.

1. 10. Android Malware Static Analysis (CCCS-CIC-AndMal-2020)
2020
This research work proposes a new comprehensive and huge android malware dataset, named CCCS-CIC-AndMal-2020. The dataset includes 200K benign and 200K malware samples totalling to 400K android apps with 14 prominent malware categories and 191 eminent malware families. To generate the representative dataset, we collaborated with CCCS to capture 200K android malware apps which are labeled and characterized into corresponding family. Benign android apps (200K) are collected from Androzoo dataset to balance the huge dataset. We collected 14 malware categories including adware, backdoor, file infector, no category, Potentially Unwanted Apps (PUA), ransomware, riskware, scareware, trojan, trojan-banker, trojan-dropper, trojan-sms, trojan-spy and zero-day. A complete taxonomy of all the malware families of captured malware apps is created by dividing them into eight categories such as sensitive data collection, media, hardware, actions/activities, internet connection, C&C, antivirus and storage & settings.

The full research paper outlining the details of the dataset and its underlying principles:
- Abir Rahali, Arash Habibi Lashkari, Gurdip Kaur, Laya Taheri, Francois Gagnon, and Frédéric Massicotte, “DIDroid: Android Malware Classification and Characterization Using Deep Image Learning”, 10th International Conference on Communication and Network Security, Tokyo, Japan, November 2020

For more information and download this dataset, visit this page.

1. 9. Investigation of the Android Malware (CICInvesAndMal2019)
2020
We provide the second part of the CICAndMal2017 dataset publicly available which includes permissions and intents as static features and API calls and all generated log files as dynamic features in three steps (During installation, before restarting and after restarting the phone). In this part, we improve our malware category and family classification performance around 30% by combining the previous dynamic features (80 network-flows by using CICFlowmeter-V3.0) with 2-gram sequential relations of API calls. In addition, we examine these features in the presented two-layer malware analysis framework. Besides these, we provide other captured features such as battery states, log states, packages, process logs, etc.

The full research paper outlining the details of the dataset and its underlying principles:
- Laya Taheri, Andi Fitriah Abdulkadir, Arash Habibi Lashkari, "Extensible Android Malware Detection and Family Classification Using Network-Flows and API-Calls", The IEEE (53rd) International Carnahan Conference on Security Technology, India, 2019

For more information and download this dataset, visit this page.

1. 8. Distributed Denial of Service (CICDDoS2019)
2019
The final dataset includes 12 DDoS attack NTP, DNS, LDAP, MSSQL, NetBIOS, SNMP, SSDP, UDP, UDP-Lag, WebDDoS, SYN and TFTP in the training day and 7 attacks including PortScan, NetBIOS, LDAP, MSSQL, UDP, UDP-Lag and SYN in the testing day. The infrastructure includes Third-Party for the attack side and the victim organization has 4 machines and 1 server. The dataset includes the captures network traffic along with 80 features extracted from the captured traffic using CICFlowmeter-V3.0.

The full research paper outlining the details of the dataset and its underlying principles:
- Iman Sharafaldin, Arash Habibi Lashkari, Saqib Hakak, and Ali A. Ghorbani, "Developing Realistic Distributed Denial of Service (DDoS) Attack Dataset and Taxonomy", IEEE 53rd International Carnahan Conference on Security Technology, Chennai, India, 2019

For more information and download this dataset, visit this page.

1. 7. Intrusion Detection and Prevention Dataset (CSE-CIC-IDS 2018)
2018
The final dataset includes seven different attack scenarios: Brute-force, Heartbleed, Botnet, DoS, DDoS, Web attacks, and infiltration of the network from inside. The attacking infrastructure includes 50 machines and the victim organization has 5 departments and includes 420 machines and 30 servers. The dataset includes the captures network traffic and system logs of each machine, along with 80 features extracted from the captured traffic using CICFlowmeter-V3.0.

The full research papers outlining the details of the dataset and its underlying principles:

- Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization", 4th International Conference on Information Systems Security and Privacy (ICISSP), Purtogal, January 2018.

- Gurdip Kaur, Arash Habibi Lashkari and Abir Rahali, "Intrusion Traffic Detection and Characterization using Deep Image Learning", The 5th Cyber Science and Technology Congress (2020) (CyberSciTech 2020), Vancouver, Canada, August 2020.

For more information and download this dataset, contact AWS.


1. 6. Intrusion Detection and Prevention Dataset (CICIDS 2017)
2017
Intrusion Detection Systems (IDSs) and Intrusion Prevention Systems (IPSs) are the most important defense tools against the sophisticated and ever-growing network attacks. Due to the lack of reliable test and validation datasets, anomaly-based intrusion detection approaches are suffering from consistent and accurate performance evolutions. Our evaluations of the existing eleven datasets since 1998 show that most are out of date and unreliable to use. Some of these datasets suffer from the lack of traffic diversity and volumes, some do not cover the variety of known attacks, while others anonymize packet payload data, which cannot reflect the current trends. Some are also lacking feature set and metadata. CICIDS2017 dataset contains benign and the most up-to-date common attacks, which resembles the true real-world data (PCAPs). It also includes the results of the network traffic analysis using CICFlowmeter-V3.0 with labeled flows based on the time stamp, source and destination IPs, source and destination ports, protocols and attack (CSV files).

The full research papers outlining the details of the dataset and its underlying principles:

- Iman Sharafaldin, Arash Habibi Lashkari, and Ali A. Ghorbani, "Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization", 4th International Conference on Information Systems Security and Privacy (ICISSP), Purtogal, January 2018.

- Gurdip Kaur, Arash Habibi Lashkari and Abir Rahali, "Intrusion Traffic Detection and Characterization using Deep Image Learning", The 5th Cyber Science and Technology Congress (2020) (CyberSciTech 2020), Vancouver, Canada, August 2020.

For more information and download the dataset, visit this page.


1. 5. Android Malware Dataset (CICAndMal2017)
2017
We propose our new Android malware dataset here, named CICAndMal2017. In this approach, we run our both malware and benign applications on real smartphones to avoid runtime behavior modification of advanced malware samples that are able to detect the emulator environment. We collected more than 10,854 samples (4,354 malware and 6,500 benign) from several sources. We have collected over six thousand benign apps from Googleplay market published in 2015, 2016, 2017. In this dataset, we installed 5,000 of the collected samples (426 malware and 5,065 benign) on real devices. Our malware samples in the CICAndMal2017 dataset are classified into four categories Adware, Ransomware, Scareware and SMS Malware. Our samples come from 42 unique malware families.

The full research paper outlining the details of the dataset and its underlying principles:
- Arash Habibi Lashkari, Andi Fitriah A.Kadir, Laya Taheri, and Ali A. Ghorbani, “Toward Developing a Systematic Approach to Generate Benchmark Android Malware Datasets and Classification”, In the proceedings of the 52nd IEEE International Carnahan Conference on Security Technology (ICCST), Montreal, Quebec, Canada, 2018.

For more information and download the dataset, visit this page.


1. 4. Android Adware and General Malware Dataset (AAGM)
2017
The sophisticated and advanced Android malware is able to identify the presence of the emulator used by the malware analyst and in response, alter its behavior to evade detection. To overcome this issue, we installed the Android applications on the real device and captured its network traffic. AAGM dataset is captured by installing the Android apps on the real smartphones semi-automated. The dataset is generated from 1900 applications with the following three categories:
- Android Adware (250 apps): Airpush, Dowgin, Kemoge, Mobidash, Shuanet
- General Android Malware (150 apps): AVpass, FakeAV, FakeFlash/FakePlayer, GGtracker, Penetho
- Benign (1500 apps): 2015 and 2016 GooglePlay market (top free popular and top free new).

The full research papers outlining the details of the dataset and its underlying principles:
- Arash Habibi Lashkari, Andi Fitriah A.Kadir, Hugo Gonzalez, Kenneth Fon Mbah and Ali A. Ghorbani, “Towards a Network-Based Framework for Android Malware Detection and Characterization”, In the proceeding of the 15th International Conference on Privacy, Security and Trust, PST, Calgary, Canada, 2017.

For more information and download the dataset, visit this page.


1. 3. URL dataset (ISCX-URL-2016)
2016
The Web has long become a major platform for online criminal activities. URLs are used as the main vehicle in this domain. To counter this issues security community focused its efforts on developing techniques for mostly blacklisting of malicious URLs. While successful in protecting users from known malicious domains, this approach only solves part of the problem. The new malicious URLs that sprang up all over the web in masses commonly get a head start in this race. Besides that, Alexa ranked, trusted websites may convey compromised fraudulent URLs called defacement URL. We explore a lightweight approach to detection and categorization of the malicious URLs according to their attack type and show that lexical analysis is effective and efficient for proactive detection of these URLs. We also study the effect of the obfuscation techniques on malicious URLs to figure out the type of obfuscation technique targeted at specific type of malicious URL. We study mainly five different types of URLs include Benign, Spam, Phishing, Malware, and Defacement.

The full research paper outlining the details of the dataset and its underlying principles:
- Mohammad Saiful Islam Mamun, Mohammad Ahmad Rathore, Arash Habibi Lashkari, Natalia Stakhanova and Ali A. Ghorbani,"Detecting Malicious URLs Using Lexical Analysis", Network and System Security, Springer International Publishing, P467--482, 2016.

For more information and download the dataset, visit this page.


1. 2. Tor-nonTor Network Traffic dataset
2016
To be sure about the quantity and diversity of this dataset in CIC, we defined a set of tasks to generate a representative dataset of real-world traffic. We created three users for the browser traffic collection and two users for the communication parts such as chat, mail, FTP, p2p, etc. For the non-Tor traffic we used previous benign traffic from VPN project and for the Tor traffic we used 7 traffic categories: Browsing, Email, Chat, Audio-Streaming, Video-Streaming, FTP, VoIP, P2P. The traffic was captured using Wireshark and tcpdump, generating a total of 22GB of data. To facilitate the labeling process, as we explained in the related published paper, we captured the outgoing traffic at the workstation and the gateway simultaneously, collecting a set of pairs of .pcap files: one regular traffic pcap (workstation) and one Tor traffic pcap (gateway) file. Later, we labelled the captured traffic in two steps. First, we processed the .pcap files captured at the workstation: we extracted the flows, and we confirmed that the majority of traffic flows were generated by application X (Skype, ftps, etc.), the object of the traffic capture. Then, we labelled all flows from the Tor .pcap file as X. ISCXFlowMeter has been written in Java for reading the pcap files and create the csv file based on selected features. The dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by CICFlowMeter) also are publicly available for researchers.

The full research paper outlining the details of the dataset and its underlying principles:
- Arash Habibi Lashkari, Gerard Draper-Gil, Mohammad Saiful Islam Mamun and Ali A. Ghorbani, "Characterization of Tor Traffic Using Time Based Features", In the proceeding of the 3rd International Conference on Information System Security and Privacy, SCITEPRESS, Porto, Portugal, 2017.

For more information and download the dataset, visit this page.


1. 1. VPN-nonVPN Network Traffic dataset
2015
To generate a representative dataset of real-world traffic in ISCX we defined a set of tasks, assuring that our dataset is rich enough in diversity and quantity. We created accounts for users Alice and Bob in order to use services like Skype, Facebook, etc. Below we provide the complete list of different types of traffic and applications considered in our dataset for each traffic type (VoIP, P2P, etc.). We captured a regular session and a session over VPN, therefore we have a total of 14 traffic categories: VOIP, VPN-VOIP, P2P, VPN-P2P, etc. We also give a detailed description of the different types of traffic generated: Browsing, Email, Chat, Audio-Streaming, Video-Streaming, FTP, VoIP, P2P. The traffic was captured using Wireshark and tcpdump, generating a total amount of 28GB of data. For the VPN, we used an external VPN service provider and connected to it using OpenVPN (UDP mode). To generate SFTP and FTPS traffic we also used an external service provider and Filezilla as a client. To facilitate the labeling process, when capturing the traffic all unnecessary services and applications were closed. (The only application executed was the objective of the capture, e.g., Skype voice-call, SFTP file transfer, etc.) We used a filter to capture only the packets with source or destination IP, the address of the local client (Alice or Bob).
ISCXFlowMeter (formerly known as ISCXFlowMeter) has been written in Java for reading the pcap files and create the csv file based on selected features. The dataset consists of labeled network traffic, including full packet in pcap format and csv (flows generated by CICFlowMeter) also are publicly available for researchers.

The full research paper outlining the details of the dataset and its underlying principles:
- Gerard Drapper Gil, Arash Habibi Lashkari, Mohammad Mamun, Ali A. Ghorbani, "Characterization of Encrypted and VPN Traffic Using Time-Related Features", In Proceedings of the 2nd International Conference on Information Systems Security and Privacy(ICISSP 2016), pages 407-414, Rome, Italy, 2016.

For more information and download the dataset, visit this page.



2. Open Source Projects:

2. 2. DNS over HTTPS (DoH) Analyzer (DoHLyzer)
2020
Set of tools to capture HTTPS traffic, extract statistical and time-series features from it, and analyze them with a focus on detecting and characterizing DoH (DNS-over-HTTPS) traffic.

For more information and download the source code, visit this page.

2. 1. Network Traffic Analyzer (CICFlowMeter formerly known as ISCXFlowMeter)
2015
CICFlowmeter-V3.0 (formerly known as ISCXFlowMeter) as an open source project has been written in Java for reading the pcap files and create the csv file based on more than 80 network traffic features.

For more information and download the source code, visit this page.

Researchers named among top researchers for Canada 150
The cybersecurity Research and Academic Leadership award, Canada 2019
The cybersecurity academic award, Canada 2017