Cybersecurity Datasets:
10. Android Malware Static Analysis (CCCS-CIC-AndMal2020)

A collaboration project with Canadian Centre for Cyber Security (CCCS)
Malicious domains are one of the major threat of android malware is the root cause of various security problems on the internet. Android malware industry is becoming increasingly disruptive with almost 12,000 new android malware instances every day. Detecting android malware in smartphones is an essential target for cyber community to get rid of menacing malware samples.

Android malware is one of the most serious threats on the internet which has witnessed an unprecedented upsurge in recent years. It is an open challenge for cybersecurity experts. There are many techniques available to identify and classify android malware based on machine learning, but recently, deep learning has emerged as a prominent classification method for such samples.

This research work proposes a new comprehensive and huge android malware dataset, named CCCS-CIC-AndMal-2020. The dataset includes 200K benign and 200K malware samples totalling to 400K android apps with 14 prominent malware categories and 191 eminent malware families.

Introduction
To generate the representative dataset, we collaborated with CCCS to capture 200K android malware apps which are labeled and characterized into corresponding family. Benign android apps (200K) are collected from Androzoo dataset to balance the huge dataset. We collected 14 malware categories including adware, backdoor, file infector, no category, Potentially Unwanted Apps (PUA), ransomware, riskware, scareware, trojan, trojan-banker, trojan-dropper, trojan-sms, trojan-spy and zero-day.

A complete taxonomy of all the malware families of captured malware apps is created by dividing them into eight categories such as sensitive data collection, media, hardware, actions/activities, internet connection, C&C, antivirus and storage & settings. The taxonomy is presented in the research paper mentioned under license (Section 5).

Capturing data and final dataset
CCCS supported us to capture the real-world android malware apps for analysis. We used VirusTotal to specify malware family and label the dataset by following a consensus of 70% anti-viruses to incorporate reliability in labeled dataset. We searched for similar malware samples to categorize malware samples in dataset with similar characteristics. Table 1 presents the details of 14 android malware categories along with number of respective families and samples in the dataset.

Table 1: Dataset details

Category - Number of families - Number of samples
Adware - 48 - 47,210
Backdoor - 11 - 1,538
File Infector - 5 - 669
No Category - _ - 2,296
PUA - 8 - 2,051
Ransomware - 8 - 6,202
Riskware - 21 - 97,349
Scareware - 3 - 1,556
Trojan - 45 - 13,559
Trojan-Banker - 11 - 887
Trojan-Dropper - 9 - 2,302
Trojan-SMS - 11 - 3,125
Trojan-Spy - 11 - 3,540
Zero-day - _ - 13,340

The families of each malware category in Table 1 along with the numbers of the captured samples are as presented below:

Adware:
Sr. No. - Family - Number of captured samples
1 - dowgin - 2679
2 - adflex - 418
3 - admogo - 79
4 - adviator - 77
5 - adwo - 188
6 - airpush - 2242
7 - appad - 92
8 - appsgeyser - 60
9 - baiduprotect - 984
10 - batmobi - 458
11 - dianjin - 45
12 - dianle - 19
13 - domob - 103
14 - ewind - 1047
15 - feiwo - 108
16 - fictus - 349
17 - ganlet - 28
18 - adend - 301
19 - gmobi - 17
20 - hiddenad - 61
21 - hummingbad - 28
22 - igexin - 82
23 - inmobi - 330
24 - inoco - 5649
25 - kalfere - 113
26 - kuguo - 1015
27 - leadbolt - 233
28 - mobclick - 41
29 - mobidash - 1033
30 - mobisec - 117
31 - mulad - 171
32 - oimobi - 913
33 - shedun - 19036
34 - sprovider - 227
35 - viser - 31
36 - wooboo - 16
37 - xynyin - 44
38 - zdtad - 5694
39 - frupi - 43
40 - kyhub - 28
41 - stopsms - 26
42 - loki - 46
43 - kyview - 127
44 - pandaad - 50
45 - plague - 14
46 - accutrack - 7
47 - adcolony - 17
48 - gexin - 3

Backdoor
Sr. No. - Family - Number of captured samples
1 - kapuser - 15
2 - kmin - 24
3 - fobus - 171
4 - mobby - 119
5 - hiddad - 664
6 - moavt - 166
7 - androrat - 129
8 - dendroid - 48
9 - levida - 51
10 - pyls - 24
11 - droidkungfu - 50

File Infector
Sr. No. - Family - Number of captured samples
1 - commplat - 77
2 - leech - 99
3 - tachi - 45
4 - gudex - 14
5 - aqplay - 407

PUA
Sr. No. - Family - Number of captured samples
1 - apptrack - 92
2 - cauly - 27
3 - secapk - 1004
4 - umpay - 67
5 - wiyun - 11
6 - youmi - 529
7 - utchi - 139
8 - scamapp - 99

Ransomware
Sr. No. - Family - Number of captured samples
1 - masnu - 35
2 - congur - 252
3 - fusob - 67
4 - jisut - 820
5 - koler - 79
6 - lockscreen - 356
7 - slocker - 998
8 - smsspy - 3319

Riskware
Sr. No. - Family - Number of captured samples
1 - skymobi - 10229
2 - anydown - 57
3 - badpac - 45
4 - deng - 58
5 - dnotua - 36
6 - jiagu - 721
7 - metasploit - 28
8 - mobilepay - 1197
9 - remotecode - 36
10 - revmob - 806
11 - secneo - 27
12 - smspay - 28512
13 - smsreg - 50073
14 - talkw - 49
15 - tencentprotect - 144
16 - tordow - 7
17 - triada - 493
18 - wapron - 93
19 - nqshield - 46
20 - kingroot - 24
21 - wificrack - 15

Scareware
Sr. No. - Family - Number of captured samples
1 - avpass - 126
2 - mobwin - 23
3 - fakeapp - 1332

Trojan
Sr. No. - Family - Number of captured samples
1 - Autosms - 239
2 - coinge - 16
3 - droiddreamlight - 15
4 - gluper - 680
5 - hiddenapp - 157
6 - iconosys - 33
7 - lotoor - 661
8 - mobtes - 343
9 - mseg - 148
10 - qysly - 94
11 - rootnik - 474
12 - syringe - 99
13 - wkload - 143
14 - zbot - 85
15 - hyspu - 112
16 - basebridge - 63
17 - boogr - 218
18 - lovetrap - 48
19 - oveead - 30
20 - rusms - 27
21 - systemmonitor - 61
22 - uupay - 27
23 - wintertiger - 24
24 - typstu - 28
25 - blouns - 652
26 - autoins - 479
27 - cnsms - 3413
28 - gappusin - 766
29 - gedma - 11
30 - ginmaster - 130
31 - hypay - 360
32 - mytrackp - 1054
33 - subspod - 11
34 - walkfree - 15
35 - xinyinhe - 59
36 - drosel - 59
37 - uapush - 11
38 - uten - 9
39 - smsagent - 1166
40 - styricka - 833
41 - autoinst - 12
42 - noicondl - 33
43 - obtes - 5
44 - droiddream - 3
45 - hiddenap - 3

Trojan-Banker
Sr. No. - Family - Number of captured samples
1 - asacub - 260
2 - fakebank - 17
3 - faketoken - 52
4 - marcher - 87
5 - minimob - 56
6 - guerrilla - 256
7 - bankbot - 4
8 - gugi - 8
9 - svpeng - 68
10 - wroba - 9
11 - zitmo - 40

Trojan-Dropper
Sr. No. - Family - Number of captured samples
1 - locker - 1296
2 - rooter - 51
3 - xiny - 31
4 - boqx - 106
5 - hqwar - 118
6 - ramnit - 84
7 - ztorg - 500
8 - gorpo - 16

Trojan-SMS
Sr. No. - Family - Number of captured samples
1 - opfake - 368
2 - hipposms - 20
3 - podec - 13
4 - feejar - 56
5 - smsdel - 40
6 - plankton - 186
7 - jsmshider - 21
8 - smsbot - 42
9 - boxer - 87
10 - fakeinst - 2148
11 - vietsms - 13

Trojan-Spy
Sr. No. - Family - Number of captured samples
1 - spynote - 21
2 - kasandra - 29
3 - spyagent - 48
4 - spyoo - 13
5 - tekwon - 19
6 - sandr - 208
7 - qqspy - 27
8 - smforw - 1873
9 - smsthief - 1058
10 - smszombie - 52
11 - spydealer - 1

For benign android apps, we used the Androzoo dataset, which currently contains more than eight million unique android apps and the number is still growing. The architecture is developed to collect the Androzoo dataset from different sources including official android market, Google Play, Anshi, AppChina, 1mobile, and Genome project dataset. A weekly updated list containing all the detailed information about the apps is created. HTTP API is provided to allow the full download of the unaltered APKs from the Androzoo dataset.

Static analysis
AndroidManifest.xml contains a lot of features that can be used for static analysis. The main extracted features include:

- Activities: An android activity is one screen of the android app's user interface
- Broadcast receivers and providers
- Metadata: It is basically an additional option to store information that can be accessed through the entire project
- The permissions requested by application: It protects the privacy of the user and is needed to access sensitive user data (such as contacts and SMS)
- System features (such as camera and internet)

Table 2 presents the examples of static features extracted from captured dataset.

Table 2: List of static features
Feature - Values
Package Name - "com.fb.iwidget"
Activities - "com.fb.iwidget.OverlayActivity", "org.acra.CrashReportDialog", "com.batch.android.BatchActionActivity", "com.fb.iwidget.MainActivity", "com.fb.iwidget.PreferencesActivity", "com.fb.iwidget.PickerActivity", "com.fb.iwidget.IntroActivity"
Services - "com.batch.android.BatchActionService", "com.fb.iwidget.MainService", "com.fb.iwidget.SnapAccessService"
Receivers/Providers - "com.fb.iwidget.ExpandWidgetProvider", "com.fb.iwidget.ActionReceiver", Intents Actions - "android.accessibilityservice.AccessibilityService", "android.appwidget.action.APPWIDGET_UPDATE", "android.intent.action.BOOT_COMPLETED", "android.intent.action.CREATE_SHORTCUT", "android.intent.action.MAIN", "android.intent.action.MY_PACKAGE_REPLACED", "android.intent.action.USER_PRESENT", "android.intent.action.VIEW" "com.fb.iwidget.action.SHOULD_REVIVE"
Intents Categoriesv - "android.intent.category.BROWSABLE", "android.intent.category.DEFAULT", "android.intent.category.LAUNCHER"
Permissions - "android.permission.ACCESS_NETWORK_STATE", "android.permission.CALL_PHONE", "android.permission.INTERNET", "android.permission.RECEIVE_BOOT_COMPLETED", "android.permission.SYSTEM_ALERT_WINDOW", "com.android.vending.BILLING", "android.permission.BIND_ACCESSIBILITY_SERVICE"
Meta-Data - "android.accessibilityservice", "android.appwidget.provider"
#Icons - 331
#Pictures - 0
#Videos - 0
Audio files - 0
Videos - 0
Size of the App - 4.2M
Dynamic analysis
For understanding the behavioral changes of these malware categories and families, six categories of features are extracted after executing the malware in an emulated environment. The main extracted features include:

Memory: Memory features define activities performed by malware by utilizing memory.
API: Application Programming Interface (API) features delineate the communication between two applications. Network: Network features describe the data transmitted and received between other devices in the network. It indicates foreground and background network usage.
Battery: Battery features describe the acces to battery wakelock and services by malware.
Logcat: Logcat features write log messages corresponding to a function performed by malware.
Process: Process features count the interaction of malware with toal number of process.
Table 3 presents the complete list of dynamic features extracted from dynamic execution of malware.

Table 3: List of dynamic features
Category - Feature
Memory - Memory_PssTotal, Memory_PssClean , Memory_SharedDirty , Memory_PrivateDirty, Memory_SharedClean, Memory_PrivateClean, Memory_SwapPssDirty, Memory_HeapSize, Memory_HeapAlloc, Memory_HeapFree, Memory_Views, Memory_ViewRootImpl, Memory_AppContexts Memory_Activities, Memory_Assets, Memory_AssetManagers, Memory_LocalBinders, Memory_ProxyBinders, Memory_ParcelMemory, Memory_ParcelCount Memory_DeathRecipients, Memory_OpenSSLSockets, Memory_WebViews
API - API_Process_android.os.Process_start, API_Process_android.app.ActivityManager_killBackgroundProcesses, API_Process_android.os.Process_killProcess, API_Command_java.lang.Runtime_exec, API_Command_java.lang.ProcessBuilder_start, API_JavaNativeInterface_java.lang.Runtime_loadLibrary, API_JavaNativeInterface_java.lang.Runtime_load, API_WebView_android.webkit.WebView_loadUrl, API_WebView_android.webkit.WebView_loadData, API_WebView_android.webkit.WebView_loadDataWithBaseURL, API_WebView_android.webkit.WebView_addJavascriptInterface, API_WebView_android.webkit.WebView_evaluateJavascript, API_WebView_android.webkit.WebView_postUrl, API_WebView_android.webkit.WebView_postWebMessage, API_WebView_android.webkit.WebView_savePassword, API_WebView_android.webkit.WebView_setHttpAuthUsernamePassword, API_WebView_android.webkit.WebView_getHttpAuthUsernamePassword, API_WebView_android.webkit.WebView_setWebContentsDebuggingEnabled, API_FileIO_libcore.io.IoBridge_open, API_FileIO_android.content.ContextWrapper_openFileInput, API_FileIO_android.content.ContextWrapper_openFileOutput, API_FileIO_android.content.ContextWrapper_deleteFile, API_Database_android.content.ContextWrapper_openOrCreateDatabase, API_Database_android.content.ContextWrapper_databaseList, API_Database_android.content.ContextWrapper_deleteDatabase, API_Database_android.database.sqlite.SQLiteDatabase_execSQL, API_Database_android.database.sqlite.SQLiteDatabase_deleteDatabase, API_Database_android.database.sqlite.SQLiteDatabase_getPath, API_Database_android.database.sqlite.SQLiteDatabase_insert, API_Database_android.database.sqlite.SQLiteDatabase_insertOrThrow, API_Database_android.database.sqlite.SQLiteDatabase_insertWithOnConflict, API_Database_android.database.sqlite.SQLiteDatabase_openDatabase, API_Database_android.database.sqlite.SQLiteDatabase_openOrCreateDatabase, API_Database_android.database.sqlite.SQLiteDatabase_query, API_Database_android.database.sqlite.SQLiteDatabase_queryWithFactory, API_Database_android.database.sqlite.SQLiteDatabase_rawQuery, API_Database_android.database.sqlite.SQLiteDatabase_rawQueryWithFactory, API_Database_android.database.sqlite.SQLiteDatabase_update, API_Database_android.database.sqlite.SQLiteDatabase_updateWithOnConflict, API_Database_android.database.sqlite.SQLiteDatabase_compileStatement, API_Database_android.database.sqlite.SQLiteDatabase_create, API_IPC_android.content.ContextWrapper_sendBroadcast, API_IPC_android.content.ContextWrapper_sendStickyBroadcast, API_IPC_android.content.ContextWrapper_startActivity, API_IPC_android.content.ContextWrapper_startService, API_IPC_android.content.ContextWrapper_stopService, API_IPC_android.content.ContextWrapper_registerReceiver, API_Binder_android.app.ContextImpl_registerReceiver, API_Binder_android.app.ActivityThread_handleReceiver, API_Binder_android.app.Activity_startActivity, API_Crypto_javax.crypto.spec.SecretKeySpec_$init, API_Crypto_javax.crypto.Cipher_doFinal, API_Crypto-Hash_java.security.MessageDigest_digest, API_Crypto-Hash_java.security.MessageDigest_update, API_DeviceInfo_android.telephony.TelephonyManager_getDeviceId, API_DeviceInfo_android.telephony.TelephonyManager_getSubscriberId, API_DeviceInfo_android.telephony.TelephonyManager_getLine1Number, API_DeviceInfo_android.telephony.TelephonyManager_getNetworkOperator, API_DeviceInfo_android.telephony.TelephonyManager_getNetworkOperatorName, API_DeviceInfo_android.telephony.TelephonyManager_getSimOperatorName, API_DeviceInfo_android.net.wifi.WifiInfo_getMacAddress, API_DeviceInfo_android.net.wifi.WifiInfo_getBSSID, API_DeviceInfo_android.net.wifi.WifiInfo_getIpAddress, API_DeviceInfo_android.net.wifi.WifiInfo_getNetworkId, API_DeviceInfo_android.telephony.TelephonyManager_getSimCountryIso, API_DeviceInfo_android.telephony.TelephonyManager_getSimSerialNumber, API_DeviceInfo_android.telephony.TelephonyManager_getNetworkCountryIso, API_DeviceInfo_android.telephony.TelephonyManager_getDeviceSoftwareVersion, API_DeviceInfo_android.os.Debug_isDebuggerConnected, API_DeviceInfo_android.content.pm.PackageManager_getInstallerPackageName, API_DeviceInfo_android.content.pm.PackageManager_getInstalledApplications, API_DeviceInfo_android.content.pm.PackageManager_getInstalledModules, API_DeviceInfo_android.content.pm.PackageManager_getInstalledPackages, API_Network_java.net.URL_openConnection, API_Network_org.apache.http.impl.client.AbstractHttpClient_execute, API_Network_com.android.okhttp.internal.huc.HttpURLConnectionImpl_getInputStream, API_Network_com.android.okhttp.internal.http.HttpURLConnectionImpl_getInputStream, API_DexClassLoader_dalvik.system.BaseDexClassLoader_findResource, API_DexClassLoader_dalvik.system.BaseDexClassLoader_findResources, API_DexClassLoader_dalvik.system.BaseDexClassLoader_findLibrary, API_DexClassLoader_dalvik.system.DexFile_loadDex, API_DexClassLoader_dalvik.system.DexFile_loadClass, API_DexClassLoader_dalvik.system.DexClassLoader_$init, API_Base64_android.util.Base64_decode, API_Base64_android.util.Base64_encode, API_Base64_android.util.Base64_encodeToString, API_SystemManager_android.app.ApplicationPackageManager_setComponentEnabledSetting, API_SystemManager_android.app.NotificationManager_notify, API_SystemManager_android.telephony.TelephonyManager_listen, API_SystemManager_android.content.BroadcastReceiver_abortBroadcast, API_SMS_android.telephony.SmsManager_sendTextMessage, API_SMS_android.telephony.SmsManager_sendMultipartTextMessage, API_DeviceData_android.content.ContentResolver_query, API_DeviceData_android.content.ContentResolver_registerContentObserver, API_DeviceData_android.content.ContentResolver_insert, API_DeviceData_android.content.ContentResolver_delete, API_DeviceData_android.accounts.AccountManager_getAccountsByType, API_DeviceData_android.accounts.AccountManager_getAccounts, API_DeviceData_android.location.Location_getLatitude, API_DeviceData_android.location.Location_getLongitude, API_DeviceData_android.media.AudioRecord_startRecording, API_DeviceData_android.media.MediaRecorder_start, API_DeviceData_android.os.SystemProperties_get, API_DeviceData_android.app.ApplicationPackageManager_getInstalledPackages, API__sessions
Network - Network_TotalReceivedBytes, Network_TotalReceivedPackets, Network_TotalTransmittedBytes, Network_TotalTransmittedPackets
Battery - Battery_wakelock, Battery_service
Logcat - Logcat_verbose, Logcat_debug, Logcat_info, Logcat_warning, Logcat_error, Logcat_total
Process - Process_total

License
ou may redistribute, republish, and mirror the CCCS-CIC-AndMal-2020 dataset in any form. However, any use or redistribution of the data must include a citation to the CCCS-CIC-AndMal-2020 dataset and the following papers:

- David Sean Keyes, Beiqi Li, Gurdip Kaur, Arash Habibi Lashkari, Francois Gagnon, Frederic Massicotte, "EntropLyzer: Android Malware Classification and Characterization Using Entropy Analysis of Dynamic Characteristics", Reconciling Data Analytics, Automation, Privacy, and Security: A Big Data Challenge (RDAAPS), IEEE, Canada, ON, McMaster University, 2021

- Abir Rahali, Arash Habibi Lashkari, Gurdip Kaur, Laya Taheri, Francois Gagnon, and Frédéric Massicotte, "DIDroid: Android Malware Classification and Characterization Using Deep Image Learning", 10th International Conference on Communication and Network Security (ICCNS2020), Pages 70–82, Tokyo, Japan, November 2020

Acknowledgements
We thank the Mitacs Globalink Program for providing the Research Internship (GRI) opportunity and Harrison McCain Young Scholar Foundation funds from University of New Brunswick (UNB) for supporting this project. We also thank CCCS for sharing the malware samples of this dataset with us.

You can download this dataset from here.
Researchers named among top researchers for Canada 150
The cybersecurity Research and Academic Leadership award, Canada 2019
The cybersecurity academic award, Canada 2017