Smart contracts are autonomous programs deployed on blockchain platforms such as Ethereum, enabling automated execution of digital agreements without intermediaries. However, insecure implementations in these immutable contracts have led to major system failures, financial thefts, and severe service disruptions. Common vulnerabilities, such as reentrancy, gas limit issues, and unprotected calls, can compromise contract behavior and open the door to exploitation.
The BCCC-SCsVul-2024 dataset introduces a large-scale, multi-label vulnerability benchmark specifically designed to support vulnerability detection, classification, and behavioral profiling of Solidity-based smart contracts. The dataset includes 111,897 contracts, each characterized by 193 structured features extracted using the advanced static analyzer
SCsVulLyzer V2.0.
These features span bytecode, opcode, AST, ABI, and source-level semantics, allowing for fine-grained modeling of secure and insecure behaviors. Unlike prior versions, this dataset introduces more vulnerability classes and source diversity, enabling research in multi-label classification and deep explainability.
Four statistical measures—average, minimum, maximum, and standard deviation—are computed across all continuous features to highlight structural and semantic variabilities within the dataset.
Figure 1: Average
Figure 2: Minimum
Figure 3: Maximum
Figure 4: Standard deviation
This research introduces a new dataset by collecting Solidity source code from a range of diverse and reputable sources, including SmartBugs, ESCs, Slither-audited smart contracts, and the SmartScan dataset. These repositories were selected based on their inclusion of high-quality and compromised smart contracts exhibiting various vulnerability types.
To ensure uniqueness and eliminate redundant entries, all contracts are converted into SHA-256 hash representations during preprocessing. Each contract is then statically analyzed and converted into a numerical feature vector using
SCsVulLyzer V2.0.
The dataset follows a naturally imbalanced distribution that reflects real-world smart contract deployments. Table 1 summarizes the label-wise distribution of the dataset:
Table 1: BCCC-SCsVul-2024 dataset overview (the * indicates secure examples with no known vulnerabilities)
Each smart contract in this dataset is processed using
SCsVulLyzer V2.0 to extract 193 normalized features, grouped into categories based on the source of information and analytical level:
• Bytecode-Level Features: Measure entropy and character-level distribution from compiled bytecode, capturing patterns indicative of code complexity or obfuscation.
• Opcode-Level Features: Frequency counts for EVM instructions such as CALL, REVERT, PUSH, JUMP, and SELFDESTRUCT, highlighting behavioral traits of contract logic.
• AST-Level Features: Include node types, exported symbols, child node counts, and source identifiers, enabling structural reasoning and program comprehension.
• ABI-Level Features: Length-based metrics on input/output definitions, constants, types, and mutability settings to reveal potential misuse in function interfaces.
• Solidity Source Features: Counts of loops, conditionals, external calls, public function declarations, line-type distributions (code, blank, comment, duplicate), and specific Solidity keywords (delegatecall, send, create2, etc.).
• Contract Metadata: Includes contract names, inheritance relationships, and function/event statistics.
These features enable training of both shallow and deep machine learning models for multi-class and multi-label vulnerability detection.
Table 2: BCCC-SCsVul-2024 dataset overview (the * indicates secure examples with no known vulnerabilities)


Feature Extraction Methodology
The
SCsVulLyzer V2.0 engine performs static inspection of each contract across multiple layers:
• Bytecode and Opcode: Generated via Solidity compilation and disassembled to extract entropy and operation profiles.
• AST and ABI: Extracted from compiler output (Solc) to examine contract logic, interface design, and structure.
• Source Code: Parsed for control structures, loops, modifiers, code quality metrics, and dangerous pattern indicators.
Features are normalized, and missing values are replaced with zero or null-safe defaults to maintain consistency across the dataset. Contracts with no detectable features in a given category are still included with valid placeholders to avoid structural misalignment in machine learning pipelines.
The BCCC-SCsVul-2024 dataset supports a wide range of research and engineering applications in the domain of smart contract security, including:
• Supervised classification of contracts based on vulnerability types
• Profiling contracts for behavioral or structural similarity
• Multi-label detection and correlation analysis among vulnerability types
• Feature selection and explainability experiments (e.g., SHAP, LIME)
• Detection model benchmarking using traditional classifiers and neural networks
• Evaluating static analysis and symbolic execution tools
• Evolutionary algorithm and genetic programming experiments for vulnerability search
You may redistribute, republish, and mirror the BCCC-SCsVul-2024 dataset in any form. However, any use or redistribution of data must include a citation to the BCCC-SCsVul-2024 dataset and the following paper:
- Sepideh Hajihosseinkhani, Arash Habibi Lashkari, Ali Mizani, “
Unveiling Smart Contracts Vulnerabilities: Toward Profiling Smart Contracts Vulnerabilities using Enhanced Genetic Algorithm and Generating Benchmark Dataset”,
Blockchain: Research and Applications, Volume 5, December 2024, Article ID: 100253.
You can download this dataset from here.