Skip to content

Benchmark Datasets for Aggregation & Anomaly-Based Elastic Security DetectionsΒ #186

@zzylol

Description

@zzylol

Summary

To benchmark aggregation-based detections, threshold rules, and statistical/time-series anomaly detection in Elastic Security, we should evaluate against established open datasets across multiple domains:

  • 🌐 Network traffic
  • πŸ–₯️ Host / system logs
  • πŸ‘€ User behavior (UEBA)
  • ☁️ Cloud audit logs
  • πŸ“ˆ Time-series anomaly benchmarks

This issue proposes publicly available datasets suitable for:

  • πŸ“Š Aggregation-based detection testing (threshold rules)
  • πŸ“ˆ Statistical baselining and deviation detection
  • πŸ“‰ Time-series anomaly detection benchmarking
  • πŸ§ͺ Precision / recall evaluation using labeled data

🌐 Network Datasets


1️⃣ CSE-CIC-IDS2018

Type: Network traffic + labeled attack flows
Best For: Threshold detection, aggregation benchmarks, brute-force detection, exfiltration testing

πŸ”— https://www.unb.ca/cic/datasets/ids-2018.html


2️⃣ UNSW-NB15

Type: Network intrusion dataset

πŸ”— https://research.unsw.edu.au/projects/unsw-nb15-dataset


3️⃣ CTU-13 Botnet Dataset

Type: Botnet traffic captures

πŸ”— https://www.stratosphereips.org/datasets-ctu13


4️⃣ UGR’16 Dataset

Type: ISP-scale NetFlow dataset

πŸ”— https://nesg.ugr.es/nesg-ugr16/


☁️ Cloud / Audit Log Datasets

Cloud datasets are particularly useful for benchmarking:

  • API call frequency thresholds
  • Privilege escalation detection
  • Rare IAM activity
  • Geographic login anomalies
  • Cross-account access detection
  • Aggregation-based misuse detection

5️⃣ AWS Open Data Registry (CloudTrail & Related Logs)

Type: Public AWS datasets including CloudTrail-style audit logs

πŸ”— https://registry.opendata.aws/

Why Use It

  • Real-world cloud activity logs
  • API call records with timestamps and principals
  • Suitable for:
    • terms aggregation on userIdentity
    • API call count thresholds
    • Rare service usage detection
    • Geographic anomaly detection
    • Privilege escalation analysis

6️⃣ Rhino Security Labs – CloudGoat (Cloud Attack Scenarios)

Type: Open-source cloud attack simulation environment

πŸ”— https://github.com/RhinoSecurityLabs/cloudgoat

Why Use It

  • Simulated AWS attack scenarios
  • Generates realistic CloudTrail logs
  • Good for:
    • IAM privilege escalation detection
    • Misconfigured policy detection
    • Cross-account access anomaly detection
    • Threshold rule testing in cloud environments

7️⃣ Azure AD / Microsoft Audit Log Samples

Type: Publicly available Azure AD / M365 audit log samples

Example reference:
πŸ”— https://learn.microsoft.com/en-us/azure/active-directory/reports-monitoring/

Why Use It

  • Authentication logs
  • Role assignment logs
  • API access logs
  • Useful for:
    • Failed login aggregation rules
    • Rare role assignment detection
    • Impossible travel detection
    • Privilege grant spike detection

πŸ–₯️ Host / User Behavior Datasets


8️⃣ CERT Insider Threat Dataset

Type: Insider threat simulation

πŸ”— https://resources.sei.cmu.edu/library/asset-view.cfm?assetid=508099


9️⃣ LANL Authentication Dataset

Type: Enterprise authentication logs

πŸ”— https://csr.lanl.gov/data/auth/


πŸ“ˆ Time-Series / Anomaly Detection Benchmarks


πŸ”Ÿ Numenta Anomaly Benchmark (NAB)

πŸ”— https://github.com/numenta/NAB


Proposed Elastic Benchmark Plan

Step 1: Normalize to ECS

Map dataset fields to:

  • @timestamp
  • source.ip
  • destination.ip
  • user.name
  • event.action
  • event.category
  • cloud.account.id
  • cloud.provider
  • network.bytes
  • process.name

Step 2: Detection Categories

Aggregation-Based Rules

  • API call count > threshold (CloudTrail)
  • Failed login count > threshold
  • Outbound byte sum > threshold
  • Rare IAM role usage
  • Rare process parent-child relationships

Statistical / Time-Series Detection

  • Volume spikes vs baseline
  • Cardinality deviation
  • Hourly/weekly seasonal anomalies
  • Population analysis (user vs peer group)

Step 3: Evaluation Metrics

  • Precision
  • Recall
  • False positive rate
  • Detection latency
  • Threshold sensitivity sweep
  • Anomaly score distribution

Goal

Create a reproducible benchmark framework for evaluating:

  • Elasticsearch aggregation performance
  • Elastic Security threshold rules
  • EQL-based detection logic
  • Elastic ML anomaly detection
  • Cloud-specific detection engineering

Please comment if additional cloud providers (GCP, OCI, etc.) or SaaS audit logs should be included.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions