KIT | KIT-Bibliothek | Impressum | Datenschutz

Dataset: GridStratLLM: Agent Framework for Coordinated Cyberattacks on the Smart Grid with Large Language Models

Kellerer, Nicolai ORCID iD icon 1; Hagenmeyer, Veit ORCID iD icon 1
1 Institut für Automation und angewandte Informatik (IAI), Karlsruher Institut für Technologie (KIT)

Abstract (englisch):

A new cybersecurity threat emerges: Recent Large Language Models (LLMs) with advanced reasoning and tool calling enable even attackers lacking expert knowledge to coordinate large-scale attacks on Smart Grids (SG).
These LLMs can orchestrate multiple malware instances, select appropriate signals and deltas, and execute data-modification attacks on the S7 and Modbus protocols.
Thereby, the automatically generated attack progresses towards the targeted unsafe state and evades detection by the Intrusion Detection System (IDS).
To assess this emerging threat, we introduce GridStratLLM, a novel agent framework for coordinated attacks on industrial networks.
Furthermore, we evaluate attack plans generated by four frontier Large Language Models using the open-source Network Security Monitor (NSM) Zeek and a commercial NSM.
Finally, we contribute a dataset recorded in a Hardware-in-the-Loop (HIL) testbed to support the training of IDS solutions against these attacks.
The dataset is 24 hours and 11 minutes long, containing 436 attacks with 212 coordinated attacks.


Zugehörige Institution(en) am KIT Institut für Automation und angewandte Informatik (IAI)
Publikationstyp Forschungsdaten
Publikationsdatum 19.05.2026
Erstellungsdatum 15.02.2026 - 03.03.2026
Identifikator DOI: 10.35097/bx5337kcykte438h
KITopen-ID: 1000193145
HGF-Programm 46.23.02 (POF IV, LK 01) Engineering Security for Energy Systems
Embargofrist Die Forschungsdaten sind ab dem 22.06.2026 frei zugänglich.
Lizenz Creative Commons Namensnennung 4.0 International
Schlagwörter Attack Plan, LLM, Smart Grid, Modbus, S7, Data Modification
Liesmich

GridStratLLM Dataset

This dataset contains coordinated cyberattacks generated using the GridStratLLM agent framework against a hardware-in-the-loop testbed of a distributed generation environment.
It covers one normal operation and five attack datasets, each using a different LLM.
Every dataset captures network traffic, process data from SCADA, log messages, and metadata from the attack scripts.

Paper: https://doi.org/10.1145/3765611.3815147
GridStratLLM source code: https://github.com/nbke/GridStratLLM

Each dataset directory contains:

  • attack_session_llm.parquet: LLM prompts, plans, chain-of-thought reasoning, token usage
  • attack_worker.parquet: Network interface info (MAC, IP, interface name)
  • packet_metadata.parquet: Packet metadata (timestamps, addresses, ports, protocol)
  • packets.pcap: Raw packet capture
  • attack_datamod_history.parquet: Packet modification log with delta values
  • attack_exec_steps.parquet: Attack execution timeline per worker
  • process_data.parquet: WinCC SCADA data
  • logs.parquet: Logs from PLC 1512 and PLC 1516

See network.json for a list of network devices. modbus.json contains a mapping of Modbus registers to signal names.
s7_connections.json contains all signals transmitted via S7.

Parquet Files

attack_session_llm.parquet

The structure of the plan column is explained in appendix E of the paper.
coordinated is true if an attack session uses multiple attack workers.
The column all_messages may be NULL due to a data capture issue.

packet_metadata.parquet

The entries in packet_metadata.parquet are in the same order as packets.pcap.
If the UUID of a packet (id in packet_metadataparquet) is contained in the packet_id column in attack_datamod_history.parquet, then the packet originates from an attack script.

SCADA process data: process_data.parquet

PV:

  • Control Signals: on_off
  • Monitor Signals: temp_air, poa_direct, wind_speed, poa_diffuse, cell_temperature, inverter_ac_power, inverter_dc_power

Wind:

  • Control Signals: blade_rotation, rotation_speed
  • Monitor Signals: power, height, pressure, wind_speed_a, wind_speed_b, temperature_a, temperature_b

Battery:

  • Control Signals: on_off, target_power
  • Monitor Signals: current, voltage, temperature, state_of_charge, actual_charge_power

Log messages: logs.parquet

Log messages are in German and only sent when the signal value changes. Example:

Wertänderung "SysLogDaten".Inverter_ac_power Altwert: 239,0 aktueller Wert: 20,0 CPU:SECCPU16

DuckDB File: merged_datasets.duckdb

The combined merged_datasets.duckdb file contains all data from the Parquet files plus raw packet data.
Differences from the Parquet files:

  • process_data is split into four tables: wind_process_data, pv_process_data, battery_process_data, demand_process_data (one column per signal instead of JSON values).
  • attack_session_llm is renamed to attack_session.
  • attack_exec_steps is renamed to exec_steps and setup_duration is stored as an interval instead of a bigint.
  • packet_metadata is renamed to packets. The packets from the PCAP files are stored in the raw_packet BLOB column.
    The l2_flow_id, l3_flow_id, and l4_flow_id columns are omitted, which are always null in the Parquet files.
  • id columns use uuid type instead of blob.
  • Categorical columns (model_name, transport, state, kind, etc.) use DuckDB enum types instead of string.

Column name prefix for process data tables:

  • C_: Control signals (commands sent to power plants)
  • M_: Monitor signals (measured values sent to SCADA)

Funding

This research is supported in part by funding from the topic Engineering Secure Systems of the Helmholtz Association (HGF) and by KASTEL Security Research Labs (structure 46.23.02).

Art der Forschungsdaten Dataset
KIT – Die Universität in der Helmholtz-Gemeinschaft
KITopen Landing Page