Energy & Technology8 min read

The April 2025 Blackout and What Real-Time Data Could Have Done

I studied energy engineering before I ever touched a line of code. I spent years thinking about power grids, frequency stability, and what happens when large amounts of generation disappear from the network in seconds. Then I spent the next decade building data infrastructure for some of the largest technology companies in the world. On April 28, 2025, those two worlds collided in a way that was genuinely hard to watch.

At 12:33 CEST, the Iberian grid lost fifteen gigawatts of generation capacity in five seconds, roughly 60% of Spain's load gone in the time it takes to read this sentence, and Spain and Portugal went dark for ten hours. Traffic lights stopped, data centers dropped off, trains froze mid-route, and everything that modern life treats as a given stopped working simultaneously, with nobody knowing when it would come back.

I had been working on an article for Ververica about exactly this type of scenario for weeks before it happened. Watching the news that Monday was an uncomfortable experience.

How the grid actually works

Power grids were mostly designed in an era where almost all energy came from thermal sources: gas, coal, nuclear. Thermal generation is dispatchable, meaning you can increase or decrease output on demand, roughly the way you press a gas pedal. If demand rises, you feed more fuel into the system and output rises within seconds.

The fundamental constraint of any electrical grid is that supply and demand must match at every instant. Not approximately. Exactly. One additional person turning on a kettle means that somewhere on the network, generation must increase by roughly 1 kilowatt at that same moment. Grid operators achieve this by monitoring **grid frequency**: when more power is consumed than generated, frequency drops below the design value of 50Hz in Europe. That drop is the signal to bring more generation online.

The tolerance here is tighter than most people realize. A frequency deviation of more than 0.5Hz triggers automatic disconnections. Power electronics, industrial motors, and control systems are designed to shut down protectively at frequencies outside their design range, because running them otherwise risks permanent damage. This is why a frequency excursion that starts locally can cascade across a continent within seconds if it isn't contained.

What renewables changed

Renewable energy is not dispatchable. Solar and wind output cannot be increased on demand. Cloud cover can reduce a solar plant from 100% to 37% output in under a minute. This means the rest of the grid must constantly compensate for intermittency that cannot be fully predicted in the short term, even as forecast models have gotten very good at projecting output over hours and days.

To manage this, grids maintain reserve capacity: thermal plants that run at partial load, or not at all, specifically to be ready to ramp up if renewables drop or a line fails. Building and operating that reserve capacity is expensive, and most countries provide regulatory incentives to make it economically viable. When either primary production or reserves are insufficient, the result is a blackout.

Grids with high renewable penetration are fundamentally more volatile than the systems they replaced. The physical inertia of large spinning generators used to provide a natural buffer that slowed frequency drops enough for operators to respond. Inverter-based renewable generation does not provide that inertia. The same amount of lost generation causes a faster, steeper frequency drop than it would have twenty years ago.

What happened on April 28

The Spanish Transmission System Operator (REE) confirmed the following: a sudden 15-gigawatt drop in generation capacity in southeastern Spain caused grid frequency to fall below acceptable limits. Generating units connected via inverters then disconnected automatically to protect themselves from the abnormal frequency, which reduced generation further, which caused more units to disconnect. This adverse cascade took Spain and Portugal from normal operation to a full blackout in a matter of seconds.

The root cause is still under investigation. What is already clear is that the cascade was too fast for human intervention. No operator could have assessed the situation and issued corrective commands in the time available. The question is whether automated systems could have contained it, and if so, what those systems would need to look like.

Why milliseconds are the unit that matters

The **protective relays** and circuit breakers fitted to European grids operate in the 10 to 100 millisecond range. A human blink takes approximately 250 milliseconds. The devices that actually protect the grid are operating at a quarter of that speed, making the entire detection-and-response cycle faster than you can close your eyes.

The problem is the monitoring layer above them. **SCADA systems**, which provide operators with visibility into generators and transmission equipment, poll their sensors at intervals of 1 to 2 seconds. By the time a SCADA display shows an anomaly, that anomaly is already 1 to 2 seconds old. Add the time for an operator to assess the situation and issue a command, and the round-trip from fault to response can exceed 5 seconds. In a fast cascade, that is far too late.

The gap is not a failure of operator skill. It is a fundamental architectural mismatch: the physical protection layer operates at millisecond scale, while the monitoring and decision layer operates at second scale. Closing that gap requires a different approach to data.

What streaming data changes

**Phasor Measurement Units (PMUs)** are specialized devices that sample grid conditions at 30 to 60 times per second, orders of magnitude faster than SCADA polling. They capture the instantaneous magnitude and phase angle of voltage and current waveforms, time-synchronized across the network via GPS. A PMU-equipped grid produces a continuous, high-resolution picture of what is happening at every instrumented node, updated many times per second.

The value of PMU data depends entirely on how fast you can process it. Storing it in a batch system and analyzing it every few minutes is useful for post-event forensics. Processing it in a streaming pipeline as it arrives is what enables real-time detection and response.

A streaming architecture built for this purpose can ingest PMU phasors via Apache Kafka at sub-100 millisecond intervals, apply **Complex Event Processing (CEP)** rules to detect patterns such as voltage sag combined with angle jump within a defined time window, run online anomaly detection models alongside rules to catch fault signatures that rules alone would miss, and trigger automated isolation commands over IEC 61850 GOOSE protocols within 5 to 10 milliseconds of detection. The entire cycle from fault to automated response can be completed in under 50 milliseconds: well inside the protective relay operating range, and about five times faster than a human blink.

What automated response looks like

Detection is only half the problem. Once a fault is detected, the system must isolate the affected segment, reconfigure the network topology to route power around it, and if necessary, execute a black-start sequence to restore generation after a full blackout. Each of these steps has a different latency target and a different set of dependencies.

Fault isolation happens at the substation level, via trip commands broadcast over the GOOSE protocol. Because GOOSE messages are multicast and pre-configured, they require no central polling and routinely achieve 5 to 10 millisecond latencies. Grid topology reconfiguration follows within 100 to 200 milliseconds, as an Advanced Distribution Management System computes alternate feeder paths and pushes switch-closing commands. Black start, if required, is a slower and more carefully sequenced process: dedicated generating units with self-start capability bring up the grid incrementally, stabilizing voltage and frequency at each step before energizing additional transmission segments.

The April 2025 cascade suggests that the isolation step worked as designed at individual nodes, but the system-level response was not fast enough to prevent the cascade from propagating. Whether better streaming infrastructure would have made a difference depends on details that the investigation will eventually clarify. What the event demonstrates clearly is that grids with high renewable penetration need monitoring and response architectures that match the speed of their physical protection layer.

A note on why I find this personally interesting

I wrote a longer technical version of this for Ververica's blog, focused on the architecture of a real-time streaming solution. This piece is the other side of it: the energy engineering background that makes me care about the problem in the first place.

Grid stability and real-time data processing are not topics that usually show up in the same person's background. I got lucky that way. The April 2025 blackout is a case study in what happens when the data infrastructure lags the physical system it is supposed to monitor. Closing that gap is a real engineering problem, and it is solvable. The technology exists. What is missing in most cases is the architecture to use it.

Let's Talk

Let's build something
worth talking about.

I take on a limited number of advisory and fractional engagements. Only projects where I can make a real difference. If you're navigating growth, AI, or revenue challenges in a technical B2B environment, let's talk.