The Missing Manual: Safety Engineering for Organizations Using LLMs

7 min readApr 16, 2025
Every engineer studies failure. Shouldn’t LLM integrators do the same?

It’s now been exactly 70 years since the term “artificial intelligence” was coined. AI first joined the technical vocabulary 1955, primarily as a marketing term to distinguish the technology from cybernetics.

While countless applications of AI exist, LLMs offer some interesting advances — if they’re designed to amplify what’s best about human intelligence.

However, inflated expectations around LLMs often lead to developers forcing their systems to perform in ways that create short term excitement — and unstable long term outcomes.

As BlockScience’s Dr. Micheal Zargham¹ noted in a recent presentation, civil engineers work on projects with a set of understandings around goals and limits in the environment. They solve for that, while also understanding the situations in which the system will break. Consider this sign:

Would you drive a 24 ton truck over this bridge?

When Civil Engineers build bridges, they understand specific limits. They know about how much temperature and wind and load can play across time and how that impacts specific amounts of load weight. They place this knowledge publicly on the bridge, so drivers of cars and trucks know whether they can travel across it safely. It’s also why truck drivers are required to weigh in at specific stops along their route, so they can adjust travel plans accordingly.

Knowing Your AI’s Weight Limit: Engineering Principles for LLM Integration

Structural engineering projects are built to withstand far more weight the weight they’re rated for. If the rating is exceeded, there’s a buffer so that the failure is not catastrophic. But this also accounts for weakened materials, subcontractor shortcuts and human error (both in construction and sign reading). In short, issues are taken into account and designed for.

Every civil engineering student knows the classic bridge failure example of “Galloping Gertie”. Their education starts with an understanding of what happens when you design without failure in mind, and how people can get hurt.

On November 7, 1940, the bridge spanning the Tacoma Narrows Strait of Puget Sound collapsed after moving dramatically up and down during high winds. The bridge garnered the name “Galloping Gertie,” because of the motion it made before collapsing.

Unfortunately, few software engineers are schooled in this approach, which is partly why OpenAI’s safety warnings almost seem like annafterthought:

No “weight limit”. That’s the message.

In a recent presentation on organizations and automation, my colleague Dr. Michael Zargham asked:

“Would you run an organization on an LLM without knowing its “weight limit?”

This parking garage buffer safely shows the vehicle height limit softly before a car going in gets in too deep.

Many AI safety challenges have parallels in engineering fields like aerospace, robotics, and industrial systems. Engineers in these domains have developed sophisticated approaches, such as:

  • Constraint design: Limiting action spaces to prevent harmful outcomes
  • Monitoring: Creating ways to visualize system operations
  • Kill switches: Implementing emergency protocols when systems operate outside safe parameters
  • Resource management: Preventing runaway processes from consuming excessive resources
  • Graceful Degradation: Shifting the system to an analog or human-based care once it reaches agreed-upon suboptimal outcomes.

Design AI Systems With an Understanding of Safety

There are many kinds of automated systems in the world.

An escalator is an automated system. It’s also dangerous when it breaks. Loose fitting clothing can get caught in the grooves of the steps. Dog’s paws can get hurt in them. A well-designed escalator incorporates these risks into how it’s run, including safety buttons so that people can easily stop the system in an emergency.

It’s also become a social norm that people learn how to use an escalator with each other. Parents teach kids how to get on and off safely. Public safety videos and dog enthusiasts warn people to carry their dogs up escalators, or to take an elevator instead.

Many companies are rushing to add AI to their products, but haven’t configured them to work safely when they fail. Google’s AI, powered by Gemini, for instance, has been under fire “for giving dangerous and wrong answers”.

Part of the problem lies in Gemini’s design: After entering a search query on Google, and reading a huge chunk of the AI Summary which takes up half of the top page — text that quite possibly includes wrong information — you only then get at the very bottom of the output half-hearted disclaimer that the foregoing could be wrong. (“Generative AI is experimental”).

Google’s AI often presents AI information in a way that appears highly trustworthy.

At the time of writing this article, Google’s included a very sall “thumbs up or thumbs down” feedback option, as if Gemini were an airport bathroom whose cleanliness you were reporting on. There’s no way to contribute any real information, and no understanding of whether someone on Google’s end will fix it.

There wasn’t more feedback than that, and it was difficult to even get to it before reading the often incorrect information it provided.

This has allowed benign jokes to propagate into Google’s LLM, like believing in the supposed existence of “blinker fluid”, often requested as a prank assignment to new hires at car repair shops. The blinker fluid joke will only cause the junior hire a few minutes of confusion — a hallucinated answer with the wrong information about car repair could damage a vehicle.

Many other LLM hallucinations, however, can be much more dangerous. And while LLM providers can try as hard as they can to design safety into the system, organizations using LLMs aren’t typically trained on how to do safe integrations.

Creating Simulations before Deploying Live

Airplane designers send their craft through an entire suite of testing kits before they let airplanes fly with passengers. Even then, companies can rush products to market, with dire results.

The Boeing 737 MAX airplane killed people because basic engineering safety approaches were not designed into the system. The MCAS system, designed to prevent stalls, relied on a single sensor. If that sensor provided incorrect data, MCAS would automatically push the nose down, potentially causing a crash. Pilots weren’t trained to handle MCAS malfunctions, and much more.

Electrical and mechanical engineers know to run their systems in simulations so that they can explore how materials work before they make prototypes. Each of these early products must be tested and certified within safety standards before they go out into the market.

Circuitboard design uses The SPICE simulation program helps people design circuitboards and test them before they make them in real life.

The SPICE Analysis tool allows people to run circuit simulations in different configurations before deployment.

Organizations rushing to deploy LLMs to look like they’re “with it” often don’t employ simulation tools or test methods to determine when to integrate AI features. Many companies are so afraid of falling behind that they implement untested AI slop on their unwitting (and often unwilling) audiences (anyone enjoy seeing a bright and shiny AI prompt box inside of Adobe Acrobat Pro?)

A SPICE system for an LLM might look like this: Simulation of an automated system could be run with a set of varying inputs, so that the system could be tested under different conditions; engineers could then work on the output paths and flow paths that led to undesired outcomes.

Outputs could be color labeled with an amount of confidence so that they could be adjusted to improve the resulting output, or provide breakpoints for high quality live customer support.

Different ways of visualizating data can help with analysis in different domains. Image from Artificial Intelligence in Medicine, Volume 140, June 2023, 102543.

One example of this is a color-coded visualization system applied to a machine learning system studying Alzheimer’s disease developed in 2023; researchers found that visual output improved the likelihood of a more accurate diagnosis.

Final Thoughts

Many AI organizations themselves are obsessed with safety and are very concerned with implementing it well — but it doesn’t mean that organizations basing their operations on AI understand how to model safety or test output themselves.

When organizations like Google, which first built its brand on delivering trustable search results, hastily integrate untested and inaccurate LLM systems into its search results — which will be seen by millions of end users — who is responsible for the handful of dangerous outcomes that are sure to occur? Had Google tested its system better, or understood it from more of a civil engineering-based approach, they could have aligned implementation with greater outcomes.

  1. The author would like to thank Dr. Michael Zargham for his presentation Presentation on Engineering Principles at the annual BlockScience Company Retreat, Oct 16, 2024.

--

--

Amber Case
Amber Case

Written by Amber Case

Design advocate, founder of the Calm Technology Institute, speaker and author of Calm Technology. Former Research Fellow at MIT Media Lab and Harvard BKC.

No responses yet