Cyber researchers discuss challenges to building resilient, software-intensive sociotechnical systems and glean insights from the commercial sector.

In 2017, the malware known as NotPetya tore through the internet, crippling computers throughout Ukraine first and then paralyzing companies and government agencies around the world. If such a cyberattack were to disrupt the U.S. military systems used for command and control, the consequences could have a serious impact on U.S. missions and even result in the loss of life. As attackers grow increasingly sophisticated, their success at infiltrating and significantly disrupting mission systems is not a question of if, but when.

Just as important as defending a mission system from exploitation is ensuring the mission's resiliency. A resilient mission system can keep fulfilling its most important functions and adapt to circumstances even when its walls are breached. Researchers in Lincoln Laboratory's Secure Resilient Systems and Technology Group are working to understand how to design for, improve, and test resiliency in government systems and networks. This work is part of the multiprogram Applied Resilience for Mission Systems (ARMS) effort under Martine Kalke within the Cyber Security and Information Sciences Division. ARMS is focused on defining and implementing resiliency for Laboratory prototypes and government programs.

Our capabilities will be disrupted and degraded; we'll lose information and secrets, but how do we make sure we can respond and recover so that some portion of our missions can keep going?

Orton Huang

"What we care about is how to ensure that our warfighters can still perform their missions even when the adversary is actively disrupting their capabilities," cyber researcher Orton Huang said. "Our capabilities will be disrupted and degraded; we'll lose information and secrets, but how do we make sure we can respond and recover so that some portion of our missions can keep going? And how do you not just reset to a prior state but actually change and adapt? This is not just about the technology but involves the whole organization as we need to learn and evolve."

Lincoln Laboratory hosted an invitation-only workshop earlier this year to discuss these challenges. The workshop focused on understanding and defining resiliency from different perspectives and included five panel discussions on how industries and the Department of Defense view, build, and manage resiliency within their organizations. More than 60 people attended, including invited panelists and moderators from Stanford University, Columbia University, Bank of America, Akamai, SEI CERT, Naval War College, New England Complex Systems Institute, Capsule8, Microsoft, George Mason University, Boston Cybernetics Institute, Threat Stack, and Atlantic Council.

One challenge faced by both industry and the government is the growing complexity of software and computing systems. "The use of complexity as a feature in systems drives a lot of the lack of resiliency," said Trey Herr, who is the director of the Cyber Statecraft Initiative for the Atlantic Council and was previously a senior security strategist with Microsoft. He participated in the workshop as a moderator. "There's convenience in stacking these tools that are compatible and add functionality. We need to be thinking of reverse incentives to unwind that."

About thirty people sit in a classroom setting, with chairs facing a panel at front and projectors with a slide reading "Resilience in Complex Systems"
At the Understanding Mission-Driven Resiliency Workshop hosted by Lincoln Laboratory, participants from industry, government, and academia discussed challenges in improving the resiliency of systems to cyberattacks.

As systems grow more complex, Jeremy Mineweaser — who co-organized the workshop with Huang, Kalke, Robert Lychev, and Reed Porada — sees it increasingly important to integrate a system's developers into later stages of its lifecycle.

"We want to expand people's view about what technology is; it includes people and process and is part of a sociotechnical system," Mineweaser said. "The canonical example of a program in the military has contractors handing systems off to maintainers and moving on to build the next system. But we need to have a resiliency mindset, which calls for better strategies for transferring technology, knowledge, and skills to the operators. People who study and make the technology gain deep knowledge of how it works and understand its complexity and how to manage it. So, just handing it off to a third party is not positioning them for success. There needs to be a continuous and agile learning and feedback system."

Transitioning to a new model of working, called SecDevOps (Secure Development and Operations), will help. Under SecDevOps, the development and operations teams are no longer siloed but work together across the entire cycle of development, test, deployment, and operation. This model of working will make it easier for both parties to understand problems, push changes, and continuously evolve software. SecDevOps also brings security into the discussion from the start and takes advantage of DevOps best practices that have significant security benefits.

"Adopting SecDevOps is important to be able to build systems that can change and grow over time, like industry is able to do, but requires a change in government culture and mindset," Lychev said. Part of this mindset includes rethinking long development timelines, though ingrained they may be into how government programs are funded. "Apple, for example, doesn't announce the 2040 iPhone in 2020, whereas the Department of Defense might get funding from Congress for things planned over the next one to two decades," Mineweaser added.

We want to expand people's view about what technology is; it includes people and process and is part of a sociotechnical system.

Jeremy Mineweaser

Another cultural shift the team sees as vital to ensuring resiliency is leveraging "chaos engineering." The idea is to break a system in its operational environment to understand how the system responds to disruptions, to learn from these disruptions, and to work out how to fix it quickly. This type of live resiliency testing in operations or production is common in the commercial sector. Netflix, for example, developed a tool called the Chaos Monkey that the developers liken to "unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables" and is run while Netflix operates its services. This way, engineers can identify, manage, and learn from weaknesses in their systems and then build automated recovery mechanisms to deal with them if they happen again when no one is watching.

Eric Lofgren, a fellow with the Mercatus Center at George Mason University and a panelist at the resiliency workshop, agreed that exposing systems and networks to threats "is the best way to identify, contain, and overcome downside risk while still allowing the system to move towards new structures using the principles of combinatorial innovation."

This rhythm of failing, learning, and automatically recovering is what the Laboratory researchers hope to develop for government systems. They also recognize that the stakes for government agencies are higher than those for Netflix — lives depend on defense systems working correctly. "You need a lot of monitoring and observability to understand what has broken, and a lot of infrastructure to support automatically restoring capability," Mineweaser said.

Aiming to provide those needs, the team has been working for the past two years on the Resilient Systems Toolbox (RST). RST allows developers at the Laboratory to push their mission software, monitor it, disrupt it, observe the effects, and learn how to make it more resilient. Running these exercises in the testbed is helping the researchers build automated tools to enhance resiliency and provide techniques and metrics for evaluating the efficacy of these tools. Several programs at the Laboratory have begun leveraging the components developed.

The team continues to take stock of lessons learned from the discussions with industry participants at their workshop as they continue their research. "The workshop was a great forum to drive forward and sustain this dialogue between communities," Herr said, adding that he hopes to see more events like it. Another resiliency workshop is planned for this spring.

"Many participants expressed a strong desire to continue the workshop series, and we intend on doing so," Lychev said. "Please contact us if you would like to participate in the future."

RELATED

Understanding Mission-Driven Resiliency Workshop

Risk Identification, the Acquisition Process, and System Resiliency