Clear Prop #9 | Forum 79 Paper Spotlight #2
FEATURE: Concerns with using Python in Machine Learning Flight Critical Applications
This is the second in our Vertical Flight Society Forum 79 Feature Series where we dive deep into exciting papers which were presented in West Palm Beach, Florida, May 16-18. The VFS Annual Forum, organized by the leading vertical flight non-profit organization around the world, featured 250+ papers that are the latest & greatest in rotorcraft, drone, eVTOL, and AAM research. Highly recommended to attend, best technical event of the year - hands down!
In this edition, I sit down with Harold Glenn Carter who is the Chief of Air Vehicle Management Systems at the U.S. Army Combat Capabilities Development Command (DEVCOM), and Jason Rupert, Senior Software Airworthiness Engineer at Modern Technology Solutions (MTSI). With two other authors from DEVCOM, Glenn and Jason wrote an intriguing paper (not publicly available yet) on a largely unexplored topic: the suitability of using the Python programming language within flight critical systems for aircraft. In short, there are multiple reasons why engineers need to think heavily before deploying Python in the cockpit.
Below, you can access the key takeaways of the research followed by our conversation touching on the “big picture”.
FEATURE: Concerns with using Python in Machine Learning Flight Critical Applications
Flight critical systems are written in C/C++, for which there are clear paths for certification in aviation. However, with the advent of the machine learning (ML) age of computer vision, reinforcement learning, and autonomous decision-making, Python has become the go-to choice for data scientists. It is an easy language to code in, with an enormous database of machine learning libraries that enable developers to do extraordinary things without reinventing the wheel. However, Python is an interpreted language that relies upon a virtual machine to execute bytecode. This means that Python does not use binary code that is inherent to the machines it runs on but needs an additional layer to convert its “human-friendly” code into a language that computers can understand. It also does “garbage collection”, which causes the code to stop on its own schedule to free up space in an unpredictable way. All of this makes Python extremely challenging to certify for on-board systems that are critical.
Key takeaways:
Python’s ease of use, diverse library support, and broad open-source community make it the most popular programming language today. However, this is a double-edged sword as it also makes it an intractable maze that is near-impossible to certify under DO-178C, with which the FAA & EASA certify software.
One course of action would be to certify, qualify, and mature a specific version of Python. Past experience shows that this could be a costly and lengthy endeavor. The Joint Strike Fighter (F-35) program required C++ coding standards to be developed, which required the direct involvement of Bjarne Stroustrup, the creator of C++. Certifying just the CPython virtual machine may require up to $80M (hundreds of $ per line). And this does not include the machine learning libraries of Python such as Scikit-Learn, PyTorch, TensorFlow, and such.
A second option would be to develop machine learning code under certifiable languages from the start i.e. C/C++. However, this would potentially lengthen the development process without access to ML libraries that make model building so easy and fast in Python. There also is a risk that the performance may not be as great.
The final option - which is the one recommended by the authors - would be to use Python to develop the machine learning code but then transform it to a certifiable language i.e. C/C++, to be run on the avionics systems. This has the advantage of enabling developers to be “Python-native” during model building while enabling certifiable safety assurance on board the aircraft. However, what is less clear today is the translation step necessary to convert Python code into C/C++ for implementation (perhaps OpenAI GPT-like models can help here?)
Besides the issues associated with the Python framework itself, machine learning and thus the implementation of non-deterministic algorithms on board the aircraft do not currently have a pathway to certification by the FAA & EASA. Thus, this is a multi-layered problem that is highly complex and probably will take years to resolve.
The BFD: Autonomous flight is one of the pillars of Advanced Air Mobility. Whether we are talking about the world of small drones, eVTOLs, or electric fixed-wing aircraft, the introduction of autonomous systems makes the economics so much more attractive that it is impossible to not pursue it. Thus, the certification of trustworthy autonomy is of prime importance in this decade if we want aerial mobility services to be competitive with other modes of transport. The problem is that ML approaches mainly originate from the world of B2B SaaS to Netflix recommendation algorithms, which have no need for real-time safety assurance. Therefore, there needs to be a clear path of transformation from “consumer ML” to “aviation ML” frameworks. As of now, besides a few papers published by the industry and regulators, it is not clear how this will take shape.
Pamir: Glenn, Jason - grateful to have you here. What are the main overarching concerns you have with certifying autonomy in general?
Glenn & Jason: There are a few clouds of problems that need to be solved if we are to certify autonomy in aviation. Perhaps the most impactful one in terms of flight safety is the probabilistic, non-deterministic nature of machine learning algorithms. For example, when a camera on an aircraft detects aircraft around it by using image recognition and a classifier, these are models that are based on neural networks that are statistical representations of reality with no clear-cut causality.
Today’s machine learning models give accuracy in the high 80s to low 90s out of 100. If we are targeting a failure rate of 1 in a billion flight hours, then this won’t do. And if your models are giving you a high level of confidence with your outputs today, you should essentially throw them out as they most likely overfitted your data. Moreover, neural nets are blackboxes that do not provide much of a way of scrutiny and explainability to the regulator. Therefore, there is no path to certification under the current regulations based on DO-178C.
Secondly, the underlying code itself is unquantifiable in the context of flight because it’s typically based on Python. As can be seen in the paper as well, this can lead to unintended, undesirable, and unexpected functionalities that may be catastrophic in flight. Now, we do not have empirical data showing what actually would happen in flight but based on the characteristics we know of Python, it is very likely so.
Thirdly, the certifiability of the hardware where the ML code will run is key as well. This is something that is being tackled by the industry but it is still a hard problem as graphics processor units (GPUs) such as those that are optimized to train and deploy ML models (e.g. NVIDIA) are orders of magnitude more complex than multi-core processors.
There are a few more problems associated with certifying autonomy such as data assurance and learning assurance, which need a more in-depth analysis than we can give here.
Pamir: Could you give us the main reasons why you think Python, as it currently is today, may not work for flight-critical software?
Glenn & Jason: There are multiple reasons why we think Python may not work for flight-critical applications. One of them is the fact that Python is a dynamically-typed language. This makes the code dependent upon a virtual machine which converts Python code into something that a machine can understand. This creates all kinds of unpredictability issues as now you need to certify the virtual machine which has 10s of thousands of lines of code with no good documentation available. Python also has garbage collectors which essentially make the code run on its own schedule, in a non-deterministic way. So, the predictability of the functioning of the code is an issue here.
We do not exactly know how these may affect certain machine learning features in-flight but we can generally say that in addition to the Python source code itself, if you are using various Python libraries, we do not know what those API calls would be doing. It is like a blackbox. If each of those API calls is certified to the expected behavior, then there wouldn’t be a concern. But how many Python libraries can you really go out there and certify? This would mean looking at millions of lines of code.
For advanced air mobility, machine learning is an amazing concept that enables us to do things that we wouldn’t be able to do with traditional software approaches. However, the outputs can be unreliable. Therefore, we need an increased level of safety analysis and engineering from bottom-up such that we can address these issues.
Pamir: Why aren’t developers writing ML code in C/C++ with certification in mind?
Glenn & Jason: The advantages of developing machine learning code in Python are obvious. You can develop and deploy code much faster and easier compared to using a language like C or C++ with incredible libraries and solid community support. Therefore, we haven’t seen many companies opting to use C/C++ from the get-go. The logical way is - and this is what we recommend in our paper - to have two teams that work together on avionics code. One would be made of data scientists developing the ML models while the other would take this code and turn it into something that is “avionics friendly” and certifiable. Most likely, the first team would code in Python while the other in C or C++.
Your particular approach as a company depends also on whether your goal is just a product demo or product commercialization. If it’s the former, you can quickly build and deploy systems using Python. However, you have to keep in mind that once you have built everything in Python, it is almost impossible to retrofit this existing code for a certifiable product. You probably would need to build from scratch.
Engineers need to think about certification from the beginning. For example, using TensorFlow instead of other libraries such as PyTorch may be an obvious choice as its adaptability to C/C++ is easier. You can also choose to build your pipeline in a modular way - rather than an end-to-end one - enabling you to construct simpler pipelines with fewer challenges in model deployment. In other words, rather than having one super large neural network that trains through reinforcement learning which would have everything from object detection to decision-making within its model weights, build smaller neural nets that connect to each other. In practice, this would take the form of having a model for object detection, a model for segmentation, a model for decision-making, etc. which would all be connected to each other in a modular way. This makes debugging easier, requires less data to train individual modules, and arguably would make certification swifter.
Pamir: What would be some of your main messages to our readers?
Glenn & Jason: Safety analysis for anything that flies is key. Analyses of functions, failure modes, and the whole architecture under different use cases are really important. Each subcomponent in the whole ML architecture needs to be scrutinized heavily, including the programming language and its dependencies as well.
When we think of autonomy, we think of the removal of the human aspect from the cockpit. However, unless we achieve Level 5 autonomy in aircraft, the role of the human interacting with the autonomous system becomes more important, not less! Autonomous systems can give humans a false sense of confidence. They can also divide human attention unnecessarily. Therefore, the human aspects of engineering an autonomous system are highly critical. We need to consider safety right from the start and understand where the human-machine mix may be problematic. Perhaps we will find out that it is much better to go directly to full autonomy and eliminate the human from the operations. Otherwise, catastrophic results can take place based on the complex interaction between the human and the machine.
This edition’s sponsor is Vertical Flight Society. VFS is an amazing resource that I use to access leading technical papers and workshop decks for my work. Whether you are an investor, engineer, researcher, or entrepreneur, becoming a VFS member can give you an asymmetric advantage in the AAM space.
Membership gives discounted access to more than 15,000 technical papers, presentations and articles. Learn more here.
I don't think, it is realistic to develop something in Python and then "transform" it to C++. In Python all variables are on the heap and everything is dynamic. I don't think, that in a development process for safety critical software you can connect two technologies or programming languages that are so much different. That itself is a high risk. What you propose sounds like you want to use Python more like a domain specific language for your C++ code. But I don't think, that this is a good idea, because Python is a general purpose language. You could also use Fortran or Haskell or Lisp or whatever for that purpose. But I have doubts that this is efficient. Actually Lisp would even work better than Python, because it is extremely minimalistic and with macros you can express exactly what you want to bring to C++. The idea of using "specification languages" is a bit more complex, than just a simple decision to use Python for that purpose.