Strict Adherence to a Standard Does Not Guarantee Security or Safety
The recent publication “Refinement-based Specification and Security Analysis of Separation Kernels” by Zhao, et. al. is a reminder that strict adherence to an informal specification is not always the best policy for safety or security in your product. While there may be no leeway when implementing a formal specification, the ARINC 653 specification is itself informally written and has several areas where it is unclear and remains unclear despite multiple revisions. This means that an implementer needs to use at least some outside knowledge and understanding of the problem space in order to implement a standards compliant API that also adheres to the spirit of the specification and not just the literal text.
As part of our Small Business Innovation Research (SBIR) work we implemented the ARINC 653 Inter Partition Communication APIs as extensions to the Xen hypervisor. Since this work was done in 2011, the referenced paper was not available, but we still managed to avoid all avoidable IPC issues presented in the paper. It might seem like luck, but it was due to making careful implementation choices that put safety and security as the primary concern, and using the API description of the standard as a guideline instead of the literal required implementation.
Implementing ARINC 653 IPC APIs Without Sacrificing Security or Safety
There are three IPC issues presented in the paper by Zhao, each of which is listed below along with the reasoning that drove our implementation choices and allowed us to avoid these issues without needing to go through all the work of formally modeling the standard:
Issue #1: Queue Status as a Covert Channel
The first issue presented is categorized as a problem with the standard itself. The issue has two parts: 1) a receiver covertly communicates information (in this case full/empty status) back to the sender 2) due to back pressure on a queue interface a receiver can influence the sender by not taking messages off the queue. The latter being the bigger problem since it could cause a sender to halt or repeatedly retry the transmission. #1 is indeed a problem with the standard itself, and there is not much that can be done to avoid it without changing the standard to allow time out errors to be handled without necessarily notifying the partition whose transaction timed out. Whether or not #2 is a problem with the standard is a bit more complex.
The authors avoid this second issue by allowing the dropping of packets, which at first glance seems to be in violation of the standard. After all, the Queuing Send API makes no mention of dropping packets. However, while the authors claim that the ARINC 653 standard prohibits dropping of messages, the ARINC 653 standard actually does not say that. In fact it almost directly says the opposite, “However, if the receiving FIFO is full, new messages may be discarded”. Additionally the Receive Queuing API description says what to do on queue overflow. If the queue overflowed that must mean that messages have been dropped. This at least implies that the send interface may drop packets even if the send interface description does not explicitly mention doing so. This implication in combination with the previous statement makes it clear that dropping packets is not in violation of the standard, though individual implementations are not required to adopt this packet dropping behavior.
When addressing this issue of message overflow and dropping, we considered what would happen in the case of an intermediate I/O partition that was routing streaming I/O on behalf of ARINC 653 guest partitions. If the I/O partition was relaying I/O messages to a guest partition through a queuing IPC connection, and the receiving partition stopped processing queue messages, the I/O partition can only store so many new messages for the destination partition before it must start dropping packets. Since the I/O partition must tell the receiver that it dropped these packets, it makes sense to put the mechanism to handle this situation (including local storage and dropping on timeout status) in the Send/Receive APIs themselves.
Issue #2: Not checking if a port belongs to a partition.
The second issue is once again categorized as a problem with the standard, because the IPC send/write APIs do not mention that the ownership of a port by a partition needs to be verified. While the ARINC 653 standard does not dictate this checking, it still seems like an obvious oversight not to check, but since the authors of the paper found real ARINC 653 implementations that fall into this trap in must not be as obvious as we thought.
The API statement in question is “return INVALID_PARAM when the “[SAMPLING|QUEUEING]_PORT_ID does not identify an existing port”. In our view the intended interpretation of this statement is “…does not identify an existing port in the partition”, but since it is ambiguous the alternative interpretation of “… does not identify an existing port in the system” is plausible, though as pointed out by the paper’s authors such an interpretation is problematic for security.
That some ARINC 653 operating system vendors chose this interpretation is concerning. Admittedly, it is more a security issue than a safety one, since most safety systems are not concerned with malicious actors forging inputs. However, depending on how the underlying port IDs are generated, a simple programming error or a single random bit flip could allow a partition to access ports not belonging to it.
In our implementation, not only do we preclude a partition from using a port that does not belong to it, but we also do not even tell a partition that ports other than its own exist. Even letting a partition know of the existence of another partitions’ ports may not violate the letter of the law in ARINC 653, but it violates the spirit – that each partition behaves the same, whether other partitions are present or not. This concept is important for safety, but it turns out to be even more important for security.
Issue #3: Using an easily derived global id for port identifiers.
The last IPC issue is categorized as a potential implementation problem. In this case, the issue happens in systems that create global port ids in a way that can be easily guessed (or accidentally occur due to random bit errors). In concert with the previous issue, a misbehaving guest can easily start using a port to which it should not have access.
The same implementation decisions that negate the previous problem also apply here, but additionally in our implementation, port ids are partition specific, so a partition can only derive port id information about itself and not any other partition.
Our Choices Were Not in Vain
When implementing these APIs it was clear to us that just blindly using the informal ARINC 653 standard as the required implementation was not enough to guarantee safety or security, and we made appropriate choices to remedy these shortcomings. We are thankful to the authors of “Refinement-based Specification and Security Analysis of Separation Kernels” for providing a formal backing to our intuition and hard won battle scars.