Indestructible memory, indestructible processes


Having read recently ( 1 , 2 , 3 ) with what difficulty “space” processors are given, I involuntarily wondered if the “price” for stable iron is so high, maybe it’s worth taking a step and, on the other hand, making “software” resistant to special factors? But not application software, but rather its execution environment: compiler, OS. Is it possible to make the execution of the program at any time possible to interrupt, reboot the system and continue from the same (or almost the same) place. In the end, there is hibernation .

Radiation effects


Almost everything that flies from space is capable of disrupting the operation of the microcircuit, it is only a matter of the amount of energy that “it” brought with it. Even a photon, if it has a gamma-ray wavelength, is able to overcome several centimeters of aluminum and ionize the atom (s) or even cause a nuclear photoelectric effect . An electron cannot penetrate through any dense obstacle, but if it is accelerated more strongly, it will emit a gamma quantum when braking with all the ensuing consequences. Given that the half-life of a free neutron is about 10 minutes, a rare (and very fast) neutron reaches us from the Sun. But the nuclei of anything fly by and are also capable of doing things. The neutrinos are perhaps not seen in anything like that.

How can one not recall Piglet with him: “it is difficult to be brave when you are just a Very Small Being.”

The consequences of cosmic radiation entering a semiconductor can be different. This is the ionization of atoms and the violation of the crystal lattice and nuclear reactions. This is where silicon doping with thermal neutrons in an atomic reactor is described , when Si (30) turns into P (31), and the desired semiconductor properties are achieved. It’s not worth retelling the wonderful articles mentioned, we’ll only note the following -

  1. Some effects have a short-term effect and do not have long-term effects. They can lead to errors that can be fixed either hardware or software. In the worst case, a reboot helps.
  2. , . - .
  3. .

Note that effects of types 2 and 3, if they were able to be stopped, lead to gradual degradation of the microcircuit. For example, if one of the (even 4) adders “burned out” in the superscalar processor, you can (at least speculatively not difficult) physically disable the power to the victim and use the remaining three, outwardly only a drop in performance will be noticeable. Similarly, if one of the internal pool registers is damaged, it may be marked as “always busy” and will not be able to participate in the planning of operations. The memory unit may become unavailable. ... But if something irreparable has deteriorated, you will have to raise the cold reserve. If he is.

“Staying in a cold reserve, by the way, does not protect the microcircuit from the accumulation of the dose, and even from the accumulation of charge in the gate insulator. Moreover, microcircuits are known in which dose degradation without power supply is even worse than with it. But all the single effects that cause hard failures require the inclusion of the chip. With the power off, there may be bias effects, but they are not important for digital logic. ” (amartology)

Thus, there are two factors

  • at any time, a failure may occur, which is treated by a reboot
  • the system will gradually degrade (sequence of failures), most of the work will occur in conditions of partial degradation

How do you live with all this? Due to reservation / tripling with voting throughout the hierarchy of functional blocks. In itself, tripling is not a panacea, it is necessary in order to understand which of the results is correct when one of the components fails. Then the failed component can be restarted and brought into line with two workers. But in the event of a failure, when the component cannot be brought into working condition, only cold reserve, if any, will help.

Even if the failure does not look critical, it can cause serious problems. Suppose we have three synchronously working computers, in one of them (hypothetical, mentioned above) one of the adders failed. This is not a problem from the point of view of a computer that has remained operational, but it is a problem for the entire system since the affected computer will begin to be systematically late and serious efforts will be required for overall synchronization.

Another example, a memory failure, as a result of which some of its range (even one page) has become unusable, is not critical from the point of view of a single computer. After diagnosis, the operating system is able to cope with this problem without using this range. But from the point of view of the trojan system, this is a disaster. Now, if there is a failure (which is treated by rebooting), we will need to bring the failed computer to a state identical to any of the remaining ones, but this is impossible because on other computers, this range is working and probably used. In principle, it is possible to prohibit this range on all three computers, however, it is not obvious that it will be possible to do this without restarting all computers in turn.

It is a paradoxical situation when a system that is trojaned at the upper level is less reliable compared to a single computer that can adapt to gradual degradation.

It is worth mentioning the approach called Lock-step , when two cores perform the same task with a shift of one or two clock cycles, and after that the results are compared. If they are not equal, some piece of code is re-executed. This does not work if there is an error in the memory or the general cache, however, it has its own protection.

There is also an approach where the compiler repeats the execution of part of the commands and compares the results. Such a soft version of Lock-step.

Both of these approaches (thanksamartologyper tip) - an attempt to detect a failure and try to fix it with "little blood", without rebooting. We will more likely consider the situation when a serious failure or non-critical failure occurs and a reboot is inevitable. How to make sure that the program without any special efforts on its part could be interrupted at any moment, and then continue without serious losses.

How to teach hardware and OS to adapt to gradual degradation is a topic for another discussion.

What if


The idea of ​​a stable / persistent memory is not new in itself, so the respected Dmitry Zavalishin (dzavalishin) proposed his concept of persistent memory . In his hands, this gave rise to a whole persistent Phantom OS , in fact a virtual machine with corresponding overhead.

Perhaps, over time, MRAM or FRAM technologies will mature ... while they are raw.

There is also a legend about the on-board computer of the R-36M rocket (15L579?), Which was able to launch through a radioactive cloud immediately after a close nuclear explosion. Applied memory on ferrite cores is immune to radiation. The recording cycle of such a memory is of the order of units of microseconds, so during the time that the rocket flies a few decimeters, there was a physical opportunity to maintain the context of the processor - the contents of registers and flags. Waking up in a safe environment, the processor continued to work.
Sounds believable.

There are some buts:

  1. Hibernation in its current form is not suitable. It takes some effort and time. We are trying to protect ourselves from a sudden failure. It is not obvious that after this failure the processor is physically able to do at least something. Similarly, in 15L579, the system receives a warning before the troubles begin and has time to protect against them.
  2. “” — , , — . , () , .
  3. , , . — -.

In general, crash recovery is essentially a counterpart to exception handling. Actually, the failure itself in most cases begins as a hardware interrupt. The difference is that after the exception, we just can continue to work, and in this case we first need to restore the working context - the memory and state of the kernel of the operating system. But the final part looks the same.

First, how it should look from the side of the application programmer.

A look from outside the OS kernel


Since recovering from failures is similar to recovering from throwing an exception, then working with it may look similar. For example, in C ++ we inherit the std :: tremendous_error class from std :: exception, catch it in a regular try / catch block and organize the processing.

However, the author likes the semantics of setjmp / longjmp (SJLJ) more because:

  • this is concise, just call the analog setjmp (& buf) and resume work from the same place
  • even no “& buf” is required, just calling a system function that stores the current state
  • besides C ++, there are other great languages, not everywhere there is exception handling, but everywhere there is a call to system functions
  • and there’s no need to modify the language, because we were originally going to act as invasively as possible

At one time, SJLJ lost to the DWARF technique (strictly speaking, dwarf is just an information recording format) in exception handling due to poorer performance, performance is not so important here. In any case, maintaining the state will not be cheap; one must approach it responsibly.

A look from inside the OS kernel


What needs to be saved, what does the context of the process execution consist of?

  1. For each thread in user mode - the current “jmp_buf” with the necessary registers, this means that the OS must stop all threads of the calling process before saving the data
  2. , — . (: ), (ex: ).
  3. . (ex: ), (ex: TCP ). .
  4. , . ,
  5. . , . , — . .. .

    , , . .
  6. , .

Information is not required for transcoding from virtual memory to physical and vice versa; upon restart, this information will be recreated by itself, possibly in a different way.

As for working with the file system. Among file systems, there are transactional ones. If the application requires precisely transactional behavior, the preservation of the process context should be synchronized with the confirmation of the file system transaction. On the other hand, for example, for recording text logs, it is logical to use a regular file system, transactionality here would be strange.

Of all the above, the greatest questions are caused by the preservation of the contents of memory; the volume of everything else is insignificant in comparison with this.
For example, runtimethe library buffers memory allocations, asks them from the system in relatively large chunks, and distributes itself. Therefore, creating / deleting segments is relatively few.

But programs work continuously with memory, in essence it is the memory subsystem that is usually the bottleneck in the calculations. And all that can simplify our lives is hardware support for flags of modified pages. It is expected that between the state saves, not too many modified pages appear.

Based on this, in the future we will deal with the contents of memory.

Saving the contents of memory


The desired behavior is close to databases - the DBMS can “fall” at any time, but the work done will continue until the last commit. This is achieved by maintaining a transaction log, getting into which commit records will legalize all changes made to the transaction.

But, since the term “ transactional memory ” is busy, we will introduce another - “indestructible memory”.

Offhand you can see two methods by which this indestructible memory can be implemented.

Option one , let's call it “unpretentious”.
The main idea is that all data changed in a transaction should be placed in RAM. Those. during operation, the swap mechanism does not save anything to disk, but during the commit, all changed pages are saved to the swap file.

Information about the selected segments and their connection with the place in the swap file is written to the log. During operation, this information is accumulated and recorded during the commit. When restarting, the system has the ability to create segments anew. The swap mechanism will be able to pull them up and the interrupted program will magically receive its data.

However, in this mode it is impossible, for example, to allocate a calloc- th array larger than the available memory ( malloc- th, by the way, is possible). However, this in any case would not be a very good idea.

Even if such a regime applies only to processes that have declared themselves to be “indestructible”, the amount of memory occupied by the current transactions of all such processes cannot exceed the physically available. The swap mechanism actually stops swapping and turns into a mechanism for storing recent transactions.

All this imposes a certain discipline on application developers, can lead to an uneven load on the disk, in general, this is not quite what we wanted, but it can work in embedded systems.

A significant drawback of this option is that an fatal error during the commit, when only part of the pages were written, leads the corresponding process to an unstable state, after which it will have to be stopped.
It turns out some kind of 50% inviolability.

Option Two , “Shadow”
To act as a transaction manager, you need to be a transaction manager.

Let's define entities:

  1. The page file contains data pages, so the size is a multiple of the size of the page. We say the file, we mean the section rather, because a fixed size improves system stability.
  2. Paging file page allocator . It is necessary to select a page not only for user data, but also, for example, to record the state of the allocator itself. As well as all that was mentioned above.
  3. . , . , ,
    (= , ).
  4. . —
    • ID
    • ( )
    • ID .

    - TLB, .. .

    ( ) . . , ex: (Buddy Allocator) .

    , . .
  5. . COW (copy on write) . , . COW, - , . .

    — - , “dirty”. .
  6. (). .

    : , .

    . , . . , , . . , ? , .

    , .

    (= , , , ).


    (=, ). .

    . , . , , , .

    , . , . .

    — . , , , , .
    , , .

    .
  7. . , , — . , ?

    — . , . , . , .. .

    — , SSD ! , SSD ( “” ) .

    , .

    — . , . , . ( ).

    , , , . , , . , , , . , — . .
  8. Checkpoint.

    , , , , . — . , . checkpoint. .

    . . - checkpoint- . .

    , . - .

    checkpoint-. , / .

    -, - /, . , ( ...). .

    . . . , . — , checkpoint.


It is a pity that there are no storage devices that are completely resistant to long-term operation in space conditions. Ferrite cores were resistant to radiation, but had their own specific problems due to the large number of soldered joints. Plus low capacity, low speed and high complexity of manufacturing.

Nevertheless, you must be able to reliably write and read this data.

An obvious candidate is flash memory. Flash was not initially highly reliable due to the low number of valid write cycles, so special methods were developed to work with it .

It was previously mentioned that tripling is used to work with unreliable elements, RAID1 is enough herebecause if the recording fails due to the control values, it is known which of the two pages was written incorrectly and must be overwritten.

Total


Well, now we have in our hands all four letters of the word ACID .

A - atomicity, achieved
C - consistency,
I - isolation is evident , is achieved naturally. If you do not consider the case of shared memory. And at the moment we are not considering it.

D - persistence, the only time we tried to cheat when we released the process after a commit without waiting for the physical recording of all the data in its memory to disk. In the worst case, this will lead to a rollback to the previous transaction. It is unclear how critical this is for both performance and durability.

PS. Just a quick note. We do not have a mechanism for rollback transactions, rollback can only be a fatal error. Technically (it seems) it is easy to implement a transaction rollback program as an analogue of longjmp. But this is a much more advanced version of longjmp since completely restores the internal state of the process at the time of “setjmp”, avoiding memory leaks, allows the transition not only from the bottom up the stack ...

PPS . Perhaps, the OpenLink Virtouso DBMS server , available also as free software , can be considered the prototype of the transaction manager .

PPPS . Thanks to Valery Shunkov (amartology) and Anton Bondarev (abondarev) for a meaningful and very useful discussion.

PPPPS . Illustration by Anna Rusakova .

All Articles