checkpoint.rst 5.0 KB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152
  1. ..
  2. Copyright 2014 Modelling, Simulation and Design Lab (MSDL) at
  3. McGill University and the University of Antwerp (http://msdl.cs.mcgill.ca/)
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7. http://www.apache.org/licenses/LICENSE-2.0
  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. Checkpointing
  14. =============
  15. .. note:: Checkpointing is only possible in distributed simulation and only if the MPI backend is used.
  16. Checkpointing offers the user the possibility to resume a computation from a previous simulation run. This previous simulation run might have been interrupted, with only a partial simulation as a result. Furthermore, all possible tracers will only have parts of their actual output being written. Restarting the simulation from scratch might be unacceptable due to the long time that was already spent on simulation. Checkpointing offers a solution to this problem, because it will save the current simulation state to a file after a fixed amount of GVT computations.
  17. The checkpointing algorithm is closely linked to the GVT algorithm, as this allows for several optimisations. At the GVT, it is possible to know that no message will arrive from before, so all states from before can be removed. Since after a checkpoint recovery all nodes will revert to the GVT, no future state needs to be saved too.
  18. The only data that is thus stored is the model itself. To allow for somewhat easier implementation, some other data is also stored, such as configuration options. Basically it boils down to a selective *pickle* of the kernels at every location.
  19. Now how do you actually use checkpointing? The first step is of course to enable it in the configuration options, like this::
  20. sim = Simulator(DQueue())
  21. sim.setCheckpointing("myQueue", 1)
  22. sim.simulate()
  23. The *setCheckpointing* function takes a name as its first parameter, which is simply used to identify the checkpoints and it will be used as a filename. The second parameter is the amount of GVT computations that should pass before a checkpoint is made. It might be possible to calculate the GVT frequently (e.g. after 10 seconds of simulation), but only create a checkpoint after a few minutes of simulation. This is because the GVT calculation frees up memory and might therefore be necessary. On the other hand, creating checkpoints very often is I/O intensive and when restoring a checkpoint, it will probably not be a matter of seconds.
  24. .. warning:: The first parameter of the *setCheckpointing* function is used as a filename, so make sure that this would create a valid file name.
  25. When simulation is running with these options, files will be created at every checkpoint step that are placed in the current directory. The created files will have the PDC extension, which stands for PythonDEVS Checkpoint. There will be as many files as there are nodes running: one for each kernel. Furthermore, a basic file will be created at the start, which contains the simulator that oversees the simulation. This file doesn't change with simulation, so it is not altered during simulation itself.
  26. Now that we have our checkpoints, we only need to be able to recover from them. This is again as simple as running the *loadCheckpoint* function **before** recreating a simulator and model. It is not completely necessary to do this before, though the work would be useless... This *loadCheckpoint* call will automatically resume simulation as soon as all nodes are recovered. The call will return *None* in case no recovery is possible (e.g. when there are no checkpoint files), or will return a simulator object when simulation has finished. It is therefore **only** necessary to create a new model and simulator if this fails. This gives the following code::
  27. sim = loadCheckpoint("myQueue")
  28. if sim is None:
  29. sim = Simulator(DQueue())
  30. sim.setCheckpointing("myQueue", 1)
  31. sim.simulate()
  32. # Here, the simulation is finished and the Simulator object can be used as normally in both cases
  33. The *loadCheckpoint* will automatically search for the latest available checkpoint that is completely valid. If certain files are missing, then the next available option will be tried until a usable one is found. Note that it is possible for a checkpoint file to be corrupt, for example when the simulation crashes while writing the checkpoint file. This will be seen by the user as a seemingly random *PicklingError*. In this case it is necessary to remove at least one of these files and retry. For this reason, older checkpoints are still kept.
  34. .. note:: On-the-fly recovery of a crashed node is not possible, all nodes will have to stop and restart the simulation anew.