Restarting a Terminated Run using restart_file_info.dat
This post outlines the improved HydroGeoSphere restart functionality, designed to simplify resuming a model run after unexpected termination. Previously, restarting required modifying multiple input files, rerunning grok.exe, and manually appending outputs. Now, with the automatic generation of parallelindx.dat
, restart_file_info.dat
, and prefixo.restart
, the process is much more efficient.
By updating the __Simulation_Restart value in the parallelindx.dat file, users can seamlessly restart simulations without adjusting model inputs. This approach ensures continuity, maintains model states, and offers flexibility in managing output files. This feature is particularly useful for long or complex simulations where interruptions may occur.
Once again, this ‘command of the week’ post is not going to highlight a particular HGS command but instead presents a bit of an advanced technique. This week’s post is all about restarting a model run that was terminated early for whatever reason. Fortunately, as of June, 2021 (revision 2270) we have overhauled the model restart process to make it much easier to implement!
There are several reasons for an HGS simulation to terminate early:
Power failures!
A new boundary condition comes into effect which results in a diverging solution;
Manual termination to devote some CPU power to other tasks
Perhaps you’re working in a supercomputing environment with a fixed maximum run-time policy and your model takes much longer to run (resulting in several restarts).
Among HydroGeoSphere users, the most common method of resuming a simulation is probably to save the available outputs, retrofit the model *.grok file with the Initial head from output file command (for all active domains) and to modify the initial time of the model using the initial time command. The previous head files are then used to initialize the head throughout the subsurface and surface domains, and HGS then calculates the velocity, flux, water saturation, etc. The re-initialized model then becomes identical to the terminated model at the final output time. However, this method requires you to make several adjustments to the model input files, re-run grok.exe, and what’s most unfortunate is that you would then have to spend considerable time updating your output files and concatenating data from multiple model runs.
Figure 1: Example ‘parallelindx.dat’ file. Note that only the final setting (__Simulation_Restart, the “restart index”) is used in the model restart procedure. Restart mode is activated if >1 when phgs.exe is initiated.
The new restart functionality takes care of all these issues for you. To understand how this process works, there are a few "behind the scenes" things that you should understand:
When you initiate a model run, phgs.exe will create a file called ‘parallelindx.dat’. This file is used primarily to specify whether a model will be executed in parallel mode, but it does also include a flag (__Simulation_Restart) which indicates whether a model will be run from scratch (i.e., time = 0) or whether it should be restarted from a later time.
At every timestep, phgs.exe will update the binary ‘prefixo.restart’ file, which records the latest head (and concentration if transport is active) across all active model domains at the latest timestep.
At every timestep, phgs.exe will update the ‘restart_file_info.dat’ file, which records information required to initiate the model restart.
Figure 2: Example ‘restart_file_info.dat’ file. This file is updated automatically and does not require user input, unless the command Restart write off has been included (see Figure 3 below).
Using these three files, HydroGeoSphere is now able to initiate a model restart without requiring any changes to your *.grok file or any inputs (other than parallelindx.dat and restart_file_info.dat files).
Here is a quick overview of how the restart process works:
Figure 3: Reference Manual entry for Restart write off command
You run a model, it terminates prematurely, and you want to restart it from where you left off.
You open the parallelindx.dat file and change the restart index (i.e., __Simulation_Restart) to any integer greater than 1. Save and close the file.
When phgs.exe is initialized, it will recognize that a model restart is required.
An optional step is to open 'restart_file_info.dat' and change the __append_to_output_files logical flag.
By default, this flag is set to ‘T=true’, which means that all regularly generated output files (e.g., 'prefixo.lst', observation point/well outputs, species mass balance files, boundary condition output files, etc.) will have results appended to the existing file.
Setting this flag to ‘F=false’ will create new output files that incorporates the __Simulation_Restart value into their file names. For example, if the restart index is set to 2, the new *.lst file would be named 'prefixo.0002.lst'.
Run phgs.exe again, no further changes are needed!
phgs.exe recognizes that a restart is required based on the __Simulation_Restart value within the 'parallelindx.dat' file.
phgs.exe will read 'restart_file_info.dat' to determine the latest successful/completed timestep (__initial_time), the timestep size for the next timestep (__initial_timestep), the next timestep target (__ntloop_target), the starting index number for future binary output files (__iphead) and a logical flag indicating whether model output files should be appended or overwritten (__append_to_output_files(F=false,T=true)). If transport is active a starting index number for these binary outputs (__ipconc) will also be included.
The 'prefixo.restart' file is used to update the initial heads and concentrations throughout the model, allowing the model to resume seamlessly from where it was terminated.
The model will carry on as though it never failed. Output files will be either appended, or new versions (with the __Simulation_Restart index in the filename) will be created.
In certain situations you may want to disable updating of the restart files ‘prefixo.restart’ and ‘restart_file_info.dat’ during a simulation. You can do so via the command Restart write off (restart files are always updated at the end of a successful simulation regardless of this command). Use this command with caution, however, since if your simulation does not complete successfully, you will be unable to restart it via the restart feature.
Figure 4: The 'prefixo.eco' file can be used to identify the next target time and initial head index #s listed in the 'restart_file_info.dat' file.
Figure 5: Abdul_Transport problems ‘restart_file_info.dat’ file after exactly 26 timesteps are successfully solved.
You can easily test this new restart procedure yourself using any of the readily available verification models. In the images below I have highlighted some of the resulting files after running the abdul_transport verification problem and terminating the model run (to terminate a model run prematurely press CTRL+C in the command line while phgs.exe is running) after successfully running past t=1800 (i.e., 26 timesteps).
Before running phgs.exe we can review the 'abdul_transport.eco' file to review a list of target times (see Figure 4)
After running phgs.exe and terminating the model after exactly 26 timesteps we can see that the 'restart_file_info.dat' file has updated itself (Figure 5 below). When this model is restarted it will be based on a new initial time of 1800 seconds, with an initial timestep of 100, the next target time index is 7 (i.e., t=3000).
The index # for binary outputs are given by the __iphead and __ipconc settings. This ensures that the correct output time index # is applied to the end of binary output files. In this case, since __iphead and __ipconc have a value of 6, the next output time (t=3000) will have binary outputs in the format: 'prefixo.variable_domain.0007' (e.g., 'abdul_transporto.head_pm.0007'). This ensures that all binary output file numbering follow along from the current outputs without missing a beat
Finally, non-binary output files for this model will be appended to existing files as indicated by the ‘__append_to_output_files’ setting (see Figure 5)
To restart this model simply open the ‘parallelindx.dat’ file and change the ‘__Simulation_restart’ setting to anything greater than 1, then save the file and initiate phgs.exe from the command line.
Figure 6: ‘abdul_transo.water_balance.dat’ after model termination/restart/completion.
You will see that all non-binary output files are appended, showing no sign of the model terminating. For example, Figure 6 below shows the resulting ‘abdul_transo.water_balance.dat’ file after successfully running the model after the model restart:
We hope that this new feature helps ease some of the frustrations of unexpected model terminations/failures. There may some model states that are held in 'volatile' memory and would be lost in case of a model crash, but this new restart feature should be valid and appropriate for the vast majority of models. If you do notice any unusual behaviour after a model restart please do let us know. And if you have any questions about this new feature don’t hesitate to ask, we're here to help!