UtilsWorkflow
AutoRPE.UtilsWorkflow.BinaryTree
- class AutoRPE.UtilsWorkflow.BinaryTree.BinaryTree(root: Job, accuracy_test: object, local_folder: str, max_running_jobs: int = 100)[source]
Bases:
objectRepresents a binary tree of jobs for an experiment, where each job can have various statuses. Jobs are processed based on their statuses and are moved through different stages in the workflow.
PENDING: The job has been created and is ready to be launched.
RUNNING: The job is currently running.
SUCCESS: A job that has been completed with successful results and asserted.
FAILED: A job that has been completed with unsuccessful results and cannot be split.
SUSPENDED: A job that has been completed with unsuccessful results, that has been split, and depends on the results of descendant jobs.
Initializes the binary tree with a root job, accuracy test, local folder, and a maximum number of running jobs.
- Parameters:
root (Job) – The root job of the binary tree, which starts the experiment.
accuracy_test (object) – The accuracy test to evaluate the job results.
local_folder (str) – The local folder where the experiment data is stored.
max_running_jobs (int, optional) – The maximum number of jobs allowed to run concurrently (default is 100).
- all()[source]
Returns a list of all jobs across different statuses (excluding disinherited jobs).
- Returns:
A combined list of jobs from all statuses.
- Return type:
list
- ban_variable(job: Job, var_id: str)[source]
Bans a variable from being used in a job or its children if it is part of the reduced precision set.
- Parameters:
job (Job) – The job where the variable is banned.
var_id (str) – The ID of the variable to be banned.
- Returns:
None
- check_children(job: Job)[source]
Checks if all child jobs of a given job have finished. If all children are finished, updates the job status accordingly.
- Parameters:
job (Job) – The job whose children are being checked.
- Returns:
None
- check_pending(job: Job)[source]
Checks if a job is already in the queue. If not, it is submitted for execution.
- Parameters:
job (Job) – The job to be checked and potentially started.
- Returns:
None
- check_running(job: Job)[source]
Checks if a running job has finished and updates its status accordingly.
- Parameters:
job (Job) – The job to be checked.
- Returns:
None
- checkpoint(incremental_id: int)[source]
Saves a checkpoint of the experiment at the current status.
- Parameters:
incremental_id (int) – The ID of the job at which to save the checkpoint.
- Returns:
None
- fail(job: Job)[source]
Handles the failure of a job, either by resubmitting it, subdividing it into children, or marking it as failed.
- Parameters:
job (Job) – The job that has failed.
- Returns:
None
- manage_reshuffled_job(job: Job)[source]
Manages a reshuffled job by determining whether it can be re-executed based on its child jobs’ statuses.
- Parameters:
job (Job) – The reshuffled job to be managed.
- Returns:
None
- move_to_failed(job: Job)[source]
Moves a job to the FAILED list after it has failed.
- Parameters:
job (Job) – The job to be moved to failed.
- Returns:
None
- move_to_pending(job: Job)[source]
Moves a job to the PENDING list, indicating it is ready to be re-executed.
- Parameters:
job (Job) – The job to be moved to pending.
- Returns:
None
- move_to_success(job: Job)[source]
Moves a job to the SUCCESS list after it has completed successfully.
- Parameters:
job (Job) – The job to be moved to success.
- Returns:
None
- move_to_suspended(job: Job)[source]
Moves a job to the SUSPENDED list, indicating it has failed and needs further evaluation.
- Parameters:
job (Job) – The job to be moved to suspended.
- Returns:
None
- print_status()[source]
Prints the current status of all jobs in the experiment, showing the number of jobs in each list.
- Returns:
None
AutoRPE.UtilsWorkflow.BinaryTreeSearch
- class AutoRPE.UtilsWorkflow.BinaryTreeSearch.BinaryTreeSearch(communicator: SSH, local_folder: str, analysis_status: dict, job_template: str, vault: Vault, original_precision_level: int, reduced_precision_level: int, accuracy_test: object, max_running_jobs: int, output_filename: str, experiment_name: str = 'BinaryTreeSearch')[source]
Bases:
objectInitializes the experiment with given parameters and sets up necessary components.
- Parameters:
communicator (SSH) – The communicator object used to manage communication between remote-local machines.
local_folder (str) – The local directory to store data.
analysis_status (dict) – Dictionary used for tracking the experiment’s progress (status of each job of the analysis).
job_template (str) – Filename (path) of template used to generate jobs for the analysis.
vault (Vault) – The vault containing variables used for the analysis.
original_precision_level (int) – The precision level to use in the original run (dp=52, sp=23, hp=10).
reduced_precision_level (int) – The precision level for the reduced precision run (dp=52, sp=23, hp=10).
accuracy_test (object) – Object containing the accuracy test to validate results.
max_running_jobs (int) – The maximum number of jobs that can run simultaneously.
output_filename (str) – The name of the output file where results will be saved.
experiment_name (str, optional) – The name of the experiment. Defaults to “BinaryTreeSearch”.
- binary_tree_search(id_forced_var: list = [], id_banned_var: list = [])[source]
Runs the binary tree search to evaluate the precision of variables.
- Parameters:
id_forced_var (list, optional) – List of variable IDs that are forced into the analysis.
id_banned_var (list, optional) – List of variable IDs that are excluded from the analysis.
- Returns:
The root job of the binary tree after the search is complete.
- Return type:
- initial_check(forced_id: list = [], banned_id: list = [])[source]
Performs an initial test run with original precision to ensure the accuracy test works.
- Parameters:
forced_id (list, optional) – List of variable IDs that are forced into the test.
banned_id (list, optional) – List of variable IDs that are excluded from the test.
- Returns:
None
- Raises:
AssertionError – If the basic test doesn’t pass, an error is raised.
- print_root_configuration()[source]
Prints the configuration of the root job to the specified output file.
- recover_checkpoint()[source]
Recovers the analysis state from a previously saved checkpoint.
- Returns:
The binary tree object after recovery from the checkpoint.
- Return type:
- root_job_succeeded()[source]
Checks if the root job of the binary tree search has successfully completed.
- Returns:
True if the root job succeeded, otherwise False.
- Return type:
bool
- Raises:
AssertionError – If the root job failed, an error is raised.
- setup(id_reduced_precision: list, forced_ids: list, id_banned_var: list)[source]
Sets up the experiment driver in preparation for starting the analysis.
- Parameters:
id_reduced_precision (list) – List of variable IDs to use with reduced precision.
forced_ids (list) – List of variable IDs that must be included in the analysis.
id_banned_var (list) – List of variable IDs that should be kept in original precision.
- Returns:
None
- class AutoRPE.UtilsWorkflow.BinaryTreeSearch.Counter[source]
Bases:
objectA simple counter class to track increments.
- class AutoRPE.UtilsWorkflow.BinaryTreeSearch.GracefulKiller[source]
Bases:
objectThis class is used to allow the user to stop an analysis at any point creating a pause_checkpoint.pkl which can be used in the future to restart the analysis from the same point.
It uses the signal library to catch signals.
- exit_gracefully(signum: int, frame)[source]
Handle termination signals by setting the kill flag and printing a message.
- Parameters:
signum (int) – The signal number received.
frame (frame object) – The current stack frame (unused in this implementation).
- kill_now = False
AutoRPE.UtilsWorkflow.Communicator
- class AutoRPE.UtilsWorkflow.Communicator.SSH(user: str, host: str, remote_scratch: str = '')[source]
Bases:
objectInitializes an SSH connection and sets up SFTP.
- Parameters:
user (str) – The username for SSH authentication.
host (str) – The host address of the remote server.
remote_scratch (str, optional) – Path to the remote scratch directory. Defaults to “”.
- connect()[source]
Establishes an SSH connection to the remote host.
- Returns:
True if the connection is established, False otherwise.
- Return type:
bool
- Raises:
IOError – If the connection cannot be established.
- execute(command: str)[source]
Executes a command on the remote server.
- Parameters:
command (str) – The command to execute.
- Returns:
A tuple of file-like objects for the command’s stdin, stdout, and stderr.
- Return type:
(stdin, stdout, stderr)
- get(remote_path: str, local_path: str)[source]
Downloads a file from the remote server to the local machine.
- Parameters:
remote_path (str) – The remote file path.
local_path (str) – The local destination file path.
- is_remote_file(path: str)[source]
Checks if a given path on the remote server is a file.
- Parameters:
path (str) – The remote file path to check.
- Returns:
True if the path is a file, False otherwise.
- Return type:
bool
- list_dir(dir_path: str)[source]
Lists the contents of a remote directory.
- Parameters:
dir_path (str) – The remote directory path to list.
- Returns:
A list of file and directory names in the remote directory.
- Return type:
list[str]
- AutoRPE.UtilsWorkflow.Communicator.mkdir_p(sftp: SFTPClient, remote_directory: str)[source]
Recursively creates directories on the remote server if they do not exist.
- Parameters:
sftp (SFTPClient) – The SFTP client used for communication with the remote server.
remote_directory (str) – The remote directory path to create.
- Returns:
True if any directories were created, False otherwise.
- Return type:
bool
AutoRPE.UtilsWorkflow.ExceptionManager
“Different ways of handling exception
- AutoRPE.UtilsWorkflow.ExceptionManager.children_fail_test(analyzed_job: Job, binary_tree: BinaryTree)[source]
Handles failure of children jobs by separating them into batches.
- Parameters:
analyzed_job (Job) – The job being analyzed.
binary_tree (BinaryTree) – The binary tree managing the job states.
- Returns:
None
- AutoRPE.UtilsWorkflow.ExceptionManager.children_submit_batch(analyzed_job: Job, binary_tree: BinaryTree)[source]
Submits a batch of children jobs for analysis.
- Parameters:
analyzed_job (Job) – The job whose children are being submitted.
binary_tree (BinaryTree) – The binary tree managing job states.
- Returns:
None
- AutoRPE.UtilsWorkflow.ExceptionManager.choose_child(analyzed_job: Job)[source]
Selects the child job for further analysis based on variable restrictions.
- Parameters:
analyzed_job (Job) – The parent job with children to analyze.
- Returns:
The selected success_child and failed_child.
- Return type:
tuple
- AutoRPE.UtilsWorkflow.ExceptionManager.divide_and_force(analyzed_job: Job, binary_tree: BinaryTree)[source]
Handles exceptions by dividing and forcing variables for analyzed jobs.
- Parameters:
analyzed_job (Job) – The job being analyzed.
binary_tree (BinaryTree) – The binary tree managing the job states.
- Returns:
None
- AutoRPE.UtilsWorkflow.ExceptionManager.resolve_exception(analyzed_job: Job, binary_tree: BinaryTree)[source]
Resolves exceptions during job analysis.
- Parameters:
analyzed_job (Job) – The job that encountered an exception.
binary_tree (BinaryTree) – The binary tree managing job states.
- Returns:
None
AutoRPE.UtilsWorkflow.Job
- class AutoRPE.UtilsWorkflow.Job.Job(id_reduced_precision: list, forced_ids: list, banned_variables, analysis_variables: list, reduced_precision_level: int, communicator: SSH, vault: Vault, template: str, local_folder: str, result_filename: str, counter: Counter, analysis_status: str)[source]
Bases:
RemoteManagerRepresents a binary search job for precision analysis, extending the RemoteManager.
- ancestors()[source]
Retrieves all ancestor jobs of the current job.
- Returns:
List of ancestor jobs.
- Return type:
list[Job]
- create_child(_id_subset, _index)[source]
Creates a child job with a subset of variables.
- Parameters:
_id_subset (list) – Subset of reduced precision IDs.
_index (int) – Child index in the hierarchy.
- Returns:
Newly created child job.
- Return type:
- create_children_batch()[source]
Groups child jobs into batches for submission, ordered by banned variables.
- descendants()[source]
Retrieves all descendant jobs recursively.
- Returns:
Descendants of the current job.
- Return type:
list[Job]
- divide_function_level(variables: list)[source]
Divides variables into subsets based on function levels.
- Parameters:
variables (list) – Variables to be divided.
- Returns:
List of subsets of variable IDs.
- Return type:
list
- divide_set_cluster()[source]
Divides variables into subsets based on clusters and module hierarchy.
- Returns:
Two subsets of variable IDs.
- Return type:
tuple
- divide_set_module(variables: list)[source]
Divides a set of variables into two groups based on module and routine.
- Parameters:
variables (list) – Variables to be divided.
- Returns:
Two subsets of variable IDs.
- Return type:
tuple
- fail_child(bad_child: Job)[source]
Marks a child job as failed and propagates failure to its descendants.
- Parameters:
bad_child (Job) – The failed child job.
- find_child_set(analysis_set_dict: dict)[source]
Finds and stores active children in the analysis set dictionary.
- Parameters:
analysis_set_dict (dict) – Dictionary to store active children.
- get_cluster_id()[source]
Determines the cluster IDs of variables under reduced precision.
- Returns:
Unique cluster IDs, empty if no clusters exist.
- Return type:
list
- get_variables_reduced_precision()[source]
Retrieves variables under reduced precision from the vault.
- Returns:
Variables with reduced precision.
- Return type:
list
- has_cluster()[source]
Checks whether the job contains clustered variables.
- Returns:
True if clusters exist, False otherwise.
- Return type:
bool
- kind_of_exception()[source]
- Categorizes the exception type when merging sets fails. Types of exception:
No exception: It isn’t a failed job
Intra-routine: All variables belong to same routine
Intra-module: All variables belong to same module, but not routine
Inter-module: The variables belong to different modules.
- Returns:
Type of exception (‘IntraRoutine’, ‘IntraModule’, ‘InterModule’, etc.).
- Return type:
str
- plot_variables(var_name: str)[source]
Plots a variable’s values over time and compares to error limits.
- Parameters:
var_name (str) – Name of the variable to plot.
- Returns:
Returns None if successful, otherwise raises exceptions.
- Return type:
None | Exception
AutoRPE.UtilsWorkflow.RemoteManager
- class AutoRPE.UtilsWorkflow.RemoteManager.RemoteManager(id_reduced_precision: list, forced_ids: list, reduced_precision_level: list, analysis_variables: list, communicator: SSH, vault: Vault, template: str, result_filename: str, counter: Counter, local_folder: str, analysis_status: str)[source]
Bases:
objectInitializes the RemoteManager with the given parameters.
- Parameters:
id_reduced_precision (list) – List of reduced precision variable IDs.
forced_ids (list) – List of forced variable IDs, if any.
reduced_precision_level (int) – Precision level for the variables.
analysis_variables (list) – List of analysis variables.
communicator (SSH) – The communicator object for remote operations.
vault (Vault) – Vault containing variable data.
template (str) – Path to the job template.
result_filename (str) – Filename for the results.
counter (Counter) – Counter object for incrementing job IDs.
local_folder (str) – Local directory for storing job data.
analysis_status (dict) – Dictionary containing the status of various analyses.
- check_running()[source]
Checks if the job is currently running on the remote scheduler.
- Returns:
The current status of the job (e.g., “RUNNING”, “PENDING”).
- Return type:
str
- check_submitted()[source]
Checks if the job has already been submitted to the remote scheduler.
- Returns:
The job’s current status (e.g., “PENDING” or “TO_SUBMIT”).
- Return type:
str
- evaluate_simulation(accuracy_test: object)[source]
Evaluates the accuracy of the simulation using the provided accuracy test.
- Parameters:
accuracy_test (object) – The accuracy test to evaluate the simulation.
- Returns:
The result of the accuracy test evaluation.
- Return type:
str
- generate_jobscript()[source]
Generates a job script based on the template and current job parameters.
- Returns:
A job script ready for submission to the scheduler.
- Return type:
str
- generate_namelist()[source]
Generates a namelist for the variables being analyzed, setting their precision according to the reduced precision level.
- Returns:
A string representing the generated namelist.
- Return type:
str
- get_result(accuracy_test: object)[source]
Retrieves the result of the simulation, either from the dictionary or from the remote system.
- Parameters:
accuracy_test (object) – The accuracy test to evaluate the result.
- Returns:
The simulation result, either from the dictionary or from a remote file.
- Return type:
str
- property hash
Generates a unique hash based on the list of variable IDs and reduced precision level.
- Returns:
A hash string representing the unique identifier.
- Return type:
str
- job_status()[source]
Retrieves the current status of the job based on its ID.
- Returns:
The current status of the job (e.g., “COMPLETED”, “PENDING”).
- Return type:
str
- property remote_status
Retrieves the remote status of the job.
- Returns:
Current job status.
- Return type:
str
- run_job(force: bool = False)[source]
Submits the job to the remote scheduler if it has not been submitted or if forced.
- Parameters:
force (bool) – If True, forces the submission of the job even if it has already been submitted.
- Returns:
The status of the job after attempting to run.
- Return type:
str
- Raises:
ExceptionNotManaged – If the job status is unknown.
- property status
Retrieves the current status of the job.
- Returns:
Current job status.
- Return type:
str
- submit_job(check_low=True)[source]
Submits the job to the scheduler and returns the job ID.
- Parameters:
check_low (bool) – If True, checks if the job is queued and handles any low-level errors.
- Returns:
The job ID assigned by the scheduler.
- Return type:
str
- update_parameters()[source]
Updates the parameters used for remote job submission, including paths and job-related attributes.
- property variable_set
Returns the set of variable IDs, including forced IDs if provided.
- Returns:
List of variable IDs, including forced IDs.
- Return type:
list