Multi-Party Resource Coordination¶
1. Description¶
Resources refer to the basic engine resources, mainly CPU resources and memory resources of the compute engine, CPU resources and network resources of the transport engine, currently only the management of CPU resources of the compute engine is supported
2. Total resource allocation¶
- The current version does not automatically get the resource size of the base engine, so you configure it through the configuration file $FATE_PROJECT_BASE/conf/service_conf.yaml, that is, the resource size of the current engine allocated to the FATE cluster
- FATE Flow Servergets all the base engine information from the configuration file and registers it in the database table- t_engine_registrywhen it starts.
- FATE Flow Serverhas been started and the resource configuration can be modified by restarting- FATE Flow Serveror by reloading the configuration using the command:- flow server reload.
- total_cores=- nodes*- cores_per_node
Example
fate_on_standalone: is for executing a standalone engine on the same machine as FATE Flow Server, generally used for fast experiments, nodes is generally set to 1, cores_per_node is generally the number of CPU cores of the machine, also can be moderately over-provisioned
fate_on_standalone:
  standalone:
    cores_per_node: 20
    nodes: 1
fate_on_eggroll: configured based on the actual deployment of EggRoll cluster, nodes denotes the number of node manager machines, cores_per_node denotes the average number of CPU cores per node manager machine
fate_on_eggroll:
  clustermanager:
    cores_per_node: 16
    nodes: 1
  rollsite:
    host: 127.0.0.1
    port: 9370
fate_on_spark: configured based on the resources allocated to the FATE cluster in the Spark cluster, nodes indicates the number of Spark nodes, cores_per_node indicates the average number of CPU cores per node allocated to the FATE cluster
fate_on_spark:
  spark:
    # default use SPARK_HOME environment variable
    home:
    cores_per_node: 20
    nodes: 2
Note: Please make sure that the Spark cluster allocates the corresponding amount of resources to the FATE cluster, if the Spark cluster allocates less resources than the resources configured in FATE here, then it will be possible to submit the FATE job, but when FATE Flow submits the task to the Spark cluster, the task will not actually execute because the Spark cluster has insufficient resources. Insufficient resources, the task is not actually executed
3. Job request resource configuration¶
We generally use task_cores`'' andtask_parallelism`' to configure job request resources, such as
{
"job_parameters": {
  "common": {
    "job_type": "train",
    "task_cores": 6,
    "task_parallelism": 2,
    "computing_partitions": 8,
    "timeout": 36000
    }
  }
}
The total resources requested by the job are task_cores * task_parallelism. When creating a job, FATE Flow will distribute the job to each party based on the above configuration, running role, and the engine used by the party (via $FATE_PROJECT_BASE/conf/service_conf .yaml#default_engines), the actual parameters will be calculated as follows
4. The process of calculating the actual parameter adaptation for resource requests¶
- Calculate request_task_cores:
- guest, host.- request_task_cores=- task_cores
 
- 
arbiter, considering that the actual operation consumes very few resources: `request_task_cores - request_task_cores= 1
 
- 
Further calculate task_cores_per_node.
- 
task_cores_per_node"= max(1,request_task_cores/task_nodes)
- 
If eggroll_runorspark_runconfiguration resource is used in the abovejob_parameters, then thetask_coresconfiguration is invalid; calculatetask_cores_per_node.- task_cores_per_node"= eggroll_run["eggroll.session.processors.per.node"]
- task_cores_per_node"= spark_run["executor-cores"]
 
- 
The parameter to convert to the adaptation engine (which will be presented to the compute engine for recognition when running the task). 
- fate_on_standalone/fate_on_eggroll:- eggroll_run["eggroll.session.processors.per.node"] = task_cores_per_node
 
- eggroll_run["eggroll.session.processors.per.node"] = 
- 
fate_on_spark: - spark_run["num-executors"] = task_nodes
- spark_run["executor-cores"] = task_cores_per_node
 
- spark_run["num-executors"] = 
- 
The final calculation can be seen in the job's job_runtime_conf_on_party.json, typically in$FATE_PROJECT_BASE/jobs/$job_id/$role/$party_id/job_runtime_on_party_conf.json
5. Resource Scheduling Policy¶
- total_coressee total_resource_allocation
- apply_coressee job_request_resource_configuration,- apply_cores=- task_nodes*- task_cores_per_node*- task_parallelism
- If all participants apply for resources successfully (total_cores - apply_cores) > 0, then the job applies for resources successfully
- If not all participants apply for resources successfully, then send a resource rollback command to the participants who have applied successfully, and the job fails to apply for resources
6. Related commands¶
query¶
For querying fate system resources
flow resource query
Options
Returns
| parameter name | type | description | 
|---|---|---|
| retcode | int | return code | 
| retmsg | string | return message | 
| data | object | return data | 
Example
{
    "data": {
        "computing_engine_resource": {
            "f_cores": 32,
            "f_create_date": "2021-09-21 19:32:59",
            "f_create_time": 1632223979564,
            "f_engine_config": {
                "cores_per_node": 32,
                "nodes": 1
            },
            "f_engine_entrance": "fate_on_eggroll",
            "f_engine_name": "EGGROLL",
            "f_engine_type": "computing",
            "f_memory": 0,
            "f_nodes": 1,
            "f_remaining_cores": 32,
            "f_remaining_memory": 0,
            "f_update_date": "2021-11-08 16:56:38",
            "f_update_time": 1636361798812
        },
        "use_resource_job": []
    },
    "retcode": 0,
    "retmsg": "success"
}
return¶
Resources for returning a job
flow resource return [options]
Options
| parameter name | required | type | description | 
|---|---|---|---|
| job_id | yes | string | job_id | 
Returns
| parameter name | type | description | 
|---|---|---|
| retcode | int | return code | 
| retmsg | string | return message | 
| data | object | return data | 
Example
{
    "data": [
        {
            "job_id": "202111081612427726750",
            "party_id": "8888",
            "resource_in_use": true,
            "resource_return_status": true,
            "role": "guest"
        }
    ],
    "retcode": 0,
    "retmsg": "success"
}