OpenStack研发：Cyborg

Introduction

Cyborg (previously known as Nomad) is an OpenStack project that aims to provide a general purpose management framework for acceleration resources (i.e. various types of accelerators such as GPU, FPGA, ASIC, NP, SoCs, NVMe/NOF SSDs, ODP, DPDK/SPDK and so on).

Overview

Cyborg内部架构

cyborg-api - cyborg-api is a cyborg service that provides REST API interface for the Cyborg project. It supports POST/PUT/DELETE/GET operations and interacts with cyborg-agent and cyborg-db via cyborg-conductor.
cyborg-conductor - cyborg-conductor is a cyborg service that coordinates interaction, DB access between cyborg-api and cyborg-agent.
cyborg-agent - cyborg-agent is a cyborg service that is responsible for interaction with accelerator backends via the Cyborg Driver. For now the only implementation in play is the Cyborg generic Driver. It will also handle the communication with the Nova placement service. Cyborg-Agent will also write to a local cache for local accelerator events.
Vendor drivers - Cyborg can be integrated with drivers for various accelerator device types, such as FPGA, GPU, NIC, and so forth. You are welcome to extend your own driver for a new type of accelerator device.

加速资源使用架构

Blueprints for Cyborg

Cyborg API

This blueprint provides the initial design for the cyborg api. The cyborg api should support the basic operations concerning accelerators, and does not necessarily have to be user facing api at the early stage.

The api should support the following interfaces:

attach: either attaching existing physical accelerators or creating new virtual functions and then allocating to the VM
detach: detaching existing physical accelerators or deallocating virtual functions for the VM
list: list all the attached accelerators
update: make modification to the accelerators (either the state or the device itself)
admin: for certain configurations that does not related to the resource centric CRUD operations.

Cyborg Agent

The Cyborg agent will reside on compute hosts and potentially other hosts that may make use of accelerators.

Agent responsibilities:

Inspect hardware to locate accelerators
Manage installing drivers, dependencies and other setup and teardown
Manage connecting the instance to the accelerator once it has spawned
Report data about available accelerators, status, and utilization to the Cyborg server

Hardware Discovery:
The instance is scanned for accelerators and usage levels of existing accelerators every few seconds and this information is reported in a heartbeat message to the Cyborg server to help manage scheduling and availability.

Hardware Management:
Ansible will be used to manage configuration files and other setup for each accelerator and it’s driver. Setup and teardown playbooks will be made for each set of supported hardware. A configuration change on cyborg managed hardware will boil down to running the uninstall playbook and the install playbook with different configuration options.

Instance connection:
Once a instance is spawned that requires connecting to a specific accelerator on the host Cyborg server will send a message to Cyborg agent to inform the agent of the new instance. Since the connection method may change dramatically between different accelerators the driver should probably provide a connect function to call out to.

Workflow

Cyborg discovery accelerates device workflow

加速资源发现的工作流程如下：

根据配置的驱动信息，调用指定驱动的discover()方法，后续流程以NVIDIA GPU驱动为例。
调用系统lspci -nnn -D命令，获取所有PCI设备。
根据VENDOR_ID及GPU_TAG匹配到GPU设备信息。
通过正则表达式，从命令行输出中提取并格式化关键信息。
根据配置文件中是否启用了vGPU，判断添加何种resource class与可用设备数量。
根据VENDOR_ID和PRODUCT_ID生成特定的traits。
实例化DriverDevice()。
返回实例化后的DriverDevice()对象，RPC方式发送给cyborg conductor，进行资源上报。

Cyborg and Nova interaction workflow

This flow is captured by the following sequence diagram, in which the Nova conductor and scheduler are together represented as the Nova controller.

A Cyborg client module is added to nova (cyborg-client-module). All Cyborg API calls are routed through that.

The Nova API server receives a POST /servers API request with a flavor that includes a device profile name.
The Nova API server calls the Cyborg API GET /v2/device_profiles?name=$device_profile_name and gets back the device profile. The request groups in that device profile are added to the request spec.
The Nova scheduler invokes Placement and gets a list of allocation candidates. It selects one of those candidates and makes claim(s) in Placement. The Nova conductor then sends a RPC message build_and_run_instances to the Nova compute manager.
Nova conductor manager calls the Cyborg API POST /v2/accelerator_requests with the device profile name. Cyborg creates a set of unbound ARQs for that device profile and returns them to Nova.
The Cyborg client in Nova matches each ARQ to the resource provider picked for that accelerator.
The Nova compute manager calls the Cyborg API PATCH /v2/accelerator_requests to bind the ARQ with the host name, device’s RP UUID and instance UUID. This is an asynchronous call which prepares or reconfigures the device in the background.

Cyborg, on completion of the bindings (successfully or otherwise), calls Nova’s POST /os-server-external-events API with:

{
    "events": [
        { "name": "accelerator-request-bound",
            "tag": $device_profile_name,
            "server_uuid": $instance_uuid,
            "status": "completed" # or "failed"
        },
        ...
    ]
}

The Nova compute manager waits for the notification, subject to the timeout mentioned in Section Other deployer impact. It then calls the Cyborg REST API GET /v2/accelerator_requests?instance=<uuid>&bind_state=resolved.
The Nova virt driver uses the attach handles returned from the Cyborg call to compose PCI passthrough devices into the VM’s definition.
If there is any error after binding has been initiated, Nova must unbind the relevant ARQs by calling Cyborg API. It may then retry on another host or delete the (unbound) ARQs for the instance.

Nova allocate mdevs

Source Code：(stable/victoria) Nova Compute Libvirt Driver - def _allocate_mdevs(self, allocations)

nova compute创建mdev设备的流程如下：

整理出需要分配的vGPU信息。
判断需要分配的vGPU是否由同一个rp提供，目前仅支持从一个rp创建vGPU，若存在多个rp，则选取第一个。
从placement获取vGPU的rp详细信息。
从配置文件中获取当前支持的vGPU类型。
根据rp的device信息与支持的vGPU类型，获取可用的mdevs信息（所有的mdevs - 已分配的mdevs）。
若存在可用mdev设备，则直接pop弹出一个。
若不存在可用mdev设备，则调用nova.privsep.libvirt.create_mdev()，创建一个mdev设备。
分配mdev设备，直至数量等于需求值。

代码PDB调试流程见下图：

Accelerator ARQ state flow diagram

# 状态流转范围
# cyborg/common/constants.py
...
# TODO(Shaohe): maybe we can use oslo automaton lib
# ref: https://docs.openstack.org/automaton/latest/user/examples.html
# The states in value list can transfrom to the key state
ARQ_STATES_TRANSFORM_MATRIX = {
    ARQ_INITIAL: [],
    ARQ_BIND_STARTED: [ARQ_INITIAL, ARQ_UNBOUND],
    ARQ_BOUND: [ARQ_BIND_STARTED],
    ARQ_UNBOUND: [ARQ_INITIAL, ARQ_BIND_STARTED, ARQ_BOUND, ARQ_BIND_FAILED],
    ARQ_BIND_FAILED: [ARQ_BIND_STARTED, ARQ_BOUND],
    ARQ_DELETING: [ARQ_INITIAL, ARQ_BIND_STARTED, ARQ_BOUND,
                   ARQ_UNBOUND, ARQ_BIND_FAILED]
}
# PS:
# 1. 除了Deleting状态，其余状态均可转为Unbound，但目前代码只有Bound状态可以流转为Unbound。
# 2. 所有状态均可流转为Deleting，但目前代码实现上删除操作未流转至Deleting，而是直接删除。
...

# 状态范围控制
# 参考：cyborg/objects/extarq/ext_arq_job.py > start_bind_job()
# Check can ARC be bound.
if (self.arq.state not in
    ARQ_STATES_TRANSFORM_MATRIX[constants.ARQ_BIND_STARTED]):
raise exception.ARQInvalidState(state=self.arq.state)