5/5 - (2 votes)

One of the tasks of administering a cloud infrastructure is monitoring its components: it is important to be aware of the cloud’s malfunctioning, to identify and correct configuration errors in time. There are several ways to manage the VMWare cloud:

  • for solving simple tasks in the unixway style, the vcd-cli utility is useful. It is difficult to assemble a pipeline or write a shell script
  • writing PowerShell scripts? There is a PowerCLI module
  • more accustomed to writing in Python? There is a pyvcloud library (vcd-cli is built on its basis)
  • or you can work directly with the VMware Cloud Director API – this is a more time-consuming, but also more flexible way. The choice of language, libraries, and combinations of REST requests is entirely yours!

In this article, we will consider a fundamental solution to two practical problems through the API. There will be few code examples, but with a twist: let’s add asynchrony to the code to speed up a batch of several thousand requests!

Preparation

Before diving into programming, you need to understand how to send requests, what requests will be needed, and in what format it is more convenient to receive a response. Postman or curl can be used to debug and brute force requests.

According to the official documentation, working with the API starts with two requests:

1. Find out the addresses for authorization, API versions, and support level by running a simple get-request:

PS > curl -X GET https://vcd.cloud4y.ru/api/versions

Sample code:

PS > curl -X GET https://vcd.cloud4y.ru/api/versions -H "Accept:application/*+json"
{
"versionInfo" : [ {
"version" : "30.0",
"loginUrl" : "https://vcd.cloud4y.ru/api/sessions",
"mediaTypeMapping" : [ ],
"any" : [ ],
"deprecated" : true,
"otherAttributes" : { }
}, {
"version" : "31.0",
"loginUrl" : "https://vcd.cloud4y.ru/api/sessions",
"mediaTypeMapping" : [ ],
"any" : [ ],
"deprecated" : true,
"otherAttributes" : { }
}, {
"version" : "32.0",
"loginUrl" : "https://vcd.cloud4y.ru/api/sessions",
"mediaTypeMapping" : [ ],
"any" : [ ],
"deprecated" : false,
"otherAttributes" : { }
}, {
"version" : "33.0",
"loginUrl" : "https://vcd.cloud4y.ru/api/sessions",
"mediaTypeMapping" : [ ],
"any" : [ ],
"deprecated" : false,
"otherAttributes" : { }
}, {
"version" : "34.0",
"loginUrl" : "https://vcd.cloud4y.ru/api/sessions",
"mediaTypeMapping" : [ ],
"any" : [ ],
"deprecated" : false,
"otherAttributes" : { }
}, {
"version" : "35.0",
"loginUrl" : "https://vcd.cloud4y.ru/api/sessions",
"mediaTypeMapping" : [ ],
"any" : [ ],
"deprecated" : false,
"otherAttributes" : { }
} ],
"schemaRoot" : "https://vcd.cloud4y.ru/api/v1.5/schema/",
"any" : [ ],
"otherAttributes" : { }
}

2. Get temporary tokens to authorize further API requests. To do this, we will perform basic authorization in the cloud, if successful, a session will be opened, and we will find the necessary tokens in the response headers.

In this request, you must specify in the header one of the already known API versions that you plan to work with: Accept:application/*+json;version=35.0. You also need to pass credentials – this is a base64-encoded string like Login@vOrg:Password.

Sample code:

PS > curl -X POST https://vcd.cloud4y.ru/api/sessions -H "Accept:application/*+json;version=35.0" -H "Authorization:Basic TG9naW5Adk9yZzpQYXNzd29yZA==" -I
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 29 Apr 2021 09:09:31 GMT
Content-Type: application/vnd.vmware.vcloud.session+json;version=35.0
Content-Length: 4945
Connection: close
X-VMWARE-VCLOUD-REQUEST-ID: 58156abf-9f16-4082-9c42-f7f1e612be0c
X-VMWARE-VCLOUD-ACCESS-TOKEN: eugXbGciOiJSUzI1NiJ9.euJzdWIiOigXZG1pbmlzdHgXdG9yIiwiaXNzIjoiZTE5N2QwZGMtYTA1Ny00YgXlLTlkZTUtMDZlMzQxMDQ4YjgzQDlhNjI1YWUzLWJjNDEtNDc5ZS1hZWY3LTIwMDI3ODM1Yzg3ZiIsImV4cCI6MTYxOTc3MzcxMSwidmVyc2lvbiI6InZjbG91ZF8xLjAiLCJqdGkiOiJlNnFNnzBhYWU4ZTY0OWJiYNn0YTBiYjY1ODA1NjgwMSJ9.EGFg_MYPkEPOHUW-k7Dh5sg0h8BrVces3e_q7iiLZ5G8t6D3RhGb1g921qipLuHWksrSYXJxxU18icpyiUNI_uwFqz88BrCaaVag-LVsrpxRWVe3COyKDl9xBw45bmuhr1ZGRIwQr8B495fDhhaILg7yB7-PlRSTKYhn2Ratew6mdDjq57ddqg_p7oIqezkuZZQ3L-On3OHCELKhqqFZ6GzescPFii22NC9_0hh_hJvmoewgXo-S1o2E-2qY--muRJm2EWOn2wIdQg_hZtA7WjKggbQNGvWSyjL9AUTz6At-2lHuZXJoORpMt5I-9Jo9NOPPx8RVgfa8cg7O8qy8Gw
X-VMWARE-VCLOUD-TOKEN-TYPE: Bearer
x-vcloud-authorization: e2fa30aae8e649bbbc4a0bb658056801
X-VMWARE-VCLOUD-REQUEST-EXECUTION-TIME: 227
Cache-Control: no-store, must-revalidate
Vary: Accept-Encoding, User-Agent
Strict-Transport-Security: max-age=31536000
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff

At the moment, there are two methods of authorization through tokens:

1. Authorize the request with the x-vcloud-authorization token

PS > curl -X GET https://vcd.cloud4y.ru/api/query?type=edgeGateway -H "Accept:application/*+json;version=35.0" -H "x-vcloud-authorization:e2fa30aae8e649bbbc4a0bb658056801"
suitable for debugging, but not recommended in production scripts, deprecated and may be removed in future versions of the API

Use X-VMWARE-VCLOUD-ACCESS-TOKEN and X-VMWARE-VCLOUD-TOKEN-TYPE tokens for authorization

PS > curl -X GET https://vcd.cloud4y.ru/api/query?type=edgeGateway -H "Accept:application/*+json;version=35.0" -H "Authorization:Bearer eugXbGciOiJSUzI1NiJ9...9Jo9NOPPx8RVgfa8cg7O8qy8Gw"
This is the vendor's recommended way

Now you have everything you need to send authorized requests. Let's move on to the tasks.

First: 'Edge Health Check'

You need to monitor the state of virtual edge routers and signal if someone else has it different from normal.

Everything is quite simple here: normal routers are in the normal status in the Web UI vCloud Director. Any other conditions should be checked and, if necessary, eliminated. For example, a router might be in critical (Web UI) / UNREACHABLE (API) status for several reasons:

  • The router virtual machine is off. This usually happens after the end of test access, when the client's cloud infrastructure has been turned off and queued for deletion. In this case, the alert level will be low, because no action is expected from the engineers here and now, they will only be required if the notification has been active for a long time.
  • There was an out-of-sync between the NSX and vCenter databases. The alert level will be higher: you need to fix it as soon as possible by performing a Redeploy on the router.

The correspondence of the web states of routers to their API counterparts is indicated in the table:

Web UI status

API status

Alert

normal

READY

REALIZED

NO

warning

FAILED_CREATION

FAILED_UNDEPLOYMENT

FAILED_REDEPLOYMENT

YES

critical

NOT_READY

UNREACHABLE

UNKNOWN

ERROR

REALIZATION_FAILED

undefined

YES

busy

CONFIGURING

PENDING

YES

Finding a suitable API request

Over 500 pages of documentation is not the most exciting text. Patience, of course, will not be enough. I want to get the result right away, on the spot, like in the Web UI. Therefore, you can cheat and spy on the desired request along with the parameters in the browser developer console!

GET https://vcd.cloud4y.ru/api/query?type=edgeGateway

This helped to localize the required section of the manual and select the necessary parameters. In order not to disclose sensitive information, we will give an example of a response to a request with an additional filter.

Sample code:

PS > curl -X GET "https://vcd.cloud4y.ru/api/query?type=edgeGateway&filter=(gatewayStatus!=READY);(gatewayStatus!=REALIZED);(name==*mih*)" -H "Accept:application/*+json;version=35.0" -H "x-vcloud-authorization:e2fa30aae8e649bbbc4a0bb658056801"
{
"otherAttributes" : { },
"link" : [ {
"otherAttributes" : { },
"href" : "https://vcd.cloud4y.ru/api/query?type=edgeGateway&page=1&pageSize=25&format=references&filter=(gatewayStatus!=READY);(gatewayStatus!=REALIZED);(name==*mih*)",
"id" : null,
"name" : null,
"type" : "application/vnd.vmware.vcloud.query.references+xml",
"model" : null,
"rel" : "alternate",
"vCloudExtension" : [ ]
}, {
"otherAttributes" : { },
"href" : "https://vcd.cloud4y.ru/api/query?type=edgeGateway&page=1&pageSize=25&format=references&filter=(gatewayStatus!=READY);(gatewayStatus!=REALIZED);(name==*mih*)",
"id" : null,
"name" : null,
"type" : "application/vnd.vmware.vcloud.query.references+json",
"model" : null,
"rel" : "alternate",
"vCloudExtension" : [ ]
}, {
"otherAttributes" : { },
"href" : "https://vcd.cloud4y.ru/api/query?type=edgeGateway&page=1&pageSize=25&format=idrecords&filter=(gatewayStatus!=READY);(gatewayStatus!=REALIZED);(name==*mih*)",
"id" : null,
"name" : null,
"type" : "application/vnd.vmware.vcloud.query.idrecords+xml",
"model" : null,
"rel" : "alternate",
"vCloudExtension" : [ ]
}, {
"otherAttributes" : { },
"href" : "https://vcd.cloud4y.ru/api/query?type=edgeGateway&page=1&pageSize=25&format=idrecords&filter=(gatewayStatus!=READY);(gatewayStatus!=REALIZED);(name==*mih*)",
"id" : null,
"name" : null,
"type" : "application/vnd.vmware.vcloud.query.idrecords+json",
"model" : null,
"rel" : "alternate",
"vCloudExtension" : [ ]
} ],
"href" : "https://vcd.cloud4y.ru/api/query?type=edgeGateway&page=1&pageSize=25&format=records&filter=(gatewayStatus!=READY);(gatewayStatus!=REALIZED);(name==*mih*)",
"type" : "application/vnd.vmware.vcloud.query.records+json",
"name" : "edgeGateway",
"page" : 1,
"pageSize" : 25,
"total" : 2,
"record" : [ {
"_type" : "QueryResultEdgeGatewayRecordType",
"link" : [ ],
"metadata" : null,
"href" : "https://vcd.cloud4y.ru/api/admin/edgeGateway/62e4464a-905c-4dbc-adab-2504545d9ba6",
"id" : null,
"type" : null,
"otherAttributes" : {
"task" : "https://vcd.cloud4y.ru/api/task/7b950b14-8b49-4871-a2db-640fae971c0f",
"isSyslogServerSettingInSync" : "true",
"taskOperation" : "nsxProxyResourceConfigureServices",
"taskStatus" : "success",
"taskDetails" : " "
},
"advancedNetworkingEnabled" : true,
"availableNetCount" : 8,
"distributedRoutingEnabled" : false,
"edgeGatewayType" : "NSXV_BACKED",
"egressPointId" : null,
"gatewayStatus" : "UNREACHABLE",
"haStatus" : "DISABLED",
"isBusy" : false,
"name" : "mihailovgpuwin2019test2_EDGE",
"numberOfExtNetworks" : 1,
"numberOfOrgNetworks" : 1,
"orgName" : "mihailovgpuwin2019test2",
"orgVdcName" : "mihailovgpuwin2019test2_VDC_hk41gpu",
"vdc" : "https://vcd.cloud4y.ru/api/vdc/09fe3b51-c908-4e9a-a4b8-36d69a7853b8",
"vdcGroupId" : null,
"vdcGroupName" : null
}, {
"_type" : "QueryResultEdgeGatewayRecordType",
"link" : [ ],
"metadata" : null,
"href" : "https://vcd.cloud4y.ru/api/admin/edgeGateway/6942a0bc-7569-49da-9581-203a402386d8",
"id" : null,
"type" : null,
"otherAttributes" : {
"task" : "https://vcd.cloud4y.ru/api/task/313b3dcc-0cf0-42ac-858d-fcea13d49ed2",
"isSyslogServerSettingInSync" : "true",
"taskOperation" : "networkEdgeGatewayCreate",
"taskStatus" : "success",
"taskDetails" : " "
},
"advancedNetworkingEnabled" : true,
"availableNetCount" : 9,
"distributedRoutingEnabled" : false,
"edgeGatewayType" : "NSXV_BACKED",
"egressPointId" : null,
"gatewayStatus" : "UNREACHABLE",
"haStatus" : "DISABLED",
"isBusy" : false,
"name" : "mihailov-edge-health-check-demo",
"numberOfExtNetworks" : 1,
"numberOfOrgNetworks" : 0,
"orgName" : "mihailov-vorg",
"orgVdcName" : "mihailov-vdc_HM14",
"vdc" : "https://vcd.cloud4y.ru/api/vdc/6f5c8aaf-b5e0-4317-940b-cf22b6019229",
"vdcGroupId" : null,
"vdcGroupName" : null
} ],
"vCloudExtension" : [ ]
}

The most interesting thing awaits after the link section.

These parameters will help to get all objects (records) that meet our criteria:

"page" : 1, // current page (response) number
"pageSize" : 25, // number of objects per answer
"total" : 2, // total number of objects

And here are enough details to prepare a correct and understandable monitoring object:

"record": [ // list of objects with more detailed information
{
...
"gatewayStatus" : "UNREACHABLE", // the very state that you need to control
"name" : "mihailov-edge-health-check-demo", // the change of the router and elements
"orgName" : "mihailov-vorg", // of the client infrastructure: a virtual organization
"orgVdcName" : "mihailov-vdc_HM14", // and a virtual data center
...
}
]

So, the required queries have been identified. The complete implementation of the solution in code will require additional effort for many system administrators (they are far from developers), but it is quite trivial and not of particular interest: Python3 + the requests library.

There is also nothing special to optimize here, in reality, the number of objects ranges from 0 to several pieces, and all the information is collected in one go. If in your case the number of objects is on the order of several hundred, then you can set the maximum allowable value in the parameter of "pageSize" : 128 to collect a complete list in two or three requests.

Second: 'Disk Provisioning vs Storage Profile'

Find inconsistencies between VM disk types and storage profiles.

Firstly, different tasks require different disk performance: one VM can act as a file share and it makes no sense to overpay for an ultra-performance profile, which is suitable, for example, in a database. Therefore, profiles differ in performance and, accordingly, in price.

Secondly, there are several ways to provide space for VM disk files:

  • Thin Provisioning. "Thin" drives allow you to save disk space because take up exactly as much space as the guest OS uses. For example, in the properties of the VM, you have specified 200 GB with a margin, but in fact, 50 GB is used, so the virtual machine disk file will occupy only 50 out of the allocated 200 GB. But on the other hand, this reduces the performance of the VM disk. there is an overhead when the actual size of the disk changes.
  • Thick Provisioning. "Thick" disks, on the contrary, are more productive, since the entire requested volume is immediately allocated and it does not jump from the current actual need of the guest OS, but this leads to unnecessary over-expenditure of space on the storage volumes.

Thus, when the highest performance of the VM disk is needed, it is assigned the maximum profile and made "thick". In all other cases, disks must be "thin", and inconsistencies must be identified and corrected.

Selection of the necessary queries

This time, you have to dig deeper into the documentation, because you can't just take and collect all the information in one API request. Yes, you can use the Query Service to collect a complete list of virtual machines. But it will contain only general information: VM names / id, vOrg, vDC – no specifics on disks. The virtual machine configuration is requested addressably, separately for each VM, but the level of detail is very detailed. In the process of debugging, there was a moment when both lists were saved to files, and then they were compared. It turned out 3+ MB total versus 120+ MB detailed.

Query schema in general terms:

1. General information on virtual machines

PS > curl -X GET "https://vcd.cloud4y.ru/api/query?type=adminVM" -H "Accept:application/*+json;version=35.0" -H "x-vcloud-authorization:e2fa30aae8e649bbbc4a0bb658056801"

The answer will be something like this:

{
...
"name": "adminVM",
"page": 1,
"pageSize": 128,
"total": 999,
"record": [
...
{
"_type": "QueryResultAdminVMRecordType",
"href": "https://vcd.cloud4y.ru/api/vApp/vm-8ed1331e-23b2-43b3-a869-6d324561d188", // direct link to request VM configuration
"containerName": "Ubuntu-20.04_Template",
"dateCreated": "2020-12-14T08:44:44.214Z",
"description": "шаблон ВМ",
"guestOs": "Ubuntu Linux (64-bit)",
"name": "base for template ubuntu-20.04",
"org": "https://vcd.cloud4y.ru/api/org/e197b8dc-a357-4d8e-9de9-06e341348b83", // by ID you can find out the name
"status": "POWERED_OFF",
"storageProfileName": "vcd-type-med",
"vdc": "https://vcd.cloud4y.ru/api/vdc/6f5c8aaf-b5e0-4317-940b-cf22b6019229", // by ID you can find out the name
"vmToolsVersion": 11301
},
...

We focus on total, collect the rest of the pages in a loop. As a result, we will get a list with VM names, their direct links, and some additional information.

2. Detailed information

PS > curl -X GET "https://vcd.cloud4y.ru/api/vApp/vm-8ed1331e-23b2-43b3-a869-6d324561d188" -H "Accept:application/*+json;version=35.0" -H "x-vcloud-authorization:e2fa30aae8e649bbbc4a0bb658056801"

3. Although this is enough, all the necessary information has been obtained, but people will be watching the monitoring and I would like to see the names of the virtual organization and the client's data center (vOrg & vDC) instead of ID-links.

All the same Query Service, from the new one – the format=references parameter – we get a minimum of details, unlike records.

Organizations:

PS > curl -X GET "https://vcd.cloud4y.ru/api/query?format=references&type=organization" -H "Accept:application/*+json;version=35.0" -H "x-vcloud-authorization:e2fa30aae8e649bbbc4a0bb658056801"

{
...
"reference": [
{
"otherAttributes": {},
"href": "https://vcd.cloud4y.ru/api/org/e197b8dc-a357-4d8e-9de9-06e341348b83",
"id": "urn:vcloud:org:e197b8dc-a357-4d8e-9de9-06e341348b83",
"name": "mihailov-vorg",
"type": "application/vnd.vmware.vcloud.org+xml",
"vCloudExtension": []
},
...

Data centers:

PS > curl -X GET "https://vcd.cloud4y.ru/api/query?format=references&type=adminOrgVdc" -H "Accept:application/*+json;version=35.0" -H "x-vcloud-authorization:e2fa30aae8e649bbbc4a0bb658056801"

{
...
"reference": [
{
"otherAttributes": {},
"href": "https://vcd.cloud4y.ru/api/admin/vdc/6f5c8aaf-b5e0-4317-940b-cf22b6019229",
"id": "urn:vcloud:vdc:6f5c8aaf-b5e0-4317-940b-cf22b6019229",
"name": "mihailov-vdc_HM14",
"type": "application/vnd.vmware.admin.vdc+xml",
"vCloudExtension": []
},
...

For dessert: solving the problem with the "head-on" option

Who is guilty?

The classic approach with requests is sequential: request the first object, get the first. They asked for the second – got the second, etc. The problem is linear dependence: the more objects, the longer the cycle.

Query Service requests can be slightly tweaked through the API: request 128 objects at a time instead of the default 25 – and slightly speed up the process. But the second page will still be requested only after the first is received. The collection of detailed information cannot be tweaked. When there are several thousand virtual machines in the cloud, the process takes tens of minutes or even hours, depending on the current load on the cloud.

Real example from the log, 58 minutes.

2021-04-02 23:16:14,063 | vam-py script | INFO | vam.py | main: 105 | ================================================================================
2021-04-02 23:16:14,063 | vam-py script | DEBUG | vam.py | main: 106 | Namespace(auth_probe=False, check_json=False, disk_info=True, edge_info=False)
2021-04-02 23:16:14,063 | vam-py script | DEBUG | vam.py | main: 143 | ['disk_info']
2021-04-02 23:16:14,063 | vam-py script | DEBUG | helper.py | update_tokens: 434 | running
...
2021-04-03 00:14:12,828 | vam-py script | DEBUG | helper.py | write_list_to_csv_file: 463 | running
2021-04-03 00:14:12,828 | vam-py script | DEBUG | helper.py | get_current_script_dir: 23 | running
2021-04-03 00:14:12,828 | vam-py script | DEBUG | helper.py | get_current_script_dir: 32 | done
2021-04-03 00:14:12,843 | vam-py script | DEBUG | helper.py | write_list_to_csv_file: 478 | done
2021-04-03 00:14:12,844 | vam-py script | INFO | helper_disk.py | foo: 83 | disk-status.csv report ready

On the one hand, it is tolerant. Firstly, identifying errors in the preparation of discs is not an urgent task, it does not require a minute response. Secondly, the finished script will be launched on a schedule, for example, at night, so that by morning the data will be monitored.

Debugging, on the other hand, is a manual process and runs many times. It becomes very critical how much time the individual operations take.

What to do?

It's time to learn new tricks. If we replace requests with aiohttp + asyncio, we get x10 – x15 boost: data preparation time was reduced to 3-5 minutes! Debugging the script has become much faster, and as a bonus, we quickly collected a rake of failures and improved the code.

What happened? The number of objects in the cloud, i.e. the volume of requests remained, plus or minus the same. But now the script sends requests one after another, without waiting for a response to the previous request, the responses are collected as soon as they are ready.

It works like this: according to the general list of VMs, they launched a bunch of detailed requests, collected and saved successful responses. We repeat the failed requests with a small delay between them (80..100 ms). This is a potion against 500 bugs. It also happens that from the moment the VM was started, they managed to delete it, then there will be 400 errors. In this case, the repeater, following Einstein's principle – doing the same thing over and over and expecting a different result is stupid – stops hammering in vain with requests and returns successful responses.