The Self-Healing Datacenter 2.0

About 18 months ago, I wrote a series of blog posts that many customers used to enable automation for self-healing IT operations.  At the time, it was a good choice if you really wanted to integrate your custom vRealize Orchestrator workflows with vRealize Operations alerts.  I am happy to report, with the release of the

The post The Self-Healing Datacenter 2.0 appeared first on VMware Cloud Management.

About 18 months ago, I wrote a series of blog posts that many customers used to enable automation for self-healing IT operations.  At the time, it was a good choice if you really wanted to integrate your custom vRealize Orchestrator workflows with vRealize Operations alerts.  I am happy to report, with the release of the vRealize Operations Management Pack for vRealize Orchestrator 2.0 you can now more easily and reliably use your own custom workflows as part of the Action framework in vRealize Operations to fully embrace self-driving operations.

In this blog post I will show you an example of using the management pack to integrate your own workflows for response to alerts for host compliance.  Specifically, you will see how to reduce risk from misconfigured NTP ESXi host settings by fixing them when they are found.

If you have not yet read my blog posts providing an overview of the management pack features and how to install the management pack I recommend doing so before you go any further.  Also, the “level of difficulty” here is not very high, if you are already familiar with vRealize Orchestrator.  If you are new to that product, you may find some of the concepts below a little confusing, but I encourage you to try this out as it is a great opportunity to learn how to use a fantastic automation tool!

Right, On Time

As I mentioned, the scenario I will use is ESXi host NTP configuration.  It could be any host setting.  But I think this is a pretty common concern with customers, especially after they have enabled the vSphere Security Hardening Guide compliance feature vRealize Operations.  When NTP isn’t working due to misconfigured settings, it can lead to outages and other issues.  Often, administrators don’t realize it is broken until it becomes a problem.

While it is great to know that NTP is misconfigured before it causes issues, but why not just fix it for me?  It is a low risk change and perfect for “task automation” with this solution.  That is exactly what I will show you, but first let me show you the workflows I have created for this situation.

  • First up is the workflow “SDO Set Host NTP” which accepts input of a VC:HostSystem and a string array of NTP servers. I will use this as an Action in vRealize Operations for ESXi host systems to make it easier to set NTP settings directly (rather than opening vCenter client).
  • The second workflow “SDO Reset Host NTP” accepts input of VC:HostSystem and will be used to automatically configure NTP settings on any ESXi host when vRealize Operations finds a problem with the settings.

Both are simple examples of the power available through this integration.  In the workflows, I have used the recommended vSphere Security Hardening Guide values for the NTP service settings.  The only input required from the user is the NTP servers and I will explain more about that in each example later in this post.

SDO Package Install

By the way, you can download a package that contains the workflows from Sample Exchange for your own use.  The “SDO” in the title of the package and contents stands for “Self-Driving Operations” if you were wondering.  Once you have the package downloaded, start up your vRealize Orchestrator client and change to the Design role.  Navigate to the Packages tab where you can import a package and select the downloaded file for import.

Be sure to select all the package elements for import.

At this point, you are ready to have the package “discovered” by the management pack in vRealize Operations.  The easiest way to do this is from the “vRealize Orchestrator Workflows Overview” dashboard that is included with the management pack.  Select your adapter instance and click the Action icon in the list widget to run the Configure Package Discovery workflow.

In the pop-up, you can click into the text area (you may need to expand the text input area) and add the name of the SDO Workflow Package.  It can be found in vRealize Orchestrator client under the Packages tab.

Click the Begin Action button on the pop-up to save the change.  On the next collection of the management pack adapter (every 5 minutes) the workflows in the package will be available for adding as Actions.

To add specific workflows as Actions, run the Create/Modify Workflow Action on vCenter Resources action on the adapter instance.

In the pop-up, scroll to find the custom workflows.  I will go ahead and add both now to save time, selecting the Host System resource type and Add as the operation.  You can always run this again to add or remove resource types for the workflows.

Click Begin Action to complete this task, which will add these workflows to the Actions in vRealize Operations on the next collection cycle of the adapter instance (5 minutes by default).

To confirm these have been added, you can navigate to Alerts > Alert Settings > Actions and filter for the name “SDO” and you will see the new Actions for these two custom workflows.  I always recommend using some standard naming for custom content such as Alert Definitions, Dashboards, Views and others so that you can easily find them.

Now we are ready to test out our two use cases, setting NTP configuration for hosts from the vRealize Operations UI and automatically resetting NTP configurations for hosts when vRealize Operations detects misconfiguration.

No Hassle NTP Configuration

Now that the custom workflows are part of the Action framework, associated with the Host System resource type, we should see them displayed as options in the Action menu for any host.  In the screen shot below, I have navigated to an ESXi host summary page and here you can see the new workflows that have been added.

For the first example, setting NTP configuration on a host I will run the “SDO Set Host NTP” workflow.  You will see that there are two inputs – the host system and the NTP servers.

The “host” input is pre-filled by vRealize Operations since it understands that the workflow is looking for a VC:HostSystem data type.

Next, the “ntpServers” input is added by the user and I have added my desired NTP server IP addresses.

There are a few important points here, so let me explain:

  • The input names “host” and “ntpServers” are taken from the workflow inputs. So, it is a good idea to give any input parameters a descriptive name as there is no additional help available.
  • Workflows can use input parameters of VC plugin datatypes. Otherwise, only common datatypes such as string, number or Boolean are valid.
  • The workflow can accept an array of strings, which you can see in the example screenshot above. You simply need to enclose the values in either a bracket, brace or parenthesis and use commas to separate the values.

As you can tell, this workflow will set the NTP servers for the selected host.  It does not make sure that NTP is running, is set to start with the host or that a firewall port is open for NTP.  Let’s take a quick peek at the code in the workflow.

The workflow is simply setting the NTP server value, as shown in the vCenter client.

That was easy, now let’s try something even more useful.

Someone Broke NTP

People make mistakes, it happens.  Sometimes people do things they shouldn’t or think they are doing the right thing.  In any case, this is where automation comes to the rescue to reduce risk.  The vSphere Security Hardening Guide provides recommendations for NTP settings, which are mapped to Alert Definitions and Symptoms in vRealize Operations.

This is great, but why just alert when you can fix the problem automatically?  That is the point of this next example.  The workflow “SDO Rest Host NTP” will be used to address each of the highlighted symptoms above automatically when a misconfiguration is detected by vRealize Operations.

First, I will create a specific Alert Definition for this, as I do not intend to address the other vSphere Security Hardening Guide issues.

I will clone the Alert Definition and edit it to remove all the other Symptoms.

Here I have trimmed it down to just the four symptoms, updated the name and description as well.

Now we need to create a new Recommendation to associate the workflow Action with.  Go to the Recommendations tab and click the green plus icon.

Create the new Recommendation, by adding descriptive text that tells the user what has happened and what they should do next.  You will then add the workflow Action by selecting the vRealize Orchestrator Adapter type and then the SDO Reset Host NTP workflow.

Now you can find and add the recommendation.  This is where it is great to have your custom content appended with something easy to recognize (such as “SDO” in this case). Simply drag the Recommendation over to the Alert Definition and then click Save.

Now the Alert Definition is ready, but we need to do one more thing.  You recall that the “SDO Set Host NTP” workflow has a user input for ntpServers.  If we want to fully automate this, the workflow can only have a single input and that input must be a vRealize Orchestrator VC plugin type (host, VM, cluster, etc).  So, how can you set the values for ntpServers, which will be different for every customer?

If you view the “SDO Reset Host NTP” workflow in vRealize Orchestrator, you can see that ntpServers is now a workflow attribute and is set to the value contained in a Configuration Element.

You simply need to edit the Configuration Element “SDO Host NTP Config” and set the ntpServers value to your organization’s NTP servers.  The vSphere Security Hardening Guide does not recommend using external NTP servers (such as pool.ntp.org) directly on ESXi hosts.  By default, though, the Configuration Element does have values for external NTP – so, use at your own risk or configure as recommended.

In vRealize Orchestrator client, click the Configurations tab and browse to SDO Configurations > SDO Host NTP Config.  Click the pencil icon to edit.

Click on the ntpServers value to modify the contents of the array with your own NTP servers, accept and then save and close.  Hint, use the red X icon to remove values from the array.

Now we are ready to test things out.  First, we need to enable the alert.  Best practice is to create a custom group with the objects that will have the alert enabled and then create a policy specifically for that group.  This goes a bit beyond the scope of this (already long) blog post, so I will not cover that, but I do have a video to help you.

In this case, I’ll create a custom group “SDO NTP Configuration” containing the hosts I wish to monitor and enforce NTP settings on.  Then, I’ll create a child of the default policy with the “SDO Someone Broke NTP” Alert Definition enabled and assign it to the custom group.  The critical part here is to enable automation for this Alert Definition in the policy.

This allows vRealize Operations to automatically run this workflow to reset NTP whenever the alert is triggered.

You can optionally disable the Alert Definition in the default policy.  It is not set to run automatically by default, but it will trigger alerts.  I will leave it enabled in the default policy just to show the difference.

Let’s try it out!  Going to the alert page I can see that I have a few hosts in my lab that are not properly configured for NTP.  Let’s look specifically at one alert.

Here we can see that the alert triggered on the Symptom which checks to see if the NTP firewall rule allows all IPs.  The vSphere Security Hardening Guide recommends that you set the allowed hosts for the NTP servers only to reduce risk.  So, I’ll go ahead and click on “Run Action” to launch the workflow manually (the alert was triggered before I added the SDO NTP policy and made it effective for the host, and therefor is not automated.  The next time this alert triggers the automated policy setting will take effect).

In vCenter client we can see that the workflow ran and reset NTP, making sure the service was started, updated the policy to “start with host” and added the NTP servers to the firewall ruleset.

Now, I will intentionally misconfigure a host NTP settings.  By removing the NTP servers.

The alert is triggered within 5 minutes.

And the workflow runs, resetting my host NTP configuration to reduce risk.

Well, that was a pretty long post but I hope it is useful for customers who want to automate with Self-Driving Operations and create a Self-Healing Datacenter!

Posted by News Monkey