OpenStack: Viewing a Child Heat Resource

empty birds nest

In this post we are going to walk through a process you can use to troubleshoot a failed heat stack deploy by viewing information of nested resources.

In the command below we are looking for any failed resource. Our grep allows us to grab the name of Parent id.

[stack@undercloud] # heat resource-list -n5 overcloud | grep -A5 -B5 -i Failed

What is below its very hard to read as I am unable to get the copy/paste to format correctly. What we are looking for is the parent resource for the failed child resource. I have bolded this line below to make is easier to find.

| StoragePort | ed706420-e4e2-4e46-b3d3-8dfd47855a3a | OS::Neutron::Port | CREATE_COMPLETE | 2016-03-03T18:48:42Z | StorageVirtualIP |
| VipPort | 2833b468-5f0f-4162-a0bb-e928126b3767 | OS::Neutron::Port | CREATE_COMPLETE | 2016-03-03T18:48:42Z | RedisVirtualIP |
| 0 | 6c8b39f8-4ed2-47e4-957b-f74b8520d386 | OS::TripleO::Compute | CREATE_IN_PROGRESS | 2016-03-03T18:48:47Z | Compute |
| 0 | 829a6b13-7a92-483a-b9d3-52a99dc655f8 | OS::TripleO::Controller | CREATE_IN_PROGRESS | 2016-03-03T18:48:49Z | Controller |
| 1 | 556da671-baa6-4fc8-be83-f3acfb324f46 | OS::TripleO::Controller | CREATE_IN_PROGRESS | 2016-03-03T18:48:49Z | Controller |
| 2 | c28d5942-a206-4734-bfd1-c6d263fafc52 | OS::TripleO::Controller | CREATE_FAILED | 2016-03-03T18:48:49Z | Controller |
| InternalApiPort | 59800552-0b66-4437-bf43-833df8131673 | OS::TripleO::Compute::Ports::InternalApiPort | CREATE_COMPLETE | 2016-03-03T18:48:49Z | 0 |
| NetIpMap | 2d455861-3cbe-40cd-8443-f0e63b0c24b7 | OS::TripleO::Network::Ports::NetIpMap | CREATE_COMPLETE | 2016-03-03T18:48:49Z | 0 |
| NetworkConfig | 062eb0ed-7914-435b-8f1e-61395fed6548 | OS::TripleO::Compute::Net::SoftwareConfig | CREATE_COMPLETE | 2016-03-03T18:48:49Z | 0 |
| NetworkDeployment | 5a11128b-c62a-4720-a8b2-6f27b0d83d74 | OS::TripleO::SoftwareDeployment | CREATE_IN_PROGRESS | 2016-03-03T18:48:49Z | 0 |
| NodeUserData | 94deb049-9015-490a-9a2b-925ff17c5c9f | OS::TripleO::NodeUserData | CREATE_COMPLETE | 2016-03-03T18:48:49Z | 0 |

 

Our parent resource is “Controller” so let’s take a look at that first. We want to grab the “physical_resource_id“. Again, hard to see so I have bolded that line below.

[stack@undercloud] # heat resource-show overcloud Controller

+————————+————————————-+
| Property | Value |
+————————+—————————–+
| attributes | { |
| | “attributes”: null, |
| | “refs”: null |
| | } |
| description | |
| links | http://172.16.0.10:8004/v1/94508024f96c426abac45b7e1acdfe39/stacks/overcloud/

054fde22-87f2-44cb-8318-784d1fa46323/resources/Controller (self) |
| | http://172.16.0.10:8004/v1/94508024f96c426abac45b7e1acdfe39/stacks/overcloud/

054fde22-87f2-44cb-8318-784d1fa46323 (stack) |
| | http://172.16.0.10:8004/v1/94508024f96c426abac45b7e1acdfe39/stacks/

overcloud-Controller-swdfj36bsf6e/6faa959c-3f1b-4f60-aeee-c01e34f399e7 (nested) |
| logical_resource_id | Controller |
| physical_resource_id | 6faa959c-3f1b-4f60-aeee-c01e34f399e7 |
| required_by | ControllerNodesPostDeployment |
| | VipDeployment |
| | ControllerIpListMap |
| | ControllerBootstrapNodeDeployment |
| | ControllerClusterDeployment |
| | CephClusterConfig |
| | ControllerSwiftDeployment |
| | SwiftDevicesAndProxyConfig |
| | ControllerBootstrapNodeConfig |
| | ControllerCephDeployment |
| | allNodesConfig |
| | ControllerAllNodesDeployment |
| | ControllerClusterConfig |
| resource_name | Controller |
| resource_status | CREATE_IN_PROGRESS |
| resource_status_reason | state changed |
| resource_type | OS::Heat::ResourceGroup |
| updated_time | 2016-03-03T18:48:34Z |
+————————+————————————–+

Now that we know the physical_resource_id we can drill down into that resource.  Note resource “2”  is the one we want to focus on.

[stack@undercloud] # heat resource-list 6faa959c-3f1b-4f60-aeee-c01e34f399e7
+—————+—————————+———————-+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+—————+————————+———————-+
| 0 | 829a6b13-7a92-483a-b9d3-52a99dc655f8 | OS::TripleO::Controller | CREATE_IN_PROGRESS | 2016-03-03T18:48:49Z |
| 1 | 556da671-baa6-4fc8-be83-f3acfb324f46 | OS::TripleO::Controller | CREATE_IN_PROGRESS | 2016-03-03T18:48:49Z |
| 2 | c28d5942-a206-4734-bfd1-c6d263fafc52 | OS::TripleO::Controller | CREATE_FAILED | 2016-03-03T18:48:49Z |
+—————+———————————————–+

Now we run “heat resource-show” using the physical resource id plus the resource_name (which is 2).

[stack@undercloud] # heat resource-show 6faa959c-3f1b-4f60-aeee-c01e34f399e7 2
+————————+——————————————–+
| Property | Value |
+————————+——————————————–+
| attributes | { |
| | “storage_mgmt_ip_address”: null, |
| | “hostname”: “overcloud-controller-2”, |
| | “config_identifier”: “,”, |
| | “nova_server_resource”: “52872fac-9db2-4913-8e02-281fc01f44eb”, |
| | “tenant_ip_address”: null, |
| | “external_ip_address”: null, |
| | “swift_device”: “r1z1-:%PORT%/d1”, |
| | “corosync_node”: { |
| | “ip”: “172.16.0.136”, |
| | “name”: “overcloud-controller-2” |
| | }, |
| | “hosts_entry”: ” overcloud-controller-2.localdomain overcloud-controller-2 overcloud”, |
| | “swift_proxy_memcache”: “:11211”, |
| | “storage_ip_address”: null, |
| | “internal_api_ip_address”: null, |
| | “ip_address”: “172.16.0.136” |
| | } |
| description | |
| links | http://172.16.0.10:8004/v1/94508024f96c426abac45b7e1acdfe39/stacks/

overcloud-Controller-swdfj36bsf6e/6faa959c-3f1b-4f60-aeee-c01e34f399e7/resources/2 (self) |
| | http://172.16.0.10:8004/v1/94508024f96c426abac45b7e1acdfe39/stacks/overcloud-Controller-swdfj36bsf6e/6faa959c-3f1b-4f60-aeee-c01e34f399e7 (stack) |
| | http://172.16.0.10:8004/v1/94508024f96c426abac45b7e1acdfe39/stacks/overcloud-Controller-swdfj36bsf6e-2-j5zi5fahlwf3/c28d5942-a206-4734-bfd1-c6d263fafc52 (nested) |
| logical_resource_id | 2 |
| parent_resource | Controller |
| physical_resource_id | c28d5942-a206-4734-bfd1-c6d263fafc52 |
| required_by | |
| resource_name | 2 |
| resource_status | CREATE_FAILED |
| resource_status_reason | CREATE aborted |
| resource_type | OS::TripleO::Controller |
| updated_time | 2016-03-03T18:48:49Z |
+————————+——————————————————————————————————————–+

OpenStack: Introduction to Troubleshooting Heat

keyboard-key-board-melt-broken-computer_p

Introduction to Heat

Heat is the main orchestration engine for OpenStack, and is used my OpenStack director to install an OpenStack Overcloud environment.

When we run the “openstack deploy overcloud” command, we are specifically
telling RHEL OSP director that we want it to use the pre-defined Heat templates from
/usr/share/openstack-tripleo-heat-templates/. OSP director will manage the
deployment of a new overcloud heat stack, using files from this directory.
When RHEL OSP director calls the Heat stack, it needs the following data…

  • A top-level Heat template to use that describes the overall environment and the
    resources required.
  • An environment/resource registry to tell Heat where to find resource
    definitions for non-standard Heat elements, e.g. TripleO components.
  • A set of parameters to declare the deployment-specific options (via -e)

 

The most important files for us to focus on are in our deployment directory, these are the default files that get called by OSP director.

  • The top-level Heat template that OSP director uses for deployment is
    /usr/share/openstack-tripleo-heat-templates/overcloud-without-mergepy.yaml
  • The resource registry, which tells Heat where to find the templates for
    deployment resources is
    /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.
    yaml

Creating a Heat Stack

To create the stack we run the command below. This command instructs heat to use
the templates in ~/my_templates/, as well as the override templates specified
with the ‘-e’ option.

This is just an example of what I am using in my lab environment, your deploy command will be much different. Also note that I have copied the templates from /usr/share/openstack-tripleo-heat-templates to ~/my_templates/.

#openstack overcloud deploy –debug –templates ~/my_templates/ \
–ntp-server 10.1.0.1 –control-scale 3 –compute-scale 2 \
-e ~/my_templates/advanced-networking.yaml

Troubleshooting a Failed Heat Stack

Unfortunately our deploy failed with the following errors.

Exception: Heat Stack create failed.
DEBUG: openstackclient.shell clean_up DeployOvercloud
DEBUG: openstackclient.shell got an error: Heat Stack create failed.
ERROR: openstackclient.shell Traceback (most recent call last):
File “/usr/lib/python2.7/site-packages/openstackclient/shell.py”, line 176, in
run
return super(OpenStackShell, self).run(argv)
File “/usr/lib/python2.7/site-packages/cliff/app.py”, line 230, in run
result = self.run_subcommand(remainder)
File “/usr/lib/python2.7/site-packages/cliff/app.py”, line 295, in
run_subcommand
result = cmd.run(parsed_args)
File “/usr/lib/python2.7/site-packages/cliff/command.py”, line 53, in run
self.take_action(parsed_args)
File
“/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py”,
line 864, in take_action
self._deploy_tripleo_heat_templates(stack, parsed_args)
File
“/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py”,
line 535, in _deploy_tripleo_heat_templates
parsed_args.timeout)
File
“/usr/lib/python2.7/site-packages/rdomanager_oscplugin/v1/overcloud_deploy.py”,
line 478, in _heat_deploy
raise Exception(“Heat Stack create failed.”)
Exception: Heat Stack create failed.

We can verify that the deploy failed with the command below.

stack@undercloud] # heat stack-list
+————————————–+————+—————+—| id | stack_name | stack_status | creation_time |
+————————————–+————+—————+—
| ce993847-b0ee-4ea2-ac15-dc0ddc81825a | overcloud | CREATE_FAILED |
2016-02-29T20:40:54Z |

Since the stack deploy has failed, let’s take a closer look at the stack resources
and see if we can determine which resources failed.

Here we will make things simple by viewing only failed resources.

[stack@undercloud] # heat resource-list overcloud | grep -i failed
| Compute | c032c668-755f-422f-8ad1-4abf46b022ff
| OS::Heat::ResourceGroup | CREATE_FAILED |
2016-02-29T20:40:55Z |
| Controller | 668d27e0-9ab1-4dbe-8445-1d1ee8839265
| OS::Heat::ResourceGroup | CREATE_FAILED |
2016-02-29T20:40:55Z |

The failed resources are named “Compute” and “Controller“. Lets take a closer
look at those using the “resource-show” argument.

#heat resource-show overcloud Compute

| resource_status_reason | ResourceUnknownStatus: Resource failed – Unknown
status FAILED due to “Resource CREATE failed: ResourceUnknownStatus:
Resource failed – Unknown status FAILED due to “Resource CREATE failed:
StackValidationFailed: Property error : OsNetConfigImpl: config The Parameter
(BondInterfaceOvsOptions) was not provided.”” |

Let’s now do the same for Controller.

#heat resource-show overcloud Controller

| resource_status_reason | ResourceUnknownStatus: Resource failed – Unknown
status FAILED due to “Resource CREATE failed: ResourceUnknownStatus:
Resource failed – Unknown status FAILED due to “Resource CREATE failed:
StackValidationFailed: Property error : OsNetConfigImpl: config The Parameter
(BondInterfaceOvsOptions) was not provided.”” |

Apparently I have some issues with my OVS bonding options, so I need to get those straight before I can continue.

Deleting a Failed Heat Stack

Since our last deploy failed, we need to delete the failed stack before we can kick off another stack deploy. Below is an example of that command – note we are using the UUID of the stack.

 

[stack@vz-undercloud] # heat stack-delete 2b0da4f6-e6f8-41cd-89e8-bf070d0e0d15
+————————————–+————+——————-| id | stack_name | stack_status | creation_time |
+————————————–+————+——————-
| 2b0da4f6-e6f8-41cd-89e8-bf070d0e0d15 | overcloud | DELETE_IN_PROGRESS |
2016-03-01T17:21:58Z |
+————————————–+————+——————-

then…

[stack@vz-undercloud] # heat stack-list
+————————————–+————+—————+—
| id | stack_name | stack_status | creation_time |
+————————————–+————+—————+—
| 2b0da4f6-e6f8-41cd-89e8-bf070d0e0d15 | overcloud | DELETE_FAILED |
2016-03-01T17:21:58Z |
+————————————–+————+—————+—

Now lets kick off another deploy

#openstack overcloud deploy –debug –templates ~/my_templates/ \
–ntp-server 10.1.0.1 –control-scale 3 –compute-scale 2 \
-e ~/my_templates/advanced-networking.yaml

Unfortunately, this deploy failed as well.

Ok, let’s take a look at /var/log/heat/heat/heat-engine.log for more details. I also suggest opening another ssh session and tailing the log while the delete is attempting to do its thing.

If the output is too verbose to follow, I suggest attempting to thin out the output using the command below

#tail -f /var/log/heat/heat-engine.log | egrep ‘error|fatal’

This lead me to the following error.

2016-03-01 13:46:12.366 18554 ERROR heat.engine.resource [-] Error marking
resource as failed
2016-03-01 13:46:12.366 18554 TRACE heat.engine.resource DBConnectionError:
(_mysql_exceptions.OperationalError) (2003, “Can’t connect to MySQL server on
‘172.16.0.10’ (111)”)

Mysql is down? So now we need to look at the mariadb logs – where we see the following.

160301 12:16:24 [Warning] Failed to setup SSL
160301 12:16:24 [Warning] SSL error: SSL_CTX_set_default_verify_paths failed

Apparently SELinux is blocking the reads for the certificates.
There are two ways to work around this issue. You can run “restorecon -v
/path/to/certs/“, or you can work around by disabling selinux by running
“setenforce 0” or by editing the /etc/selinux/config file and setting ‘SELINUX=DISABLED’.
You may need to rerun the delete, in my case it was stuck in
“DELETE_IN_PROGRESS”.  I restarted all heat releated services to force the delete to error.

#systemctl restart openstack-heat-engine.service openstack-heat-api.service
openstack-heat-api-cloudwatch.service openstack-heat-api-cfn.service

This will cause the delete to error. You can then retry the delete.

If the delete is taking a long time, you can dig a bit deeper into the delete
using the command below.

#heat resource-list overcloud

Now drill down more with the command below.

#heat event-list overcloud

Make note of the resource_name and its id and use them in the next command.
Note that stack name is still overcloud.

heat event-show overcloud Compute d9e13b02-07b0-4beb-8442-f25de0e7ef8b

I have found that rebooting the undercloud will clear out any in-progress
tasks, you can then run the delete again.

You can also try to manually delete each node from Ironic by mimic’ing what the
nova driver in Ironic does. This is shown below for reference.

$ ironic node-set-provision-state <node uuid> deleted

And to remove the instance_uuid

$ ironic node-update <node uuid> remove instance_uuid

Additional Resources

https://wiki.openstack.org/wiki/Heat

http://hardysteven.blogspot.com/2015/04/debugging-tripleo-heat-templates.html