OpenStack Installer (for RHEL-OSP) pt5 – troubleshooting

Hi folks,

So in the last post, I finally gave up the goods and showed you how to use the RHEL-OSP Installer to blow out a 3-node non-HA OpenStack deployment. Hopefully, it worked for you without any issue whatsoever. What would be even better, is if you ran into at least 1 or 2 problems..

Yes, you heard me (read me) right. I hope you ran into 1 or 2 problems. Read on, and I’ll tell you why..

One of my early tech jobs was to teach new hires how to build RISC-based systems (IBM RS/600, DEC Alpha, HP PA-RISC, etc). When I would teach new hires how to build these (sometimes) million dollar servers, the first thing I would preach is “patience”. Don’t “save” 5 minutes here if it costs you 45 on the other side.

Most importantly, if it comes out right the first time, every time, then all you’ve learned to do is turn a screw. Or in this case, type a command.

Here is my point. If something goes wrong, it’s an opportunity for you to learn how it works. If you know exactly what went wrong, it’s an opportunity for you to remember to double check that detail next time. If it keeps happening, but you know you’re doing it right, then it’s an opportunity for the developer(s) to fix something – provided you open a bugzilla.

But mostly, it’s an opportunity for you to learn how it works. To learn where the different log files are, the different TCP/UDP ports needed, which errors can be recovered from immediately, and which errors mean start over. In other words, it’s the opportunity to really learn the product. This is the difference between understanding what’s going on under the covers versus being a trained monkey.*

Still, there is less stress when things go perfectly…

Tips for a perfect deployment:

  1. Label all Ethernet ports on your systems with at least the last 4 characters of the MAC address. Label makers are cheap. And fun.
  2. If eth0 (or eno0) is “PXE/Mgmt” on one node, make it the same across the board. Less confusion, easier troubleshooting.
  3. Be sure that if you need a node to have 3 configured interfaces that the installer sees that node with 3 interfaces. If not, re-plug the cables, then re-PXE boot the node and make sure that the 3 interfaces show up. Same story if it needs 2 interfaces.
  4. If you have more than 1 tunnel subnet, they can be in the same subnet range, but I highly advise not overlapping actual IP addresses. Netspaces will keep them separate, but troubleshooting becomes a bitch. For example, giving Coke 192.168.100.10 – 20 and giving Pepsi 192.168.100.21-31 is fine, provided they are created separately and show up with different segmentation ID’s. But don’t give them both 10-20, or anything else that actually overlaps.
  5. Clear the boot sector on the node(s) in question before every build/re-build.
  6. Double or even triple check your IP configuration of each node once you have your nodes assigned to roles. Your Controller node is easiest as it will only have 1 interface, Neutron nodes need to see everything, and Compute nodes will need to see everything but the public network. Again, if the installer can see the interface and has a configuration for the interface, the rest of the deployment will generally go smoothly.

Where do I watch things happen, i.e. monitor progress and/or problems?:

  1. The console of the node being installed (during install) – From here you can watch the PXE boot, the kickstart being pulled down, the OS install, and the post-configuration. The post-configuration includes registering to RHN, getting additional packages and updating everything that is already installed. And because this is a foreman/puppet based installer for OpenStack, it deploys puppet related packages at the end.
  2. The syslog of the node being configured (post-install) – After the initial installation, post-configuration, and reboot, the system should start a series of puppet runs where it gets additional packages and configuration files. This can all be tracked in /var/log/messages.
  3. The installer deployment page – This is good for overall progress only. While the deployment is going on, you see the overall progress of each node. However, it does not take into consideration the puppet configuration runs that occur post-installation. So “100%” only means that a node completed the OS installation, kickstart %post, and reboot.
  4. The installer host page (during post install puppet runs). Once the node in question is either in the midst of its puppet runs or supposed to be in its puppet runs, you can check progress in Hosts -> All Hosts -> (hostname) -> Reports -> (specific report), where “specific report” is listed as “2 minutes ago” or “5 minutes ago”, etc. If you see that a particular report includes failures, you can drill down and even sort on failures to see what the exact issue was.

What can I recover from?

Everything else I’ve been able to recover from – for example, the original guide didn’t have all of the necessary ports listed for IPtables, so getting through to foreman didn’t work. As soon as I punched foreman through the firewall it was fine. Another time,  for whatever reason, the TFTP server failed to pick up the vmlinuz image and initramfs, so I had to copy them over from the ISO image. No big deal, but it was odd looking at the screen reading “image not found”.

When do I need to start over?

If you didn’t double check your interface count before hitting “deploy”, and the Installer doesn’t see all of the interfaces, you are partially screwed. For example, if you have a compute node, the Installer needs to see 2 interfaces on it. If it only sees one, then when it comes time to do the puppet runs, it will fail. Manually configuring the interface and restarting puppet will not help. You need to tag the host for re-build, re-PXE, and ensure that the Installer sees all of the interfaces.

That was the biggest problem that I ran into.. which is why I’ve mentioned it so many times. Let me know what issues you’ve run into. Maybe I can help (maybe not…)

Hope this helps,

Captain KVM

* In the early space program, monkeys (chimpanzees, mostly) could easily be trained to turn switches and dials in specific order at specific times and effectively “fly” a capsule/module. But if something went wrong, there was nothing the monkey or ground control could do. On the other hand, humans that truly understand how everything under the control panel works, like everyone involved with Apollo 13, can deal with adversity based on mastery of what is laid before them. I have nothing against monkeys – trained or otherwise.

 

 

2 thoughts on “OpenStack Installer (for RHEL-OSP) pt5 – troubleshooting”

  1. Hi, I am trying to deploy the openstack using RHEL OSP installer. I have 3 node configuration (1 Controller and 2 Compute). On the compute node I am getting following error once deployment is started:

    “There was an error rendering the Kickstart RHEL default template: undefined method ‘has_vlanid?’ for NilClass::Jail (NilClass)”

    Could please suggest what is wrong here?

Agree? Disagree? Something to add to the conversation?