How openstack service like nova-api consume oslo.messaging. Walkthrough to understand how openstack services consume oslo.messaging and the rabbitmq driver and to understand the root cause of the AMQP heartbeat/eventlet issue under apache MPM prefork
The goal of this post is to centralize pieces of my analyze and of the original puzzle, related to an openstack issue with the AMQP heartbeat on nova-api under apache mod_wsgi.
I this post I want to show you:
prefork
module in use with eventlet is an issueThese info can be useful for other persons so I want to share these with you.
My walkthrough (cf. section bellow) had help me to understand how things works together, so I think it can be useful to also add it there even if it’s a little bit out of the scope of the fix and the real issue.
I’m not a nova expert and maybe I’m wrong on some points, so, if you see errors in this walkthrough do not hesitate to fix it and to submit a pull request.
For developers and engineers who wants to understand why apache with eventlet can be an issue in some situations.
You don’t need to be developer on openstack, if you want to use eventlet through a HTTP restful API this article can be useful for you to help you to understand the dependencies in your stack and the execution model behind your choices.
We facing an issue with the rabbitmq driver heartbeat
under apache MPM prefork
module
and with mod_wsgi
when nova-api [monkey
patch the stdlib by using eventlet]((https://eventlet.net/doc/patching.html).
nova-api calling eventlet.monkey_patch()
when it runs under mod_wsgi.
This impacts the AMQP heartbeat thread, which is meant to be a native thread. Instead of checking AMQP sockets every 15s, It is now suspended and resumed by eventlet.
However, resuming greenthreads can take a very long time if mod_wsgi isn’t processing traffic regularly, which can cause rabbitmq to close the AMQP connection.
Here is a list of related issues:
The RabbitMQ heartbeat was introduced few years ago to keep connections from various components into RabbitMQ alive. In some situations, by example by placing a stateful firewall between the connection could result in idle connection being terminated without either endpoint being aware.
The oslo.messaging RabbitMQ driver and especially the heartbeat suffer to inherit the execution model of the service which consume him.
In this scenario nova-api need green threads to manage cells and edge features so nova-api monkey patch the stdlib to obtain async features, and the oslo.messaging rabbitmq driver endure these changes.
On openstack The default apache MPM module in use is the prefork
module.
I think the main issue here is that nova-api want async and use eventlet green
threads to obtain it, because, eventlet is based on epoll
or libevent, in an environment based on
apache MPM prefork
module who doesn’t support epoll and recent kernel
features.
The MPM prefork
apache module is appropriate for sites that
need to avoid threading for compatibility with non-thread-safe
libraries.
So we suspect that the apache MPM engine (prefork
) in use here is also a part
of the problem.
We have 2 possible solutions to fix the issue:
prefork
to event
To avoid similar issue we want to allow user to isolate the heartbeat
execution model from the parent process inherited execution model by passing the
heartbeat_in_pthread
option through the driver config.
While we use MPM prefork
we want to avoid to use libevent and epoll.
If the heartbeat_in_pthread
option is given we want to force to use the
python stdlib threading module to run the
rabbitmq heartbeat to avoid issue related to a non “standard”
environment. I mean “standard” because async features isn’t the default
config in mostly case, starting by apache which define prefork
is the
default engine.
This is an experimental feature, we can help us to ensure to run heartbeat through a classical python thread not build over epoll and libevent to avoid issue with environment who doesn’t support these kernel features.
Proposed fix:
We will try to switch [4][5] from the apache prefork
module to the event
module which support non blocking sockets, use modern kernel features
like epoll through APR.
Proposed fixes:
These changes are not fully tested yet with the oslo.messaging rabbitmq heartbeat. I need to realize some tests with a non patched version of oslo.messaging to observe if it work better with eventlet under apache and mod_wsgi.
Possibly if these changes work as expected the oslo.messaging fix (cf. previous section) will become obsolete and we could restore the original behaviour. An inhereted parent execution model would not be an issue for oslo.messaging.
I will update this section when I’ll more info to share with you about this.
To see how things works in openstack we will do a walkthrough to inspect how the nova-api consume and use the oslo.messaging rabbitmq driver.
We will start by analyze how openstack run the nova-api, the different approaches to launch the service (puppet, ansible), in a second time we will observe how the WSGI application for the nova-api work, and to finish we will see how nova-api established the connections with the rabbitmq driver and how the health check heartbeat will be launched.
I’ve initiate this walkthrough to better understand how openstack service works and to try to fix an oslo.messaging issue related to the rabbitmq driver when we run it through nova in an eventlet monkey patched environment under apache mod_wsgi (cf. the previous described issue).
The puppet team provide a puppet project dedicated to nova who define how to run nova under an apache environment. Also it’s important to note that other deployments systems exists and are in use on openstack, like ansible through the openstack ansible project who provide a different approach to deploy nova. There are many others deployment systems in use too (Chef, Salt, etc.).
This puppet project will configure apache to run nova
It is the script_path
is used for the wsgi_script_dir
in configuring the
httpd conf
.
In the end its all used by https://github.com/openstack/puppet-openstacklib/blob/master/manifests/wsgi/apache.pp to configure the wsgi settings of the vhost
We can observe that this puppet script define the following configuration api_wsgi_script_source => '/usr/bin/nova-api-wsgi
.
Nova will be launched by calling the wsgi script available at /usr/bin/nova-api-wsgi
.
The nova-api will be runned under the apache MPM prefork
module, like described previously.
This wsgi scripts is generated by pbr and setuptools during the nova install.
Nova define it in its own setup.cfg file.
This entry point will when it’s called while trigger nova.api.openstack.compute.wsgi:init_application
who correspond to the initialization of the nova wsgi application.
To implement the Python Web Server Gateway Interface (WSGI)(PEP 3333) nova
use the paste.deploy
python module.
Paste deploy was at the origin a submodule of the paste module.
Basically at some point in the past people realized that paste.deploy
was
one of the main useful parts of paste and they didn’t want all the rest of
paste, so it was extracted.
As paste and paste.deploy both got “old” maintenance sort of diverged. They are still maintained, but as separate packages.
Even if the Paste module seems not used here
I will describe some paste
specific behaviours and especially
concerning requests handling, threads, and the python interpreter life cycle,
which I guess we need to take care to really undestand the eventlet issue and green threads for the heartbeat.
Indeed, to avoid issue with requests management (freeze, memory usage, etc…),
paste
manage threads with 3 states:
If a paste
thread initiate the oslo.messaging AMQP heartbeat who will run in
an async way by using green thread maybe in this case the parent thread consider that
it can become idle for some reasons and the main reason is that this thread
(heartbeat) is not a blocking thread.
In parallel, uwsgi (not used here but the heartbeat seems occur within too)
support for paste
facing some issues in multiple process/workers mode
On the paste.deploy
side nova
call
the loadapp
method to serve the nova-api.
Paste Deployment is a system for finding and configuring WSGI applications and servers. For WSGI application consumers it provides a single, simple function (loadapp) for loading a WSGI application from a configuration file or a Python Egg. For WSGI application providers it only asks for a single, simple entry point to your application, so that application users don’t need to be exposed to the implementation details of your application.
The nova service call the loadapp
method by passing available configurations.
The configuration seems to be designed by following the
nova configuration guide.
A sample configuration example for nova is available online (cf. the wsgi part).
Paste deploy don’t seem to define specific behaviours related to threads management. It only seem to help to define application parts and related url/uri, database url, etc…
Well, now we will continue to follow the nova code.
During the init of the nova-api application, nova will try to setup some services.
These services are setup by using the _setup_service
method.
Services to setup are defined in a service manager.
Nova manage services by using a dedicated service module.
They services seems to be also retrieving by querying the database by using the previous definied service module.
To continue this inspection of nova we will now become focused on the ConsoleAuthManager
module.
This class will instanciate a compute_rpcapi.ComputeAPI
object
This object (ComputeAPI
) define
a router
method who will return
a rpc client.
The returned RPC client returned by the nova rpc module is an oslo.messaging rpc client
The instantiated oslo.messaging RPCClient
The transport layer is defined by nova and it will be retrieved by using the oslo.messaging mechanismes based on the config and the url and the oslo.messaging defined drivers by using stevedore.
In our case we will instantiate a RabbitDriver
.
The used driver will initiate a connection pool by using the Connection class defined in the driver.
The oslo.messaging connection pool module will create the connection and also start the healt check mechanism by triggering the heartbeat in a dedicated thread
Also on the oslo.messaging we need to take care about the connection class execution model. Even if rabbit has only one Connection class, this connection can be used for two purposes:
On an over hand the connection pool seems to define the _on_expire
event
listener.
This listener seems to be called when an:
Idle connection has expired and been closed
The Idle connection here seem to be the connection with the rabbitmq server (by example) who have expired.
Then the “thread safe” Pool
mechanism,
Modelled after the eventlet.pools.Pool interface, but designed to be safe
when using native threads without the GIL, defined the expire
method who
will clean the pool from the expired connections based on a ttl.
I think we need to take care about the previous mechanism due to the fact that the connection and the heartbeat have been invoqued from the connection pool mechanism (cf. previous lines about how the connection pool module start the health check)
The nova rpc module also define a RPC server inherited from oslo.messaging
There the used executor is an eventlet
executor, so the initiated object will use eventlet and so the heartbeat thread will use a green thread.
The oslo.messaging instantiate RPC server (https://github.com/openstack/oslo.messaging/blob/40c25c2bde6d2f5a756e7169060b7ce389caf174/oslo_messaging/rpc/server.py#L190)
Some POC are available at: https://github.com/4383/pyamqp-heartbeat/blob/master/POC.md
You can test behaviors between execution models and different versions of the oslo.messaging code base (patched/non-patched).
These POC can be useful to compare the difference between using a native python
thread or an eventlet green thread under apache MPM prefork
.
These changes don’t reflect the latest changes and especially the
heartbeat_in_pthread
option and the possibility to turn on/off the
feature. In other words on these POCs we always force to use pthreads
and these POCs allow you to compare the rabbitmq heartbeat connection
with pthread and green thread.
Congretulation! You are now at the end of this article, you’ve read the most biggest part of this post.
Execution environment can have impactes on services, applications, and libraries, like the nova-api/oslo.messaging use case.
It’s not a trivial thing to debug, I hope my article can help some of you.
Openstack is a little bit complicated to debug, to run a service we need to instantiate many environments and apps like apache, mod_wsgi, etc…
With stack like this it can be difficult to determine which part introduce the issue, and why the error occur.
I hope you appreciated read this article, don’t hesitate to fix errors if you see one. Don’t hesitate to contact me for further discussions and questions.
You can find further resources and useful links bellow.
Cheers!