- Apr 13, 2023
-
-
Matt Crees authored
With the addition of the variable `om_enable_rabbitmq_high_availability`, this feature in the upgrade task should be brought back. It is also now used in the deploy task. The `ha-all` policy is cleared only when `om_enable_rabbitmq_high_availability` is set to `false`. Change-Id: Ia056aa40e996b1f0fed43c0f672466c7e4a2f547
-
- Apr 12, 2023
-
-
Michal Nasiadka authored
Since RMQ 3.8 we can use rolling upgrade [1]. Depends-On: https://review.opendev.org/c/openstack/kolla/+/872393 [1]: https://www.rabbitmq.com/upgrade.html#rolling-upgrades Change-Id: If6a7c6c12d9226a2406728108b3c87b3485ac55f
-
- Mar 21, 2023
-
-
John Garbutt authored
Following ideas here: https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit Make sure old messages with no consumer are dropped after the message TTL of 10 mins, longer than the 1 min RPC timeout. Also ensure queues expire after an hour of inactivity, so queues from removed nodes or renamed nodes don't grow over time. Change-Id: Ifb28ac68b6328adb604a7474d01e5f7a47b2e788
-
Matt Crees authored
Adds two new flags to alter behaviour in RabbitMQ: * `rabbitmq_message_ttl_ms`, which lets you set a TTL on messages. * `rabbitmq_queue_expiry_ms`, which lets you set an expiry time on queues. See https://www.rabbitmq.com/ttl.html for more information on both. Change-Id: I51ca37ffbb1bb5c07f2d39873f0f33ca20263f2a
-
- Feb 14, 2023
-
-
John Garbutt authored
Currently we do not follow the RabbitMQ advice on replicas here: https://www.rabbitmq.com/ha.html#replication-factor Here we reduce the number of replicas to n // 2 + 1 as advised above. The hope it this helps speed up recovery from rabbit issues. Related-Bug: #1954925 Change-Id: Ib6bcb26c499c9884faa4a0cd51abaec00cacb096
-
Matt Crees authored
Adds the flag `rabbitmq_ha_replica_count` to change how many different nodes a queue should be mirrored across. If the value is not set, then it defaults to "ha-mode":"all". This value is unset by default to avoid any unexpected changes to the RabbitMQ definitions.json file, as that would trigger an unexpected restart of RabbitMQ during the next deploy. Change-Id: Iee98cd937197a73a3b04aa8501fa325e8ecfff24
-
- Feb 09, 2023
-
-
John Garbutt authored
By default ha-promote-on-shutdown=when-synced. However we are seeing issues with RabbitMQ automatically recovering when nodes are restarted. https://www.rabbitmq.com/ha.html#cluster-shutdown Rather than waiting for operator interventions, it is better we allow recovery to happen, even if that means we may loose some messages. A few failed and timed out operations is better than a totaly broken cloud. This is achieved using ha-promote-on-shutdown=always. Note, when a node failure is detected, this is already the default behaviour from 3.7.5 onwards: https://www.rabbitmq.com/ha.html#promoting-unsynchronised-mirrors This patch adds the option to change the ha-promote-on-shutdown definition, using the flag `rabbitmq_ha_promote_on_shutdown`. This value is unset by default to avoid any unexpected changes to the RabbitMQ definitions.json file, as that would trigger an unexpected restart of RabbitMQ during the next deploy. Related-Bug: #1954925 Change-Id: I2146bda2c72ddac2c9923c6941b0596395fd9ab5
-
- Jan 17, 2023
-
-
Michal Arbet authored
As rabbitmq's configuration file is not ini or yaml file, there is no option to extend configuration by new config options via merge_configs or merge_yaml. This patch moves config options to dictionary so it can be overriden in /etc/kolla/globals.yml. Change-Id: I5cd772f4fb80a0e200fb24d67be735ca81e3fdeb
-
- Jan 13, 2023
-
-
Matt Crees authored
A combination of durable queues and classic queue mirroring can be used to provide high availability of RabbitMQ. However, these options should only be used together, otherwise the system will become unstable. Using the flag ``om_enable_rabbitmq_high_availability`` will either enable both options at once, or neither of them. There are some queues that should not be mirrored: * ``reply`` queues (these have a single consumer and TTL policy) * ``fanout`` queues (these have a TTL policy) * ``amq`` queues (these are auto-delete queues, with a single consumer) An exclusionary pattern is used in the classic mirroring policy. This pattern is ``^(?!(amq\\.)|(.*_fanout_)|(reply_)).*`` Change-Id: I51c8023b260eb40b2eaa91bd276b46890c215c25
-
- Jan 12, 2023
-
-
Mark Goddard authored
When running in check mode, some prechecks previously failed because they use the command module which is silently not run in check mode. Other prechecks were not running correctly in check mode due to e.g. looking for a string in empty command output or not querying which containers are running. This change fixes these issues. Closes-Bug: #2002657 Change-Id: I5219cb42c48d5444943a2d48106dc338aa08fa7c
-
- Jan 09, 2023
-
-
Erik Berg authored
assert will also fail when we're not meeting the conditions, makes clear what we're actually testing, and isn't listed as a skipped task when the condition is ok. Change-Id: I4c919b523dde2602c81179ab3d28b913650b4c9f
-
- Dec 21, 2022
-
-
Matt Crees authored
Regularly, we experience issues in Kolla Ansible deployments because we use wrong options in OpenStack configuration files. This is because OpenStack services ignore unknown options. We also need to keep on top of deprecated options that may be removed in the future. Integrating oslo-config-validator into Kolla Ansible will greatly help. Adds a shared role to run oslo-config-validator on each service. Takes into account that services have multiple containers, and these may also use multiple config files. Service roles are extended to use this shared role. Executed with the new command ``kolla-ansible validate-config``. Change-Id: Ic10b410fc115646d96d2ce39d9618e7c46cb3fbc
-
- Nov 02, 2022
-
-
Ivan Halomi authored
Second part of patchset: https://review.opendev.org/c/openstack/kolla-ansible/+/799229/ in which was suggested to split patch into smaller ones. This change adds container_engine variable to kolla_container_facts module, this prepares module to be used with docker and podman as well without further changes in roles. Signed-off-by:
Ivan Halomi <i.halomi@partner.samsung.com> Co-authored-by:
Martin Hiner <m.hiner@partner.samsung.com> Change-Id: I9e8fa30646844ab4a288555f3aafdda345b3a118
-
- Oct 28, 2022
-
-
Ivan Halomi authored
First part of patchset: https://review.opendev.org/c/openstack/kolla-ansible/+/799229/ in which was suggested to split patch into smaller ones. This implements kolla_container_engine variable in command calls of docker,so later on it can be also used for podman without further change. Signed-off-by:
Ivan Halomi <i.halomi@partner.samsung.com> Change-Id: Ic30b67daa2e215524096ad1f4385c569e3d41b95
-
- Aug 09, 2022
-
-
Michal Arbet authored
This patch adds loadbalancer-config role which is "wrapper" around haproxy-config and proxysql-config role which will be added in follow-up patches. Change-Id: I64d41507317081e1860a94b9481a85c8d400797d
-
- Jul 27, 2022
-
-
Radosław Piliszek authored
It is no longer needed per the removed comment. Change-Id: I8d88c21c7e115b842a56f0ba5c780c3bde593964
-
- Jul 25, 2022
-
-
Michal Nasiadka authored
ansible-lint introduced var-spacing - let's fix our code. Change-Id: I0d8aaf3c522a5a6a5495032f6dbed8a2be0251f0
-
- May 23, 2022
-
-
Radosław Piliszek authored
Change-Id: Ib4b15ed4feac82d8492b1c0f0238a752eac668e6
-
- Apr 20, 2022
-
-
Marcin Juszkiewicz authored
We have only one value for install_type now and it gets removed from image names. Change-Id: I8bf95fd7aa9dd26b80d618ca0fcb097003b4cb0a
-
- Mar 24, 2022
-
-
Sven Kieske authored
this adds back the ability to configure the rabbitmq/erlang kernel network interface which was removed in https://review.opendev.org/#/c/584427/ seemingly by accident. Closes-Bug: 1900160 Change-Id: I6f00396495853e117429c17fadfafe809e322a31
-
- Mar 18, 2022
-
-
Mark Goddard authored
Follow up to I91d0e23b22319cf3fdb7603f5401d24e3b76a56e, which fixes a conditional corner case when removing the ha-all policy. Change-Id: Iea75551bc6d0da7dd10515dd8bd28c014eed7a5e
-
- Feb 21, 2022
-
-
Doug Szumski authored
When OpenStack is deployed with Kolla-Ansible, by default there are no durable queues or exchanges created by the OpenStack services in RabbitMQ. In Rabbit terminology, not being durable is referred to as `transient`, and this means that the queue is generally held in memory. Whether OpenStack services create durable or transient queues is traditionally controlled by the Oslo Notification config option: `amqp_durable_queues`. In Kolla-Ansible, this remains set to the default of `False` in all services. The only `durable` objects are the `amq*` exchanges which are internal to RabbitMQ. More recently, Oslo Notification has introduced support for Quorum queues [7]. These are a successor to durable classic queues, however it isn't yet clear if they are a good fit for OpenStack in general [8]. For clustered RabbitMQ deployments, Kolla-Ansible configures all queues as `replicated` [1]. Replication occurs over all nodes in the cluster. RabbitMQ refers to this as 'mirroring of classic queues'. In summary, this means that a multi-node Kolla-Ansible deployment will end up with a large number of transient, mirrored queues and exchanges. However, the RabbitMQ documentation warns against this, stating that 'For replicated queues, the only reasonable option is to use durable queues: [2]`. This is discussed further in the following bug report: [3]. Whilst we could try enabling the `amqp_durable_queues` option for each service (this is suggested in [4]), there are a number of complexities with this approach, not limited to: 1) RabbitMQ is planning to remove classic queue mirroring in favor of 'Quorum queues' in a forthcoming release [5]. 2) Durable queues will be written to disk, which may cause performance problems at scale. Note that this includes Quorum queues which are always durable. 3) Potential for race conditions and other complexity discussed recently on the mailing list under: `[ops] [kolla] RabbitMQ High Availability` The remaining option, proposed here, is to use classic non-mirrored queues everywhere, and rely on services to recover if the node hosting a queue or exchange they are using fails. There is some discussion of this approach in [6]. The downside of potential message loss needs to be weighed against the real upsides of increasing the performance of RabbitMQ, and moving to a configuration which is officially supported and hopefully more stable. In the future, we can then consider promoting specific queues to quorum queues, in cases where message loss can result in failure states which are hard to recover from. [1] https://www.rabbitmq.com/ha.html [2] https://www.rabbitmq.com/queues.html [3] https://github.com/rabbitmq/rabbitmq-server/issues/2045 [4] https://wiki.openstack.org/wiki/Large_Scale_Configuration_Rabbit [5] https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/ [6] https://fuel-ccp.readthedocs.io/en/latest/design/ref_arch_1000_nodes.html#replication [7] https://bugs.launchpad.net/oslo.messaging/+bug/1942933 [8] https://www.rabbitmq.com/quorum-queues.html#use-cases Partial-Bug: #1954925 Change-Id: I91d0e23b22319cf3fdb7603f5401d24e3b76a56e
-
- Jan 12, 2022
-
-
Michal Nasiadka authored
Change-Id: I547ab4b05aa14ed3bbee8be2dc77a6840d4816f6
-
- Jan 11, 2022
-
-
Mark Goddard authored
Move new variables added in I4d694d6224c813285d228d6bc7eece5731db1078 to role defaults. Change-Id: Ie09a2dbae2701cb18fd1eb5bfab76e82f9920fb3
-
- Jan 09, 2022
-
-
LinPeiWen authored
rabbitmq starting from 3.8.0, built-in Prometheus support, prometheus plugins are enabled by default, when the environment is "enable_prometheus is no", rabbitmq role will disable prometheus plugins Closes-Bug: #1885106 Change-Id: I4d694d6224c813285d228d6bc7eece5731db1078
-
- Dec 31, 2021
-
-
Pierre Riteau authored
Role vars have a higher precedence than role defaults. This allows to import default vars from another role via vars_files without overriding project_name (see related bug for details). Change-Id: I3d919736e53d6f3e1a70d1267cf42c8d2c0ad221 Related-Bug: #1951785
-
- Aug 10, 2021
-
-
Radosław Piliszek authored
We get a nice optimisation by using a filtered loop instead of task skipping per service with 'when'. Partially-Implements: blueprint performance-improvements Change-Id: I8f68100870ab90cb2d6b68a66a4c97df9ea4ff52
-
- Jul 28, 2021
-
-
Radosław Piliszek authored
As mentioned in the Iced014acee7e590c10848e73feca166f48b622dc commit message, in Ussuri+ we can use ``+sbwtdcpu none +sbwtdio none`` as well. This is due to relying on RMQ-provided erlang in version 23.x. This change adds the extra arguments by default. It should be backported down to Ussuri before we do a release with Iced014acee7e590c10848e73feca166f48b622dc. Change-Id: I32e247a6cb34d7f6763b544f247fd408dce2b3a2
-
- Jun 23, 2021
-
-
Mark Goddard authored
By default, Ansible injects a variable for every fact, prefixed with ansible_. This can result in a large number of variables for each host, which at scale can incur a performance penalty. Ansible provides a configuration option [0] that can be set to False to prevent this injection of facts. In this case, facts should be referenced via ansible_facts.<fact>. This change updates all references to Ansible facts within Kolla Ansible from using individual fact variables to using the items in the ansible_facts dictionary. This allows users to disable fact variable injection in their Ansible configuration, which may provide some performance improvement. This change disables fact variable injection in the ansible configuration used in CI, to catch any attempts to use the injected variables. [0] https://docs.ansible.com/ansible/latest/reference_appendices/config.html#inject-facts-as-vars Change-Id: I7e9d5c9b8b9164d4aee3abb4e37c8f28d98ff5d1 Partially-Implements: blueprint performance-improvements
-
- Jun 08, 2021
-
-
Mark Goddard authored
The host list order seen during Ansible handlers may differ to the usual play host list order, due to race conditions in notifying handlers. This means that restart_services.yml for RabbitMQ may be included in a different order than the rabbitmq group, resulting in a node other than the 'first' being restarted first. This can cause some nodes to fail to join the cluster. The include_tasks loop was introduced in [1]. This change fixes the issue by splitting the handler into two tasks, and restarting the first node before all others. [1] https://review.opendev.org/c/openstack/kolla-ansible/+/763137 Change-Id: I1823301d5889589bfd48326ed7de03c6061ea5ba Closes-Bug: #1930293
-
- Jun 07, 2021
-
-
John Garbutt authored
On machines with many cores, we were seeing excessive CPU load on systems that were not very busy. With the following Erlang VM argument we saw RabbitMQ CPU usage drop from about 150% to around 20%, on a system with 40 hyperthreads. +S 2:2 By default RabbitMQ starts N schedulers where N is the number of CPU cores, including hyper-threaded cores. This is fine when you assume all your CPUs are dedicated to RabbitMQ. Its not a good idea in a typical Kolla Ansible setup. Here we go for two scheduler threads. More details can be found here: https://www.rabbitmq.com/runtime.html#scheduling and here: https://erlang.org/doc/man/erl.html#emulator-flags +sbwt none This stops busy waiting of the scheduler, for more details see: https://www.rabbitmq.com/runtime.html#busy-waiting Newer versions of rabbit may need additional flags: "+sbwt none +sbwtdcpu none +sbwtdio none" But this patch should be back portable to older versions of RabbitMQ used in Train and Stein. Note that information on this tuning was found by looking at data from: rabbitmq-diagnostics runtime_thread_stats More details on that can be found here: https://www.rabbitmq.com/runtime.html#thread-stats Related-Bug: #1846467 Change-Id: Iced014acee7e590c10848e73feca166f48b622dc
-
- May 21, 2021
-
-
Michal Arbet authored
Change-Id: If2fdab2ae0f981d9fcbb0fea7a92fcde325804f8
-
- Apr 14, 2021
-
-
LinPeiWen authored
This change enables the use of Docker healthchecks for rabbitmq services. Implements: blueprint container-health-check Depends-On: https://review.opendev.org/c/openstack/kolla/+/784562 Change-Id: I23a2c2efab858b9ed39c6ce0ec4a82df10e7f93d
-
- Dec 14, 2020
-
-
Mark Goddard authored
This reverts commit 9cae59be. Reason for revert: This patch was found to introduce issues with fluentd customisation. The underlying issue is not currently fully understood, but could be a sign of other obscure issues. Change-Id: Ia4859c23d85699621a3b734d6cedb70225576dfc Closes-Bug: #1906288
-
- Nov 19, 2020
-
-
Victor Chembaev authored
Change-Id: I1ff4cbdf3f60cb7fd5fe5d3c5d498e05fe2df79a Closes-Bug: #1904702
-
- Oct 27, 2020
-
-
Radosław Piliszek authored
Main plays are action-redirect-stubs, ideal for import_tasks. This avoids 'include' penalty and makes logs/ara look nicer. Fixes haproxy and rabbitmq not to check the host group as well. Change-Id: I46136fc40b815e341befff80b54a91ef431eabc0 Partially-Implements: blueprint performance-improvements
-
- Oct 12, 2020
-
-
Radosław Piliszek authored
Config plays do not need to check containers. This avoids skipping tasks during the genconfig action. Ironic and Glance rolling upgrades are handled specially. Swift and Bifrost do not use the handlers at all. Partially-Implements: blueprint performance-improvements Change-Id: I140bf71d62e8f0932c96270d1f08940a5ba4542a
-
- Sep 17, 2020
-
-
Mark Goddard authored
This change adds support for encryption of communication between OpenStack services and RabbitMQ. Server certificates are supported, but currently client certificates are not. The kolla-ansible certificates command has been updated to support generating certificates for RabbitMQ for development and testing. RabbitMQ TLS is enabled in the all-in-one source CI jobs, or when The Zuul 'tls_enabled' variable is true. Change-Id: I4f1d04150fb2b5af085b762890092f87ae6076b5 Implements: blueprint message-queue-ssl-support
-
- Aug 28, 2020
-
-
Mark Goddard authored
Including tasks has a performance penalty when compared with importing tasks. If the include has a condition associated with it, then the overhead of the include may be lower than the overhead of skipping all imported tasks. For unconditionally included tasks, switching to import_tasks provides a clear benefit. Benchmarking of include vs. import is available at [1]. This change switches from include_tasks to import_tasks where there is no condition applied to the include. [1] https://github.com/stackhpc/ansible-scaling/blob/master/doc/include-and-import.md#task-include-and-import Partially-Implements: blueprint performance-improvements Change-Id: Ia45af4a198e422773d9f009c7f7b2e32ce9e3b97
-
- Aug 10, 2020
-
-
Mark Goddard authored
Previously we mounted /etc/timezone if the kolla_base_distro is debian or ubuntu. This would fail prechecks if debian or ubuntu images were deployed on CentOS. While this is not a supported combination, for correctness we should fix the condition to reference the host OS rather than the container OS, since that is where the /etc/timezone file is located. Change-Id: Ifc252ae793e6974356fcdca810b373f362d24ba5 Closes-Bug: #1882553
-