Commits · e502b65ba1ef7ae0b321ed001948d96d29c57f08 · Very Demiurge Very Mindful / Kolla Ansible

Jan 17, 2024

Fix OpenSearch upgrade tasks idempotency · e502b65b

Matt Crees authored 1 year ago

Shard allocation is disabled at the start of the OpenSearch upgrade
task. This is set as a transient setting, meaning it will be removed
once the containers are restarted. However, if there is not change in
the OpenSearch container it will not be restarted so the cluster is left
in a broken state: unable to allocate shards.

This patch moves the pre-upgrade tasks to within the handlers, so shard
allocation and the flush are only performed when the OpenSearch
container is going to be restarted.

Closes-Bug: #2049512
Change-Id: Ia03ba23bfbde7d50a88dc16e4f117dec3c98a448

e502b65b

Jan 11, 2024

Fix trove failed to discover swift endpoint · 9eff4380

wu.chunyang authored 1 year ago

This change fixes the trove failed to discover swift endpoint
by adding service_credentials in guest-agent.conf

Closes-Bug: #2048829

Change-Id: I185484d2a0d0a2d4016df6acf8a6b0a7f934c237

9eff4380

Fix trove failed to connect rabbitmq - quorum queues support · 57b24f01

wu.chunyang authored 1 year ago

This change fixes the trove guest instance failed to connect to
RabbitMQ by adding quorum queues support to oslo_messaging_rabbit
section in guest-agent.conf.

Closes-Bug: #2048822
Change-Id: I94908f8e20981f20fbe4dc18e2091d3798f8b801

57b24f01

Fix trove failed to connect rabbitmq - durable queues support · 6b96d098

wu.chunyang authored 1 year ago

This change fixes the trove guest instance failed to connect to
RabbitMQ by adding durable queues support to oslo_messaging_rabbit
section in guest-agent.conf.

Partial-Bug: #2048822

Change-Id: I8efc3c92e861816385e6cda3b231a950a06bf57d

6b96d098

Jan 08, 2024

Fix Nova scp failures on Debian Bookworm · bfa9dd97

Pierre Riteau authored 1 year ago

The addition of an instance resize operation [1] to CI testing is
triggering a failure in kolla-ansible-debian-ovn jobs, which are using a
nodeset with multiple nodes:

    oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
    Command: scp -r /var/lib/nova/instances/8ca2c7e8-acae-404c-af7d-6cac38e354b8_resize/disk 192.0.2.2:/var/lib/nova/instances/8ca2c7e8-acae-404c-af7d-6cac38e354b8/disk
    Exit code: 255
    Stdout: ''
    Stderr: "Warning: Permanently added '[192.0.2.2]:8022' (ED25519) to the list of known hosts.\r\nsubsystem request failed on channel 0\r\nscp: Connection closed\r\n"

This is not seen on Ubuntu Jammy, which uses OpenSSH 8.9, while Debian
Bookworm uses OpenSSH 9.2. This is likely related to this change in
OpenSSH 9.0 [2]:

    This release switches scp(1) from using the legacy scp/rcp protocol
    to using the SFTP protocol by default.

Configure sftp subsystem like on RHEL9 derivatives. Even though it is
not yet required for Ubuntu, we also configure it so we are ready for
the Noble release.

[1] https://review.opendev.org/c/openstack/kolla-ansible/+/904249
[2] https://www.openssh.com/txt/release-9.0

Closes-Bug: #2048700
Change-Id: I9f1129136d7664d5cc3b57ae5f7e8d05c499a2a5

bfa9dd97

Enable glance proxying behaviour · 9ecfcf5a

Michal Arbet authored 1 year ago

This patch sets URL to glance worker.
If this is set, other glance workers will know how to contact this one
directly if needed. For image import, a single worker stages the image
and other workers need to be able to proxy the import request to the
right one.

With current setup glance image import just not working.

Closes-Bug: #2048525

Change-Id: I4246dc8a80038358cd5b6e44e991b3e2ed72be0e

9ecfcf5a

Jan 05, 2024

cadvisor: Set housekeeping interval to Prometheus scrape interval · 97e5c0e9

Mark Goddard authored 1 year ago

The prometheus_cadvisor container has high CPU usage. On various
production systems I checked it sits around 13-16% on controllers,
averaged over the prometheus 1m scrape interval. When viewed with top we
can see it is a bit spikey and can jump over 100%.

There are various bugs about this, but I found
https://github.com/google/cadvisor/issues/2523 which suggests reducing
the per-container housekeeping interval. This defaults to 1s, which
provides far greater granularity than we need with the default
prometheus scrape interval of 60s.

Reducing the housekeeping interval to 60s on a production controller
reduced the CPU usage from 13% to 3.5% average. This still seems high,
but is more reasonable.

Change-Id: I89c62a45b1f358aafadcc0317ce882f4609543e7
Closes-Bug: #2048223

97e5c0e9

Fix long service restarts while using systemd · b1fd2b40

Michal Arbet authored 1 year ago

Some containers exiting with 143 instead of 0, but
this is still OK. This patch just allows
ExitCode 143 (SIGTERM) as fix. Details in
bugreport.

Services which exited with 143 (SIGTERM):

kolla-cron-container.service
kolla-designate_producer-container.service
kolla-keystone_fernet-container.service
kolla-letsencrypt_lego-container.service
kolla-magnum_api-container.service
kolla-mariadb_clustercheck-container.service
kolla-neutron_l3_agent-container.service
kolla-openvswitch_db-container.service
kolla-openvswitch_vswitchd-container.service
kolla-proxysql-container.service

Partial-Bug: #2048130
Change-Id: Ia8c85d03404cfb368e4013066c67acd2a2f68deb

b1fd2b40

Jan 04, 2024

ironic: Remove enable_ironic_pxe_uefi bits · d8700ad0

Michal Nasiadka authored 1 year ago

These were missed in I081aa1345603fa27c390e4e09231a5ff226bcb39

Change-Id: I2884bca3c06ff98004e318757a20b60c12375924

d8700ad0

Jan 03, 2024
- Use service-images-pull role for letsencrypt and venus · 498d3243
  Mark Goddard authored 1 year ago
  
  This reduces code duplication. Change-Id: Ie529875aaa42435835417468868250bbe4fcf649
  498d3243
Jan 02, 2024

haproxy: Fix single frontend after LE cert path change · 21e5b21f

Michal Nasiadka authored 1 year ago

I35317ea0343f0db74ddc0e587862e95408e9e106 changed certificate path but omitted
single frontend template.

Change-Id: I638ba32e97234900745df62056710dcc37e7db77

21e5b21f

magnum: Disable CAPI driver when kubeconfig missing · 48796560
Michal Nasiadka authored 1 year ago
```
Closes-Bug: #2047360
Change-Id: I73490d84da39a74ea7ac493c7dd41fe7bfe2f578
```
48796560

Dec 28, 2023
- post-2023.1: Remove keystone admin endpoint bits · 982c4d5e
  Michal Nasiadka authored 1 year ago
  
  Change-Id: I27028ffae26a57d510e1a78c38ead2f925396e81
  982c4d5e
- Remove after-Zed TODOs · 65a0cee7
  Michal Nasiadka authored 1 year ago
  
  Change-Id: I081aa1345603fa27c390e4e09231a5ff226bcb39
  65a0cee7
Dec 21, 2023

Set a log retention policy for OpenSearch · 5e5a2dca

Doug Szumski authored 2 years ago

We previously used ElasticSearch Curator for managing log
retention. Now that we have moved to OpenSearch, we can use
the Index State Management (ISM) plugin which is bundled with
OpenSearch.

This change adds support for automating the configuration of
the ISM plugin via the OpenSearch API. By default, it has
similar behaviour to the previous ElasticSearch Curator
default policy.

Closes-Bug: #2047037

Change-Id: I5c6d938f2bc380f1575ee4f16fe17c6dca37dcba

5e5a2dca

Remove nova cell sync comment · e9e7362f

Alex-Welsh authored 1 year ago

Removed a comment suggesting we use nova-manage db sync --local_cell
when bootstrapping the nova service, since that suggestion has now been
implemented in Kolla. See [1] for more details.

[1]: https://review.opendev.org/c/openstack/kolla/+/902057

Related-Bug: #2045558
Depends-On: Ic64eb51325b3503a14ebab9b9ff2f4d9caec734a
Change-Id: I591f83c4886f5718e36011982c77c0ece6c4cbd7

e9e7362f

Dec 20, 2023
- fluentd: Fix LE pos_file path after version bump · bf22f3dd
  Michal Nasiadka authored 1 year ago
  
  Change-Id: Ia6db7d6a41ddbda8fcbf563dc55a0c65ef8db9be
  bf22f3dd
Dec 19, 2023
- Rework quorum queues precheck · afa24178
  Michal Nasiadka authored 1 year ago
  
  Change-Id: Ic9bd25a09b860838910dbe3d55f94421a0461c57
  afa24178
- quorum: add missing octavia and masakari · be5dc32c
  Michal Nasiadka authored 1 year ago
  
  Change-Id: Ibf9a9a0c18938f638c8e8b00b6017c64f1523b23
  be5dc32c
Dec 18, 2023

CI: fix two ansible-lint warnings · 176aa5a4

Sven Kieske authored 1 year ago


Signed-off-by: Sven Kieske <kieske@osism.tech>
Change-Id: I81a9b2dab7e9a4e2c8facaa0f32538f2884e3ca9

176aa5a4

Dec 14, 2023

Fix Docker health check for sahara_engine · 693c5c8b

Pierre Riteau authored 1 year ago

The wrong process name was being used.

Closes-Bug: #2046268
Change-Id: I5a5d4f227205e811732331ee6e020ccea67b6fab

693c5c8b

Dec 13, 2023

Add precheck for RabbitMQ quorum queues · 61f84e3b

Matt Crees authored 1 year ago

Adds a precheck to fail if non-quorum queues are found in RabbitMQ.

Currently excludes fanout and reply queues, pending support in
oslo.messaging [1].

[1]: https://review.opendev.org/c/openstack/oslo.messaging/+/888479

Closes-Bug: #2045887
Change-Id: Ibafdcd58618d97251a3405ef9332022d4d930e2b

61f84e3b

Dec 05, 2023

Fix broken list concatenation in horizon role · 97cd1731

Andrey Kurilin authored 1 year ago


Starting with ansible-core 2.13, list concatenation format is changed
and does not support concatenation operations outside of the jinja template.

The format change:

  "[1] + {{ [2] }}" -> "{{ [1] + [2] }}"

This affects the horizon role that iterates over existing policy files to
override and concatenate them into a single variable.

Co-Authored-By: Dr. Jens Harbott <harbott@osism.tech>

Closes-Bug: #2045660
Change-Id: I91a2101ff26cb8568f4615b4cdca52dcf09e6978

97cd1731

Support Ansible max_fail_percentage · af6e1ca4

Mark Goddard authored 3 years ago

This allows us to continue execution until a certain proportion of hosts
to fail. This can be useful at scale, where failures are common, and
restarting a deployment is time-consuming.

The default max failure percentage is 100, keeping the default
behaviour. A global max failure percentage may be set via
kolla_max_fail_percentage, and individual services may define a max
failure percentage via <service>_max_fail_percentage.

Note that all hosts in the inventory must be reachable for fact
gathering, even those not included in a --limit.

Closes-Bug: #1833737
Change-Id: I808474a75c0f0e8b539dc0421374b06cea44be4f

af6e1ca4

Dec 02, 2023

Fix wsrep sync status task while switched to TCP/IP · 35c7a9eb

Maksim Malchuk authored 1 year ago


Followup on Id6eae798784126d4dd53adef15bdce6b47b4601f to fix an issue
when a client with provided port set tries to connect 'localhost', so
while we switch to TCP/IP we need to explicitly provide the host too.

Partial-Bug: #2024554
Change-Id: Ib08c159dadd69a1f44924d658f4afe1e794a18b0
Signed-off-by: Maksim Malchuk <maksim.malchuk@gmail.com>

35c7a9eb

Dec 01, 2023

magnum: support kubeconfig configuration file · c939504d

Christian Berendt authored 1 year ago

If a file {{ node_custom_config }}/magnum/kubeconfig exists, it is
copied to /var/lib/magnum/.kube/config in all Magnum Service Containers.
At this location, the vexxhost/magnum-cluster-api will loo for the Kubeconfig
configuration file to control the Cluster API Control Plane. If the
vexxhost/magnum-cluster-api is installed in the Magnum container images,
control of a cluster API control plane can then take place via the Magnum API.

Depends-On: https://review.opendev.org/c/openstack/kolla/+/902101
Change-Id: I986c5192fe96b9c480a2d8fa87d719a50ce78186

c939504d

fluentd: Fix getting podman labels · bdd2aa37

Michal Nasiadka authored 1 year ago

podman_image_info returns Config dict, not ContainerConfig.

Change-Id: I9f813c90b42246c4835d7d7b18476a021d80548b

bdd2aa37

Nov 30, 2023

enable quorum queues · 64575519

Sven Kieske authored 1 year ago

This implements a global toggle `om_enable_rabbitmq_quorum_queues`
to enable quorum queues for each service in RabbitMQ, similar to
what was done for HA[0].

Quorum Queues are enabled by default.

Quorum queues are more reliable, safer, simpler and faster than
replicated mirrored classic queues[1].

Mirrored classic queues are deprecated and scheduled for removal
in RabbitMQ 4.0[2].

Notice, that we do not need a new policy in the RabbitMQ definitions
template, because their usage is enabled on the client side and can't
be set using a policy[3].

Notice also, that quorum queues are not yet enabled in oslo.messaging
for the usage of reply_ and fanout_ queues (transient queues).
This will change once[4] is merged.

[0]: https://review.opendev.org/c/openstack/kolla-ansible/+/867771
[1]: https://www.rabbitmq.com/quorum-queues.html
[2]: https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/
[3]: https://www.rabbitmq.com/quorum-queues.html#declaring
[4]: https://review.opendev.org/c/openstack/oslo.messaging/+/888479



Signed-off-by: Sven Kieske <kieske@osism.tech>
Change-Id: I6c033d460a5c9b93c346e9e47e93b159d3c27830

64575519

Nov 29, 2023

etcd: update to v3.4 · ccfa2a6c

Jan Gutter authored 1 year ago

* Updates etcd to v3.4
* Updated the config to use v3.4's logging mechanism
* Deprecated etcd CA parameters aren't used, so we are not affected
  by their removal.
* Note that we are not currently guarding against skip-version updates for
  etcd.

Notable non-voting jobs exercising some of this:
* kolla-ansible-ubuntu-upgrade-cephadm (cinder->tooz->etcd3gw->etcd)
* kolla-ansible-ubuntu-zun (see
  https://review.opendev.org/c/openstack/openstack-ansible/+/883194 )

Depends-On: https://review.opendev.org/c/openstack/kolla/+/890464
Change-Id: I086e7bbc7db64421445731a533265e7056fbdb43

ccfa2a6c

etcd: deduplicate environments for containers · ae21f317

Jan Gutter authored 1 year ago

* etcd service containers usually have a set of
  environment parameters required to boot the container.
* The short-lived etcd bootstrap containers pass extra
  ETCD_INITIAL_* environment variables, but still need to
  pass the ones that the service containers use.
* This uses ansible's `combine` filter to cut down on the
  duplication.
* This is intended to be just a straightforward refactor.

Change-Id: I04e95f92a8f365553afd618d58b99de595d48312

ae21f317

Nov 28, 2023

etcd: Add support for more scenarios · ed3b27cc

Jan Gutter authored 1 year ago

This commit addresses a few shortcomings in the etcd service:
  * Adding or removing etcd nodes required manual intervention.

  * The etcd service would have brief outages during upgrades or
    reconfigures because restarts weren't always serialised.

This makes the etcd service follow a similar pattern to mariadb:
  * There is now a distiction between bootstrapping the cluster
    and adding / removing another member.

  * This more closely follows etcd's upstream bootstrapping
    guidelines.

  * The etcd role now serialises restarts internally so the
    kolla_serial pattern is no longer appropriate (or necessary).

This does not remove the need for manual intervention in all
failure modes: the documentation has been updated to address the
most common issues.

Note that there's repetition in the container specifications: this
is somewhat deliberate. In a future cleanup, it's intended to reduce
the duplication.

Change-Id: I39829ba0c5894f8e549f9b83b416e6db4fafd96f

ed3b27cc

fluentd: Use labels for transition to v5 · 06baa8f6

Michal Nasiadka authored 1 year ago

Depends-On: https://review.opendev.org/c/openstack/kolla/+/901508
Change-Id: I8c7d3de95d0f1f8e57a993b8c3417d90459e19be

06baa8f6

Fix Horizon WSGI application log parsing · 4168b46c

Doug Szumski authored 1 year ago

Like other WSGI services in Kolla Ansible, the Horizon WSGI application
handles log output via the `wsgi.errors` object. See [1] for further
information. The problem is that this log output is written to a file called
`horizon.log`, causing it to processed as an 'Oslo log' in the Fluentd
processing pipeline. Since the log format doesn't match the expected format,
this results in parsing errors.

This fix renames the log file and adjusts the format to match other WSGI
applications. The logs are then processed in the same way as other WSGI
application logs, resolving the issue.

[1] https://modwsgi.readthedocs.io/en/master/user-guides/debugging-techniques.html

Change-Id: I93777d1c53920f5470c78356e6b3a4064fbe04b4
Closes-Bug: #1898174

4168b46c

Revert "Enable RabbitMQ HA queues by default" · cdda49ec

Matt Crees authored 1 year ago

This reverts commit b86c304a.

Reason for revert: We want to enable Quorum Queues by default in Caracal, without requiring two queue migrations between releases. See etherpad for details: https://etherpad.opendev.org/p/kolla-ansible-rmq-quorum-queues-proposal

Change-Id: Ia19ab97f538125475297976347c5da332a7fdda7

cdda49ec

Nov 22, 2023

Fix octavia's proxysql configuration · ff785625

Michal Arbet authored 1 year ago

The patch [1] mentioned below added the jobboard
functionality to the octavia role, but unfortunately
it incorrectly implemented the functionality of users
and rules for proxysql.

This patch fixes this bug.

[1] https://review.opendev.org/c/openstack/kolla-ansible/+/888588

Closes-Bug: #2044293
Change-Id: I6524fabad19b438113db4affe05f5586db99dff4

ff785625

Fix expose prometheus externally with single frontend · 2c9dc5da
Will Szumski authored 1 year ago
```
Closes-Bug: #2043831
Change-Id: I010fabd255d93d5329de82af2b5d21c8fa7d93c4
```
2c9dc5da
Configure CloudKitty with Prometheus basic auth · 4131eb45
Pierre Riteau authored 1 year ago
```
Closes-Bug: #2044226
Change-Id: I5e17152584b758c9ca4f1cc14520337f979584b7
```
4131eb45

Nov 21, 2023

Move [oslo_policy] back inside Jinja if block · c2bd8914

Pierre Riteau authored 1 year ago

This avoids generating an empty [oslo_policy] section in nova.conf when
no custom policy file is defined.

Change-Id: I23fae8387573e7f37eda0f2a09cd937239afd93f

c2bd8914

Nov 17, 2023

Fix an issue with prometheus scraping itself · 775fac2b
Will Szumski authored 1 year ago
```
Closes-Bug: #2043829
Change-Id: Ic4cbaf592a2699d9c0312c575f68613c8681239f
```
775fac2b

Fix grafana prometheus datasource · dfce510c

Will Szumski authored 1 year ago

See:
https://grafana.com/docs/grafana/latest/administration/provisioning/

Closes-Bug: #2043828
Change-Id: I9ed07dc8c995adddf6d89838cd515af93d10bd00

dfce510c