logging-with-heka.rst



Logging with Heka
https://blueprints.launchpad.net/kolla/+spec/heka
Kolla currently uses Rsyslog for logging. And Change Request 252968 [1]
suggests to use ELK (Elasticsearch, Logstash, Kibana) as a way to index all the
logs, and visualize them.
This spec suggests using Heka [2] instead of Logstash, while still using
Elasticsearch for indexing and Kibana for visualization. It also discusses
the removal of Rsyslog along the way.
What is Heka? Heka is a open-source stream processing software created and
maintained by Mozilla.
Using Heka will provide a lightweight and scalable log processing solution
for Kolla.

Problem description
Change Request 252968 [1] adds an Ansible role named "elk" that enables
deploying ELK (Elasticsearch, Logstash, Kibana) on nodes with that role. This
spec builds on that work, proposing a scalable log processing architecture
based on the Heka [2] stream processing software.
We think that Heka provides for a lightweight, flexible and powerful solution
for processing data streams, including logs.
Using Heka our primary goal is distributing the logs processing load across the
OpenStack nodes rather than using a centralized log processing engine that
represents a bottleneck and a single-point-of-failure.
We also know from experience that Heka provides all the necessary flexibility
for processing other types of data streams than log messages. For example, we
already use Heka together with Elasticsearch for logs, but also with collectd
and InfluxDB for statistics and metrics.

Proposed change
We propose to build on the ELK infrastructure brought by CR 252968 [1], and
use Heka to collect and process logs in a distributed and scalable way.
This is the proposed architecture:

In this architecture Heka runs on every node of the OpenStack cluster. It runs
in a dedicated container, referred to as the Heka container in the rest of this
document.
Each Heka instance reads and processes the logs local to the node it runs on,
and sends these logs to Elasticsearch for indexing. Elasticsearch may be
distributed on multiple nodes for resiliency and scalability, but that part is
outside the scope of that specification.
Heka, written in Go, is fast and has a small footprint, making it possible to
run it on every node of the cluster. In contrast, Logstash runs in a JVM and
is known [3] to be too heavy to run on every node.
Another important aspect is flow control and avoiding the loss of log messages
in case of overload. Heka's filter and output plugins, and the Elasticsearch
output plugin in particular, support the use of a disk based message queue.
This message queue allows plugins to reprocess messages from the queue when
downstream servers (Elasticsearch) are down or cannot keep up with the data
flow.
With Logstash it is often recommended [3] to use Redis as a centralized queue,
which introduces some complexity and other points-of-failures.

Remove Rsyslog
Kolla currently uses Rsyslog. The Kolla services are configured to write their
logs to Syslog. Rsyslog gets the logs from the /var/lib/kolla/dev/log Unix
socket and dispatches them to log files on the local file system. Rsyslog
running in a Docker container, the log files are stored in a Docker volume
(named rsyslog).
With Rsyslog already running on each cluster node, the question of using two
log processing daemons, namely rsyslogd and hekad, has been raised on
the mailing list. The spec evaluates the possibility of using hekad only,
based on some prototyping work we have conducted [4].
Note: Kolla doesn't currently collect logs from RabbitMQ, HAProxy and
Keepalived. For RabbitMQ the problem is related to RabbitMQ not having the
capability to write its logs to Syslog. HAProxy and Keepalived do have that
capability, but the /var/lib/kolla/dev/log Unix socket file is currently
not mounted into the HAProxy and Keepalived containers.

Use Heka's DockerLogInput plugin
To remove Rsyslog and only use Heka one option would be to make the Kolla
services write their logs to stdout (or stderr) and rely on Heka's
DockerLogInput plugin [5] for reading the logs. Our experiments have
revealed a number of problems with this option:

The DockerLogInput plugin doesn't currently work for containers that have
a tty allocated. And Kolla currently allocates a tty for all containers
(for good reasons).
When DockerLogInput is used there is no way to differentiate log messages
for containers producing multiple log streams. neutron-agents is an
example of such a container. (Sam Yaple has raised that issue multiple
times.)
If Heka is stopped and restarted later then log messages will be lost, as the
DockerLogInput plugin doesn't currently have a mechanism for tracking its
positions in the log streams. This is in contrast to the LogstreamerInput
plugin [6] which does include that mechanism.

For these reasons we think that relying on the DockerLogInput plugin may
not be a practical option.
For the note, our experiments have also shown that the OpenStack containers
logs written to stdout are visible to neither Heka nor docker logs.
This problem is not reproducible when stderr is used rather than
stdout. The cause of this problem is currently unknown. And it looks like
other people have come across that issue [7].

Use local log files
Another option consists of configuring all the Kolla services to log into local
files, and using Heka's LogstreamerInput plugin [5].
This option involves using a Docker named volume, mounted both into the service
containers (in rw mode) and into the Heka container (in ro mode). The
services write logs into files placed in that volume, and Heka reads logs from
the files found in that volume.
This option doesn't present the problems described in the previous section.
And it relies on Heka's LogstreamerInput plugin, which, based on our
experience, is efficient and robust.
Keeping file logs locally on the nodes has been established as a requirement by
the Kolla developers. With this option, and the Docker volume used, meeting
that requirement necessitates no additional mechanism.
For this option to be applicable the services must have the capability of
logging into files. Most of the Kolla services have this capability. The
exceptions are HAProxy and Keepalived, for which a different mechanism should
be used (described further down in the document). Note that this will make it
possible to collect logs from RabbitMQ, which does not support logging to
Syslog but does support logging to a file.
Also, this option requires that the services have the permission to create
files into the Docker volume, and that Heka has the permission to read these
files. This means that the Docker named volume will have to have appropriate
owner, group and permission bits. With the Heka container running under
a specific user (see below) this will mean using an extend_start.sh script
including sudo chown and possibly sudo chmod commands. Our prototype
[4] already includes this.
As mentioned already the LogstreamerInput plugin includes a mechanism for
tracking positions in log streams. This works with journal files stored on the
file system (in /var/cache/hekad). A specific volume, private to Heka,
will be used for these journal files. In this way no logs will be lost if the
Heka container is removed and a new one is created.

Handling HAProxy and Keepalived
As already mentioned HAProxy and Keepalived do not support logging to files.
This means that some other mechanism should be used for these two services (and
any other services that only suppport logging to Syslog).
Our prototype has demonstrated that we can make Heka act as a Syslog server.
This works by using Heka's UdpInput plugin with its net option set
to unixgram.
This also requires that a Unix socket is created by Heka, and that socket is
mounted into the HAProxy and Keepalived containers. For that we will use the
same technique as the one currently used in Kolla with Rsyslog, that is
mounting /var/lib/kolla/dev into the Heka container and mounting
/var/lib/kolla/dev/log into the service containers.
Our prototype already includes some code demonstrating this. See [4].
Also, to be able to store a copy of the HAProxy and Keepalived logs locally on
the node, we will use Heka's FileOutput plugin. We will possibly create
two instances of that plugin, one for HAProxy and one for Keepalived, with
specific filters (message_matcher).