Skip to content
Snippets Groups Projects
logging-with-heka.rst 13.3 KiB
Newer Older
  • Learn to ignore specific revisions
  • Éric Lemoine's avatar
    Éric Lemoine committed
    =================
    Logging with Heka
    =================
    
    https://blueprints.launchpad.net/kolla/+spec/heka
    
    
    caoyuan's avatar
    caoyuan committed
    Kolla currently uses Rsyslog for logging. And Change Request ``252968`` [1]
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    suggests to use ELK (Elasticsearch, Logstash, Kibana) as a way to index all the
    logs, and visualize them.
    
    This spec suggests using Heka [2] instead of Logstash, while still using
    
    caoyuan's avatar
    caoyuan committed
    Elasticsearch for indexing and Kibana for visualization. It also discusses
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    the removal of Rsyslog along the way.
    
    
    caoyuan's avatar
    caoyuan committed
    What is Heka? Heka is a open-source stream processing software created and
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    maintained by Mozilla.
    
    Using Heka will provide a lightweight and scalable log processing solution
    for Kolla.
    
    Problem description
    ===================
    
    Change Request ``252968`` [1] adds an Ansible role named "elk" that enables
    
    caoyuan's avatar
    caoyuan committed
    deploying ELK (Elasticsearch, Logstash, Kibana) on nodes with that role. This
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    spec builds on that work, proposing a scalable log processing architecture
    based on the Heka [2] stream processing software.
    
    We think that Heka provides for a lightweight, flexible and powerful solution
    for processing data streams, including logs.
    
    Using Heka our primary goal is distributing the logs processing load across the
    OpenStack nodes rather than using a centralized log processing engine that
    represents a bottleneck and a single-point-of-failure.
    
    We also know from experience that Heka provides all the necessary flexibility
    
    caoyuan's avatar
    caoyuan committed
    for processing other types of data streams than log messages. For example, we
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    already use Heka together with Elasticsearch for logs, but also with collectd
    and InfluxDB for statistics and metrics.
    
    Proposed change
    ===============
    
    We propose to build on the ELK infrastructure brought by CR ``252968`` [1], and
    use Heka to collect and process logs in a distributed and scalable way.
    
    This is the proposed architecture:
    
    .. image:: logging-with-heka.svg
    
    In this architecture Heka runs on every node of the OpenStack cluster. It runs
    in a dedicated container, referred to as the Heka container in the rest of this
    document.
    
    Each Heka instance reads and processes the logs local to the node it runs on,
    
    caoyuan's avatar
    caoyuan committed
    and sends these logs to Elasticsearch for indexing. Elasticsearch may be
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    distributed on multiple nodes for resiliency and scalability, but that part is
    outside the scope of that specification.
    
    Heka, written in Go, is fast and has a small footprint, making it possible to
    
    caoyuan's avatar
    caoyuan committed
    run it on every node of the cluster. In contrast, Logstash runs in a JVM and
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    is known [3] to be too heavy to run on every node.
    
    Another important aspect is flow control and avoiding the loss of log messages
    
    in case of overload. Heka's filter and output plugins, and the Elasticsearch
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    output plugin in particular, support the use of a disk based message queue.
    This message queue allows plugins to reprocess messages from the queue when
    downstream servers (Elasticsearch) are down or cannot keep up with the data
    flow.
    
    With Logstash it is often recommended [3] to use Redis as a centralized queue,
    which introduces some complexity and other points-of-failures.
    
    Remove Rsyslog
    --------------
    
    
    caoyuan's avatar
    caoyuan committed
    Kolla currently uses Rsyslog. The Kolla services are configured to write their
    logs to Syslog. Rsyslog gets the logs from the ``/var/lib/kolla/dev/log`` Unix
    socket and dispatches them to log files on the local file system. Rsyslog
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    running in a Docker container, the log files are stored in a Docker volume
    (named ``rsyslog``).
    
    With Rsyslog already running on each cluster node, the question of using two
    
    caoyuan's avatar
    caoyuan committed
    log processing daemons, namely ``rsyslogd`` and ``hekad``, has been raised on
    the mailing list. The spec evaluates the possibility of using ``hekad`` only,
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    based on some prototyping work we have conducted [4].
    
    Note: Kolla doesn't currently collect logs from RabbitMQ, HAProxy and
    
    caoyuan's avatar
    caoyuan committed
    Keepalived. For RabbitMQ the problem is related to RabbitMQ not having the
    capability to write its logs to Syslog. HAProxy and Keepalived do have that
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    capability, but the ``/var/lib/kolla/dev/log`` Unix socket file is currently
    not mounted into the HAProxy and Keepalived containers.
    
    Use Heka's ``DockerLogInput`` plugin
    ------------------------------------
    
    To remove Rsyslog and only use Heka one option would be to make the Kolla
    services write their logs to ``stdout`` (or ``stderr``) and rely on Heka's
    
    caoyuan's avatar
    caoyuan committed
    ``DockerLogInput`` plugin [5] for reading the logs. Our experiments have
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    revealed a number of problems with this option:
    
    * The ``DockerLogInput`` plugin doesn't currently work for containers that have
    
    caoyuan's avatar
    caoyuan committed
      a ``tty`` allocated. And Kolla currently allocates a tty for all containers
    
    Éric Lemoine's avatar
    Éric Lemoine committed
      (for good reasons).
    
    * When ``DockerLogInput`` is used there is no way to differentiate log messages
    
    caoyuan's avatar
    caoyuan committed
      for containers producing multiple log streams. ``neutron-agents`` is an
      example of such a container. (Sam Yaple has raised that issue multiple
    
    Éric Lemoine's avatar
    Éric Lemoine committed
      times.)
    
    * If Heka is stopped and restarted later then log messages will be lost, as the
      ``DockerLogInput`` plugin doesn't currently have a mechanism for tracking its
    
    caoyuan's avatar
    caoyuan committed
      positions in the log streams. This is in contrast to the ``LogstreamerInput``
    
    Éric Lemoine's avatar
    Éric Lemoine committed
      plugin [6] which does include that mechanism.
    
    For these reasons we think that relying on the ``DockerLogInput`` plugin may
    not be a practical option.
    
    For the note, our experiments have also shown that the OpenStack containers
    logs written to ``stdout`` are visible to neither Heka nor ``docker logs``.
    This problem is not reproducible when ``stderr`` is used rather than
    
    caoyuan's avatar
    caoyuan committed
    ``stdout``. The cause of this problem is currently unknown. And it looks like
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    other people have come across that issue [7].
    
    Use local log files
    -------------------
    
    Another option consists of configuring all the Kolla services to log into local
    files, and using Heka's ``LogstreamerInput`` plugin [5].
    
    This option involves using a Docker named volume, mounted both into the service
    
    caoyuan's avatar
    caoyuan committed
    containers (in ``rw`` mode) and into the Heka container (in ``ro`` mode). The
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    services write logs into files placed in that volume, and Heka reads logs from
    the files found in that volume.
    
    This option doesn't present the problems described in the previous section.
    And it relies on Heka's ``LogstreamerInput`` plugin, which, based on our
    experience, is efficient and robust.
    
    Keeping file logs locally on the nodes has been established as a requirement by
    
    caoyuan's avatar
    caoyuan committed
    the Kolla developers. With this option, and the Docker volume used, meeting
    
    that requirement necessitates no additional mechanism.
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    
    For this option to be applicable the services must have the capability of
    
    caoyuan's avatar
    caoyuan committed
    logging into files. Most of the Kolla services have this capability. The
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    exceptions are HAProxy and Keepalived, for which a different mechanism should
    
    caoyuan's avatar
    caoyuan committed
    be used (described further down in the document). Note that this will make it
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    possible to collect logs from RabbitMQ, which does not support logging to
    Syslog but does support logging to a file.
    
    Also, this option requires that the services have the permission to create
    files into the Docker volume, and that Heka has the permission to read these
    
    caoyuan's avatar
    caoyuan committed
    files. This means that the Docker named volume will have to have appropriate
    owner, group and permission bits. With the Heka container running under
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    a specific user (see below) this will mean using an ``extend_start.sh`` script
    
    caoyuan's avatar
    caoyuan committed
    including ``sudo chown`` and possibly ``sudo chmod`` commands. Our prototype
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    [4] already includes this.
    
    As mentioned already the ``LogstreamerInput`` plugin includes a mechanism for
    
    caoyuan's avatar
    caoyuan committed
    tracking positions in log streams. This works with journal files stored on the
    file system (in ``/var/cache/hekad``). A specific volume, private to Heka,
    will be used for these journal files. In this way no logs will be lost if the
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    Heka container is removed and a new one is created.
    
    Handling HAProxy and Keepalived
    -------------------------------
    
    As already mentioned HAProxy and Keepalived do not support logging to files.
    This means that some other mechanism should be used for these two services (and
    any other services that only suppport logging to Syslog).
    
    Our prototype has demonstrated that we can make Heka act as a Syslog server.
    This works by using Heka's ``UdpInput`` plugin with its ``net`` option set
    to ``unixgram``.
    
    This also requires that a Unix socket is created by Heka, and that socket is
    
    caoyuan's avatar
    caoyuan committed
    mounted into the HAProxy and Keepalived containers. For that we will use the
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    same technique as the one currently used in Kolla with Rsyslog, that is
    mounting ``/var/lib/kolla/dev`` into the Heka container and mounting
    ``/var/lib/kolla/dev/log`` into the service containers.
    
    Our prototype already includes some code demonstrating this. See [4].
    
    Also, to be able to store a copy of the HAProxy and Keepalived logs locally on
    
    caoyuan's avatar
    caoyuan committed
    the node, we will use Heka's ``FileOutput`` plugin. We will possibly create
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    two instances of that plugin, one for HAProxy and one for Keepalived, with
    specific filters (``message_matcher``).
    
    Read Python Tracebacks
    ----------------------
    
    In case of exceptions the OpenStack services log Python Tracebacks as multiple
    
    caoyuan's avatar
    caoyuan committed
    log messages. If no special care is taken then the Python Tracebacks will be
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    indexed as separate documents in Elasticsearch, and displayed as distinct log
    
    caoyuan's avatar
    caoyuan committed
    entries in Kibana, making them hard to read. To address that issue we will use
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    a custom Heka decoder, which will be responsible for coalescing the log lines
    
    caoyuan's avatar
    caoyuan committed
    making up a Python Traceback into one message. Our prototype includes that
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    decoder [4].
    
    Collect system logs
    -------------------
    
    In addition to container logs we think it is important to collect system logs
    
    caoyuan's avatar
    caoyuan committed
    as well. For that we propose to mount the host's ``/var/log`` directory into
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    the Heka container, and configure Heka to get logs from standard log files
    
    caoyuan's avatar
    caoyuan committed
    located in that directory (e.g. ``kern.log``, ``auth.log``, ``messages``). The
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    list of system log files will be determined at development time.
    
    Log rotation
    ------------
    
    
    caoyuan's avatar
    caoyuan committed
    Log rotation is an important aspect of the logging system. Currently Kolla
    doesn't rotate logs. Logs just accumulate in the ``rsyslog`` Docker volume.
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    The work on Heka proposed in this spec isn't directly related to log rotation,
    
    caoyuan's avatar
    caoyuan committed
    but we are suggesting to address this issue for Mitaka. This will mean
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    creating a new container that uses ``logrotate`` to manage the log files
    created by the Kolla containers.
    
    Create an ``heka`` user
    -----------------------
    
    For security reasons an ``heka`` user will be created in the Heka container and
    
    caoyuan's avatar
    caoyuan committed
    the ``hekad`` daemon will run under that user. The ``heka`` user will be added
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    to the ``kolla`` group, to make sure that Heka can read the log files created
    by the services.
    
    Security impact
    ---------------
    
    
    caoyuan's avatar
    caoyuan committed
    Heka is a mature product maintained and used in production by Mozilla. So we
    trust Heka as being secure. We also trust the Heka developers as being serious
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    should security vulnerabilities be found in the Heka code.
    
    As described above we are proposing to use a Docker volume between the service
    
    caoyuan's avatar
    caoyuan committed
    containers and the Heka container. The group of the volume directory and the
    log files will be ``kolla``. And the owner of the log files will be the user
    that executes the service producing logs. But the ``gid`` of the ``kolla``
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    group and the ``uid``'s of the users executing the services may correspond
    
    caoyuan's avatar
    caoyuan committed
    to a different group and different users on the host system. This means
    that the permissions may not be right on the host system. This problem is
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    not specific to this specification, and it already exists in Kolla (for
    the mariadb data volume for example).
    
    Performance Impact
    ------------------
    
    
    caoyuan's avatar
    caoyuan committed
    The ``hekad`` daemon will run in a container on each cluster node. But the
    ``rsyslogd`` will be removed. And we have assessed that Heka is lightweight
    enough to run on every node. Also, a possible option would be to constrain the
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    Heka container to only use a defined amount of resources.
    
    Alternatives
    ------------
    
    An alternative to this proposal involves using Logstash in a centralized
    way as done in [1].
    
    Another alternative would be to execute Logstash on each cluster node, as this
    
    caoyuan's avatar
    caoyuan committed
    spec proposes with Heka. But this would mean running a JVM on each cluster
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    node, and using Redis as a centralized queue.
    
    Also, as described above, we initially considered relying on services writing
    
    caoyuan's avatar
    caoyuan committed
    their logs to ``stdout`` and use Heka's ``DockerLogInput`` plugin. But our
    prototyping work has demonstrated the limits of that approach. See the
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    ``DockerLogInput`` section above for more information.
    
    Implementation
    ==============
    
    Assignee(s)
    -----------
    
      Éric Lemoine (elemoine)
    
    Milestones
    ----------
    
    Target Milestone for completion: Mitaka 3 (March 4th, 2016).
    
    Work Items
    ----------
    
    1. Create an Heka Docker image
    2. Create an Heka configuration for Kolla
    3. Develop the necessary Heka decoders (with support for Python Tracebacks)
    4. Create Ansible deployment files for Heka
    5. Modify the services' logging configuration when required
    6. Correctly handle RabbitMQ, HAProxy and Keepalived
    7. Integrate with Elastichsearch and Kibana
    8. Assess logs from all the Kolla services are collected
    9. Make the Heka container upgradable
    10. Integrate with kolla-mesos (will be done after Mitaka)
    
    Testing
    =======
    
    We will rely on the existing gate checks.
    
    Documentation Impact
    ====================
    
    The location of log files on the host will be mentioned in the documentation.
    
    References
    ==========
    
    
    ZhongShengping's avatar
    ZhongShengping committed
    [1] <https://review.opendev.org/#/c/252968/>
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    [2] <http://hekad.readthedocs.org>
    [3] <http://blog.sematext.com/2015/09/28/recipe-rsyslog-redis-logstash/>
    
    ZhongShengping's avatar
    ZhongShengping committed
    [4] <https://review.opendev.org/#/c/269745/>
    
    Éric Lemoine's avatar
    Éric Lemoine committed
    [5] <http://hekad.readthedocs.org/en/latest/config/inputs/docker_log.html>
    [6] <http://hekad.readthedocs.org/en/latest/config/inputs/logstreamer.html>
    
    ZhongShengping's avatar
    ZhongShengping committed
    [7] <https://review.opendev.org/#/c/269952/>