Elasticsearch Data Streams, Aliases and Logstash

Posted by Attila on Mon, Mar 25, 2024

TLDR

Data streams are for storing time series data that is not modified after indexing. Also data streams streamline index creation, rollover and life-cycle management. Logstash benefits greatly from these features.

Indices in Elasticsearch

In Elasticsearch indices are the basic unit for storing data. Each index is referred by its unique name, and is backed by one or more shards that split the data behind the scene. Elasticsearch provides a great example that the term index has a quite overloaded meaning in computer science: the shards in Elasticsearch are actually Apache Lucene indices, and also have nothing to do with database indices. However, Lucene indices are beyond the scope here, but more information can be found in its official documentation [2]. In the following I will use the term index exclusively referring to Elasticsearch indices.

Data Streams

A data stream is an abstraction on top of indices introduced in Elasticsearch 7.9. As of writing this text the current version of Elasticsearch is 8.12. Data streams allow storing time series data across multiple indices while providing a single named resource to use and access the data. However, you can only append data and cannot delete or modify existing data directly in data streams. If you wish to modify or delete data then you must send update or delete document requests directly to the backing indices [1]. Contrary, and similarly to write requests, search and index requests should be sent directly to the data stream and the stream will route them to the backing indices. Data streams are intended for time series data (logs, metrics, synthetics and traces), thus a @timestamp field must exist within the backing indices.

Index Lifecycle Management, Aliases and Rollover

At first look data streams may look similar to Index Aliases, albeit with the append only restriction. Indeed most Elasticsearch APIs accept an alias in place of an index or data stream. However, an alias is a secondary name and can refer both to one or more indices and or one or more data streams. Furthermore, there are some restrictions, as an alias cannot refer to both an index and data stream and a data stream’s backing index cannot be added to an index alias. So aliases can be used both for data streams and indices, however data streams itself behave like an alias as they refer to a set of backing indices.

Next, it is not feasible to write time series data such as logs or metrics endlessly to a single index. To address indexing and search performance needs, as well as manage resource consumption, you should write to an index up to a particular limit. Once this limit is reached, it’s time to create a new index and start writing to the new one. This process is referred as rollover. It can be automated by using Index Lifecycle Management (ILM) and data stream lifecycle so you don’t have to initiate it manually. Rollover can be configured to start based on size of the index, number of documents in the index or after a time period (e.g., after 9 days).

Data streams allow streamlining the index creation, aliasing and rollover process. However, first, let’s look at how you would create and index with an alias and rollover using ILM and how to set it up in Logstash.

ILM in Logstash

The workflow of using indices with aliases and rollover from Logstash is as follows:

  1. Create and index template with an ILM policy and Alias in Elasticsearch. NOTE: If no policy exists then Logstash will create a default one (that rolls over the index when it reaches 50GB or is 30 days old). Similarly, if no alias is present it creates a default alias named logstash.

  2. Create the first index (bootstrap index), e.g., my-application-log-000001 with an alias pointing to this index. This can be cumbersome when having many different data groups to be ingested into different indices or when trying to automate the process.

  3. Configure the Elasticsearch output of Logstash. It should contain the following:

1output {
2    elasticsearch {
3        hosts => ["http://es1:9200"]
4        ilm_rollover_alias => "my-application-log"
5        ilm_enabled => "true"
6        ilm_pattern => "000001"
7        ilm_policy => "my_policy"
8    }
9}
  1. Start Logstash. It will create the first index (my-appliation-log-000001) via the Alias. The ILM policy will handle the rollover process.

We can see the following configuration options in the Logstash configuration above:

  • ilm_rollover_alias: The alias to be used. If omitted and it is not specified by an index template, then the default one will be used.
  • ilm_enabled: enable or disable the use of index lifecycle management. This setting can be omitted as it is detected if the targeted Elasticsearch instance supports ILM.
  • ilm_pattern: The pattern used to name the backing indices. It must end with a dash and a number that will be incremented (e.g., -000001).
  • ilm_policy: custom ILM policy to be used. It can be omitted if your index template refers to an ILM policy or you wish to use the default policy created by Logstash.

Data Streams in Logstash

With data streams the backing indices are created automatically. In Logstash the following example configuration could be used, that also allows to customize the name of the data stream:

 1output {
 2    elasticsearch {
 3        hosts => ["http://es1:9200"]
 4        data_stream => "true"
 5        data_stream_type => "metrics"
 6        data_stream_dataset => "foo"
 7        data_stream_namespace => "bar"
 8        action => "create"
 9    }
10}

In the configuration the settings data_stream_* will be used the set the name of the data stream as {type}-{dataset}-{namespace} and the backing indices will use the following naming pattern: .ds-<data-stream>-<yyyy.MM.dd>-<generation>. Here data_stream is the name of the data stream, followed by a date, and finally a generation that is a number that will be incremented, similarly to the number in the ilm_pattern option.

The configuration options in detail are the following:

  • data_stream: whether data is indexed into a data stream. Other data_stream_* settings are ignored if set to false.
  • data_stream_type: data stream type to construct the data stream. Valid options are logs, metrics, synthetics and traces.
  • data_stream_dataset: name of the dataset to construct the data stream.
  • data_stream_namespace: name of the namespace. Namespaces can be used to separate environments such as staging and production. Kibana supports switching between namespaces on its user interface.
  • action: default value is create for data streams: index the document and fail if a document with the same id already exists in the data stream.

When using data streams we do not need to create a bootstrap index, however an ILM policy and index template is required (these can be applied to multiple data streams).

In summary data streams are for storing time series data that is not modified after indexing. Also data streams streamline index creation, rollover and life-cycle management, and Logstash benefits greatly from these features.

  1. https://www.elastic.co/guide/en/elasticsearch/reference/current/use-a-data-stream.html#update-delete-docs-in-a-backing-index
  2. Apache Lucene - Index File Formats