Elasticsearch Data Streams, Aliases and Logstash

TLDR

Data streams store time series data that is not modified after indexing. Also, data streams streamline index creation, rollover, and life-cycle management. Logstash benefits greatly from these features.

Indices in Elasticsearch

In Elasticsearch, indices are the basic unit for storing data. Each index is referred to by its unique name and is backed by one or more shards that split the data behind the scenes. This terminology demonstrates the overloaded meanings of the term “index” in computer science. In the Elasticsearch context, we have indices and shards. However, the shards are Apache Lucene indices.

Additionally, it’s important to note that none of these indices are associated with database indices. In the following discussion, I will use the term “index” exclusively to refer to Elasticsearch indices. While Lucene indices are not covered here, more information can be found in the official documentation [2].

Data Streams

A data stream is an abstraction on top of indices introduced in Elasticsearch 7.9. As of writing this text, the current version of Elasticsearch is 8.12. Data streams allow the storage of time series data across multiple indices while providing a single named resource to use and access the data. However, you can only append data and cannot delete or modify existing data directly in data streams. To change or delete data, you must send update or delete document requests directly to the backing indices [1]. Contrary, and similarly to write requests, search and index requests should be sent directly to the data stream, and the stream will route them to the backing indices. Data streams are intended for time series data (logs, metrics, synthetics, and traces). Thus, a @timestamp field must exist within the backing indices.

Index Lifecycle Management, Aliases, and Rollover

At first glance, data streams may seem similar to Index Aliases, but they have an append-only restriction. Most Elasticsearch APIs accept an alias instead of an index or data stream. However, an alias is a secondary name referring to one or more indices or data streams. There are some restrictions, as an alias cannot refer to an index and data stream, and a data stream’s backing index cannot be added to an index alias. Therefore, aliases can be used for both data streams and indices. However, data streams behave like an alias, referring to a set of backing indices.

Next, endlessly writing time series data such as logs or metrics to a single index is not feasible. To address indexing and search performance needs and manage resource consumption, you should write to an index up to a particular limit. Once this limit is reached, it’s time to create a new index and start writing to the new one. This process is referred to as rollover. It can be automated using Index Lifecycle Management (ILM) and data stream lifecycle, so you don’t have to initiate it manually. Rollover can be configured to start based on the index size, the number of documents in the index, or after some time (e.g., after nine days).

Data streams streamline the process of index creation, aliasing, and rollover. Let’s first explore how to create an index with an alias and rollover using ILM and how to set it up in Logstash.

ILM in Logstash

The process of using aliases and rollover with indices in Logstash is as follows:

Create an index template with an ILM policy and Alias in Elasticsearch. NOTE: If no policy exists, Logstash will create a default one (that rolls over the index when it reaches 50GB or is 30 days old). Similarly, if no alias is present, a default alias named logstash is created.
Create the first index (bootstrap index), e.g., my-application-log-000001, with an alias pointing to this index. This can be cumbersome when having many different data groups to be ingested into different indices or when trying to automate the process.
Configure the Elasticsearch output of Logstash. It should contain the following:

1output {
2    elasticsearch {
3        hosts => ["http://es1:9200"]
4        ilm_rollover_alias => "my-application-log"
5        ilm_enabled => "true"
6        ilm_pattern => "-000001"
7        ilm_policy => "my_policy"
8    }
9}

Start Logstash. It will create the first index (my-appliation-log-000001) via the Alias. The ILM policy will handle the rollover process.

We can see the following configuration options in the Logstash configuration above:

ilm_rollover_alias: The alias to be used. If omitted and not specified by an index template, then the default one will be used.
ilm_enabled: Turn the use of index lifecycle management on or off. This setting can be omitted as it is detected if the targeted Elasticsearch instance supports ILM.
ilm_pattern: The pattern used to name the backing indices. It must end with a dash and a number that will be incremented (e.g., -000001).
ilm_policy: Custom ILM policy is to be used. It can be omitted if your index template refers to an ILM policy or if you wish to use the default policy created by Logstash.

Data Streams in Logstash

With data streams, the backing indices are created automatically. In Logstash the following example configuration could be used, that also allows to customize the name of the data stream:

 1output {
 2    elasticsearch {
 3        hosts => ["http://es1:9200"]
 4        data_stream => "true"
 5        data_stream_type => "metrics"
 6        data_stream_dataset => "foo"
 7        data_stream_namespace => "bar"
 8        action => "create"
 9    }
10}

In the configuration the settings data_stream_* will be used the set the name of the data stream as {type}-{dataset}-{namespace} and the backing indices will use the following naming pattern: .ds-<data-stream>-<yyyy.MM.dd>-<generation>. Here, data_stream is the name of the data stream, followed by a date, and finally, a number that will be incremented, similar to the number in the ilm_pattern option.

The configuration options in detail are the following:

data_stream: whether data is indexed into a data stream. Other data_stream_* settings are ignored if set to false.
data_stream_type: data stream type to construct the data stream. Valid options are logs, metrics, synthetics, and traces.
data_stream_dataset: name of the dataset to construct the data stream.
data_stream_namespace: name of the namespace. Namespaces can be used to separate environments such as staging and production. Kibana supports switching between namespaces on its user interface.
action: The default value for data streams is create. This means the document is indexed, and the process fails if a document with the same ID already exists in the data stream.

When using data streams, we do not need to create a bootstrap index; however, an ILM policy and index template are required (these can be applied to multiple data streams).

In summary, data streams store time series data that are not modified after indexing. They also streamline index creation, rollover, and life-cycle management, and Logstash greatly benefits from these features.