On Mappings, part I: Which Fields to Index?

By | September 17, 2016

The choices you make in your Mappings have the power to make your project a success or a complete mess. There, I said it. This is part I in a series of shorts about important details of a Mapping that can make or break your cluster’s ability to get the right answer to your queries, at scale.

In this post, we’ll take a look at choosing the right data structures for your data and use case. Data structures such as Doc Values make Elasticsearch fly like a unicorn on steroids, but the more precise you are in using the tools that Elasticsearch gives you, the better performance you will get.

To index, analyze, or not at all?

For every field in every index, you have at least one choice to make: do I index the value, or not? In this context, “index” means to make the data “active” in Elasticsearch. They are field we like to search through, filters on, aggregate on, sort on, etc. Putting any data at all into an Elasticsearch cluster also is referred to as “indexing”, but it has a different meaning.

To be clear, this is how I will use the term indexing:

  1. “index a document” means: putting data in Elasticsearch.
  2. “index a field” means: to instruct Elasticsearch that we want to work with that field in queries, filters, scripts, and such.

Elasticsearch has many data structures that allow it to do its work at scale, some of the most important ones are:

  • inverted index: a row-based index of a value plus a (sparse) matrix of documents that contains that value
    • Requires on-disk resources
    • _all field: the _all field combines all other fields in a single string and analyzes and indexes that, so that you can search for values without specifying (or caring) which fields the value is in.
  • doc values: a column-based, inverted-inverted index that’s essentially a list of all the values in sequence
    • Requires on-disk or off-heap memory (if available) resources
  • fielddata: an in-heap version of doc values, now only used for analyzed text (5.0) and string (<= 2.4) fields
    • Requires in-heap memory, use with caution!
  • _source field: the original JSON data that was indexed into Elasticsearch, available as a field-of-fields for reindexing and as a return value to queries that ask for specific (lists of) documents. This field is compressed to save space (because we will never need to scan through it, compression is favorable.
    • Requires on-disk resources

The index settings in the mapping have, you guessed it, great impact on the usage of these derived data structures. Some of the data structures listed above work wonders iff in-memory, and can slow down the cluster’s performance if there is not enough of it. Even if not taking precious memory space, they will certainly occupy (SSD) disk space. Whenever we take resources, we must do that consciously. Therefore, a good practice is to go over every field fine-tune the mapping for it, providing your solution with exactly that what it needs to do the work, while cutting off any excess fat from the data structures.

The index mapping controls the field index settings, and the available options differ per field datatype. Elastic’s defaults try to balance these concerns as best as possible. The defaults in Elasticsearch 5.0 are:

OptionDefaultExtra info
indextrueMakes it possible to sort, filter, aggregate, script on this field.
fielddatafalseFor text fields only. In-heap sort, aggregate and script access. Users beware of heap pressure!
doc_valuestrueOnly for non-text fields. Used to allow sort, aggregate, and script access. If disabled, you can still filter on it.
index_optionspositionsAllows you to set how rich you want to index your data. See also here.
_alltrueAllows you to search for values without specifying in which fields. Takes resources in the inverted index.
_sourcetrueAllows you to request the original JSON data that was inserted into Elasticsearch, allows updates and reindexing.

And the defaults in the latest 2.x version, 2,4. are:

OptionDefaultExtra info
indexanalyzed for string fields, not_analyzed otherwiseIf it's a full-text field, makes it possible to search with fuzzy matching, highlighting, etc. Makes it possible to sort, filter, aggregate, script on this field if not a full-text field. Put on "no" to disable indexing.
fielddatafalseFor text fields only. In-heap sort, aggregate and script access. Users beware of heap pressure!
doc_valuestrueOnly for non-text fields. Used to allow sort, aggregate, and script access. If disabled, you can still filter on it.
_alltrueAllows you to search for values without specifying in which fields. Takes resources in the inverted index.
_sourcetrueAllows you to request the original JSON data that was inserted into Elasticsearch, allows updates and reindexing.

Recommended index settings for your situation

Quite simply, it comes down to answering these questions (for 5.0 and generally also for 2.x):

  1. Does your field contain text that you would like to have analyzed (i.e. perform full-text search, remove stop words, stemming, etc.)
    1. Yes: You’re dealing with a text field. Use “index: true” (default) and review the right setting for index_options (5.x only). Consider enabling fielddata, but be aware of its risks and look at the various ways to limit its heap usage.
    2. No: You’re dealing with a numerical field or a keyword field. Do you want to use the field “actively” in Elasticsearch? (filter, sort, script). If no, use “index: false”. Otherwise, use “index: true” (default).
  2. Are you sure you don’t need reindexing and don’t need the original JSON data that you sent to the cluster, do not need to update existing documents, and do you need the additional disk space? Then consider disabling _source, but be aware of all the disadvantages.
  3. Do you want to search for values in your data without knowing which field(s) the data is in? Then leave _all enabled. But if you want to save resources, you can disable _all and search in specific fields instead. Or, you can include only certain fields in _all.

Someday, I might make a flowchart for this :).

Common misconfigurations

Let’s turn it around:

  • If your heap pressure is leading Elasticsearch to get unstable, do excessive GC, or even crash: you might be using too much fielddata. You can analyze your fielddata statistics with the cat fielddata API. Remove fielddata or add fielddata filters, and/or increase heap GB’s, and/or add data nodes.
  • If you have complex mappings (hundreds of fields) and use only a couple of them for searching, filtering, aggregating, you should disable doc_values on them, and probably exclude them from _all. If you don’t, they will take up non-heap RAM and disk.
  • If you have simple mappings, but very high throughput and high reliance on aggregations to interpret your data (for example a metrics use case), you might not need _all, _source to be enabled. The same hardware will allow much higher throughput and storage of data that way.

See it in action

I’m taking a small sample data set of the World Bank, available on http://jsonstudio.com/resources/. The original JSON file is 2.8mb, which you will see is much bigger than the size of the _source field later on. That’s because even though we store the original JSON data in _source, it’s compressed. I’ll simply send the data to a non-existing index, so Elasticsearch will take the defaults. I’m using Elasticsearch and Kibana 5.0.0-alpha5.

GET _cat/_indices

health   status  index                                 pri.store.size
green    open    worldbank                             4.4mb
green    open    worldbank-nosource                    3.3mb
green    open    worldbank-nosource-noall              2.4mb
green    open    worldbank-nosource-noall-nodocvalues  574.5kb

Disabling fielddata does not impact on-disk size reported here, but if you start querying the full-text fields, you will see the field data size grow using the cat fielddata API:

# Execute a search query that sorts on a fielddata-enabled field
GET worldbank-withfielddata/_search
{
  "size": 1,
  "sort": [
  {
    "board_approval_month": {
    "order": "asc"
}}]}

GET _cat/fielddata?v

# Fielddata size is non-zero
id                     host      ip        node    field                size
sjihDpYhSa6sSlafOhwf4A 127.0.0.1 127.0.0.1 sjihDpY board_approval_month 3.3kb

Keep _source, stay flexible

In the old days, your choice of mapping was definitive, and that made it hard to change existing data. Since Elasticsearch 2.3 those days are over. The awesome Reindex API allows you to easily change the mapping of existing data, by reindexing it inside the cluster to an index with a different mapping. Say that we like to filter on a field that was not indexed because in all our wisdom we mistakenly thought we’d never want to filter on it. Enter Reindexing:

PUT tweets
{
  "mappings": {
    "tweet": {
      "properties": {
        "retweet_count": {
          "type": "integer",
          "index": "false"
        }
      }
    }
  }
}

Now retweet_count is not indexed, hence not filterable! Let’s create a derivative index with an updated mapping, and reindex:

PUT tweets2
{
  "mappings": {
    "tweet": {
      "properties": {
        "retweet_count": {
          "type": "integer",
          "index": "true"
        }
      }
    }
  }
}

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "twitter2"
  }
}

DELETE twitter

Now retweet_count has become filterable. However, this is only possible if you keep the original data in _source. So if you have an index like:

PUT tweets
{
  "mappings": {
    "tweet": {
      "_source": {
        "enabled": false
      },
      "properties": {
        "retweet_count": {
          "type": "integer",
          "index": "true"
        }
      }
    }
  }
}

The above mapping does not provide the _source to Elasticsearch, hence you cannot reindex this data any longer from inside the cluster. You’ll have to reingest using raw data from another data store.

One thought on “On Mappings, part I: Which Fields to Index?

  1. Pingback: On Mappings, Part II: Elasticsearch 5 Data Type Auto-Detection - Oh shard!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.