Postmortem: Outage due to Elasticsearch’s flexibility and our carelessness

On November 25 at 4:30 AM, our consumer apps stopped working because of some issue with our backend API. This article is a postmortem of what happened that night.

Some Background

Our product search and navigation is served from Elasticsearch. We create daily index of products, related merchant data and locality data in one index but under different mappings. This index is built on a daily basis and then the latest index is switched with an existing one under a fixed alias. This works well.

Most of our requests are geolocation dependent, which is the reason why we are so heavily dependent on Elasticsearch.

The current state of our systems is that we have an API (called Consumer API) which serves product data curated using another system called Content Management System (CMS). CMS is the system which we use for curating our catalog and other content that changes frequently (on a day-to-day basis or multiple times a day) and is published to our consumer app. The responsibility of curating this data using CMS is of our Content Team.

The Incident

A couple of weeks back, our consumer mobile apps stopped working because our Consumer API was sending 500 response for every request that depends on Elasticsearch. This happened at around 4:30 AM in the morning when obviously nobody from our engineering team was awake.

We had not pushed any code changes at this time so we were sure that there is no bug in the code querying Elasticsearch. On the basis of that and the exceptions thrown by Elasticsearch (java.lang.NumberFormatException), we deduced that something was wrong with the Elasticsearch index.

We re-build our indexes daily and thankfully keep the old ones for situations like such. Within minutes, we switched the search alias to use the older index and our app was working normally again, although serving stale data.

Postmortem

While our service was working normally again, we could not continue to serve stale data as we had switched to an old index.

Logs helped us figure out the failing queries. We used Sense and started playing around with those queries manually on the faulty index. The same queries were running fine on older indices. As mentioned earlier, we were seeing the following exception:

sort[<custom:\"price\": org.elasticsearch.index.fielddata.fieldcomparator.LongValuesComparatorSource@60d18f72>]: Query Failed [Failed to execute main query]]; nested: ElasticsearchException[java.lang.NumberFormatException: Invalid shift value (64) in prefixCoded bytes (is encoded value really an INT?)]  

What changed?

For the upcoming sale, our business functions behind the campaign wanted to run promotions on our app. To make the promotions related content controllable via the CMS, our engineering team worked towards adding certain features which allowed our Content Team to manage promotions that are shown in the app. This feature also implemented making promotions available for querying using Elasticsearch since promotions are also location specific.

To add promotions in the same Elasticsearch, the CMS team decided to add a new doctype/mapping in Elasticsearch called promotions. The feature was tested locally by the developer and it worked fine. No issues were caught anywhere during testing and the code was pushed to production.

The feature came into use when our content team started curating the content past midnight for the sale starting the next day. Once they added the content and started the re-indexing procedure (which is a manual button click), our consumer app stopped working. As soon as the team started indexing the content around 4:30 AM, our app stopped working. Our search queries started returning NumberFormatException on our price field.

Understanding Elasticsearch Better

With Elasticsearch things may seem very nice and simple from the outside. In reality, they may not be as nice and simple. One of the nice things that Elasticsearch has, you may say, is being schema-less and using JSON for storing documents. JSON by its very nature is very flexible to work with. Add or remove things as you please. And Elasticsearch would take care of the rest. But that's what most of us beginning with Elasticsearch believe to be true. The reality is quite different.

Elasticsearch is NOT schema-less. It just appears schema-less by assuming schema at the time of indexing a document. According to the official Elasticsearch Definitive guide:

When you index a document that contains a new field—one previously not seen—Elasticsearch will use dynamic mapping to try to guess the field type from the basic datatypes available in JSON, using the following rules:

JSON Type Field Type
Boolean: true or false boolean
Whole number: 123 long
Floating point: 123.45 double
String, valid date: "2014-09-15" date
String: "foo bar" string

How Elasticsearch really handles mappings

An Elasticsearch index is made of one or more shards. An index is nothing but a collection of shards. Practically speaking, one index with 5 shards or 5 indices with one shard each are essentially the same.

The documents are stored in shards. Shards are made up of Lucene Segments. Whenever you index a new document in Elasticsearch, it is first written in the transaction log (in-memory and file). After a fixed period of time (the refresh_interval), the documents from the transaction log are picked up and written to a Lucene Segment file. Lucene Segments are immutable which means that once written, they cannot be modified. Lucene Segments have the actual indexes that are used for performing searches. When you execute a search query, Elasticsearch opens the right segment files and uses them for getting the matching documents.

It is also important to understand how mappings/doctypes work in Elasticsearch. As mentioned earlier, Elasticsearch is not schema-less. It is just good at assuming schemas. Schemas can be defined as mappings in Elasticsearch. Mappings are the way to define the structure of a document and how the data of a document is stored at Lucene level.

Lucene Segments are not type-aware. Lucene Segments, basically, just store key-value pairs where keys are fields and values are lists of any kind of value that may be used in the search process. To quote the Elasticsearch guide:

A document in Lucene consists of a simple list of field-value pairs. A field must have at least one value, but any field can contain multiple values. Similarly, a single string value may be converted into multiple values by the analysis process. Lucene doesn’t care if the values are strings or numbers or dates—all values are just treated as opaque bytes.

Since all the inverted indices in an Elasticsearch index (don't get confused) are stored in segment files, it does not really matter if you have two different mappings with the same field names but with different types. The values for field price in products mapping and the values for field price in promotions mapping (in the same index) will essentially be added to the same list at Lucene Segment level. And it will not fail or throw an exception.

This introduces unexpected problems which can be difficult to deal with if you have not experienced this before yourself or not read the docs properly. In our case, this is precisely what happened. Our existing mappings were set by us where we had used integer type for price field in products mapping. When promotions mapping was added, we did not care to define the mapping and depended on Elasticsearch to take care of it. Elasticsearch did what it usually does - assumed the value to be long and went ahead and set the mapping to have long type for price field.

We were sorting on the price field in our queries from the consumer app. At some level because of this mess up, Elasticsearch failed to give expected results. How this is explained in Elasticsearch is that Elasticsearch is based around field data. So when all the field data is loaded in the memory, Elasticsearch will either see them as long or integer depending upon whatever it saw first. In our case, Elasticsearch probably fail to load long values in memory since it saw integer values first.

How we fixed it

As I mentioned, we had immediately switched back to previous day's index but that was just a quick interim fix. We had new data that we had to show to our consumers for the sale that was about to start in about 6 hours from the time of failure.

We first thought of re-indexing the entire index. However, our ETL process as of now takes about 8 hours as we had not worked on optimizing it and in a situation like this we could not even do much with it immediately. And if we started the re-indexing process, it would take so much time that it would not be done before the start of our sale.

We understood that we needed to delete the new promotions mapping somehow, set the correct mapping to ensure that price has integer type and re-index the promotions data which would be very quick as data size was extremely small. So we started with that. We deleted the mapping using the Delete Mapping API.

The documents got deleted and were not searchable any more. We hoped that this would work. But it didn't. Why? Go back to the fundamentals about Lucene Segments we discussed earlier - they are immutable. Deleting never really deletes documents and field data from segment files. Documents are just marked as deleted.

The actual deletion happens when segments are merged. Segments are merged so that we don't end up with too many segment files on disk (read why). So background threads in Elasticsearch pick up existing segment files and merge them and garbage is collected in the merging process, meaning that deleted documents are actually deleted.

We could not wait for this to happen in the background as we were short on time and had to make sure that things were actually fine or come up with an alternative solution. Thankfully, Elasticsearch has Optimize API which triggers segment merging.

We triggered optimization using this API. When the API succeeded, we hoped that things would work. But they didn't. This time around the issue was that Elasticsearch's Optimize API decides for itself how segments are going to be merged and if all segments will get merged or not. The only logical explanation we could make was that there were still segments lying around with invalid price values with them and did not get garbage collected.

It would have been nice if there was a way to merge all the segments. Thankfully, Optimize API has a parameter called max_num_segments which allows you to specify how many segments would you want after the optimization process has run. We set that to 1 and triggered optimization again. And... voila! We had everything working again.

We went ahead and defined the mapping properly for promotions, re-indexed promotion data quickly and had a peaceful sale for the next three days.

How could we have avoided this?

Better testing

  1. The CMS code that was pushed live was tested indeed but it was tested only locally and pushed to production as the requirements were urgent. But, everything is urgent in a startup. Its easy to dig your own grave if you don't follow the practices that are important. We cannot get careless.
  2. How was it not caught locally? The team testing the feature only tested the APIs that were used for getting promotions. They did not even imagine that the APIs that make use products mapping can get affected. On top of this, the issue surfaced up only when our content team started using the feature. We were actually sitting on a time bomb. Thank god it happened at night. Automated regression testing would have caught this issue. This is something we don't have and are prioritizing with every release.
  3. Putting a review process around everything that goes into Elasticsearch, just like how you do the same thing around your RDBMS (you do that right?).
  4. Separating concerns is important. For some circumstances at the beginning of how systems were built, we relied on the CMS to index documents in Elasticsearch for the consumer app to consume those documents. Perhaps, if we had depended on consumer app to index their documents themselves, this issue could have gotten caught in the testing process by the consumer app team. May be it is time for us to move out the responsibility of indexing documents for search from CMS.

Read The F***ing Docs

Very aptly put in our world - I can't give more emphasis on this phrase. Nothing is magic. Nothing just works. As engineers, its our responsibility to get to the root of things and understand how this works. Had we made an effort to understand mappings better, which is such an essential part of running Elasticsearch, we would have not committed this mistake.

Conclusion

  1. Elasticsearch is NOT bad at all. We love Elasticsearch. It would be difficult to imagine a lot of things we do today without Elasticsearch. But the low bar to entry can get you careless and have you shoot yourself in the foot.
  2. This incident proved how careless we were with testing changes in pre-production environments. One of the reasons why this happened to us was not having sanitized pre-production environments for testing. We are working on bringing more dev-prod parity in our software release cycles.
  3. This incident also taught us the importance of understanding the systems we use.

References

A few links for further reading:

  1. Elasticsearch From The Bottom Up
  2. Elasticsearch: The Definitive Guide - Mappings
  3. Elasticsearch: The Definitive Guide - Types and Mappings - has details on how Elasticsearch and Lucene work together under the hood
  4. Elasticsearch: The Definitive Guide - Segment Merging
  5. Elasticsearch - Indices Segments API
  6. Elasticsearch - Optimize API
  7. Lucene:
    1. What is a Lucene Segment?
    2. Visualizing Lucene's Segment Merges
    3. Lucene File Formats
  8. Mailing list threads:
    1. Number of Shards vs. Open Files
    2. Segment Merging

Discussion on Hacker News

Follow the discussion on Hacker News.