Elasticsearch is probably the most popular search engine at the moment with a developed community, support and a mountain of information on the web. However, this information arrives inconsistently and fractionally.
The very first and main misconception is “you need a search, so take an elastic!”. But in reality, if you need a quick search for a small or even quite a large project, you should understand the topic in more detail and you will refuse to use this particular system.
The second problem is that trying to figure it out from the beginning, getting the big picture will not be easy. Yes, there is a lot of information, but the sequence in its study is built ex post. You’ll have to run from the books to the documentation, and from the documentation back to the books, at the same time googling subtleties, only to understand what Elasticsearch is, why it works that way and why to use it at all, and where to choose something simpler.
In this article, I tried to consistently explain what seems to me the main thing in Elasticsearch, why it was invented and how it works.
For clarity, we’ll invent a task for ourselves. Implementation of a search in a collective blog for all materials and users. The system allows you to create tags, communities, geometries and all other things that help us categorize a huge amount of information.
Data storage scheme
What actions we will perform with the data will determine the storage scheme:
- most likely the search engine will have to quickly search
- recording and deletion may not be distinguished by high speed, in search engines, I think this can be neglected
- the data structure will be intensively changed and the storage can be filled from several independent sources (various databases, including external for our system)
Imagine once again how many attributes a publication can have and how many objects associated with it. Author, category, community, geo, media, tags, related publications. This list can be continued until exhaustion of imagination. If we store this in a familiar relational database, then we have a million links and a billion attributes. This is great for structured storage for many years, but it doesn’t match the requirements of quick searches.
But what if we want to add a couple of integrations with external systems? You have to implement additional tables or even databases. We will always need to add or change something in the objects available for search.
It is much faster to read from objects containing everything you need here and now. And it’s much easier to make changes to the unstructured data schema.
In addition, such data structures are easier to share, distribute to different physical stores, distribute, because the objects already contain everything you need.
We can perceive these objects as separate pages, files, cards, all this can be called certain documents. Therefore, such a data storage model is called a document-based one.
Search
Now you need to decide on the search mechanisms. Data is organized as documents. How are we used to searching a document?
A typical example of a document would be a web page. If we try to search the entire page in a browser, the search will be performed on all the text contained. And it’s convenient for most cases.
Many search engines work in approximately the same way, the search occurs throughout the text of the indexed pages, and not by individual fields, tags or headers. This is called full-text search.
Searching for a huge number of documents and it would be wise to remember what the document is in. In relational DBMSs we are used to optimizing such a search by indexes.
What is an index really? If you do not go into details, the index is a balanced tree, what means, a tree in which the length of the paths (the number of steps between nodes) will not differ by more than one step.
For example, if we indexed our post, then we would get a tree whose leaves would be the words used in it. In simple words, we will know in advance what words are in the document and how to quickly find them. Despite such convenient data structuring, a tree walk sounds like not the best solution for the fastest search.
But what if you do the opposite – collect a list of all the words used and find out in which documents they appear. Yes, indexing will take more time, but what we are primarily interested in is the speed of the search, not the indexing.
This index is called the reverse index and is used for full-text search.
A good example is the popular open-source full-text search library, of course, with a reverse index, Apache Lucene.
Scaling
No matter how we try to optimize data structures and search algorithms, when it comes to really large data arrays and a really large number of queries, you need to think about the possibility of influencing system performance by increasing the hardware resource. We want to be able to use a little memory, CPU and disk space so that everything goes faster. We can call it scalability.
The easiest option is to add hardware on the server. If we represent each conventional unit of computing power as a wooden cube, then now we will put the cubes in one place or one on another, building the tower vertically. Such scaling is called vertical.
The second option is to divide our tasks into a group of machines. In this case, we also increase the hardware resource, but now we can arrange the cubes on an imaginary table in any way on its plane, horizontally. Guess what such scaling is called?
The first method guarantees us a quick result without pain, but of course not everything is so smooth. How long can we increase the life of an individual machine? Firstly, it will be a cheap way only at the very beginning, then payment for one server will cost as few machines as simpler. Secondly, sooner or later you will hit the ceiling – hardware, drivers, bandwidth and a bunch of logical and physical limitations. And most importantly, a critical failure in one machine will cause a failure of the entire system, naturally.
Unlike the first method, the second does not impose such obvious restrictions, we can add machines as many as we want, connecting them with a network. Of course, this will entail network overhead – low network transmission speed (compared to processing on a single machine), network overhead. But at the same time, the network has one very important property – high fault tolerance.
Distributed Index
Ok, we will use the Lucene instance to store data and search. But earlier we decided that to ensure horizontal scaling we need to be able to host data on different machines. In fact, what’s the difference how data is physically stored? It is important that we have a single logical repository. Each Lucene instance should become part of one large index, or shard, of a broken index. Shard will directly perform data retrieval and recording operations.
Cluster
We decided on the basic concept of a distributed index. Now it is necessary to decide how, in reality, individual databases will be managed.
Earlier, we decided that a separate Lucene instance (shard) was responsible for the search and indexing operations. In order to access a distributed system of shards, we need to have some kind of coordinating node, it will be this what will receive requests and give tasks for recording or receiving data. It means, in addition to data storage, we highlight another option for the program’s behavior – coordination.
Thus, we initially focus on two types of nodes – CRUD nodes and coordinating nodes. Call them data node and coordinating node. We have a bunch of machines connected in a network and all of this is very similar to a cluster.
Each type of node responsibility imposes certain system requirements. Obviously, data nodes will often access the disk and use significant amounts of memory during operation.
We can also say that not all data will be requested equally often. Data gradually ‘cools’ as requests decrease. We can call this the data storage life cycle. It would be a good idea to keep hype publications where you can get them quickly.
The most important aspect in the use of distributed systems is the parallel execution of tasks. There is a popular distributed computing model that has the laconic name MapReduce. And it consists of dividing the implementation of the task into two big steps:
- Map – data preprocessing, task formulation and its subsequent transfer to the executing nodes
- Reduce – convolution of the set of worker-node results into one final answer
It is such a mechanism that will help us perform operations with shards. The coordinating node will receive the request, first reformulate it for intra-cluster interaction and execute requests to our worker-nodes (in this case, data-nodes).
Therefore, coordinating nodes must have a sufficient memory resource, CPU and fast network, but they can have a small disk, because they do not store data.
However, with a high frequency of requests, such nodes can become a bottleneck in the system. We can go the usual way and turn the external access point into a plane. Let there be many coordinating nodes.
This approach will allow us to apply query balancing, this can be done directly in the client code or use any existing balancers.
Cluster management
At this stage, we have the ability to access data from many points – a coordinating node. This is no problem if we are talking about simple read / write operations to an existing index. But if we talk about the allocation of new shards or their movement, confusion may begin.
Suppose the ability of coordinating nodes to manage cluster state. One node will decide to move the shard to one data node, and the second to move the same to another. The list of possible cluster-wide actions can be quite wide, and the list of possible conflicts is even wider.
Obviously, such important decisions should be made by one central node. We determined that for each type of action it is necessary to allocate a separate role in order to avoid performance losses on the node. And the “chief in the cluster” sounds like a separate responsibility.
We will call such nodes master-node. There should always be one active master; it will manage the cluster topology: create a new index, allocate and distribute shards, move them and combine them if necessary. The wizard always knows everything about the state of the cluster.
Data replication
Now each record in our index exists in only one place, and the loss of the node that stores it will lead to data loss for an indefinite period. In order to avoid this, a replication mechanism exists. It is important not to confuse the concepts of replica and backup, if backup allows you to restore data in case of loss, then the replica is a complete copy of the database.
If we lose one of the data nodes, we can always continue to work with replicas of shards in another node and in the meantime, return the lost one.
It means, for each shard there must be at least one copy on another node. You can of course allocate a separate machine for each replica, but it is very wasteful. It is necessary to place copies of data on different nodes, but this does not mean that these nodes should store only replica shards.
Thus, we always have replicas of all shards and do not raise inefficient idle nodes.
The main shard is called primary shard, and any of its copies is a replicating shard or replica shard, the primary shard and its replicas are a replication group.
Given replicas, data will be recorded in two stages, in the first record only the primary shard will be affected, and only after the flush merge operation and the commit commit operation in the Lucene index are sent, an internal request will be sent to change all replicas.
For maximum cluster stability, it is necessary that the number of data nodes be greater than or equal to the number of replicas.
Fault tolerance
Now the data will be available even if one of the storage nodes fails. But what if the cluster loses the master? Losing a single master is equivalent to losing a cluster.
Everything is according to the usual pattern – we raise a few masters.
But if we have, for example, two control nodes, how can we understand which of them should currently manage the cluster? How can they agree on their decisions? Obviously, at each moment of time there should be only one node managing the cluster.
That is, if the master is lost, one of the candidates should take his place.
Imagine. The main control node has become inaccessible to the cluster, the cluster takes the first candidate and sets it in the vacant position. After a certain time, the first master returns to the cluster and does not know that his place has already been taken. Master nodes are a kind of its brain, and now the brain of the cluster is becoming divided. This is a classic distributed system problem and is called the split-brain problem.
In society, such problems are often resolved by voting. A similar mechanism is used in distributed systems. As soon as the cluster loses the control node, the voting process should be started.
It is important to determine which of the candidates is most suitable for the role of the main node. Such a candidate must possess the most relevant information about the cluster. Versioning can be used to briefly describe the relevance of cluster information. Each time the cluster changes, the main node will update some service information and increase the version number, then the same thing will happen in parallel in the candidate nodes.
By comparing the version numbers, we can determine the most suitable candidates for the role of the master. Now, if the fallen master node returns to the cluster, the voting process will start again and a single control node will be selected.
Now it’s important to understand when it can be considered that the voting was successful? If all the participants voted? Or half? Or other any other magic amount?
The solution to this problem is to determine the quorum. This is a smart name for the control number of voters.
Obviously, such an important decision as the choice of a master should be made on the basis of the majority, that is, 50% + one vote. Fair, reliable. This value will become a quorum.
Thus, the number of candidates for the master should be odd and not less than three. It is recommended to use a simple formula to calculate the optimal number of such nodes:
NUMBER_ OF CANDIDATES = TOTAL_NUMBER_NOD / 2 + 1
Decisions for any cluster-wide actions are taken by voting, and all information necessary for voting is contained in the voting configuration. The right to vote determines another role, because the right to vote does not mean that the node can be a candidate.
Now, if the cluster is divided into two parts, the smaller nodes, pinging the nodes available in it and comparing their number with the quorum value, they will know that they are the ones that fell away from the cluster and cannot participate in decision-making.
Transport
It’s time to talk about how to communicate with the cluster from external systems, and how the nodes within the cluster will communicate. There are a number of pros and cons of using both traditional and special protocols. For a brief comparison, there is a table.
Protocol | Advantages |
Disadvantages |
HTTP | Low entry threshold compared to the native protocol. For use, you only need an HTTP client . The HTTP API never breaks compatibility, when upgrading the ES version, your application will continue to work the same. It is possible to proxy and use load balancers. JSON | The client does not know the cluster topology; therefore, it may require a larger number of requests to obtain data. Overhead. |
ES Native | The best choice for VERY big data. If you need to perform a large number of operations with the index, the native protocol will significantly speed up. | Used under the JVM. Use entails tight connectivity with ES. Updates require recompiling and redeploying custom clients. Updates breaking compatibility are possible. |
Conclusion
I would like to believe that after reading this article you understand the basics of distributed search engines. Scalability and resiliency capabilities are what Elasticsearch was created for and why it has gained popularity.
I tried to briefly and consistently talk about how and why this is exactly works. In this article, I intentionally did not mention the Elastic ecosystem, plugins, requests, tokenization, mapping, and the rest. I also did not say about Ingest and machine learning nodes, in my opinion, they provide additional features and are not basic.