- Node - a server that stores data
- Cluster - a collection of nodes
- Index - collection of similar documents
A shard is a subset of the index data.
Solves the problem where the size of an index exceeds the hardware limit of a single node.
Default number of shards in an index is 5
.
After you create the index you can't change the number of shards.
The purpose of having replicas in ES:
- High availability (in case a shard or node fails)
- Increase performance
The dafault number of replicas is 1
per shard.
Routing - handled automatically bu default. Ensures that documents are distributed equally across shards.
It's how the documents and their fields are stored. A field can have multiple mappings.
Can be divided into 4 categories:
- Core Data Types
- Complex Data Types
- Geo Data Types
- Specialized Data Types
- Text Data Type - used to index full-text values (i.e. descriptions)
- Keyword Data Type - used for structured data (i.e. tags, categories). Typically used for filtering and aggregations.
- Numeric Data Type
- Date Data Type
- Boolean Data Type
- Binary Data Type
- Range Data Type
- Object Data Type
- Array Data Type
- Nested Data Type
- Geo-Point Data Type
- Geo-Shape Data Type
New Document -> Analysis -> Store Document
keywords DO NOT go through this process, texts go through this process.
You can control which analyzer to use. Results are added to the inverted index.
There's one inverted index per text field. It allows ES to efficiently perform full-text searches. It's basically a mapping of a field's terms and which documents contain each term.
Process has 3 steps:
1. Character filter | 2. Tokenizer | 3. Token Filter |
---|---|---|
It can add, remove or change characters. | Splits into words, removes , ; etc. (i.e. standard tokenizer) |
May add, change or remove the token (i.e. lowercase token filter, synonym token filter, stemmer token filter, stop token filter - removes words like and , at , the ). |
Only the lowercase filter is enabled by default.
Standard Analyzer removes punctuation and lowercase words. Optionally the stop token filter can be enabled.
Term Frequency/ Inverse Document Frequency
-
Term Frequency - the more times a term appear in a field for a given document, the more relevant it is.
-
Inverse Document Frequency - how often the term appears within the index (across all documents). The logic here is that if a term appears in a lot of documents it has a lower weight. This means that words that appear many times have less significance (i.e.
this
,the
, etc.). If the document contains the term and the term is not frequent in the index, it's a signal that the document is relevant. -
Field Lenght Norm - the longer the lenght of the field the less relevant (i.e.
nature
in thetitle
of 50 characters is more relevant thannature
in a 1,000 charactersdescription
. A term appearing in a short field has more weight than a term appearting in a long field.
These 3
values are calculated and stored at index time (when a document is added or updated). These values are used to calculate the weight of a given term for a particular document.
- Handles stop words better
Although the value of stop words is limited, they do have some value. It's no longer common/recommended to remove stop words. In the TF/IDF algorithm, stop words are artificially boosted in longer fields (where they tend to appear more often - i.e.
description
).
To solve this problem, BM25 uses Nonlinear Term Frequency Saturation, which means that there's an upper limit on how much a term can be boosted based on how many times it appears. As the number of appearances increases, the relevance number or boosting becomes less significant.
-
Improves the field-length norm factor Instead of treating a field in the same way across all documents, it takes into consideration the average field length.
-
Can be configured with parameters
How well the documents match? - affect relevance score
Do documents match? - boolean evaluation ES can cache filters.
Search for exact matches (case sensitive - query is not analyzed). Better for matching enums, numbers, dates, not sentences.
Used with numbers or date fields. Date Math Docs
They are analyzed using the analyzer defined for the search field or the standard analyzer
if none is defined.
Match query - It's a boolean query (default operator is OR). The query
goes through the analyzer specified in the mapping.
Should Query - Its behavior depends on the bool query as a whole and what else is in the query:
-
If the bool query in a query context and contains a
must
orfilter
object - should queries don't need to match to match the bool query as a whole only purpose is to influence the relevance score of the matching documents. -
If the bool query is in the
filter
context or if it doesn't have amust
offilter
object at least one of the should queries must match.