# Elasticsearch

Properties

Keywords

* Document, Index, partition, shards

Installation

Dev-tools

Discover

Create index, field mapping

Loading Data to ES using python

Query data using dev-tools

Query data using python (Create a search engine using python)

* BM25 algorithms for searching
* Lucene Software
* Query Functionalities available - radial search, weighted search

### Update data using python

List indices using python

from elasticsearch import Elasticsearch

es = Elasticsearch(\[{'host': '', 'port': 9200}])

indices=es.indices.get\_alias().keys()

## Mapping or Schema in ES

## Querying in ES

term match

must vs should term search

exact match

radial search

Weighted search on different fields

## Lucene Search Engine

## Scoring in ES

* The scoring function is a mathematical expression for finding a value for the relative importance of different documents
* Scoring is very much is based on the TF-IDF, used in the document/information retrieval &#x20;

$$
score (q,d) = queryNorm(q) \* coord(q,d) \* idf(t)2 \* t.getBoost() \* norm(t,d) (t in q)
$$

![](https://679135566-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-Ma8GKQU92FWew4_578V%2F-MfWrPixbvm3LmH19xj1%2F-MgLedXQBeSUo8JvBY9w%2Fimage.png?alt=media\&token=273d29e5-7756-46fa-9162-a91d01a1c26e)

* `score (q,d)`:  a relevance score of a document `d` for query `q`
* `queryNorm (q)`: query normalization factor
  * queryNorm = 1 / √sumOfSquaredWeights
  * `sumOfSquaredWeights` is computed by adding together the IDF of each term in the query, squared
  * QNF is the ratio that aims to make the results of different queries comparable. It is calculated at the beginning of each query using the above formula
* `coord (q,d):` query coordination factor
  * term score \* number of matching terms / total number of terms in the query.&#x20;
  * In the case of a multi-term query, the coordination factor rewards the documents that contain a higher number of terms of that query.&#x20;
  * The more query terms appear in the document the more relevance it might have.
  * For simplicity, let’s say you have a query with three terms:  “nice,” “red,” and “carpet,” each with a 1.5 score.
  * So, for example, the document that matches “nice red” will have 3.0 \* 2 / 3 = 2.0 score. Of course, the documents that contain all three terms will be much more relevant than the document that contains just two of them.
* `tf (t in d)`*:* term frequency of the term `t` in document `d`&#x20;
  * `=`tf(t in d) = √frequency
  * Number of times each term appears in the document
* `idf (t)`: inverse document frequency for term `t`&#x20;
  * idf = 1 + ln(numDocs/(docFreq + 1))
  * Inverse document frequency (IDF) assigns low weight/relevance to terms that appear frequently in all of the documents in the index
* `t.getBoost()`:  the boost applied to the query&#x20;
* `norm(t,d)` –  the field-length norm
  * norm = 1/sqrt(numFieldTerms)
  * The value of this parameter depends on the document field length in which a match with the query was found

![](https://679135566-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-Ma8GKQU92FWew4_578V%2F-MfWrPixbvm3LmH19xj1%2F-MgLebZEW698cGK4Ine1%2Fimage.png?alt=media\&token=04f2ae5f-48b5-4ba7-a76f-4eab67c4fd0d)

![](https://679135566-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-Ma8GKQU92FWew4_578V%2F-MfWrPixbvm3LmH19xj1%2F-MgLeb0I_gFl42egANjA%2Fimage.png?alt=media\&token=c558bc94-b1d2-4952-adba-2a5148508e0c)

#### Querying same index twice can return documents in different order

<https://www.elastic.co/guide/en/elasticsearch/reference/current/consistent-scoring.html>

## Updating a Document in ES

## Reference for Further Reading

* [qbox scoring and relevancy](https://qbox.io/blog/practical-guide-elasticsearch-scoring-relevancy)
* <https://coralogix.com/blog/42-elasticsearch-query-examples-hands-on-tutorial/>
