Elasticsearch

Properties

Keywords

  • Document, Index, partition, shards

Installation

Dev-tools

Discover

Create index, field mapping

Loading Data to ES using python

Query data using dev-tools

Query data using python (Create a search engine using python)

  • BM25 algorithms for searching

  • Lucene Software

  • Query Functionalities available - radial search, weighted search

Update data using python

List indices using python

from elasticsearch import Elasticsearch

es = Elasticsearch([{'host': '', 'port': 9200}])

indices=es.indices.get_alias().keys()

Mapping or Schema in ES

Querying in ES

term match

must vs should term search

exact match

radial search

Weighted search on different fields

Lucene Search Engine

Scoring in ES

  • The scoring function is a mathematical expression for finding a value for the relative importance of different documents

  • Scoring is very much is based on the TF-IDF, used in the document/information retrieval

score(q,d)=queryNorm(q)coord(q,d)idf(t)2t.getBoost()norm(t,d)(tinq)score (q,d) = queryNorm(q) * coord(q,d) * idf(t)2 * t.getBoost() * norm(t,d) (t in q)
  • score (q,d): a relevance score of a document d for query q

  • queryNorm (q): query normalization factor

    • queryNorm = 1 / √sumOfSquaredWeights

    • sumOfSquaredWeights is computed by adding together the IDF of each term in the query, squared

    • QNF is the ratio that aims to make the results of different queries comparable. It is calculated at the beginning of each query using the above formula

  • coord (q,d): query coordination factor

    • term score * number of matching terms / total number of terms in the query.

    • In the case of a multi-term query, the coordination factor rewards the documents that contain a higher number of terms of that query.

    • The more query terms appear in the document the more relevance it might have.

    • For simplicity, let’s say you have a query with three terms: “nice,” “red,” and “carpet,” each with a 1.5 score.

    • So, for example, the document that matches “nice red” will have 3.0 * 2 / 3 = 2.0 score. Of course, the documents that contain all three terms will be much more relevant than the document that contains just two of them.

  • tf (t in d): term frequency of the term t in document d

    • =tf(t in d) = √frequency

    • Number of times each term appears in the document

  • idf (t): inverse document frequency for term t

    • idf = 1 + ln(numDocs/(docFreq + 1))

    • Inverse document frequency (IDF) assigns low weight/relevance to terms that appear frequently in all of the documents in the index

  • t.getBoost(): the boost applied to the query

  • norm(t,d) – the field-length norm

    • norm = 1/sqrt(numFieldTerms)

    • The value of this parameter depends on the document field length in which a match with the query was found

Querying same index twice can return documents in different order

https://www.elastic.co/guide/en/elasticsearch/reference/current/consistent-scoring.html

Updating a Document in ES

Reference for Further Reading

Last updated

Was this helpful?