Pages

Learning Elastic Search

Following are the software versions used while learning the below items.

Elastic search Version:  5.4.0
Java Version: 8

 Overview


  • Download and install
  • TFIDF
  • Building an index
  • Adding documents to index, individually and in bulk
  • Search queries - query DSL
  • Analysis of data , aggregations
  • Lucene - Java
  • Distributed - scales to many Nodes
  • Highly available - multiple copies of data
  • Restful APIs - CRUD, monitoring and other operations via simple JSON based HTTP calls
  • Power query DSL - Schemaless
  • Can be installed in machine, as well as cloud instance is available


Download

- download the latest version from www.elastic.co
- unzip and start
- By default it will start as a single node cluster
- cluster and node concept


CRUD operations


  • cURL (https://curl.haxx.se/download.html)
  • create
  • read/retrieve
  • update
  • delete
  • Bluk operations on indexed documents
  • Bulk creation of indices from json data


create a new index called products


curl -XPUT "localhost:9200/products?&pretty"

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "products"
}

  Requests

 curl -XPUT "localhost:9200/customers?&pretty"
 curl -XPUT "localhost:9200/orders?&pretty"

  - check the indices

  Request:

curl -XGET "localhost:9200/_cat/indices?v&pretty"

  Response:
health status index    uuid                       pri rep docs.count docs.deleted store.size pri.store.size
yellow open   products BkpL7fogS0uFYMkzV8TYZA     1   1   0          0            230        230b

Add documents to existing indices

-- Request to add Iphone7 Phone
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Iphone 7", 
  "camera": "12MP", 
  "storage": "256GB", 
  "display": "4.7inch", 
  "battery": "1960mAh", 
  "reviews": ["Incredibly happy after having used it for one week", "Best phone so far", "Very expensive"]
 }
 '


  • here products is the index
  • mobiles is the documentType
  • can pass a documentId (1) to represent this document being created
  • PUT is used for Create or delete
  • Post is used to update


-- Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "result" : "created",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 0,
   "_primary_term" : 1
 }
-- Request to add Samsung Galaxy Phone
 curl -XPUT "localhost:9200/products/mobiles/2?pretty" -H'Content-Type: application/json' -d'
 { "name": "Samsung Galaxy", 
  "camera": "8MP", 
  "storage": "128GB", 
  "display": "5.2inch", 
  "battery": "1500mAh", 
  "reviews": ["Best phone ever", "Love the screen size", "Awesome"]
 }
 '
-- Request to add Pixel 3
 curl -XPUT "localhost:9200/products/mobiles/3?pretty" -H'Content-Type: application/json' -d'
 { "name": "Pixel 3", 
  "camera": "12.2MP", 
  "storage": "128GB", 
  "display": "5.5inch", 
  "battery": "2950mAh", 
  "reviews": ["I Love the camera on this phone", "Awesome google phone"]
 }
 '
-- Request to add Macbook pro Laptop (Doctype is different)
 curl -XPUT "localhost:9200/products/laptops/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Macbook Pro", 
  "storage": "500GB", 
  "RAM" : "8GB",
  "display": "13inch", 
  "os": "El capitan", 
  "reviews": ["Size is sleek compared to other laptops", "Storage capacity is great"]
 }
 '

NOTE: 

This request will fail because as of Lucene version 6.x, multiple doc types in a single index is not supported.
{
   "error" : {
  "root_cause" : [
    {
   "type" : "illegal_argument_exception",
   "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
    }
  ],
  "type" : "illegal_argument_exception",
  "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
   },
   "status" : 400
 }

Retrieving Documents

-- curl -XGET "localhost:9200/products/mobiles/1?pretty" { "_index" : "products", "_type" : "mobiles", "_id" : "1", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "name" : "Iphone 7", "camera" : "12MP", "storage" : "256GB", "display" : "4.7inch", "battery" : "1960mAh", "reviews" : [ "Incredibly happy after having used it for one week", "Best phone so far", "Very expensive" ] } }
-- check if document exist without retrieving the source
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=false" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true
 }
 -- to fetch certain fields only in the json document
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=name,reviews" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true,
   "_source" : {
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive"
  ],
  "name" : "Iphone 7"
   }
 }

Update


  • Update document by id
  • Whole document
  • Partial document


-- update of a document can be done via a put request (whole document)
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 {
  "name" : "Iphone 7",
  "camera" : "12MP",
  "storage" : "256GB",
  "display" : "4.7inch",
  "battery" : "1960mAh",
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive",
    "Much better than android phones"
  ]
 }
 '
 Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 2,
   "result" : "updated",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 3,
   "_primary_term" : 1
 }

  • partial update of a document can be done using the _update endpoint, use the POST command with a doc field

Request: add a new field color in the mobile 1
curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "doc": {
   "color": "black"
  }
 }'


  • script field can be used to update a field of a document


-- Request: increment the shoe size by 2
 curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "script": "ctx._source.size += 2"
 }'

Deletes


  • delete a document from an index

curl -XDELETE "localhost:9200/products/mobile/1?pretty

  • delete an entire index

curl -XDELETE "localhost:9200/products/mobile?pretty


Bulk operations


  • retrieve multiple documents
  • _mget api allows us to get multiple documents in one command


curl "localhost:9200/_mget?pretty" -d'
 {
  "docs": [
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "1"
   },
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "2"
   }
  ]
 }'

-- If all the documents trying to get is of same index, can be put in the url itself
curl -XGET "localhost:9200/products/mobiles/_mget?pretty" -H'Content-Type: application/json' -d'{"docs": [{"_id": "1"}, {"_id": "2"}]}'

Index multiple documents


  • The _bulk api allows to specify multiple operations in one go.


curl -XPOST "localhost:9200/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": {"_index": "products", "_type": "mobiles", "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_index": "products", "_type": "mobiles", "_id": "4" } }
 { "name": "New Balance", "size": 9, "color": "White" }
 '

Multiple operations in one command


  • Multiple operations can be done using the _bulk api.
  • create keyword can be used instead of index, to add a document to the index
  • for create and update operation, one json has to follow with the actual json document to be created or updated.


curl -XPOST "localhost:9200/products/shoes/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": { "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_id": "4" } }
 { "name": "New Balance", "size": 8, "color": "White" }
 {"delete": { "_id": "2"}}
 { "create": {"_id": "5" } }
 { "name": "Nike Power", "size": 11, "color": "red" }
 { "update": {"_id": "1" } }
 { "doc": {"color": "orange" }
 '

Bulk index documents from a json file

Searching and filtering

Random json generator: www.json-generator.com

-- Generate 1000 customer data and save it in json format
 Schema:
 [
  '{{repeat(1000, 1000)}}',
  {
   name: '{{firstName()}} {{surname()}}',
   age: '{{integer(18, 75)}}',
   gender: '{{gender()}}',
   email: '{{email()}}',
   phone: '+1 {{phone()}}',
   street: '{{integer(100, 999)}} {{street()}}',
   city: '{{city()}}',
   state: '{{state()}}, {{integer(100, 10000)}}'
  }
 ]

Two context of search


  • Query context
    • Every document has a relevance score which tells how well the document matches the search term
    • Search term can be specified as
      • URL query parameter
      • URL request body
  • use of the _search api

  curl -XGET "localhost:9200/customers/_search?q=wyoming&pretty"
  curl -XGET "localhost:9200/customers/_search?q=wyoming&sort=age:desc&pretty"

from=10
size=2


  • Filter context

curl -XGET "localhost:9200/products/_search?pretty" -d'
 {
  "query": {"match_all": {} },
  "size": 3,
  "from": 2,
  "sort": { "age": { "order": "desc" } }
 }


  • Can search multiple indices

 curl -XGET "localhost:9200/customers,products/_search?pretty"
 curl -XGET "localhost:9200/products/mobiles,laptops/_search?pretty"

  • We can search on fields that we are interested in "term"


curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "term": {"name": "gates"}
  }
 }

  • we can append "_source": false in the above request to eliminate the body from the response.
  •  _source field is very powerful and we can even specify regular expressions


{
  "_source": ["st*", "*n*"],
  "query": {
   "term": { "state": "washington"}
  }
 }


  • we can specify to include or exclude some pattern from the source fields

{
  "_source": {
   "includes": ["st*", "*n*"],
   "excludes": [ "*der"]
  },
  "query": {
   "term": { "state": "washington"}
  }
 }

Full text queries


  • match
  • match_phrase
  • match_phrase_prefix

curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "match": {
    "name": "webster"
   }
  }
 }'
 -- above match keyword can be used to perform not an exact term match, but other ways also (other parameters)
 
 {
  "query": {
   "match": {
    "name": {
     "query": "frank morris",
     "operator": "or"
    }
   }
  }
 }
 -- logical OR matches , all documents having frank or morris in the name field 
 -- default operator is OR 

{
  "query": {
   "match_phrase": {
    "name": "frank morris"
   }
  }
 }
 -- entire phrase has to match 

 {
  "query": {
   "match_phrase_prefix": {
    "name": "fr"
   }
  }
 }
 --  all names that begins with the prefix fr 
 -- this can be used as autocomplete

TFIDF

{
 "common": {
  "reviews": {
   "query": "this is great",
   "cutoff_frequency": 0.001
  }
 }
}

  • some of the terms in the query may be common words (stop words). treat any word with frequency > 0.1% as common word while searching


Compound queries


  • Boolean query
    • Matches documents by combining multiple queries using boolean operators such as AND, OR



  • Must clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '

  • Should clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '


  • must_not clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must_not": [
      {"match": { "state": "california texas" } },
      {"match": { "street": "lane street" } }
     ]
    }
   }
  }
  '

  • filter clause


Term queries

The exact term needs to be found in inverted index for indexed documents
The terms found in the index may vary based on how you analyze them


  • simple term queries

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"term": { "state": {"value": "california"} } },
      {"term": { "street": {"value": "idaho"} } }
     ]
    }
   }
  }
  '


  • Boost some terms over others

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {
       "term": { 
          "state": 
           {
            "value": "california",
            "boost": 2.0
           } 
         } 
      },
      {
       "term": 
       { 
        "street": 
        {
         "value": "idaho"
        } 
       } 
      }
     ]
    }
   }
  }
  '

Filters


  • the documents in the result are not scored.
  • just checks if the document should be included in the result or not.


-- the most common filter is the range filter
-- term and filters could be combined
curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": { "match_all": {} },
     "filter": [
      {
       "term": { 
        "gender": "female"
       }
      },
      {
       "range": {
        "age": {
         "gte": 20,
         "lte": 30
        }
       }
      }
     ]
    }
   }
  }
  '


Analytics and Aggregations


  • Different kind of aggregations that can be performed
  •  Implement queries for metrics and bucketing aggregations
  • Work with multi level nesting of aggregations


Four kind of Aggregations


  1. Metric
  2. Bucketing
  3. Matrix
  4. Pipeline



 Metric Aggregations


  • Aggregations over a set of documents
  •  All document in a search result
  •  Document within a logical group


Bucketing Aggregations


  • Logically group documents based on search query
  • A document falls into a bucket if the criteria matches
  • Each bucket associated with a key


Matrix Aggregations


  • Operates on multiple fields and produces a matrix result
  • Experimental and may change in the future releases
  • Not covered


Pipeline Aggregations


  • Aggregations tht work on the output of other aggregations
  • Experimental and may change in the future releases
  • Not covered


Metric Aggregations


  • numeric aggregations like sum, average, count, min, etc
  • multi value stats aggregations



  • aggregations are done by using the same _search api
  • aggregations are done by using aggs keyword in the request body
  • provide a name that you want to be assigned to the result - "avg_age"
  • avg is the keyword for average aggregations
  • field keyword specifies the field over which this aggregation is going to be performed
  • size = 0, means we do not want any documents to be returned, we just want the final aggregate value


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '


  • metric aggregations become more powerful when combined with search or filter queries
  • the below query calcualtes the average age of all the customers who live in minnesota

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "query": {
   "bool": {
    "filter": {
     "match": { "state": "minnesota"}
    }
   }
  },
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '


  • elastic search can also calculate a whole range of statistics in one go
  • specify the "stats" aggregation keyword within the "aggs" field
  • "age_stats" is the field name that will appear in the response
  • "stats" calculates the count, min, max, avg, sum of the age field


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_stats": {
    "stats": {
     "field": "age"
    }
   }
  }
 }
 '

Cardinality 


  • the number of unique values in a field across all documents
  • enabling cardinality aggregations on text fields require some special setup for the field data

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_count": {
    "cardanality": {
     "field": "age"
    }
   }
  }
 }
 '
-- since age is an integer value, the above query will directly work.
-- for text field, the above query will not work by default
-- have to enable fieldData for the text field

curl -XPUT "localhsot:9200/customers/_mapping/personal?pretty" -d'
  {
   "properties": {
    "gender": {
     "type": "text", 
     "fielddata": true
    }
   }
  }
 '
-- now you can run cardanality aggregation on the gender field
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_count": {
    "cardanality": {
     "field": "gender"
    }
   }
  }
 }

Bucketing


  • similar to the GROUP BY operation in sql


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    }
   }
  }
 }
 '

-- we can also bucket by range
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_range": {
    "range": {
     "field": "age",
     "ranges": [
      { "to": 30},
      { "from": 30, "to": 40},
      { "from": 40, "to": 55},
      { "from": 55 }
     ]
    }
   }
  }
 }
 '

  • "keyed": true can be specified which changes the way the response is returned,
  • also can specify key in the ranges


Multi level nested aggregations


  • example of a metric aggregation nested inside a bucketing aggregation
  • returns the average age of males and females

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

- multi layer nesting of aggregations
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "age_ranges": {
      "range": {
       "field": "age",
       "keyed": true,
       "ranges": [
        { "key": "young", "to": 30},
        { "key": "middle-aged","from": 30, "to": 55},
        { "key": "senior","from": 55 }
       ]
      },
      "aggs": {
       "average_age": {
        "avg": {
         "field": "age"
        }
       }
      }
     }
     
    }
   }
  }
 }
 '

Filter aggregation and filters keyword

- average age of customers from the state of texas
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filter": { "term": { "state": "texas" } },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

-- you can use multiple filters instead of just one "filters" keyword
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filters": {
     "filters": {
      "washington" : { "match": { "state": "washington" } },
      "north carolina" : { "match": { "state": "north carolina" } },
      "south dakota" : { "match": { "state": "south dakota" } }
     }
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

No comments:

Post a Comment

If you like to say anything (good/bad), Please do not hesitate...