Computer Freaks: Learning Elastic Search

Following are the software versions used while learning the below items.

Elastic search Version: 5.4.0
Java Version: 8

Overview

Download and install
TFIDF
Building an index
Adding documents to index, individually and in bulk
Search queries - query DSL
Analysis of data , aggregations
Lucene - Java
Distributed - scales to many Nodes
Highly available - multiple copies of data
Restful APIs - CRUD, monitoring and other operations via simple JSON based HTTP calls
Power query DSL - Schemaless
Can be installed in machine, as well as cloud instance is available

Download

- download the latest version from www.elastic.co
- unzip and start
- By default it will start as a single node cluster
- cluster and node concept

CRUD operations

cURL (https://curl.haxx.se/download.html)
create
read/retrieve
update
delete
Bluk operations on indexed documents
Bulk creation of indices from json data

create a new index called products

curl -XPUT "localhost:9200/products?&pretty"

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "products"
}

Requests

 curl -XPUT "localhost:9200/customers?&pretty"
 curl -XPUT "localhost:9200/orders?&pretty"

- check the indices

Request:

curl -XGET "localhost:9200/_cat/indices?v&pretty"

Response:

health status index    uuid                       pri rep docs.count docs.deleted store.size pri.store.size
yellow open   products BkpL7fogS0uFYMkzV8TYZA     1   1   0          0            230        230b

Add documents to existing indices

-- Request to add Iphone7 Phone
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Iphone 7", 
  "camera": "12MP", 
  "storage": "256GB", 
  "display": "4.7inch", 
  "battery": "1960mAh", 
  "reviews": ["Incredibly happy after having used it for one week", "Best phone so far", "Very expensive"]
 }
 '

here products is the index
mobiles is the documentType
can pass a documentId (1) to represent this document being created
PUT is used for Create or delete
Post is used to update

-- Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "result" : "created",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 0,
   "_primary_term" : 1
 }

-- Request to add Samsung Galaxy Phone
 curl -XPUT "localhost:9200/products/mobiles/2?pretty" -H'Content-Type: application/json' -d'
 { "name": "Samsung Galaxy", 
  "camera": "8MP", 
  "storage": "128GB", 
  "display": "5.2inch", 
  "battery": "1500mAh", 
  "reviews": ["Best phone ever", "Love the screen size", "Awesome"]
 }
 '

-- Request to add Pixel 3
 curl -XPUT "localhost:9200/products/mobiles/3?pretty" -H'Content-Type: application/json' -d'
 { "name": "Pixel 3", 
  "camera": "12.2MP", 
  "storage": "128GB", 
  "display": "5.5inch", 
  "battery": "2950mAh", 
  "reviews": ["I Love the camera on this phone", "Awesome google phone"]
 }
 '

-- Request to add Macbook pro Laptop (Doctype is different)
 curl -XPUT "localhost:9200/products/laptops/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Macbook Pro", 
  "storage": "500GB", 
  "RAM" : "8GB",
  "display": "13inch", 
  "os": "El capitan", 
  "reviews": ["Size is sleek compared to other laptops", "Storage capacity is great"]
 }
 '

NOTE:

This request will fail because as of Lucene version 6.x, multiple doc types in a single index is not supported.

{
   "error" : {
  "root_cause" : [
    {
   "type" : "illegal_argument_exception",
   "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
    }
  ],
  "type" : "illegal_argument_exception",
  "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
   },
   "status" : 400
 }


Retrieving Documents
-- curl -XGET "localhost:9200/products/mobiles/1?pretty" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true,
   "_source" : {
  "name" : "Iphone 7",
  "camera" : "12MP",
  "storage" : "256GB",
  "display" : "4.7inch",
  "battery" : "1960mAh",
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive"
  ]
   }
 }

-- check if document exist without retrieving the source
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=false" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true
 }
 -- to fetch certain fields only in the json document
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=name,reviews" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true,
   "_source" : {
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive"
  ],
  "name" : "Iphone 7"
   }
 }

Update

Update document by id
Whole document
Partial document

-- update of a document can be done via a put request (whole document)
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 {
  "name" : "Iphone 7",
  "camera" : "12MP",
  "storage" : "256GB",
  "display" : "4.7inch",
  "battery" : "1960mAh",
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive",
    "Much better than android phones"
  ]
 }
 '
 Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 2,
   "result" : "updated",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 3,
   "_primary_term" : 1
 }

partial update of a document can be done using the _update endpoint, use the POST command with a doc field

Request: add a new field color in the mobile 1

curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "doc": {
   "color": "black"
  }
 }'

script field can be used to update a field of a document

-- Request: increment the shoe size by 2
 curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "script": "ctx._source.size += 2"
 }'

Deletes

delete a document from an index

curl -XDELETE "localhost:9200/products/mobile/1?pretty

delete an entire index

curl -XDELETE "localhost:9200/products/mobile?pretty

Bulk operations

retrieve multiple documents
_mget api allows us to get multiple documents in one command

curl "localhost:9200/_mget?pretty" -d'
 {
  "docs": [
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "1"
   },
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "2"
   }
  ]
 }'

-- If all the documents trying to get is of same index, can be put in the url itself

curl -XGET "localhost:9200/products/mobiles/_mget?pretty" -H'Content-Type: application/json' -d'{"docs": [{"_id": "1"}, {"_id": "2"}]}'

Index multiple documents

The _bulk api allows to specify multiple operations in one go.

curl -XPOST "localhost:9200/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": {"_index": "products", "_type": "mobiles", "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_index": "products", "_type": "mobiles", "_id": "4" } }
 { "name": "New Balance", "size": 9, "color": "White" }
 '

Multiple operations in one command

Multiple operations can be done using the _bulk api.
create keyword can be used instead of index, to add a document to the index
for create and update operation, one json has to follow with the actual json document to be created or updated.

curl -XPOST "localhost:9200/products/shoes/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": { "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_id": "4" } }
 { "name": "New Balance", "size": 8, "color": "White" }
 {"delete": { "_id": "2"}}
 { "create": {"_id": "5" } }
 { "name": "Nike Power", "size": 11, "color": "red" }
 { "update": {"_id": "1" } }
 { "doc": {"color": "orange" }
 '

Bulk index documents from a json file

Searching and filtering

Random json generator: www.json-generator.com

-- Generate 1000 customer data and save it in json format
 Schema:
 [
  '{{repeat(1000, 1000)}}',
  {
   name: '{{firstName()}} {{surname()}}',
   age: '{{integer(18, 75)}}',
   gender: '{{gender()}}',
   email: '{{email()}}',
   phone: '+1 {{phone()}}',
   street: '{{integer(100, 999)}} {{street()}}',
   city: '{{city()}}',
   state: '{{state()}}, {{integer(100, 10000)}}'
  }
 ]

Two context of search

Query context

Every document has a relevance score which tells how well the document matches the search term
Search term can be specified as

URL query parameter
URL request body

use of the _search api

  curl -XGET "localhost:9200/customers/_search?q=wyoming&pretty"
  curl -XGET "localhost:9200/customers/_search?q=wyoming&sort=age:desc&pretty"

from=10
size=2

Filter context

curl -XGET "localhost:9200/products/_search?pretty" -d'
 {
  "query": {"match_all": {} },
  "size": 3,
  "from": 2,
  "sort": { "age": { "order": "desc" } }
 }

Can search multiple indices

 curl -XGET "localhost:9200/customers,products/_search?pretty"
 curl -XGET "localhost:9200/products/mobiles,laptops/_search?pretty"

We can search on fields that we are interested in "term"

curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "term": {"name": "gates"}
  }
 }

we can append "_source": false in the above request to eliminate the body from the response.
_source field is very powerful and we can even specify regular expressions

{
  "_source": ["st*", "*n*"],
  "query": {
   "term": { "state": "washington"}
  }
 }

we can specify to include or exclude some pattern from the source fields

{
  "_source": {
   "includes": ["st*", "*n*"],
   "excludes": [ "*der"]
  },
  "query": {
   "term": { "state": "washington"}
  }
 }

Full text queries

match
match_phrase
match_phrase_prefix

curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "match": {
    "name": "webster"
   }
  }
 }'
 -- above match keyword can be used to perform not an exact term match, but other ways also (other parameters)
 
 {
  "query": {
   "match": {
    "name": {
     "query": "frank morris",
     "operator": "or"
    }
   }
  }
 }
 -- logical OR matches , all documents having frank or morris in the name field 
 -- default operator is OR

{
  "query": {
   "match_phrase": {
    "name": "frank morris"
   }
  }
 }
 -- entire phrase has to match 

 {
  "query": {
   "match_phrase_prefix": {
    "name": "fr"
   }
  }
 }
 --  all names that begins with the prefix fr 
 -- this can be used as autocomplete

TFIDF

{
 "common": {
  "reviews": {
   "query": "this is great",
   "cutoff_frequency": 0.001
  }
 }
}

some of the terms in the query may be common words (stop words). treat any word with frequency > 0.1% as common word while searching

Compound queries

Boolean query

Matches documents by combining multiple queries using boolean operators such as AND, OR

Must clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '

Should clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '

must_not clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must_not": [
      {"match": { "state": "california texas" } },
      {"match": { "street": "lane street" } }
     ]
    }
   }
  }
  '

filter clause

Term queries

The exact term needs to be found in inverted index for indexed documents
The terms found in the index may vary based on how you analyze them

simple term queries

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"term": { "state": {"value": "california"} } },
      {"term": { "street": {"value": "idaho"} } }
     ]
    }
   }
  }
  '

Boost some terms over others

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {
       "term": { 
          "state": 
           {
            "value": "california",
            "boost": 2.0
           } 
         } 
      },
      {
       "term": 
       { 
        "street": 
        {
         "value": "idaho"
        } 
       } 
      }
     ]
    }
   }
  }
  '

Filters

the documents in the result are not scored.
just checks if the document should be included in the result or not.

-- the most common filter is the range filter
-- term and filters could be combined

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": { "match_all": {} },
     "filter": [
      {
       "term": { 
        "gender": "female"
       }
      },
      {
       "range": {
        "age": {
         "gte": 20,
         "lte": 30
        }
       }
      }
     ]
    }
   }
  }
  '

Analytics and Aggregations

Different kind of aggregations that can be performed
Implement queries for metrics and bucketing aggregations
Work with multi level nesting of aggregations

Four kind of Aggregations

Metric
Bucketing
Matrix
Pipeline

Metric Aggregations

Aggregations over a set of documents
All document in a search result
Document within a logical group

Bucketing Aggregations

Logically group documents based on search query
A document falls into a bucket if the criteria matches
Each bucket associated with a key

Matrix Aggregations

Operates on multiple fields and produces a matrix result
Experimental and may change in the future releases
Not covered

Pipeline Aggregations

Aggregations tht work on the output of other aggregations
Experimental and may change in the future releases
Not covered

Metric Aggregations

numeric aggregations like sum, average, count, min, etc
multi value stats aggregations

aggregations are done by using the same _search api
aggregations are done by using aggs keyword in the request body
provide a name that you want to be assigned to the result - "avg_age"
avg is the keyword for average aggregations
field keyword specifies the field over which this aggregation is going to be performed
size = 0, means we do not want any documents to be returned, we just want the final aggregate value

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '

metric aggregations become more powerful when combined with search or filter queries
the below query calcualtes the average age of all the customers who live in minnesota

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "query": {
   "bool": {
    "filter": {
     "match": { "state": "minnesota"}
    }
   }
  },
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '

elastic search can also calculate a whole range of statistics in one go
specify the "stats" aggregation keyword within the "aggs" field
"age_stats" is the field name that will appear in the response
"stats" calculates the count, min, max, avg, sum of the age field

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_stats": {
    "stats": {
     "field": "age"
    }
   }
  }
 }
 '

Cardinality

the number of unique values in a field across all documents
enabling cardinality aggregations on text fields require some special setup for the field data

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_count": {
    "cardanality": {
     "field": "age"
    }
   }
  }
 }
 '

-- since age is an integer value, the above query will directly work.
-- for text field, the above query will not work by default
-- have to enable fieldData for the text field

curl -XPUT "localhsot:9200/customers/_mapping/personal?pretty" -d'
  {
   "properties": {
    "gender": {
     "type": "text", 
     "fielddata": true
    }
   }
  }
 '

-- now you can run cardanality aggregation on the gender field

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_count": {
    "cardanality": {
     "field": "gender"
    }
   }
  }
 }

Bucketing

similar to the GROUP BY operation in sql

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    }
   }
  }
 }
 '

-- we can also bucket by range

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_range": {
    "range": {
     "field": "age",
     "ranges": [
      { "to": 30},
      { "from": 30, "to": 40},
      { "from": 40, "to": 55},
      { "from": 55 }
     ]
    }
   }
  }
 }
 '

"keyed": true can be specified which changes the way the response is returned,
also can specify key in the ranges

Multi level nested aggregations

example of a metric aggregation nested inside a bucketing aggregation
returns the average age of males and females

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

- multi layer nesting of aggregations

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "age_ranges": {
      "range": {
       "field": "age",
       "keyed": true,
       "ranges": [
        { "key": "young", "to": 30},
        { "key": "middle-aged","from": 30, "to": 55},
        { "key": "senior","from": 55 }
       ]
      },
      "aggs": {
       "average_age": {
        "avg": {
         "field": "age"
        }
       }
      }
     }
     
    }
   }
  }
 }
 '

Filter aggregation and filters keyword

- average age of customers from the state of texas

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filter": { "term": { "state": "texas" } },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

-- you can use multiple filters instead of just one "filters" keyword

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filters": {
     "filters": {
      "washington" : { "match": { "state": "washington" } },
      "north carolina" : { "match": { "state": "north carolina" } },
      "south dakota" : { "match": { "state": "south dakota" } }
     }
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

Computer Freaks

Pages

Learning Elastic Search

Overview

Download

CRUD operations

create a new index called products

Requests

Request:

Add documents to existing indices

NOTE:

Retrieving Documents

Update

Deletes

Bulk operations

Index multiple documents

Multiple operations in one command

Bulk index documents from a json file

Searching and filtering

Two context of search

Full text queries

TFIDF

Compound queries

Term queries

Filters

Analytics and Aggregations

Four kind of Aggregations

Metric Aggregations

Bucketing Aggregations

Matrix Aggregations

Pipeline Aggregations

Metric Aggregations

Cardinality

Bucketing

Multi level nested aggregations

Filter aggregation and filters keyword

No comments:

Post a Comment