Following are the software versions used while learning the below items.
Elastic search Version: 5.4.0
Java Version: 8
Overview
- Download and install
- TFIDF
- Building an index
- Adding documents to index, individually and in bulk
- Search queries - query DSL
- Analysis of data , aggregations
- Lucene - Java
- Distributed - scales to many Nodes
- Highly available - multiple copies of data
- Restful APIs - CRUD, monitoring and other operations via simple JSON based HTTP calls
- Power query DSL - Schemaless
- Can be installed in machine, as well as cloud instance is available
Download
- download the latest version from
www.elastic.co
- unzip and start
- By default it will start as a single node cluster
- cluster and node concept
CRUD operations
- cURL (https://curl.haxx.se/download.html)
- create
- read/retrieve
- update
- delete
- Bluk operations on indexed documents
- Bulk creation of indices from json data
create a new index called products
curl -XPUT "localhost:9200/products?&pretty"
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "products"
}
Requests
curl -XPUT "localhost:9200/customers?&pretty"
curl -XPUT "localhost:9200/orders?&pretty"
- check the indices
Request:
curl -XGET "localhost:9200/_cat/indices?v&pretty"
Response:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open products BkpL7fogS0uFYMkzV8TYZA 1 1 0 0 230 230b
Add documents to existing indices
-- Request to add Iphone7 Phone
curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
{ "name": "Iphone 7",
"camera": "12MP",
"storage": "256GB",
"display": "4.7inch",
"battery": "1960mAh",
"reviews": ["Incredibly happy after having used it for one week", "Best phone so far", "Very expensive"]
}
'
- here products is the index
- mobiles is the documentType
- can pass a documentId (1) to represent this document being created
- PUT is used for Create or delete
- Post is used to update
-- Response:
{
"_index" : "products",
"_type" : "mobiles",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
-- Request to add Samsung Galaxy Phone
curl -XPUT "localhost:9200/products/mobiles/2?pretty" -H'Content-Type: application/json' -d'
{ "name": "Samsung Galaxy",
"camera": "8MP",
"storage": "128GB",
"display": "5.2inch",
"battery": "1500mAh",
"reviews": ["Best phone ever", "Love the screen size", "Awesome"]
}
'
-- Request to add Pixel 3
curl -XPUT "localhost:9200/products/mobiles/3?pretty" -H'Content-Type: application/json' -d'
{ "name": "Pixel 3",
"camera": "12.2MP",
"storage": "128GB",
"display": "5.5inch",
"battery": "2950mAh",
"reviews": ["I Love the camera on this phone", "Awesome google phone"]
}
'
-- Request to add Macbook pro Laptop (Doctype is different)
curl -XPUT "localhost:9200/products/laptops/1?pretty" -H'Content-Type: application/json' -d'
{ "name": "Macbook Pro",
"storage": "500GB",
"RAM" : "8GB",
"display": "13inch",
"os": "El capitan",
"reviews": ["Size is sleek compared to other laptops", "Storage capacity is great"]
}
'
NOTE:
This request will fail because as of Lucene version 6.x, multiple doc types in a single index is not supported.
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
}
],
"type" : "illegal_argument_exception",
"reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
},
"status" : 400
}
Retrieving Documents
-- curl -XGET "localhost:9200/products/mobiles/1?pretty"
{
"_index" : "products",
"_type" : "mobiles",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "Iphone 7",
"camera" : "12MP",
"storage" : "256GB",
"display" : "4.7inch",
"battery" : "1960mAh",
"reviews" : [
"Incredibly happy after having used it for one week",
"Best phone so far",
"Very expensive"
]
}
}
-- check if document exist without retrieving the source
curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=false"
{
"_index" : "products",
"_type" : "mobiles",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true
}
-- to fetch certain fields only in the json document
curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=name,reviews"
{
"_index" : "products",
"_type" : "mobiles",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"reviews" : [
"Incredibly happy after having used it for one week",
"Best phone so far",
"Very expensive"
],
"name" : "Iphone 7"
}
}
Update
- Update document by id
- Whole document
- Partial document
-- update of a document can be done via a put request (whole document)
curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
{
"name" : "Iphone 7",
"camera" : "12MP",
"storage" : "256GB",
"display" : "4.7inch",
"battery" : "1960mAh",
"reviews" : [
"Incredibly happy after having used it for one week",
"Best phone so far",
"Very expensive",
"Much better than android phones"
]
}
'
Response:
{
"_index" : "products",
"_type" : "mobiles",
"_id" : "1",
"_version" : 2,
"result" : "updated",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 3,
"_primary_term" : 1
}
- partial update of a document can be done using the _update endpoint, use the POST command with a doc field
Request: add a new field color in the mobile 1
curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
{
"doc": {
"color": "black"
}
}'
- script field can be used to update a field of a document
-- Request: increment the shoe size by 2
curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
{
"script": "ctx._source.size += 2"
}'
Deletes
- delete a document from an index
curl -XDELETE "localhost:9200/products/mobile/1?pretty
curl -XDELETE "localhost:9200/products/mobile?pretty
Bulk operations
- retrieve multiple documents
- _mget api allows us to get multiple documents in one command
curl "localhost:9200/_mget?pretty" -d'
{
"docs": [
{
"_index": "products",
"_type": "laptops",
"_id": "1"
},
{
"_index": "products",
"_type": "laptops",
"_id": "2"
}
]
}'
-- If all the documents trying to get is of same index, can be put in the url itself
curl -XGET "localhost:9200/products/mobiles/_mget?pretty" -H'Content-Type: application/json' -d'{"docs": [{"_id": "1"}, {"_id": "2"}]}'
Index multiple documents
- The _bulk api allows to specify multiple operations in one go.
curl -XPOST "localhost:9200/_bulk?pretty" -H'Content-Type: application/json' -d'
{ "index": {"_index": "products", "_type": "mobiles", "_id": "3" } }
{ "name": "Puma", "size": 9, "color": "black" }
{ "index": {"_index": "products", "_type": "mobiles", "_id": "4" } }
{ "name": "New Balance", "size": 9, "color": "White" }
'
Multiple operations in one command
- Multiple operations can be done using the _bulk api.
- create keyword can be used instead of index, to add a document to the index
- for create and update operation, one json has to follow with the actual json document to be created or updated.
curl -XPOST "localhost:9200/products/shoes/_bulk?pretty" -H'Content-Type: application/json' -d'
{ "index": { "_id": "3" } }
{ "name": "Puma", "size": 9, "color": "black" }
{ "index": {"_id": "4" } }
{ "name": "New Balance", "size": 8, "color": "White" }
{"delete": { "_id": "2"}}
{ "create": {"_id": "5" } }
{ "name": "Nike Power", "size": 11, "color": "red" }
{ "update": {"_id": "1" } }
{ "doc": {"color": "orange" }
'
Bulk index documents from a json file
Searching and filtering
Random json generator:
www.json-generator.com
-- Generate 1000 customer data and save it in json format
Schema:
[
'{{repeat(1000, 1000)}}',
{
name: '{{firstName()}} {{surname()}}',
age: '{{integer(18, 75)}}',
gender: '{{gender()}}',
email: '{{email()}}',
phone: '+1 {{phone()}}',
street: '{{integer(100, 999)}} {{street()}}',
city: '{{city()}}',
state: '{{state()}}, {{integer(100, 10000)}}'
}
]
Two context of search
- Query context
- Every document has a relevance score which tells how well the document matches the search term
- Search term can be specified as
- URL query parameter
- URL request body
- use of the _search api
curl -XGET "localhost:9200/customers/_search?q=wyoming&pretty"
curl -XGET "localhost:9200/customers/_search?q=wyoming&sort=age:desc&pretty"
from=10
size=2
curl -XGET "localhost:9200/products/_search?pretty" -d'
{
"query": {"match_all": {} },
"size": 3,
"from": 2,
"sort": { "age": { "order": "desc" } }
}
- Can search multiple indices
curl -XGET "localhost:9200/customers,products/_search?pretty"
curl -XGET "localhost:9200/products/mobiles,laptops/_search?pretty"
- We can search on fields that we are interested in "term"
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"term": {"name": "gates"}
}
}
- we can append "_source": false in the above request to eliminate the body from the response.
- _source field is very powerful and we can even specify regular expressions
{
"_source": ["st*", "*n*"],
"query": {
"term": { "state": "washington"}
}
}
- we can specify to include or exclude some pattern from the source fields
{
"_source": {
"includes": ["st*", "*n*"],
"excludes": [ "*der"]
},
"query": {
"term": { "state": "washington"}
}
}
Full text queries
- match
- match_phrase
- match_phrase_prefix
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"match": {
"name": "webster"
}
}
}'
-- above match keyword can be used to perform not an exact term match, but other ways also (other parameters)
{
"query": {
"match": {
"name": {
"query": "frank morris",
"operator": "or"
}
}
}
}
-- logical OR matches , all documents having frank or morris in the name field
-- default operator is OR
{
"query": {
"match_phrase": {
"name": "frank morris"
}
}
}
-- entire phrase has to match
{
"query": {
"match_phrase_prefix": {
"name": "fr"
}
}
}
-- all names that begins with the prefix fr
-- this can be used as autocomplete
TFIDF
{
"common": {
"reviews": {
"query": "this is great",
"cutoff_frequency": 0.001
}
}
}
- some of the terms in the query may be common words (stop words). treat any word with frequency > 0.1% as common word while searching
Compound queries
- Boolean query
- Matches documents by combining multiple queries using boolean operators such as AND, OR
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"bool": {
"must": [
{"match": { "street": "ditmas" } },
{"match": { "street": "avenue" } }
]
}
}
}
'
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"bool": {
"should": [
{"match": { "street": "ditmas" } },
{"match": { "street": "avenue" } }
]
}
}
}
'
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"bool": {
"must_not": [
{"match": { "state": "california texas" } },
{"match": { "street": "lane street" } }
]
}
}
}
'
Term queries
The exact term needs to be found in inverted index for indexed documents
The terms found in the index may vary based on how you analyze them
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"bool": {
"should": [
{"term": { "state": {"value": "california"} } },
{"term": { "street": {"value": "idaho"} } }
]
}
}
}
'
- Boost some terms over others
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"bool": {
"should": [
{
"term": {
"state":
{
"value": "california",
"boost": 2.0
}
}
},
{
"term":
{
"street":
{
"value": "idaho"
}
}
}
]
}
}
}
'
Filters
- the documents in the result are not scored.
- just checks if the document should be included in the result or not.
-- the most common filter is the range filter
-- term and filters could be combined
curl "localhost:9200/customers/_search?pretty" -d'
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": [
{
"term": {
"gender": "female"
}
},
{
"range": {
"age": {
"gte": 20,
"lte": 30
}
}
}
]
}
}
}
'
Analytics and Aggregations
- Different kind of aggregations that can be performed
- Implement queries for metrics and bucketing aggregations
- Work with multi level nesting of aggregations
Four kind of Aggregations
- Metric
- Bucketing
- Matrix
- Pipeline
Metric Aggregations
- Aggregations over a set of documents
- All document in a search result
- Document within a logical group
Bucketing Aggregations
- Logically group documents based on search query
- A document falls into a bucket if the criteria matches
- Each bucket associated with a key
Matrix Aggregations
- Operates on multiple fields and produces a matrix result
- Experimental and may change in the future releases
- Not covered
Pipeline Aggregations
- Aggregations tht work on the output of other aggregations
- Experimental and may change in the future releases
- Not covered
Metric Aggregations
- numeric aggregations like sum, average, count, min, etc
- multi value stats aggregations
- aggregations are done by using the same _search api
- aggregations are done by using aggs keyword in the request body
- provide a name that you want to be assigned to the result - "avg_age"
- avg is the keyword for average aggregations
- field keyword specifies the field over which this aggregation is going to be performed
- size = 0, means we do not want any documents to be returned, we just want the final aggregate value
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
'
- metric aggregations become more powerful when combined with search or filter queries
- the below query calcualtes the average age of all the customers who live in minnesota
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"query": {
"bool": {
"filter": {
"match": { "state": "minnesota"}
}
}
},
"aggs": {
"avg_age": {
"avg": {
"field": "age"
}
}
}
}
'
- elastic search can also calculate a whole range of statistics in one go
- specify the "stats" aggregation keyword within the "aggs" field
- "age_stats" is the field name that will appear in the response
- "stats" calculates the count, min, max, avg, sum of the age field
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
'
Cardinality
- the number of unique values in a field across all documents
- enabling cardinality aggregations on text fields require some special setup for the field data
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"age_count": {
"cardanality": {
"field": "age"
}
}
}
}
'
-- since age is an integer value, the above query will directly work.
-- for text field, the above query will not work by default
-- have to enable fieldData for the text field
curl -XPUT "localhsot:9200/customers/_mapping/personal?pretty" -d'
{
"properties": {
"gender": {
"type": "text",
"fielddata": true
}
}
}
'
-- now you can run cardanality aggregation on the gender field
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"gender_count": {
"cardanality": {
"field": "gender"
}
}
}
}
Bucketing
- similar to the GROUP BY operation in sql
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"gender_bucket": {
"terms": {
"field": "gender"
}
}
}
}
'
-- we can also bucket by range
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"age_range": {
"range": {
"field": "age",
"ranges": [
{ "to": 30},
{ "from": 30, "to": 40},
{ "from": 40, "to": 55},
{ "from": 55 }
]
}
}
}
}
'
- "keyed": true can be specified which changes the way the response is returned,
- also can specify key in the ranges
Multi level nested aggregations
- example of a metric aggregation nested inside a bucketing aggregation
- returns the average age of males and females
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"gender_bucket": {
"terms": {
"field": "gender"
},
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
'
- multi layer nesting of aggregations
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"gender_bucket": {
"terms": {
"field": "gender"
},
"aggs": {
"age_ranges": {
"range": {
"field": "age",
"keyed": true,
"ranges": [
{ "key": "young", "to": 30},
{ "key": "middle-aged","from": 30, "to": 55},
{ "key": "senior","from": 55 }
]
},
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
}
}
'
Filter aggregation and filters keyword
- average age of customers from the state of texas
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"state": {
"filter": { "term": { "state": "texas" } },
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
'
-- you can use multiple filters instead of just one "filters" keyword
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
{
"size": 0,
"aggs": {
"state": {
"filters": {
"filters": {
"washington" : { "match": { "state": "washington" } },
"north carolina" : { "match": { "state": "north carolina" } },
"south dakota" : { "match": { "state": "south dakota" } }
}
},
"aggs": {
"average_age": {
"avg": {
"field": "age"
}
}
}
}
}
}
'