Pages

Query DSL - start

 build.gradle

ext {
javaGeneratedSources = file("build/classes/generated")
}

idea {
module {
outputDir file('build/classes')
sourceDirs += javaGeneratedSources
generatedSourceDirs += javaGeneratedSources
}
}

compileJava {
doFirst {
javaGeneratedSources.mkdirs()
}
options.compilerArgs += [
'-parameters', '-s', javaGeneratedSources
]
}

// QueryDSL
implementation "com.querydsl:querydsl-jpa:4.2.2"
annotationProcessor("com.querydsl:querydsl-apt:4.2.2:jpa")
annotationProcessor("org.springframework.boot:spring-boot-starter-data-jpa")


Repository.java class
extend the Repository with QueryDSL repository interfaces
public interface SomeRepository extends
JpaRepository<SomeEntity, Long>, QuerydslPredicateExecutor<SomeEntity>, QuerydslBinderCustomizer<QSomeEntity>
create a method
@Override
default void customize(QuerydslBindings bindings, QSomeEntity root) {
bindings.bind(String.class).first((SingleValueBinding<StringPath, String>) StringExpression::containsIgnoreCase);
}

Neo4j Graph Database | Design tips

One thing I learned from working using Neo4j graph database is that we need to avoid replication as much as possible in graph database. Replication could lead to a log of inconsistency and could be the reason of a lot of problems.(or maintenance issue).

We were using graph database to represent the logical progression of states of let's say a smart phone. The phone is in certain state, which means, the phone has certain set of module hardwares, and those hardwares are running certain set of softwares to help us properly use the smartphone.

What I learned from deploying my Node application in AWS cloud

Well something needs to be done in the time of Corona (NCovid 19) virus. The schools are shutdown, and the work offices are shutting down as well. I has some free time and so I started getting restless as I always do.

Long story short, I has an idea for a simple application which I started building and then ended up creating the basic functionality over a 3 day weekend. The feature was pretty straight forward. It is a live story pointing application where a bunch of people would join and point stories based on their complexity. And I deployed it on an Aws application over my own sub-domain. Now other questions starts coming.

  1. How to make it more usable?
  2. How to add additional features?
  3. How to make is secure?
  4. What if the server restarts?
  5. How to make it scale if all the people in the whole world start using it?
  6. How much am I going to be billed?
  7. How to make it Https?

1. How to make it more usable?

Well there some work required to make an application work, but there's a lot and lot more work required to make an application more usable. I was able to develop a working application in a few hours(4-5 hrs). But to make it simple usable took the other two days (10-15 hrs). There's just a lot of scenarios that could occur in the real world. And I was able to think about a few of them. Rapid testing and fixing the bugs that could occur was the way to go. Then there was work to be done about the looks and feel of the application. I have never been a proper UI-designer so this is always a struggle for me. However, I was able to complete it using some of my previous designs. 

2. How to add additional features?

This was one of the important question in my mind. The first MVP product I have developed was using simple nodejs server and html and javascript. There's a lot of work that you need to do in the plain languages and if this project has to go anywhere, I need to use some kind of frameworks. There are frameworks for everything nowadays. Now I have to decide the backend framework, and then the css framework, and the frontend application framework etc. Then there is the effort required to port the application to the framework. Its a decent amount of work and there's a significant learning curve.
I ended up using express for the backend and bootstrap for css and React for the frontend .

3. How to make it secure?

Security was not a huge deal for this application. It does not have any sensitive information nor it deals with any kind of financial transaction. So it was not a big deal. But there is always a thinking about what if someone able to hack my system and application. What if someone is able to get into the ec2 instance, or shut down the application etc. Well, I did what I could to make it secure.

4. What if the server restarts?

Everybody should think about this point if they are deploying their application in the cloud. In the cloud, based on the resource contention, the server can restart any time. If that happens, you have to make sure, your application re-starts and is automatically available on the web. Previously I was using a linux tool called screen to run the application in a background shell. This would not have worked.
I found pm2 instead, which runs on background and can starts when the server restarts. I still need to figure out how to see the logs though.
# to keep the process running even after a server restart
npm i -g pm2
pm2 start --name "app-name" npm -- start
pm2 startup # to make sure pm2 starts when the server starts
pm2 save
pm2 ls
# to stop and delete a running instance
pm2 stop app-name
pm2 delete app-name

 5. How to make it scale if all the people in the world start using it?

This question runs on the back of my brain all the time. What would happen if a lot of people start using it. Right now I am running my node application on an ec2 micro instance with 1 GB or ram and 8 GB of memory. I choose Node because it has very less memory footprint and can serve huge number of requests rapidly. It's not good for long running tasks which this application is not. I do store some information in the server but clean it regularly and have gone about this strictly couple of iterations over my code. So I am pretty confident that my micro instance can handle significant amount of load. Also, I can start utilizing the 8 GB disk space that I have instead of relying on the memory which will make it slightly slow but I think overall it will help handle some more data. One thing I am not sure is how much of the parallel request (TCP connection) can the ec2 instance handle at a time. I believe that could be a hard limit and I have to go to other options of Auto scaling and so on. But let's see if I ever reach there.

6. How much am I going to be billed for this?

This is always one of a major concern. But I started feeling this way. If I am always worried about the bill, I will never do anything cool with my life. Beside, I cannot be billed like thousands of dollar for this application. I am still eligible for free tier and if the resources consumption starts increasing because of load, there will always come some way to pay for it. So I am calm about this. However I am ready to pay like a hundred dollars a month or so.

7. How to make it HTTPS?

Now a days, https is must for any web application. Even google will not rank you properly if you don't have https for your site. Because of this, I had to make my application use a certificate. There were other problems also while running the application on ec2. It was not allowing my application to use PORT 80 or 443 which are the default ports for http and https. So I had to use a Nginx server to route my http and https traffic to my local port on which the application was running on ec2.
# install nginx
sudo apt-get install nginx
# edit the config
sudo vi /etc/nginx/nginx.conf
# auto restart nginx on system restart
sudo systemctl enable nginx
-- to start nginx 
sudo systemctl start nginx.service
sudo systemctl stop nginx.service

## route traffic in port 80 to local 8080
sudo vim /etc/nginx/nginx.conf
server {
  listen 80;
  server_name myservername;
  if ($http_x_forwarded_proto = 'http'){
       return 301 https://$host$request_uri;
  }
  # you can add this in the sites-available/default file as well
  location / {
    proxy_set_header  X-Real-IP  $remote_addr;
    proxy_set_header  Host       $http_host;
    proxy_pass        http://127.0.0.1:3000;
  }
}
# restart again
sudo systemctl restart nginx.service

Now if the security group in your EC2 has just 80 port open, will still work.
Now researched and found that I could use aws certificate manager and configure certificate on Elastic load balancer and point the load balancer to this ec2 instance. Then I can point my domain to the elastic load balancer. Now my application is more safe, secure and load balanced as well.


Liquigraph as a data migration tool for Neo4j

I have been working a graph database recently called Neo4j. Its interesting. Since its a graph database, you have to think differently. And so, it might be intimidating for those who are used to the sql databases. Especially because they are familiar with how it works and all the tools required to work with it. I was particularly focused on Application database migration tools.

Directly to the point:
There are a couple of tools for automatic data migration for sql databases

  • Liquibase
  • Flyway
But since we are also using Neo4j (graph database), above migration tools won't work.
Well I found a tool called 
  • LIQUIGRAPH
Its basically similar to Liquibase. 

How it works?

Basically most of the getting started tutorial is here
  • Add the dependency in the spring project. 
  • Configure the datasource. (Tricky is multiple databases)
  • Add migration scripts in changelog.xml at your resources/db/liquigraph directory
Now the new changes logged in the changelog.xml will be applied to the graph data. 

Will it apply the change on every application start?

No. It creates some nodes in your graph database (_liquigraphchangelog, _liquigraphchangeset, _liquigraphquery). These are to track the previous migrations. Liquigraph knows which changes are already applied to the graph and which are the new changes, and applies only the new changes.

What are the dependencies?

Add the two dependencies

implementation ('org.liquigraph:liquigraph-spring-boot-starter:3.1.1')
implementation ('org.neo4j:neo4j-jdbc-bolt:3.5.1')

Add the properties in the properties file

liquigraph:
  user: neo4j
  password: pswd
  url: jdbc:neo4j:bolt://localhost:7687

Put the changes in the changelog.xml
<?xml version="1.0" encoding="UTF-8"?><changelog xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"           xsi:noNamespaceSchemaLocation="http://www.liquigraph.org/schema/1.0/liquigraph.xsd">
    <changeset id="hello-world" author="you">
        <query>CREATE (n:Sentence {text:'Hello monde!'}) RETURN n</query>
    </changeset>
    <changeset id="hello-world-fixed" author="you">
        <query>MATCH (n:Sentence {text:'Hello monde!'}) SET n.text='Hello world!' RETURN n</query>
    </changeset>
    <changeset id="set-author-in-sentence" author="you">
        <query>match(n: Sentence) set n.author= 'rajan' return count(n)</query>
    </changeset>
</changelog>

Now if these changes are not already applied in the graph, They will be applied at the application restart.

Problems/Confusions 

  • I was confused about specifying the liquigraph properties explicitly. I thought since I have already configured the spring.data.neo4j properties, it would pick those properties to configure the Datasource. But it does not.
  • The tutorial tells you to put spring-data-jdbc to the dependencies. I was afraid that it would mess with spring-data-jpa that I had in my project. So the second way of creating a Datasource manually won't work either. There were miltiple databases in my project (MSSQL, Mongodb, Neo4j). So faced some issue as to which Datasource the Liquigraph was picking. I had to dig inside the library to find out that the liquigraph property could be the solution, and it worked.
  • Pick proper version of the libraries

Learning Elastic Search

Following are the software versions used while learning the below items.

Elastic search Version:  5.4.0
Java Version: 8

 Overview


  • Download and install
  • TFIDF
  • Building an index
  • Adding documents to index, individually and in bulk
  • Search queries - query DSL
  • Analysis of data , aggregations
  • Lucene - Java
  • Distributed - scales to many Nodes
  • Highly available - multiple copies of data
  • Restful APIs - CRUD, monitoring and other operations via simple JSON based HTTP calls
  • Power query DSL - Schemaless
  • Can be installed in machine, as well as cloud instance is available


Download

- download the latest version from www.elastic.co
- unzip and start
- By default it will start as a single node cluster
- cluster and node concept


CRUD operations


  • cURL (https://curl.haxx.se/download.html)
  • create
  • read/retrieve
  • update
  • delete
  • Bluk operations on indexed documents
  • Bulk creation of indices from json data


create a new index called products


curl -XPUT "localhost:9200/products?&pretty"

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "products"
}

  Requests

 curl -XPUT "localhost:9200/customers?&pretty"
 curl -XPUT "localhost:9200/orders?&pretty"

  - check the indices

  Request:

curl -XGET "localhost:9200/_cat/indices?v&pretty"

  Response:
health status index    uuid                       pri rep docs.count docs.deleted store.size pri.store.size
yellow open   products BkpL7fogS0uFYMkzV8TYZA     1   1   0          0            230        230b

Add documents to existing indices

-- Request to add Iphone7 Phone
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Iphone 7", 
  "camera": "12MP", 
  "storage": "256GB", 
  "display": "4.7inch", 
  "battery": "1960mAh", 
  "reviews": ["Incredibly happy after having used it for one week", "Best phone so far", "Very expensive"]
 }
 '


  • here products is the index
  • mobiles is the documentType
  • can pass a documentId (1) to represent this document being created
  • PUT is used for Create or delete
  • Post is used to update


-- Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "result" : "created",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 0,
   "_primary_term" : 1
 }
-- Request to add Samsung Galaxy Phone
 curl -XPUT "localhost:9200/products/mobiles/2?pretty" -H'Content-Type: application/json' -d'
 { "name": "Samsung Galaxy", 
  "camera": "8MP", 
  "storage": "128GB", 
  "display": "5.2inch", 
  "battery": "1500mAh", 
  "reviews": ["Best phone ever", "Love the screen size", "Awesome"]
 }
 '
-- Request to add Pixel 3
 curl -XPUT "localhost:9200/products/mobiles/3?pretty" -H'Content-Type: application/json' -d'
 { "name": "Pixel 3", 
  "camera": "12.2MP", 
  "storage": "128GB", 
  "display": "5.5inch", 
  "battery": "2950mAh", 
  "reviews": ["I Love the camera on this phone", "Awesome google phone"]
 }
 '
-- Request to add Macbook pro Laptop (Doctype is different)
 curl -XPUT "localhost:9200/products/laptops/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Macbook Pro", 
  "storage": "500GB", 
  "RAM" : "8GB",
  "display": "13inch", 
  "os": "El capitan", 
  "reviews": ["Size is sleek compared to other laptops", "Storage capacity is great"]
 }
 '

NOTE: 

This request will fail because as of Lucene version 6.x, multiple doc types in a single index is not supported.
{
   "error" : {
  "root_cause" : [
    {
   "type" : "illegal_argument_exception",
   "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
    }
  ],
  "type" : "illegal_argument_exception",
  "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
   },
   "status" : 400
 }

Retrieving Documents

-- curl -XGET "localhost:9200/products/mobiles/1?pretty" { "_index" : "products", "_type" : "mobiles", "_id" : "1", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "name" : "Iphone 7", "camera" : "12MP", "storage" : "256GB", "display" : "4.7inch", "battery" : "1960mAh", "reviews" : [ "Incredibly happy after having used it for one week", "Best phone so far", "Very expensive" ] } }
-- check if document exist without retrieving the source
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=false" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true
 }
 -- to fetch certain fields only in the json document
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=name,reviews" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true,
   "_source" : {
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive"
  ],
  "name" : "Iphone 7"
   }
 }

Update


  • Update document by id
  • Whole document
  • Partial document


-- update of a document can be done via a put request (whole document)
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 {
  "name" : "Iphone 7",
  "camera" : "12MP",
  "storage" : "256GB",
  "display" : "4.7inch",
  "battery" : "1960mAh",
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive",
    "Much better than android phones"
  ]
 }
 '
 Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 2,
   "result" : "updated",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 3,
   "_primary_term" : 1
 }

  • partial update of a document can be done using the _update endpoint, use the POST command with a doc field

Request: add a new field color in the mobile 1
curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "doc": {
   "color": "black"
  }
 }'


  • script field can be used to update a field of a document


-- Request: increment the shoe size by 2
 curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "script": "ctx._source.size += 2"
 }'

Deletes


  • delete a document from an index

curl -XDELETE "localhost:9200/products/mobile/1?pretty

  • delete an entire index

curl -XDELETE "localhost:9200/products/mobile?pretty


Bulk operations


  • retrieve multiple documents
  • _mget api allows us to get multiple documents in one command


curl "localhost:9200/_mget?pretty" -d'
 {
  "docs": [
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "1"
   },
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "2"
   }
  ]
 }'

-- If all the documents trying to get is of same index, can be put in the url itself
curl -XGET "localhost:9200/products/mobiles/_mget?pretty" -H'Content-Type: application/json' -d'{"docs": [{"_id": "1"}, {"_id": "2"}]}'

Index multiple documents


  • The _bulk api allows to specify multiple operations in one go.


curl -XPOST "localhost:9200/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": {"_index": "products", "_type": "mobiles", "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_index": "products", "_type": "mobiles", "_id": "4" } }
 { "name": "New Balance", "size": 9, "color": "White" }
 '

Multiple operations in one command


  • Multiple operations can be done using the _bulk api.
  • create keyword can be used instead of index, to add a document to the index
  • for create and update operation, one json has to follow with the actual json document to be created or updated.


curl -XPOST "localhost:9200/products/shoes/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": { "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_id": "4" } }
 { "name": "New Balance", "size": 8, "color": "White" }
 {"delete": { "_id": "2"}}
 { "create": {"_id": "5" } }
 { "name": "Nike Power", "size": 11, "color": "red" }
 { "update": {"_id": "1" } }
 { "doc": {"color": "orange" }
 '

Bulk index documents from a json file

Searching and filtering

Random json generator: www.json-generator.com

-- Generate 1000 customer data and save it in json format
 Schema:
 [
  '{{repeat(1000, 1000)}}',
  {
   name: '{{firstName()}} {{surname()}}',
   age: '{{integer(18, 75)}}',
   gender: '{{gender()}}',
   email: '{{email()}}',
   phone: '+1 {{phone()}}',
   street: '{{integer(100, 999)}} {{street()}}',
   city: '{{city()}}',
   state: '{{state()}}, {{integer(100, 10000)}}'
  }
 ]

Two context of search


  • Query context
    • Every document has a relevance score which tells how well the document matches the search term
    • Search term can be specified as
      • URL query parameter
      • URL request body
  • use of the _search api

  curl -XGET "localhost:9200/customers/_search?q=wyoming&pretty"
  curl -XGET "localhost:9200/customers/_search?q=wyoming&sort=age:desc&pretty"

from=10
size=2


  • Filter context

curl -XGET "localhost:9200/products/_search?pretty" -d'
 {
  "query": {"match_all": {} },
  "size": 3,
  "from": 2,
  "sort": { "age": { "order": "desc" } }
 }


  • Can search multiple indices

 curl -XGET "localhost:9200/customers,products/_search?pretty"
 curl -XGET "localhost:9200/products/mobiles,laptops/_search?pretty"

  • We can search on fields that we are interested in "term"


curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "term": {"name": "gates"}
  }
 }

  • we can append "_source": false in the above request to eliminate the body from the response.
  •  _source field is very powerful and we can even specify regular expressions


{
  "_source": ["st*", "*n*"],
  "query": {
   "term": { "state": "washington"}
  }
 }


  • we can specify to include or exclude some pattern from the source fields

{
  "_source": {
   "includes": ["st*", "*n*"],
   "excludes": [ "*der"]
  },
  "query": {
   "term": { "state": "washington"}
  }
 }

Full text queries


  • match
  • match_phrase
  • match_phrase_prefix

curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "match": {
    "name": "webster"
   }
  }
 }'
 -- above match keyword can be used to perform not an exact term match, but other ways also (other parameters)
 
 {
  "query": {
   "match": {
    "name": {
     "query": "frank morris",
     "operator": "or"
    }
   }
  }
 }
 -- logical OR matches , all documents having frank or morris in the name field 
 -- default operator is OR 

{
  "query": {
   "match_phrase": {
    "name": "frank morris"
   }
  }
 }
 -- entire phrase has to match 

 {
  "query": {
   "match_phrase_prefix": {
    "name": "fr"
   }
  }
 }
 --  all names that begins with the prefix fr 
 -- this can be used as autocomplete

TFIDF

{
 "common": {
  "reviews": {
   "query": "this is great",
   "cutoff_frequency": 0.001
  }
 }
}

  • some of the terms in the query may be common words (stop words). treat any word with frequency > 0.1% as common word while searching


Compound queries


  • Boolean query
    • Matches documents by combining multiple queries using boolean operators such as AND, OR



  • Must clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '

  • Should clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '


  • must_not clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must_not": [
      {"match": { "state": "california texas" } },
      {"match": { "street": "lane street" } }
     ]
    }
   }
  }
  '

  • filter clause


Term queries

The exact term needs to be found in inverted index for indexed documents
The terms found in the index may vary based on how you analyze them


  • simple term queries

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"term": { "state": {"value": "california"} } },
      {"term": { "street": {"value": "idaho"} } }
     ]
    }
   }
  }
  '


  • Boost some terms over others

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {
       "term": { 
          "state": 
           {
            "value": "california",
            "boost": 2.0
           } 
         } 
      },
      {
       "term": 
       { 
        "street": 
        {
         "value": "idaho"
        } 
       } 
      }
     ]
    }
   }
  }
  '

Filters


  • the documents in the result are not scored.
  • just checks if the document should be included in the result or not.


-- the most common filter is the range filter
-- term and filters could be combined
curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": { "match_all": {} },
     "filter": [
      {
       "term": { 
        "gender": "female"
       }
      },
      {
       "range": {
        "age": {
         "gte": 20,
         "lte": 30
        }
       }
      }
     ]
    }
   }
  }
  '


Analytics and Aggregations


  • Different kind of aggregations that can be performed
  •  Implement queries for metrics and bucketing aggregations
  • Work with multi level nesting of aggregations


Four kind of Aggregations


  1. Metric
  2. Bucketing
  3. Matrix
  4. Pipeline



 Metric Aggregations


  • Aggregations over a set of documents
  •  All document in a search result
  •  Document within a logical group


Bucketing Aggregations


  • Logically group documents based on search query
  • A document falls into a bucket if the criteria matches
  • Each bucket associated with a key


Matrix Aggregations


  • Operates on multiple fields and produces a matrix result
  • Experimental and may change in the future releases
  • Not covered


Pipeline Aggregations


  • Aggregations tht work on the output of other aggregations
  • Experimental and may change in the future releases
  • Not covered


Metric Aggregations


  • numeric aggregations like sum, average, count, min, etc
  • multi value stats aggregations



  • aggregations are done by using the same _search api
  • aggregations are done by using aggs keyword in the request body
  • provide a name that you want to be assigned to the result - "avg_age"
  • avg is the keyword for average aggregations
  • field keyword specifies the field over which this aggregation is going to be performed
  • size = 0, means we do not want any documents to be returned, we just want the final aggregate value


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '


  • metric aggregations become more powerful when combined with search or filter queries
  • the below query calcualtes the average age of all the customers who live in minnesota

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "query": {
   "bool": {
    "filter": {
     "match": { "state": "minnesota"}
    }
   }
  },
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '


  • elastic search can also calculate a whole range of statistics in one go
  • specify the "stats" aggregation keyword within the "aggs" field
  • "age_stats" is the field name that will appear in the response
  • "stats" calculates the count, min, max, avg, sum of the age field


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_stats": {
    "stats": {
     "field": "age"
    }
   }
  }
 }
 '

Cardinality 


  • the number of unique values in a field across all documents
  • enabling cardinality aggregations on text fields require some special setup for the field data

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_count": {
    "cardanality": {
     "field": "age"
    }
   }
  }
 }
 '
-- since age is an integer value, the above query will directly work.
-- for text field, the above query will not work by default
-- have to enable fieldData for the text field

curl -XPUT "localhsot:9200/customers/_mapping/personal?pretty" -d'
  {
   "properties": {
    "gender": {
     "type": "text", 
     "fielddata": true
    }
   }
  }
 '
-- now you can run cardanality aggregation on the gender field
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_count": {
    "cardanality": {
     "field": "gender"
    }
   }
  }
 }

Bucketing


  • similar to the GROUP BY operation in sql


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    }
   }
  }
 }
 '

-- we can also bucket by range
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_range": {
    "range": {
     "field": "age",
     "ranges": [
      { "to": 30},
      { "from": 30, "to": 40},
      { "from": 40, "to": 55},
      { "from": 55 }
     ]
    }
   }
  }
 }
 '

  • "keyed": true can be specified which changes the way the response is returned,
  • also can specify key in the ranges


Multi level nested aggregations


  • example of a metric aggregation nested inside a bucketing aggregation
  • returns the average age of males and females

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

- multi layer nesting of aggregations
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "age_ranges": {
      "range": {
       "field": "age",
       "keyed": true,
       "ranges": [
        { "key": "young", "to": 30},
        { "key": "middle-aged","from": 30, "to": 55},
        { "key": "senior","from": 55 }
       ]
      },
      "aggs": {
       "average_age": {
        "avg": {
         "field": "age"
        }
       }
      }
     }
     
    }
   }
  }
 }
 '

Filter aggregation and filters keyword

- average age of customers from the state of texas
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filter": { "term": { "state": "texas" } },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

-- you can use multiple filters instead of just one "filters" keyword
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filters": {
     "filters": {
      "washington" : { "match": { "state": "washington" } },
      "north carolina" : { "match": { "state": "north carolina" } },
      "south dakota" : { "match": { "state": "south dakota" } }
     }
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '