Pages

Liquigraph as a data migration tool for Neo4j

I have been working a graph database recently called Neo4j. Its interesting. Since its a graph database, you have to think differently. And so, it might be intimidating for those who are used to the sql databases. Especially because they are familiar with how it works and all the tools required to work with it. I was particularly focused on Application database migration tools.

Directly to the point:
There are a couple of tools for automatic data migration for sql databases

  • Liquibase
  • Flyway
But since we are also using Neo4j (graph database), above migration tools won't work.
Well I found a tool called 
  • LIQUIGRAPH
Its basically similar to Liquibase. 

How it works?

Basically most of the getting started tutorial is here
  • Add the dependency in the spring project. 
  • Configure the datasource. (Tricky is multiple databases)
  • Add migration scripts in changelog.xml at your resources/db/liquigraph directory
Now the new changes logged in the changelog.xml will be applied to the graph data. 

Will it apply the change on every application start?

No. It creates some nodes in your graph database (_liquigraphchangelog, _liquigraphchangeset, _liquigraphquery). These are to track the previous migrations. Liquigraph knows which changes are already applied to the graph and which are the new changes, and applies only the new changes.

What are the dependencies?

Add the two dependencies

implementation ('org.liquigraph:liquigraph-spring-boot-starter:3.1.1')
implementation ('org.neo4j:neo4j-jdbc-bolt:3.5.1')

Add the properties in the properties file

liquigraph:
  user: neo4j
  password: pswd
  url: jdbc:neo4j:bolt://localhost:7687

Put the changes in the changelog.xml
<?xml version="1.0" encoding="UTF-8"?><changelog xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"           xsi:noNamespaceSchemaLocation="http://www.liquigraph.org/schema/1.0/liquigraph.xsd">
    <changeset id="hello-world" author="you">
        <query>CREATE (n:Sentence {text:'Hello monde!'}) RETURN n</query>
    </changeset>
    <changeset id="hello-world-fixed" author="you">
        <query>MATCH (n:Sentence {text:'Hello monde!'}) SET n.text='Hello world!' RETURN n</query>
    </changeset>
    <changeset id="set-author-in-sentence" author="you">
        <query>match(n: Sentence) set n.author= 'rajan' return count(n)</query>
    </changeset>
</changelog>

Now if these changes are not already applied in the graph, They will be applied at the application restart.

Problems/Confusions 

  • I was confused about specifying the liquigraph properties explicitly. I thought since I have already configured the spring.data.neo4j properties, it would pick those properties to configure the Datasource. But it does not.
  • The tutorial tells you to put spring-data-jdbc to the dependencies. I was afraid that it would mess with spring-data-jpa that I had in my project. So the second way of creating a Datasource manually won't work either. There were miltiple databases in my project (MSSQL, Mongodb, Neo4j). So faced some issue as to which Datasource the Liquigraph was picking. I had to dig inside the library to find out that the liquigraph property could be the solution, and it worked.
  • Pick proper version of the libraries

Learning Elastic Search

Following are the software versions used while learning the below items.

Elastic search Version:  5.4.0
Java Version: 8

 Overview


  • Download and install
  • TFIDF
  • Building an index
  • Adding documents to index, individually and in bulk
  • Search queries - query DSL
  • Analysis of data , aggregations
  • Lucene - Java
  • Distributed - scales to many Nodes
  • Highly available - multiple copies of data
  • Restful APIs - CRUD, monitoring and other operations via simple JSON based HTTP calls
  • Power query DSL - Schemaless
  • Can be installed in machine, as well as cloud instance is available


Download

- download the latest version from www.elastic.co
- unzip and start
- By default it will start as a single node cluster
- cluster and node concept


CRUD operations


  • cURL (https://curl.haxx.se/download.html)
  • create
  • read/retrieve
  • update
  • delete
  • Bluk operations on indexed documents
  • Bulk creation of indices from json data


create a new index called products


curl -XPUT "localhost:9200/products?&pretty"

{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "products"
}

  Requests

 curl -XPUT "localhost:9200/customers?&pretty"
 curl -XPUT "localhost:9200/orders?&pretty"

  - check the indices

  Request:

curl -XGET "localhost:9200/_cat/indices?v&pretty"

  Response:
health status index    uuid                       pri rep docs.count docs.deleted store.size pri.store.size
yellow open   products BkpL7fogS0uFYMkzV8TYZA     1   1   0          0            230        230b

Add documents to existing indices

-- Request to add Iphone7 Phone
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Iphone 7", 
  "camera": "12MP", 
  "storage": "256GB", 
  "display": "4.7inch", 
  "battery": "1960mAh", 
  "reviews": ["Incredibly happy after having used it for one week", "Best phone so far", "Very expensive"]
 }
 '


  • here products is the index
  • mobiles is the documentType
  • can pass a documentId (1) to represent this document being created
  • PUT is used for Create or delete
  • Post is used to update


-- Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "result" : "created",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 0,
   "_primary_term" : 1
 }
-- Request to add Samsung Galaxy Phone
 curl -XPUT "localhost:9200/products/mobiles/2?pretty" -H'Content-Type: application/json' -d'
 { "name": "Samsung Galaxy", 
  "camera": "8MP", 
  "storage": "128GB", 
  "display": "5.2inch", 
  "battery": "1500mAh", 
  "reviews": ["Best phone ever", "Love the screen size", "Awesome"]
 }
 '
-- Request to add Pixel 3
 curl -XPUT "localhost:9200/products/mobiles/3?pretty" -H'Content-Type: application/json' -d'
 { "name": "Pixel 3", 
  "camera": "12.2MP", 
  "storage": "128GB", 
  "display": "5.5inch", 
  "battery": "2950mAh", 
  "reviews": ["I Love the camera on this phone", "Awesome google phone"]
 }
 '
-- Request to add Macbook pro Laptop (Doctype is different)
 curl -XPUT "localhost:9200/products/laptops/1?pretty" -H'Content-Type: application/json' -d'
 { "name": "Macbook Pro", 
  "storage": "500GB", 
  "RAM" : "8GB",
  "display": "13inch", 
  "os": "El capitan", 
  "reviews": ["Size is sleek compared to other laptops", "Storage capacity is great"]
 }
 '

NOTE: 

This request will fail because as of Lucene version 6.x, multiple doc types in a single index is not supported.
{
   "error" : {
  "root_cause" : [
    {
   "type" : "illegal_argument_exception",
   "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
    }
  ],
  "type" : "illegal_argument_exception",
  "reason" : "Rejecting mapping update to [products] as the final mapping would have more than 1 type: [mobiles, laptops]"
   },
   "status" : 400
 }

Retrieving Documents

-- curl -XGET "localhost:9200/products/mobiles/1?pretty" { "_index" : "products", "_type" : "mobiles", "_id" : "1", "_version" : 1, "_seq_no" : 0, "_primary_term" : 1, "found" : true, "_source" : { "name" : "Iphone 7", "camera" : "12MP", "storage" : "256GB", "display" : "4.7inch", "battery" : "1960mAh", "reviews" : [ "Incredibly happy after having used it for one week", "Best phone so far", "Very expensive" ] } }
-- check if document exist without retrieving the source
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=false" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true
 }
 -- to fetch certain fields only in the json document
 curl -XGET "localhost:9200/products/mobiles/1?pretty&_source=name,reviews" 
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 1,
   "_seq_no" : 0,
   "_primary_term" : 1,
   "found" : true,
   "_source" : {
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive"
  ],
  "name" : "Iphone 7"
   }
 }

Update


  • Update document by id
  • Whole document
  • Partial document


-- update of a document can be done via a put request (whole document)
 curl -XPUT "localhost:9200/products/mobiles/1?pretty" -H'Content-Type: application/json' -d'
 {
  "name" : "Iphone 7",
  "camera" : "12MP",
  "storage" : "256GB",
  "display" : "4.7inch",
  "battery" : "1960mAh",
  "reviews" : [
    "Incredibly happy after having used it for one week",
    "Best phone so far",
    "Very expensive",
    "Much better than android phones"
  ]
 }
 '
 Response:
 {
   "_index" : "products",
   "_type" : "mobiles",
   "_id" : "1",
   "_version" : 2,
   "result" : "updated",
   "_shards" : {
  "total" : 2,
  "successful" : 1,
  "failed" : 0
   },
   "_seq_no" : 3,
   "_primary_term" : 1
 }

  • partial update of a document can be done using the _update endpoint, use the POST command with a doc field

Request: add a new field color in the mobile 1
curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "doc": {
   "color": "black"
  }
 }'


  • script field can be used to update a field of a document


-- Request: increment the shoe size by 2
 curl -XPOST "localhost:9200/products/mobiles/1/_update?pretty" -H'Content-Type: application/json' -d'
 {
  "script": "ctx._source.size += 2"
 }'

Deletes


  • delete a document from an index

curl -XDELETE "localhost:9200/products/mobile/1?pretty

  • delete an entire index

curl -XDELETE "localhost:9200/products/mobile?pretty


Bulk operations


  • retrieve multiple documents
  • _mget api allows us to get multiple documents in one command


curl "localhost:9200/_mget?pretty" -d'
 {
  "docs": [
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "1"
   },
   {
    "_index": "products",
    "_type": "laptops",
    "_id": "2"
   }
  ]
 }'

-- If all the documents trying to get is of same index, can be put in the url itself
curl -XGET "localhost:9200/products/mobiles/_mget?pretty" -H'Content-Type: application/json' -d'{"docs": [{"_id": "1"}, {"_id": "2"}]}'

Index multiple documents


  • The _bulk api allows to specify multiple operations in one go.


curl -XPOST "localhost:9200/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": {"_index": "products", "_type": "mobiles", "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_index": "products", "_type": "mobiles", "_id": "4" } }
 { "name": "New Balance", "size": 9, "color": "White" }
 '

Multiple operations in one command


  • Multiple operations can be done using the _bulk api.
  • create keyword can be used instead of index, to add a document to the index
  • for create and update operation, one json has to follow with the actual json document to be created or updated.


curl -XPOST "localhost:9200/products/shoes/_bulk?pretty" -H'Content-Type: application/json' -d'
 { "index": { "_id": "3" } }
 { "name": "Puma", "size": 9, "color": "black" }
 { "index": {"_id": "4" } }
 { "name": "New Balance", "size": 8, "color": "White" }
 {"delete": { "_id": "2"}}
 { "create": {"_id": "5" } }
 { "name": "Nike Power", "size": 11, "color": "red" }
 { "update": {"_id": "1" } }
 { "doc": {"color": "orange" }
 '

Bulk index documents from a json file

Searching and filtering

Random json generator: www.json-generator.com

-- Generate 1000 customer data and save it in json format
 Schema:
 [
  '{{repeat(1000, 1000)}}',
  {
   name: '{{firstName()}} {{surname()}}',
   age: '{{integer(18, 75)}}',
   gender: '{{gender()}}',
   email: '{{email()}}',
   phone: '+1 {{phone()}}',
   street: '{{integer(100, 999)}} {{street()}}',
   city: '{{city()}}',
   state: '{{state()}}, {{integer(100, 10000)}}'
  }
 ]

Two context of search


  • Query context
    • Every document has a relevance score which tells how well the document matches the search term
    • Search term can be specified as
      • URL query parameter
      • URL request body
  • use of the _search api

  curl -XGET "localhost:9200/customers/_search?q=wyoming&pretty"
  curl -XGET "localhost:9200/customers/_search?q=wyoming&sort=age:desc&pretty"

from=10
size=2


  • Filter context

curl -XGET "localhost:9200/products/_search?pretty" -d'
 {
  "query": {"match_all": {} },
  "size": 3,
  "from": 2,
  "sort": { "age": { "order": "desc" } }
 }


  • Can search multiple indices

 curl -XGET "localhost:9200/customers,products/_search?pretty"
 curl -XGET "localhost:9200/products/mobiles,laptops/_search?pretty"

  • We can search on fields that we are interested in "term"


curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "term": {"name": "gates"}
  }
 }

  • we can append "_source": false in the above request to eliminate the body from the response.
  •  _source field is very powerful and we can even specify regular expressions


{
  "_source": ["st*", "*n*"],
  "query": {
   "term": { "state": "washington"}
  }
 }


  • we can specify to include or exclude some pattern from the source fields

{
  "_source": {
   "includes": ["st*", "*n*"],
   "excludes": [ "*der"]
  },
  "query": {
   "term": { "state": "washington"}
  }
 }

Full text queries


  • match
  • match_phrase
  • match_phrase_prefix

curl "localhost:9200/customers/_search?pretty" -d'
 {
  "query": {
   "match": {
    "name": "webster"
   }
  }
 }'
 -- above match keyword can be used to perform not an exact term match, but other ways also (other parameters)
 
 {
  "query": {
   "match": {
    "name": {
     "query": "frank morris",
     "operator": "or"
    }
   }
  }
 }
 -- logical OR matches , all documents having frank or morris in the name field 
 -- default operator is OR 

{
  "query": {
   "match_phrase": {
    "name": "frank morris"
   }
  }
 }
 -- entire phrase has to match 

 {
  "query": {
   "match_phrase_prefix": {
    "name": "fr"
   }
  }
 }
 --  all names that begins with the prefix fr 
 -- this can be used as autocomplete

TFIDF

{
 "common": {
  "reviews": {
   "query": "this is great",
   "cutoff_frequency": 0.001
  }
 }
}

  • some of the terms in the query may be common words (stop words). treat any word with frequency > 0.1% as common word while searching


Compound queries


  • Boolean query
    • Matches documents by combining multiple queries using boolean operators such as AND, OR



  • Must clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '

  • Should clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"match": { "street": "ditmas" } },
      {"match": { "street": "avenue" } }
     ]
    }
   }
  }
  '


  • must_not clause

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must_not": [
      {"match": { "state": "california texas" } },
      {"match": { "street": "lane street" } }
     ]
    }
   }
  }
  '

  • filter clause


Term queries

The exact term needs to be found in inverted index for indexed documents
The terms found in the index may vary based on how you analyze them


  • simple term queries

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {"term": { "state": {"value": "california"} } },
      {"term": { "street": {"value": "idaho"} } }
     ]
    }
   }
  }
  '


  • Boost some terms over others

curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "should": [
      {
       "term": { 
          "state": 
           {
            "value": "california",
            "boost": 2.0
           } 
         } 
      },
      {
       "term": 
       { 
        "street": 
        {
         "value": "idaho"
        } 
       } 
      }
     ]
    }
   }
  }
  '

Filters


  • the documents in the result are not scored.
  • just checks if the document should be included in the result or not.


-- the most common filter is the range filter
-- term and filters could be combined
curl "localhost:9200/customers/_search?pretty" -d'
  {
   "query": {
    "bool": {
     "must": { "match_all": {} },
     "filter": [
      {
       "term": { 
        "gender": "female"
       }
      },
      {
       "range": {
        "age": {
         "gte": 20,
         "lte": 30
        }
       }
      }
     ]
    }
   }
  }
  '


Analytics and Aggregations


  • Different kind of aggregations that can be performed
  •  Implement queries for metrics and bucketing aggregations
  • Work with multi level nesting of aggregations


Four kind of Aggregations


  1. Metric
  2. Bucketing
  3. Matrix
  4. Pipeline



 Metric Aggregations


  • Aggregations over a set of documents
  •  All document in a search result
  •  Document within a logical group


Bucketing Aggregations


  • Logically group documents based on search query
  • A document falls into a bucket if the criteria matches
  • Each bucket associated with a key


Matrix Aggregations


  • Operates on multiple fields and produces a matrix result
  • Experimental and may change in the future releases
  • Not covered


Pipeline Aggregations


  • Aggregations tht work on the output of other aggregations
  • Experimental and may change in the future releases
  • Not covered


Metric Aggregations


  • numeric aggregations like sum, average, count, min, etc
  • multi value stats aggregations



  • aggregations are done by using the same _search api
  • aggregations are done by using aggs keyword in the request body
  • provide a name that you want to be assigned to the result - "avg_age"
  • avg is the keyword for average aggregations
  • field keyword specifies the field over which this aggregation is going to be performed
  • size = 0, means we do not want any documents to be returned, we just want the final aggregate value


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '


  • metric aggregations become more powerful when combined with search or filter queries
  • the below query calcualtes the average age of all the customers who live in minnesota

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "query": {
   "bool": {
    "filter": {
     "match": { "state": "minnesota"}
    }
   }
  },
  "aggs": {
   "avg_age": {
    "avg": {
     "field": "age"
    }
   }
  }
 }
 '


  • elastic search can also calculate a whole range of statistics in one go
  • specify the "stats" aggregation keyword within the "aggs" field
  • "age_stats" is the field name that will appear in the response
  • "stats" calculates the count, min, max, avg, sum of the age field


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_stats": {
    "stats": {
     "field": "age"
    }
   }
  }
 }
 '

Cardinality 


  • the number of unique values in a field across all documents
  • enabling cardinality aggregations on text fields require some special setup for the field data

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_count": {
    "cardanality": {
     "field": "age"
    }
   }
  }
 }
 '
-- since age is an integer value, the above query will directly work.
-- for text field, the above query will not work by default
-- have to enable fieldData for the text field

curl -XPUT "localhsot:9200/customers/_mapping/personal?pretty" -d'
  {
   "properties": {
    "gender": {
     "type": "text", 
     "fielddata": true
    }
   }
  }
 '
-- now you can run cardanality aggregation on the gender field
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_count": {
    "cardanality": {
     "field": "gender"
    }
   }
  }
 }

Bucketing


  • similar to the GROUP BY operation in sql


curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    }
   }
  }
 }
 '

-- we can also bucket by range
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "age_range": {
    "range": {
     "field": "age",
     "ranges": [
      { "to": 30},
      { "from": 30, "to": 40},
      { "from": 40, "to": 55},
      { "from": 55 }
     ]
    }
   }
  }
 }
 '

  • "keyed": true can be specified which changes the way the response is returned,
  • also can specify key in the ranges


Multi level nested aggregations


  • example of a metric aggregation nested inside a bucketing aggregation
  • returns the average age of males and females

curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

- multi layer nesting of aggregations
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "gender_bucket": {
    "terms": {
     "field": "gender"
    },
    "aggs": {
     "age_ranges": {
      "range": {
       "field": "age",
       "keyed": true,
       "ranges": [
        { "key": "young", "to": 30},
        { "key": "middle-aged","from": 30, "to": 55},
        { "key": "senior","from": 55 }
       ]
      },
      "aggs": {
       "average_age": {
        "avg": {
         "field": "age"
        }
       }
      }
     }
     
    }
   }
  }
 }
 '

Filter aggregation and filters keyword

- average age of customers from the state of texas
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filter": { "term": { "state": "texas" } },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '

-- you can use multiple filters instead of just one "filters" keyword
curl -XPOST "localhost:9200/customers/_search?&pretty" -d'
 {
  "size": 0,
  "aggs": {
   "state": {
    "filters": {
     "filters": {
      "washington" : { "match": { "state": "washington" } },
      "north carolina" : { "match": { "state": "north carolina" } },
      "south dakota" : { "match": { "state": "south dakota" } }
     }
    },
    "aggs": {
     "average_age": {
      "avg": {
       "field": "age"
      }
     }
    }
   }
  }
 }
 '