Learn Neo4j Cypher basics in 30 minutes
Practical tutorial for graph database beginners, based on Neo4j Workshop in Bangkok. Includes Neo4j Sandbox, Graph Data Modeling, Import Data from CSV file, Cypher examples and tasks.
[UPDATE] If you are interested to learn Neo4j and Cypher — take a look at Graphville educational platform, where I transformed my articles into interactive lessons. Welcome to Graphville.
Goal
In this small tutorial-like article we going to talk about graph query language Cypher. The main goal of this material is to understand simple Data Modeling methods and base concepts of graph querying.
Starting with the creation of Neo4j Sandbox, then you will continue with exploring a Domain Model of entities, placed in the data source. After that, you will import the data and write six Cypher queries with different kind of tricks.
All you need is your laptop and desire to learn Cypher basics in the next 30 minutes. Let’s start!
Setup
There are a lot of different options to run your own Neo4j instance just in several minutes. Today and for other study purposes, I recommend to use Neo4j Sandbox. It is the fast, free and easy way to run an isolated database to play with. Register your account and then launch a Sandbox — go to Sandbox page to SignUp and Setup.
You can also go with Docker. Neo4j via Docker also should take just a few minutes. Go to hub.
Neo4j Browser
Before we fall deep into domain and modelling, let’s mention and learn basics how to use database UI. Launch Browser will redirect you to screen below.
Blue — Database & Connection information
Yellow — Nodes, Relationships and Properties (nothing here yet)
Red — Query input box (click ESC to expand/collapse)
Green — Execute query button (Clean and Add to Favorites actions nearby)
Here you going to work with your graph: write and execute queries, explore results as graph or table view. We will back to Browser soon.
Domain Model
Domain Model is a very simple thing and can be expressed just in 3 sentences:
There are some cities.
There are some people, living in those cities.
People call each other.
Data Source
Graph Database the same as any other persistence layer used to store the data. And for our lesson we also need data. Right now our graph is empty, so our first step — import data into the database.
Currently source data for our “Calls” Data Model stored in CSV file. You can download the csv file here: https://vbatushkov.bitbucket.io/log_of_calls.csv
As you can see from the CSV file each row represent the next information: Call start and end timestamp, Person’s name, gender, phone number and city.
In total there are 10 000 calls, 2 000 persons and 5 cities. The file contains information about calls in dates from 2019–01–01 till 2019–07–30.
To import data into the graph from the CSV file we should design the structure of our future graph. Based on our decision of right graph schema we will write necessary import script.
Data Modeling
Let’s analyze 1 random row of the csv file.
How this information can be expressed in human language?
Cassidy made a call to Enrique.
Now let’s convert this information into a graph using 2 main building blocks: Nodes and Relationships.
Nodes
One node to represent Cassidy and one for Enrique. This is our first two nodes with properties like name, gender, phone number and city. Both nodes belong to the same “type” — they both represent a Person. The specified type of node is called Node Label. One more node we want to extract from the row is a node of Call. Duration property we can set as minute difference between the start and end of the call.
Relationships
Currently, we have a few nodes, but this is still not a graph. There are no relationships between our nodes and we need to fix it. From fact, that Cassidy made a call to Enrique, we can say that for Cassidy this call is outgoing. We can call this relationship — outgoing [OUT]. On the other side — Enrique received a call, so for him, this call is incoming [IN].
Normalization
As you can see, currently, we have city property inside the Person node. This structure leads to data duplication and potential problems with inconsistency. Much better design to have each City as a separate node with relationship from each person, who living there.
Import Data
Go to gist, copy script and execute it in Neo4j Browser.
From each 1 line of CSV file, we will get 1 node of Call, potentially 2 nodes of Persons and potentially 2 nodes of Cities. Why “potentially”? Because we should not duplicate City and Person nodes of the same entity, but only create it once and then reuse node adding more relationships. There is should be only one Bangkok city in the graph, so this “single instance” requirement should be true for all cities, persons and calls.
MERGE command help us to avoid node duplication. If specified structure (node or combination of nodes and relationships) does not exist in the database — then it will be created, otherwise skipped.
So far, each row represents a fact of call and we need all of them, we can simply CREATE calls without any issue of duplication. But running this script twice will lead to the creation of same Call nodes again and again. You can try different scenarios to understand the idea.
Once again, the source code to the Import Script is here. To be able to execute multi-statement, enable this feature in Browser Settings. After successful execution you should see this result message:
Added 12005 labels — for each node we have, created 12005 nodes, all good:
10K of calls + 2K of persons + 5 cities = 12005 nodes
36005 properties in total come from the sum of 3 props from each of 10K Calls, 3 props from each of 2K Persons and 1 name property from each of 5 Cities:
3 x 10K + 3 x 2K + 5 = 36005 properties
But a number of created relationship I want you to explain to me. What formula gives us 22K relationships? Think about it and find the correct explanation.
Hint: graph schema should help. Count relationships based on total of nodes and relationships between them. Execure command: CALL db.schema.visualization()
Cypher basics
Cypher — is a declarative query language, built on the basic concepts and clauses of SQL but with added graph-specific functionality. And the main idea to understand is a concept of Graph Pattern Matching.
There are no tables, where we store our data, instead, we have nodes and each one itself know its type by having a Node Label. You can imagine, that we have an infinity plain surface and all nodes just placed on it somehow. Each of these nodes marked by label (:City), (:Person), (:Call) and connected by relationships [FROM], [OUT], [IN] with other nodes. So now, when we want to find something, we should use some pattern of these nodes and relationships. Concept of patterns easy to understand by comparing the basics of SQL and Cypher queries.
In SQL we say — give me this SELECTed information FROM this table. In Cypher we say — RETURN me data, that MATCH this pattern. So, as you can see, the pattern for Cypher is like FROM and RETURN is like a SELECT.
When SQL want to combine data from many sources, we use JOINs. In Cypher there are no tables and no joins, but nodes connected by relationships. So, for us, this would be just a more complex pattern to express what many things we want to RETURN.
When we need to apply to filter in both SQL and Cypher we simply use WHERE clause. In Cypher we can set not only condition, like, for example, name = “Robert”, but also filter by additional pattern together with existed MATCH pattern.
Knowing the idea, the next required thing is a basic syntax to be able to build simple MATCH patterns.
Node Syntax
Relationship Syntax
Cypher examples
Important to mention the unique aggregation ability. You don’t need any specific GROUP BY clause to do aggregation and also you can apply as much aggregation functions in once as you want. But keep in mind, that grouping key would be the same for all aggregate functions. More details you can read in documentation.
Cypher Challenge
And now we come to the final line of a tutorial — Cypher Challenge. To prove your Neo4j understanding and Cypher theory, try to answer the questions, prepared for you. Explore the imported data and write Cypher queries to find correct results.
Please, carefully read the question and find what is asking. I know a few cases when people submit a number of Calls instead of the name of the City. Use Hints prepared to help in question.
- How many calls were missed in May?
- Find the name of man, who received a call from Tiffany in May?
- Find a city with the lowest number of internal city calls?
- How many women from Pattaya received calls from Bangkok men?
- On date 25 of April, find the woman who has the least total duration of conversations?
- How many pairs of people, where persons called to each other?
I’ve created a Google Form to make this tutorial more interactive and useful for you. I hope you can solve at least 4 of 6 questions. Good luck!
Outro
Thanks for reading. If you are interested in the next steps in Neo4j and graph database modelling, I recommend you to look at documentation below. Ping me, if you need any help and clap-clap-clap if you find this tutorial useful. Have fun with graphs!