Find Circular Money Flow with Neo4j

Vlad Batushkov
Neo4j Developer Blog
10 min readMay 27, 2020

--

How to use the subgraph technique to detect the circular money flow (money laundering) in the Neo4j Database of a finance system

Photo by Allie Smith on Unsplash

Goal

This article is written for educational purposes to share cool stuff about Neo4j graph data modelling, subgraph technique, and Cypher query building. Our goal is to illustrate a type of money laundering activity called “Circular money flow”, of a very simplified financial system using a graph database. This post provides a conceptual view of how this can be approached using graph techniques. It can be useful study-case material for general purposes for beginners, who are already familiar with basic concepts of the Neo4j Database.

Database Schema

We will use a financial system to represent how money transfer from one business entity to another. Business entities involved in this “money-game” are split into three types: Client, Company, and ATM.

The domain model initially based on the explanations from the article “How to generate a huge financial graph with money laundering patterns?” written by M grin. Thank you, M grin, for a well-explained domain.

Entity Nodes

(:Company) — a business entity, that can send and receive money.
(:Client) — a business entity, that can send and receive money.
(:ATM) — a business entity, used as an exit point for the money, so ATM cannot be a money sender.

In a real system, nodes would likely contain specific information about entities, such as identifiers and other attributes. However, for the purposes of this example, we are only concerned about the existence of nodes. In this tutorial, we don’t need any properties on our nodes.

Clients, Companies, and ATM and the money flows between them

The arrows represent money transfers. From the diagram, you may think, that Transactions can be relationships in our graph. Well, it is a possible option, but, I do not recommend you use relationships to represent Transactions in this instance. The Transaction should be a node, and here is why.

Data modelling of Transactions

Transaction — is a noun. A good old definition of a node for data modelling.

From the practical experience, it is also a valid choice. We want to manage them as independent entities. For example, querying all Transaction that were made in the last month is a trivial task on a :Transaction node label:

MATCH (t:Transaction)
WHERE t.timestamp.month = date().month - 1
RETURN t

Last, but not least: Indexes. The index is a well-known feature of Node’s properties and unfortunately non-trivial story for Relationships.

Transaction Node & Relationships

(:Transaction) — an entity, that represents an operation of money transfer between two members. It is a self-sufficient fact, that entity “A” sends money to entity “B”. Relationships between (A) — (Transaction) — (B) helps us to understand who is a sender and who is a receiver.

[:OUT] relationship type — connects the entity of the source node and Transaction. Relationship means that Entity is a money sender.
[:IN] relationship type — connects the entity of a destination node and Transaction. Relationship means that Entity is the receiver of money.

CALL db.schema.visualization()
Initial Schema in Neo4j Database

Reminder: ATM used as an exit point for the money, so ATM cannot be a money sender.

The Transaction node has 2 properties: the amount of money and timestamp of operation. This is important information for our task because we need to know how much money was sent and when it happened.

Visualization of a graph

Now, knowing all of the domain details, we can imagine our graph. Business entity nodes are connected to each other via Transaction nodes. It is also worth mentioning that between the same entities there can be many transactions.

Clients, Companies, and ATM send money to each other using transfer money Transactions.

Task Requirements

Circular money flows is a type of “money laundering” activity in which the money is sent to different accounts, and sent back to the starting node after a certain number of hops. Our goal is to find all of these circular flows in the graph. We don’t need to find extremely long chains, so we will limit our investigations to 10 transactions in the money flow.

Each participant in the money flow may send a marginally less money in the next transaction compare to the previous one. The total “payment” for the whole travel should not exceed some % loss of the source amount.

The flow of transactions will expect that each “next” transaction happens only after the “previous” one has completed. Transactions should be ordered by the creation timestamp.

Example of circular money flow with 3% loss, looped at the Red client

Graph Generation

Now, the practical part. First of all, we need an empty Neo4j Database on a local machine. I will use my Neo4j docker boilerplate to build a simple Docker image from the official one, containing the APOC & GDS plugins.

Simple way is to pull ready-to-use blank image from my docker hub profile and build it.

docker pull vladbatushkov/neo4j-apoc-gds:latestdocker run -p 7474:7474 -p 7473:7473 -p 7687:7687 --name=money1 vladbatushkov/neo4j-apoc-gds:latest

Important to setup enough space in the heap and page cache to be able to process a huge amount of data. My image configured for next settings.

ENV NEO4J_dbms_memory_heap_initial__size=4G
ENV NEO4J_dbms_memory_heap_max__size=4G
ENV NEO4J_dbms_memory_pagecache_size=2G

You can also build totally same image and configure it manually doing this.

Dockerfile

Build and run a custom docker container.

docker build . -t=money:dev --no-cachedocker run -p 7474:7474 -p 7473:7473 -p 7687:7687 --name=money1 money:dev

Data Generation

Now let’s talk about how to generate the required graph. The generation of a graph data set is always an interesting task to solve. In this particular case, I will use a well-known APOC Graph Generator procedure.

apoc.generate.er(noNodes, noEdges, label, type)
// generates a random graph according to the Erdos-Renyi model

My approach to building a required graph is based on the following steps:

  1. Generate N nodes (Client) with M random relationships (TRANSFER_TO) between them— our initial set of nodes and relationships of business entities.
  2. Partition them between Clients, Companies, and ATM .
  3. Break the previously created relationship TRANSFER_TO into the Transaction node structure, as follows:
 ()-[OUT]->(:Transaction)<-[:IN]-()

The final thing is to clean up the TRANSFER_TO relationships and add Indexes on Transaction properties. Do not forget to turn on the multi-statement mode in your browser.

The script is idempotent, you can run it many times with parameters to fit your needs. Parameters I used are:

500 000 Entities of

  • 5 000 ATMs
  • 50 000 Companies
  • 445 000 Clients

5 000 000 Transactions of

  • amount of money is a random value up to 1000 conventional units
  • timestamp is a random value from today minus up to 1000 minutes earlier
MATCH (n)-[:OUT]->(t:Transaction)<-[:IN]-(m) RETURN n,t,m LIMIT 500
Company (green), Client (khaki), ATM (red), Transaction (orange)

Solution idea

We now need to write a query to find all the chains, starting and ending at the same Client node. In addition, the amount of money on each Transaction node within the chain should not lose more than x% of the source amount. The number of transactions involved in the flow can be up to 10, and they should be ordered by time.

The solution we are thinking of will involve building a proper path pattern.

(c1)->(t1)->(e2)->(t2)->(e3)->(t3)->...->(t9)->(e10)->(t10)->(c1)

Here we have start and end in the same Client (c1), in the middle with some Client or Company nodes (e2, e3,…), and in total the longest path can have up to 10 Transactions (t1, t2, …).

(tj)->(ej)->(tj+1)

Each neighbouring Transactions should match the following conditions:

  • t1.timestamp < t2.timestamp
  • t1.amount > t2.amount
(t1)->...->(tN)

First and last Transactions also should match these additional conditions:

  • (t1.amount — tN.amount) / t1.amount < X
  • start and ends in the same Client: (t1)<-[:OUT]-(:Client)-[:IN]->(tN)

As you can see, it would be a hard problem to query with so many conditions and node types involved: Transactions and Business entities. We need to think about some smart way to minimize the complexity of the query. We need to think about how actually we can solve this problem.

Subgraph

The main idea behind a subgraph technique — to reshape the graph through adding new relationships. This can help us query the data easier. Our solution query going to work with a subgraph of original graph; with additional structures created to solve our specific problem. This approach is very useful and can be used in many other scenarios.

What is the money flow? It is a chain of Transactions. Let’s forget about business entities for a while and actually find-out, that all we need is actually just Transactions. We can build a new “network” of only Transactions, that we didn’t have before (our subgraph). We can then explore that “transaction network”, writing a query to detect circular money flows.

A subgraph of Transactions on the top of actual database entities

Now let’s build a subgraph by introducing a HOP relationship.

[:HOP] Relationship

The [:HOP] relationship between Transactions exists for the following conditions:

  • Transaction A { timestamp } earlier than Transaction B { timestamp }
  • Transaction B { amount } less than Transaction A { amount }
  • Transaction B { amount } compare to Transaction A { amount } not less than a defined loss percentage

Without the last condition, the subgraph becomes wide-range acceptable. You can use the same subgraph and apply different loss percentages in your future query. The disadvantage of this approach is that it gives us more relationships, and as a result query performance overhead.

In this generation script, I set the exact value of loss percentage, based on my needs. I define a loss as less than 25%, assuming that in my data 25% loss is still an acceptable value for circular money flow. Why it is so big? I have a random graph, and, assume, that a 10% loss for 10 hops is statistically a jackpot for randomly generated data. With a 25% loss, I’ll have a better chance to find a long-chain result.

MATCH (t1:Transaction)-[:HOP*2..10]->(t2:Transaction) 
RETURN t1,t2 LIMIT 500
Transaction hop chain

Solution Query

Now it is time to write an actual query to solve the task. Once again, a reminder of the circular money flow requirements: the same Client in the beginning and at the end, time-ordered and match less than 25% loss each progressive traversal.

As you can see, the query is really short and clear.

It is worth highlighting the MATCH pattern used to find a path from first to the last Transaction via the :HOP relationships — this is only one valid money flow, that match all chain requirements. Thanks to the :HOP relationship, we can be sure, that not only first and last Transactions satisfy requirements, but also all Transactions in between:

path = (t1:Transaction)-[:HOP*3..10]->(t2:Transaction)

The PROFILE statement is used only to look at the execution plan. Here is the query result and execution plan for 5 449 818 Nodes and 15 996 048 Relationships. Circular money flow result with longest circular money flow of 8 Transactions.

Example of Circular Money Flow result. Transaction (red), Client (green), Company (yellow).
761571154 total db hits in 1041340 ms

It takes around 17 minutes to analyse the whole graph. But do we need to analyze everything?

Use-case example

Depends on your situation you may think about query changes and apply more ideas, how to use this query.

For example, when a new Transaction created, we can run a query against only that Transaction. We will verify, that this money not came from circular money flow, organized by Client, who received money (incoming Transaction).

In this story, I have nothing except id of Node to use as a filter, so, my query change will be looks like this.

PROFILE MATCH path = (t1:Transaction)-[:HOP*3..10]->(t2:Transaction)
WHERE ID(t2) = 2801989 AND ...
...

I catch a potential scammer and it was really fast.

171 total db hits in 94 ms

Conclusion

I hope this story gives you a good understanding that graphs are a very good tool in these types of situations, and a perfect place to apply your creativity. Pick a small problem and try to solve it with a graph. If you need any help — ping me in LinkedIn, I would gladly chat with you about graphs.

Thanks for reading and enjoy Neo4j. Clap-clap-clap to catch all the scammers.

--

--

Vlad Batushkov
Neo4j Developer Blog

Engineering Manager @ Agoda. Neo4j Ninja. Articles brewed on modern tech, hops and indie rock’n’roll.