Analyse Neo4j Graph of Books domain model with Jaccard Similarity Algorithm

May 27: Library Day, One Month Graph Challenge

7 min readMay 27, 2019

Welcome word

In this series of small posts I do one simple graph daily. Domain model of graph somehow related to day’s history, some historical event, celebration or person. I do this challenge to learn Neo4j Data Modeling and Cypher. Every day. One month. Follow me. Maybe you will be inspired and next month would be yours One Month Graph Challenge. #OMGChallenge

Domain model

Today there are not much world wide events happened. And after additional research I decide to stop on best, from my point of view, celebration:
All-Russian Library Day, celebrated annually on May 27.

I think, it is a good possibility to talk about Russian Literature and many other great creations, that did huge impact on whole world culture.

My small graph would be based on books-related domain model. I want to try to find any interesting numbers and then try to expand relationships based on this mined data. Maybe also will try to find some useful recommendations for myself, based on my reading interests. Let’s see how far it will go.

Graph

First of all, let’s load HTML and parse data with help of APOC function from quite cute book online service I found. As you can see below, I decide to collect books from top pf Russian literature, “Best by century” and my lovely Sci-Fi genre.

UNWIND [
"113",//best russian literature
"16",//best of 19th
"6",//best of 20th
"7",//best of 21st
"82083",//best millennium
"114319"//best sci-fi of 20th
] as url
CALL apoc.load.html("https://www.goodreads.com/list/show/" + url, { main: "table.tableList.js-dataTooltip tbody tr td a.bookTitle", author: "table.tableList.js-dataTooltip tbody tr td a.authorName", score: "table.tableList.js-dataTooltip tbody tr td span.greyText.smallText.uitext" }) YIELD value
WITH collect(value) as data, url
WITH { main: data[2].main, author: data[1].author, score: data[0].score } as root, url
WITH [ x IN range(0, length(root.main) - 1) | { title: root.main[x].text, url: root.main[x].attributes.href, author: root.author[x].text, scoreBase: root.score[x].text  }] as result
UNWIND result as item
WITH item.title as title, item.url as url, item.author as author, apoc.text.regexGroups(item.scoreBase, "(\\d+.\\d+)(.+ )(\\d+,*\\d+)")[0] as groups
WITH title, url, author, toFloat(groups[1]) as score, toInteger(apoc.text.replace(groups[3], ",", "")) as readersNum
MERGE (b:Book { title: title })
ON CREATE SET b.url = url, b.score = score, b.readersNum = readersNum
ON MATCH SET b.url = url, b.score = score, b.readersNum = readersNum
MERGE (a:Author { name: author })
MERGE (b)-[:WRITTEN_BY]->(a)

Interesting case to mention, if you go with something like this you will got into a troubles:

MERGE (b:Book { title: title , url: url, score: score, readersNum: readersNum })

And here explanation why: same book can appear in several categories. And between html parsing of this book, people on web-site high can actively adding more scores and “reads”. This is why in sereval ms same book can already have new value of score and readersNum. And lead to new book node as we set all 4 property at once.

MERGE (b:Book { title: title })
ON CREATE SET b.url = url, b.score = score, b.readersNum = readersNum
ON MATCH SET b.url = url, b.score = score, b.readersNum = readersNum

This approach gives me correct result. I match in MERGE by title only and then update book properties with new values from website, whatever are they changed or not.

I have 521 books by 315 authors. Still not enough for me. I want to know genres for every book. This is why I include url property in previous query. Let’s add top 5 genres specified for each book.

Next query is high time consuming. Be sure you have cup of tea nearby.

MATCH (b:Book)
CALL apoc.load.html("https://www.goodreads.com" + b.url, { data: "div.left a.actionLinkLite.bookPageGenreLink" }) YIELD value
UNWIND value.data[..5] as item
MERGE (g:Genre { name: item.text })
MERGE (b)-[:OF]->(g)

Let’s see on authors with biggest number of books in classics genre.

MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)
WHERE g.name = "Classics"
RETURN a.name as author, count(b) as books
ORDER BY books DESC
LIMIT 10

I want to know the who have biggest amount of readers. But because we have here a big difference of books per each author, I think, it is good to compare them by some sum of readers only for N books. This N is an average number of books for author. This approach must be much more fair.

So, once again: I will find an average number of books (N) in graph, cut authors with less books, plus will take only N books from those, who have more books in this data. Then I will find sum of readers of these books and compare.

MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)
WHERE g.name = "Classics"
WITH g, a, count(b) as booksNum
WITH avg(booksNum) as booksAvg
WITH toInteger(ceil(booksAvg)) as avg
MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)
WHERE g.name = "Classics"
WITH a, count(b) as booksNum, collect({ readersNum: b.readersNum }) as readers, avg
WHERE booksNum >= avg
WITH a.name as author, readers, avg
WITH author, apoc.coll.sortMaps(readers, 'readersNum')[..avg] as topReaded
WITH author, reduce(r = 0, x IN topReaded | r + x.readersNum) as readersSum
RETURN author, readersSum
ORDER BY readersSum DESC
LIMIT 15

This is great to see, that Leo Tolstoy and Fyodor Dostoyevsky have really big readers auditory even at US located resource. Classics, what to say.

Now I want to come back to genres and analyse them a little bit more.

For example, top 10 author working in many genres:

MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)
WITH a.name as author, g.name as genre
WITH author, collect(DISTINCT genre) as genres
WITH author, size(genres) as genresNum
RETURN author, genresNum
ORDER BY genresNum DESC
LIMIT 10

Much more interesting to find authors by similarity of genres they both work with. To solve this task we can use Jaccard Similarity algorithm.

MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)
WITH { item: id(a), categories: collect(id(g)) } as authorData
WITH collect(authorData) as data
CALL algo.similarity.jaccard.stream(data) YIELD item1, item2, count1, count2, intersection, similarity
WITH algo.asNode(item1).name AS author1, algo.asNode(item2).name AS author2, intersection, similarity
WHERE similarity > 0.6
RETURN author1, author2, intersection, similarity
ORDER BY intersection DESC, similarity DESC
LIMIT 10

You can play with ORDER BY statements, put similarity into the first place to find pairs of authors similar by genres. Or, for example, count of pairs of totally “genre similar” authors.

...
WHERE similarity = 1.0
RETURN count(similarity) as c

Now, based on this information we can link authors with similarity relationship.

MATCH (g:Genre)<-[:OF]-(b:Book)-[:WRITTEN_BY]->(a:Author)
WITH { item: id(a), categories: collect(id(g)) } as authorData
WITH collect(authorData) as data
CALL algo.similarity.jaccard.stream(data) YIELD item1, item2, count1, count2, intersection, similarity
WITH algo.asNode(item1) AS author1, algo.asNode(item2) AS author2, intersection, similarity
WHERE similarity > 0.5
MERGE (author1)-[:SIMILAR_TO { value: similarity }]->(author2)

I want to find some similar authors in Sci-Fi genre, very close to Ray Bredbery and Philip K. Dick and good review score from readers.

MATCH (a:Author)-[s:SIMILAR_TO]->(aa:Author)<-[:WRITTEN_BY]-(b:Book)
WHERE (a.name = "Ray Bradbury" OR a.name = "Philip K. Dick") AND b.score > 4.0 AND s.value > 0.5
WITH aa.name as author, s.value as similarity, collect(b) as books
WITH author, similarity, apoc.coll.sortMaps(books, 'score') as books
RETURN author, similarity, [x IN books | x.title] as books
ORDER BY similarity DESC

Nice. Looks like I found some interesting books to read in the next month. If you have any idea about to use existing domain — share in comments!

Resume

Today I did pretty simple, but useful graph analysis. Books are wonderful. Graph of books wonderful twice.

Using the fact, that this is my blog, I want to give personal advice for those who love Sci-Fi, if you not mind: Ubik by Philip K. Dick and The Invincible by Stanislav Lem. Have a good reading!

Resources

APOC User Guide 3.5

apoc.path.expandConfig(startNode |Node|list…

neo4j-contrib.github.io

4.5. Mathematical functions - numeric - Chapter 4. Functions

These functions all operate on numeric expressions only, and will return an error if used on any other values. See also…

neo4j.com

8.1. The Jaccard Similarity algorithm - Chapter 8. Similarity algorithms

This section describes the Jaccard Similarity algorithm in the Neo4j Graph Algorithms library. Jaccard Similarity…