Non-trivial APOC Load HTML example to build 007 Neo4j Graph

May 28: James Bond, One Month Graph Challenge

Vlad Batushkov
4 min readMay 28, 2019

--

Welcome word

In this series of small posts I do one simple graph daily. Domain model of graph somehow related to day’s history, some historical event, celebration or person. I do this challenge to learn Neo4j Data Modeling and Cypher. Every day. One month. Follow me. Maybe you will be inspired and next month would be yours One Month Graph Challenge. #OMGChallenge

Domain model

Today is a birthday of Ian Lancaster Fleming. English writer, author of novels about James Bond. The world well-known character of agent 007 already has been represented everywhere: films, television, radio, comics and video games. Now it is time for Neo4j.

I plan to build a small graph of Bondiana movies. You can say, hey, another one topic about scores, ranks, best and worst? Definitely maybe.

After simple analysis of 24 movies about agent 007 (excluding upcoming movie number 25) I want to figure out why we love James Bond so much.

Graph

Data first. Need data? Start with Wikipedia.

WITH "https://en.wikipedia.org/wiki/James_Bond" as url
CALL apoc.load.html(url, { title: "table.wikitable tbody tr td:eq(0) a" }) YIELD value
UNWIND value.title[63..87] as item
MERGE (m:Movie { title: item.text })

Now, knowing films, I can append this nodes with more rich data from IMDB.

WITH "https://www.imdb.com/list/ls006405458/" as url
CALL apoc.load.html(url, { title: "div.lister-item-content h3.lister-item-header a:eq(1)", score: "div.lister-item-content div.ipl-rating-star.small span:eq(1)", bond: "div.lister-item-content p a:eq(2)", year: "div.lister-item-content span.lister-item-year", votes: "div.lister-item-content span[name=nv]:eq(1)" }) YIELD value
WITH collect(value) as data
WITH { title: data[3].title, score: data[0].score, bond: data[4].bond, year: data[1].year, votes: data[2].votes } as root
WITH [ x IN range(0, length(root.title) - 1) | { title: root.title[x].text, score: toFloat(root.score[x].text), bond: root.bond[x].text, year: apoc.text.regexGroups(root.year[x].text, "\\d{4}")[0][0], votes: toInteger(apoc.text.replace(root.votes[x].text, ",", "")) }] as result
UNWIND result as item
WITH item
WHERE item.bond <> "David Niven" AND item.bond <> "John Huston" AND item.title <> "Bond 25" // exclude some movies
MATCH (m:Movie { title: item.title })
SET m.year = toInteger(item.year), m.score = item.score, m.votes = item.votes
MERGE (a:Actor { name: item.bond })
MERGE (m)-[:BOND]->(a)

Here our 30 nodes of Bonds and movies, where they appeared. Let’s find score for each James Bond as avg score of all scores of his movies.

MATCH (m:Movie)-[:BOND]->(a:Actor)
WITH a, avg(m.score) as avg, sum(m.votes) as votes
WITH a, apoc.math.round(avg, 1) as avg, votes
SET a.score = avg
RETURN a.name as bond, a.score as score, votes
ORDER BY score DESC

Looks like, first and last Bond are the best. But this is not all the research.

Ok. What about the movies. For example, best movie for each Bond?

MATCH (m:Movie)-[:BOND]->(a:Actor)
WITH a, max(m.score) as max
MATCH (m:Movie { score: max })-[:BOND]->(a:Actor)
RETURN a.name as bond, m.title as movie, max as score
ORDER BY score DESC

Casino Royale. Damn. Seems like, I didn’t watch this movie. And it is a big blunder.

I don’t like the thing, that in this graph so little relationships. I want more. And need to find anything possible to link together. For example, let’s link all movies with NEXT relationship into one chain by year it’s released. As a diff property I will setup the difference between next movie release.

MATCH (m:Movie)
WITH collect(m) as m
WITH apoc.coll.sortMaps(m, '^year') as movies
UNWIND apoc.coll.pairsMin(movies) as value
MATCH (m1:Movie { title: value[0].title })
MATCH (m2:Movie { title: value[1].title })
MERGE (m1)-[:NEXT { diff: value[1].year - value[0].year }]->(m2)

Now I can represent the whole chain of Bond and their Movies.

MATCH (a1:Actor)<-[b1:BOND]-(m1:Movie)-[n:NEXT]-(m2:Movie)-[b2:BOND]->(a2:Actor)
RETURN *

And also find the longest period without James Bond new series.

MATCH (m1:Movie)-[n:NEXT]->(m2:Movie)
RETURN m1.year as from, m2.year as to, max(n.diff) as numberOfYearsWithoutBond
ORDER BY numberOfYearsWithoutBond DESC
LIMIT 1

Resume

That’s it. Today I built a super small graph and even invent few relationships to have at least something, to be able to spend my learning time on it.

Hope you find this stuff easy to try experimenting by yourself. Cheers.

Resources

--

--

Vlad Batushkov

Engineering Manager @ Agoda. Neo4j Ninja. Articles brewed on modern tech, hops and indie rock’n’roll.