Working with Neo4j Cypher Aggregating function AVG

May 20: Jeans, One Month Graph Challenge

Vlad Batushkov
6 min readMay 20, 2019

--

Welcome word

In this series of small posts I do one simple graph daily. Domain model of graph somehow related to day’s history, some historical event, celebration or person. I do this challenge to learn Neo4j Data Modeling and Cypher. Every day. One month. Follow me. Maybe you will be inspired and next month would be yours One Month Graph Challenge. #OMGChallenge

Domain model

May 20 is considered the official birthday of jeans. The history of denim has more than 200 years. Not one generation has grown up in these practical pants. The first patented jeans were released in America by a man named Levi Strauss in 1850 and sold at a price of 1 dollar 46 cents per pair.

Now, after 200 years, in era of internet, what is the main criteria to buy jeans: price or review score? Nice area to discover. So, today my plan is to grab initial information about jeans and build a simple but realistic part of “Sales” domain model on it.

Graph

Let’s parse jeans data from the amazon web-site. Sad to confess, but I am not a popular person, and by mentioning this or that website or brand in my blog posts I don’t advert anything. Only one reason why I do this — I just enjoy Cypher.

WITH "https://www.amazon.com/Mens-Jeans/b?ie=UTF8&node=1045564" as url
CALL apoc.load.html(url, { info: "div#mainResults div.s-item-container div.a-row.a-spacing-none.s-color-subdued", brand: "div#mainResults div.s-item-container div.a-row.a-spacing-small.a-grid-vertical-align.a-grid-center img" }) YIELD value
WITH collect(value) as data
WITH { info: data[1].info, brand: data[0].brand } as root
WITH [ x IN range(0, 29) | { info: root.info[x].text, brand: root.brand[x].attributes.title }] as result
UNWIND result as item
WITH apoc.text.regexGroups(item.info, "(.+)( \\$ )(\\d\\d \\d\\d)(.+)( out of 5 stars )([\\d,]+)") as group, item.brand as brand
WITH { name: apoc.text.replace(group[0][1], " from", ""), price: toFloat(apoc.text.replace(group[0][3], " ", ".")), score: toFloat(group[0][4]), reviewsNum: toInteger(apoc.text.replace(group[0][6], ",", "")), brand: brand } as item
MERGE (b:Brand { name: item.brand })
WITH item, b
CREATE (j:Jeans { name: item.name, price: item.price })
WITH item, b, j
MERGE (j)-[r:OF]->(b)
ON CREATE SET r.score = item.score, r.reviewsNum = item.reviewsNum

You can remember my day number 5, pretty same thing, yeah. But today my appetite become a little bit bigger.

Btw, I can explain magic of number 29. I faced broken jquery pattern again, this issue limited the amount of data. Also I failed with consistency for regex to fill all the properties. You can check by yourself and maybe help me with it. Anyway, I have enough results for the next step.

Let’s calculate average score of Brand and save this value in Brand node. I already know some average score based on amount of reviews for each jeans product. So, calucation formula should be like this:
(j1.s * j1.n + … + jN.s * jN.n)/ (j1.n + … + jN.n) = avgScore

For example, 10 reviews with score 5, 20 reviews with score 4, then the average score is: (10 * 5 + 20 * 4) / (10 + 20) = 4.33

MATCH (b:Brand)<-[r:OF]-()
WHERE r.score > 0
WITH sum(r.reviewsNum * r.score) as scoresSum, sum(r.reviewsNum) as reviewsSum, b
WITH scoresSum / reviewsSum as avgScore, b
SET b.avgScore = apoc.math.round(avgScore, 3)
RETURN b.name, avgScore
ORDER BY avgScore DESC

Another way to find this number is to use avg() function, but to do that, I need to append database schema with appropiate amount of reviews first. We can imagine that jeans, as a product, have a reviews posted by some persons. This is would be quite realistic database schema.

db schema

For simplicity, I will assume, that every review created by unique person and all scores are the same (current average score of product). For sure, to do this properly, I need to grab real data, N persons set X, M persons set Y, so average is Z. But… As you understand, I just need a proper databse model to be able to apply avg() function, so I don’t really want to spend time on load of real data.

This query create reviews for Wrangler brand.

MATCH (j:Jeans)-[r:OF]->(:Brand { name: "Wrangler" })
WHERE r.score > 0
WITH range(0, r.reviewsNum - 1) as reviews, r.score as score, j
UNWIND reviews as review
CREATE (p:Person { uuid: apoc.create.uuid() })-[:POST]->(r:Review { text: "I bought this jeans.", score: score })
MERGE (r)<-[:HAVE]-(j)

Now I can easely find avg score for each jeans product.

MATCH (b:Brand { name: "Wrangler" })<-[:OF]-(j:Jeans)-[:HAVE]->(r:Review)
RETURN j.name as name, avg(r.score) as score
ORDER BY score DESC

And average score easy to find for the whole Brand, but now, based on score of every review of brand’s product.

MATCH (b:Brand { name: "Wrangler" })<-[:OF]-(j:Jeans)-[:HAVE]->(r:Review)
RETURN b.name as name, avg(r.score) as score
ORDER BY score DESC

As expected, results for both approaches are totally same. Good practice.

Now let’s go back to prices and scores. I think, it is make sens to compare score of Brand vs avg score of all Brands. And jeans score with other jeans score of same brand. So, then I can see the most detailed picture to make a decision.

MATCH (b:Brand)
WITH apoc.math.round(sum(b.avgScore) / count(b), 2) as totalAvgScore
MATCH (b:Brand)<-[:OF]-(j:Jeans)
WITH b, apoc.math.round(sum(j.price) / count(j), 2) as brandAvgPrice, totalAvgScore
MATCH (b:Brand)<-[r:OF]-(j:Jeans)
WHERE r.score > 0
WITH j.name + "(" + b.name + ")" as jeans, j.price as price, brandAvgPrice, totalAvgScore, apoc.math.round(b.avgScore, 2) as brandAvgScore, r.score as jeansScore
RETURN jeans, toString(brandAvgScore) + " (" + toString(totalAvgScore) + ")" as globalBrandCompare, toString(jeansScore) + " (" + toString(brandAvgScore) + ")" as brandJeansCompare, toString(price) + " (" + toString(brandAvgPrice) + ")" as priceBrandCompare
ORDER BY price ASC, jeansScore DESC

To be true, “sum(b.avg Score) / count(b)” not really fair. Much more better to calculate global average score of all scores based on all reviews number. I definitely need all reviews, not only for Wranglers…

So what about my choice at the end of the end? I think, based on this results, I will buy “Men’s 501 Original-Fit Jean(Levi’s)” for $11.55. It is double time cheaper, than average. Not $1.46, unfortunately, but quite a good review score.

Resume

Ok, let’s conclude, what we have today: no algorithms, a little bit of schema, a little bit of queries. Maybe you can continue this topic and apply something really interesting, but my sandclocks and imagination for today are empty.

I just touched very small and simple things. As you can see relationships like “Brand — Product — Review — Person” is very common building block and can be used widely in any other “Sales” area. Maybe you can add more useful entities and make this domain model more fun. See you tomorrow.

Resources

--

--

Vlad Batushkov

Engineering Manager @ Agoda. Neo4j Ninja. Articles brewed on modern tech, hops and indie rock’n’roll.