Last active
August 29, 2015 14:01
-
-
Save bunkat/9494429e3cbbbeda2b11 to your computer and use it in GitHub Desktop.
Getting lots of optional nodes in a single Cypher query
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
= Graph Initialization | |
I'm using Neo4j 2.0.1 with Cypher over the Batch REST API. | |
//hide | |
//setup | |
//output | |
[source,cypher] | |
---- | |
CREATE (u:user {id: "u1", name: "Bill"})-[:CONTACT]->(c:contact {id: "c1"}) | |
CREATE (u)-[:ADDRESS]->(:address {id: "a1", city: "New York"}) | |
CREATE (u)-[:ADDRESS]->(:address {id: "a2", city: "San Francisco"}) | |
CREATE (u)-[:PHONE]->(:phone {id: "p1", number: "555-1234"}) | |
CREATE (c)-[:PHONE]->(:phone {id: "p2", number: "555-3456"}) | |
CREATE (c)-[:PHONE]->(:phone {id: "p3", number: "555-7890"}) | |
CREATE (c)-[:CREATED]->(:activity {id: "h1", createdat: 1000})<-[:USERACTIVITY]-(:contact {id: "c2"}) | |
CREATE (c)-[:HISTORY]->(:activity {id: "h2", createdat: 2000})<-[:USERACTIVITY]-(:contact {id: "c3"}) | |
---- | |
= Simplified Data Model | |
Here is what the simplified data model looks like. I have a `user` node that is connected to a `contact` node. Both the `user` node and `contact` node can have 0 or more phone numbers and addresses associated with them. Each `contact` node also points to `activity` nodes which capture when the node was edited and when it was created. I'm trying to pull all of this data back in a single query to reduce round trips to the database. | |
// graph | |
= Current Query | |
Here is the current query that I'm using, but I'm sure it can be improved quite a bit. In my production queries, I use parameters instead of populating the `id` fields directly. | |
//output | |
[source,cypher] | |
---- | |
MATCH (u:user { id: "u1" }) | |
WITH u | |
MATCH u-[:CONTACT]->(c:contact) | |
WITH u, c | |
OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact) | |
OPTIONAL MATCH (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact) | |
OPTIONAL MATCH (c)-[:PHONE]->(xp:phone) | |
OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address) | |
OPTIONAL MATCH (u)-[:PHONE]->(xup:phone) | |
OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address) | |
WITH DISTINCT c AS x, u, | |
COLLECT(DISTINCT xp) AS xps, | |
COLLECT(DISTINCT xa) AS xas, | |
COLLECT(DISTINCT xup) AS xups, | |
COLLECT(DISTINCT xua) AS xuas, | |
xca.createdat AS createdat, | |
xcu.createdat AS updatedat, | |
{id: xcc.id} AS createdby, | |
{id: xuc.id} AS updatedby | |
RETURN COLLECT({ | |
id: x.id, | |
name: COALESCE(u.name, x.name), | |
createdat: createdat, | |
createdby: createdby, | |
updatedat: updatedat, | |
updatedby: updatedby, | |
phones: (CASE WHEN size(xps)= 0 | |
THEN NULL | |
ELSE [xp IN xps | { id: xp.id, number: xp.number}] | |
END), | |
userphones: (CASE WHEN size(xups)= 0 | |
THEN NULL | |
ELSE [xup IN xups | { id: xup.id, number: xup.number }] | |
END), | |
addresses: (CASE WHEN size(xas)= 0 | |
THEN NULL | |
ELSE [xa IN xas | { id: xa.id, city: xa.city}] | |
END), | |
useraddresses: (CASE WHEN size(xuas)= 0 | |
THEN NULL | |
ELSE [xua IN xuas | { id: xua.id, city: xua.city}] | |
END) | |
}) AS r | |
---- | |
//table | |
= Results | |
The current performance I'm seeing is pretty dismal. If I remove all of the OPTIONAL MATCH statements and just return the `contact` nodes, it's about 4x faster but still much too slow. When I use a similar query to return 1000 contact nodes, it starts to crawl (even after running multiple times to make sure it is cached). | |
I have indexes set up on both contacts and users by id. I've used the PROFILE command to see if anything looks funny, but I see the indexes being used and no _rows counter or _db_hits_ counts above 3000. My database currently has about 12000 nodes in it, 1100 are in the result set I'm trying to return. | |
I've had problems with slow disks before, but since this is a read operation there should be no locks or disk syncs causing the performance issue. I've also moved to a new server which is a 24 proc Xeon with 128 GB of memory. The database sits by itself on 6 enterprise SSDs using RAID 10. I'm currently using the default settings for Neo4j but since my graph is so small I'm not sure that should matter. | |
= Question | |
There is obviously something wrong with the query pattern that I'm using. How should I be querying for nodes when a lot of the relationships are optional? I have instances throughout my database where nodes can optionally be connected to lots of other nodes (which was the primary reason for switching over to a graph database). Thanks! | |
//console |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment