bunkat · August 29, 2015 14:01
diff --git a/gistfile1.txt b/gistfile1.txt
 = Graph Initialization

 I'm using Neo4j 2.0.1 with Cypher over the Batch REST API.

 //hide
 //setup
 //output
 [source,cypher]
 ----
 CREATE (u:user {id: "u1", name: "Bill"})-[:CONTACT]->(c:contact {id: "c1"})
 CREATE (u)-[:ADDRESS]->(:address {id: "a1", city: "New York"})
 CREATE (u)-[:ADDRESS]->(:address {id: "a2", city: "San Francisco"})
 CREATE (u)-[:PHONE]->(:phone {id: "p1", number: "555-1234"})
 CREATE (c)-[:PHONE]->(:phone {id: "p2", number: "555-3456"})
 CREATE (c)-[:PHONE]->(:phone {id: "p3", number: "555-7890"})
 CREATE (c)-[:CREATED]->(:activity {id: "h1", createdat: 1000})<-[:USERACTIVITY]-(:contact {id: "c2"})
 CREATE (c)-[:HISTORY]->(:activity {id: "h2", createdat: 2000})<-[:USERACTIVITY]-(:contact {id: "c3"})
 ----

 = Simplified Data Model

 Here is what the simplified data model looks like.  I have a `user` node that is connected to a `contact` node. Both the `user` node and `contact` node can have 0 or more phone numbers and addresses associated with them. Each `contact` node also points to `activity` nodes which capture when the node was edited and when it was created.  I'm trying to pull all of this data back in a single query to reduce round trips to the database.

 // graph

 = Current Query

 Here is the current query that I'm using, but I'm sure it can be improved quite a bit. In my production queries, I use parameters instead of populating the `id` fields directly.  

 //output
 [source,cypher]
 ----
 MATCH (u:user { id: "u1" })
 WITH u

 MATCH u-[:CONTACT]->(c:contact)
 WITH u, c

 OPTIONAL MATCH  (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
 OPTIONAL MATCH  (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact)
 OPTIONAL MATCH (c)-[:PHONE]->(xp:phone)
 OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address)
 OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
 OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
 WITH DISTINCT c AS x, u,
 	COLLECT(DISTINCT xp) AS xps,
 	COLLECT(DISTINCT xa) AS xas,
 	COLLECT(DISTINCT xup) AS xups,
 	COLLECT(DISTINCT xua) AS xuas,
 	xca.createdat AS createdat,
 	xcu.createdat AS updatedat,
 	{id: xcc.id} AS createdby,
 	{id: xuc.id} AS updatedby
 RETURN COLLECT({
 	id: x.id,
 	name:  COALESCE(u.name, x.name), 
 	createdat: createdat,
 	createdby: createdby,
 	updatedat: updatedat,
 	updatedby: updatedby,
        phones: (CASE WHEN size(xps)= 0
          THEN NULL
          ELSE [xp IN xps | { id: xp.id, number: xp.number}]
          END),
        userphones: (CASE WHEN size(xups)= 0
          THEN NULL
          ELSE [xup IN xups | { id: xup.id, number: xup.number }]
          END),
        addresses: (CASE WHEN size(xas)= 0
          THEN NULL
          ELSE [xa IN xas | { id: xa.id, city: xa.city}]
          END),
        useraddresses: (CASE WHEN size(xuas)= 0
          THEN NULL
          ELSE [xua IN xuas | { id: xua.id, city: xua.city}]
          END)
 }) AS r
 ----

 //table

 = Results

 The current performance I'm seeing is pretty dismal. If I remove all of the OPTIONAL MATCH statements and just return the `contact` nodes, it's about 4x faster but still much too slow. When I use a similar query to return 1000 contact nodes, it starts to crawl (even after running multiple times to make sure it is cached).

 I have indexes set up on both contacts and users by id.  I've used the PROFILE command to see if anything looks funny, but I see the indexes being used and no _rows counter or _db_hits_ counts above 3000.  My database currently has about 12000 nodes in it, 1100 are in the result set I'm trying to return.

 I've had problems with slow disks before, but since this is a read operation there should be no locks or disk syncs causing the performance issue.  I've also moved to a new server which is a 24 proc Xeon with 128 GB of memory. The database sits by itself on 6 enterprise SSDs using RAID 10.  I'm currently using the default settings for Neo4j but since my graph is so small I'm not sure that should matter.

 = Question

 There is obviously something wrong with the query pattern that I'm using.  How should I be querying for nodes when a lot of the relationships are optional?  I have instances throughout my database where nodes can optionally be connected to lots of other nodes (which was the primary reason for switching over to a graph database).  Thanks!

 //console
	= Graph Initialization

	I'm using Neo4j 2.0.1 with Cypher over the Batch REST API.

	//hide
	//setup
	//output
	[source,cypher]
	----
	CREATE (u:user {id: "u1", name: "Bill"})-[:CONTACT]->(c:contact {id: "c1"})
	CREATE (u)-[:ADDRESS]->(:address {id: "a1", city: "New York"})
	CREATE (u)-[:ADDRESS]->(:address {id: "a2", city: "San Francisco"})
	CREATE (u)-[:PHONE]->(:phone {id: "p1", number: "555-1234"})
	CREATE (c)-[:PHONE]->(:phone {id: "p2", number: "555-3456"})
	CREATE (c)-[:PHONE]->(:phone {id: "p3", number: "555-7890"})
	CREATE (c)-[:CREATED]->(:activity {id: "h1", createdat: 1000})<-[:USERACTIVITY]-(:contact {id: "c2"})
	CREATE (c)-[:HISTORY]->(:activity {id: "h2", createdat: 2000})<-[:USERACTIVITY]-(:contact {id: "c3"})
	----

	= Simplified Data Model

	Here is what the simplified data model looks like. I have a `user` node that is connected to a `contact` node. Both the `user` node and `contact` node can have 0 or more phone numbers and addresses associated with them. Each `contact` node also points to `activity` nodes which capture when the node was edited and when it was created. I'm trying to pull all of this data back in a single query to reduce round trips to the database.

	// graph

	= Current Query

	Here is the current query that I'm using, but I'm sure it can be improved quite a bit. In my production queries, I use parameters instead of populating the `id` fields directly.

	//output
	[source,cypher]
	----
	MATCH (u:user { id: "u1" })
	WITH u

	MATCH u-[:CONTACT]->(c:contact)
	WITH u, c

	OPTIONAL MATCH (c)-[:CREATED]->(xca:activity)<-[:USERACTIVITY]-(xcc:contact)
	OPTIONAL MATCH (c)-[:HISTORY]->(xcu:activity)<-[:USERACTIVITY]-(xuc:contact)
	OPTIONAL MATCH (c)-[:PHONE]->(xp:phone)
	OPTIONAL MATCH (c)-[:ADDRESS]->(xa:address)
	OPTIONAL MATCH (u)-[:PHONE]->(xup:phone)
	OPTIONAL MATCH (u)-[:ADDRESS]->(xua:address)
	WITH DISTINCT c AS x, u,
	COLLECT(DISTINCT xp) AS xps,
	COLLECT(DISTINCT xa) AS xas,
	COLLECT(DISTINCT xup) AS xups,
	COLLECT(DISTINCT xua) AS xuas,
	xca.createdat AS createdat,
	xcu.createdat AS updatedat,
	{id: xcc.id} AS createdby,
	{id: xuc.id} AS updatedby
	RETURN COLLECT({
	id: x.id,
	name: COALESCE(u.name, x.name),
	createdat: createdat,
	createdby: createdby,
	updatedat: updatedat,
	updatedby: updatedby,
	phones: (CASE WHEN size(xps)= 0
	THEN NULL
	ELSE [xp IN xps \| { id: xp.id, number: xp.number}]
	END),
	userphones: (CASE WHEN size(xups)= 0
	THEN NULL
	ELSE [xup IN xups \| { id: xup.id, number: xup.number }]
	END),
	addresses: (CASE WHEN size(xas)= 0
	THEN NULL
	ELSE [xa IN xas \| { id: xa.id, city: xa.city}]
	END),
	useraddresses: (CASE WHEN size(xuas)= 0
	THEN NULL
	ELSE [xua IN xuas \| { id: xua.id, city: xua.city}]
	END)
	}) AS r
	----

	//table

	= Results

	The current performance I'm seeing is pretty dismal. If I remove all of the OPTIONAL MATCH statements and just return the `contact` nodes, it's about 4x faster but still much too slow. When I use a similar query to return 1000 contact nodes, it starts to crawl (even after running multiple times to make sure it is cached).

	I have indexes set up on both contacts and users by id. I've used the PROFILE command to see if anything looks funny, but I see the indexes being used and no _rows counter or _db_hits_ counts above 3000. My database currently has about 12000 nodes in it, 1100 are in the result set I'm trying to return.

	I've had problems with slow disks before, but since this is a read operation there should be no locks or disk syncs causing the performance issue. I've also moved to a new server which is a 24 proc Xeon with 128 GB of memory. The database sits by itself on 6 enterprise SSDs using RAID 10. I'm currently using the default settings for Neo4j but since my graph is so small I'm not sure that should matter.

	= Question

	There is obviously something wrong with the query pattern that I'm using. How should I be querying for nodes when a lot of the relationships are optional? I have instances throughout my database where nodes can optionally be connected to lots of other nodes (which was the primary reason for switching over to a graph database). Thanks!

	//console