Seven Databases: Neo4j and misunderstanding indexes

The Neo4j chapter of Seven Databases in Seven Weeks has a short discussion of indexing (starting on p241 of the P1.0 version of the PDF). I found it mislead me into thinking there were two types of index, when really there are just two ways to query an index.

The book creates an index named authors by simply adding a key-value-node triple to it. It says that the resulting index is key-value or hash style, and shows that the node can be retrieved by supplying the key and value:

curl http://localhost:7474/db/data/index/node/authors/name/P.G.+Wodehouse

It contrasts this with what it calls a full-text search inverted index, which must be created with some configuration before we can add entries to it. Then it shows how we can use Lucene syntax to query this index (named fulltext):

curl http://localhost:7474/db/data/index/node/fulltext?query=name:P*

The implication is that this sort of prefix query is only supported by indexes created in this way.

I thought that full-text search was more than just prefix searching, and so had a look at the Neo4j documentation. It turns out that both indexes can be queried with Lucene query syntax:

$ curl 'http://localhost:7474/db/data/index/node/fulltext?query=name:P*' > /tmp/a
$ curl 'http://localhost:7474/db/data/index/node/authors?query=name:P*' > /tmp/b
$ diff -s /tmp/a /tmp/b
Files /tmp/a and /tmp/b are identical
$ cat /tmp/a # to prove they didn't fail...
[ {
... snip ...
  "data" : {
    "genre" : "British Humour",
    "name" : "P.G. Wodehouse"
  },
... snip ...
} ]

and that you can even do fuzzy matching on both:

$ curl 'http://localhost:7474/db/data/index/node/authors?query=name:P.G.\+Woodhouse~' > /tmp/c
$ curl 'http://localhost:7474/db/data/index/node/fulltext?query=name:P.G.\+Woodhouse~' > /tmp/d
$ diff -s /tmp/a /tmp/c
Files /tmp/a and /tmp/c are identical
$ diff -s /tmp/a /tmp/d
Files /tmp/a and /tmp/d are identical

The index manager screen on the Neo4j webadmin tool shows the differences between the two indexes:

authors
{"type":"exact"}
fulltext
{"to_lower_case":"true", "type":"fulltext"}

and their meanings are given here: http://docs.neo4j.org/chunked/milestone/indexing-create-advanced.html. Both indexes use Lucene and are full-text search inverted indexes. Only the tokenizers and case-sensitivity differ. We can see the effect of case-sensitivity on the running example:

$ curl 'http://localhost:7474/db/data/index/node/authors?query=name:p.g.*'
[ ]
$ curl 'http://localhost:7474/db/data/index/node/fulltext?query=name:p.g.*'
[ {
... stuff ...
} ]

Demonstrating the tokenization difference requires a new example, as the "P.G.+Wodehouse" value we’ve been working with will be tokenized identically by both the KeywordTokenizer (used by "type":"exact") and the WhitespaceTokenizer (used by "type":"fulltext"). We’ll need a value that includes whitespace:

$ curl -X POST http://localhost:7474/db/data/index/node/fulltext \
-H "Content-Type: application/json" \
-d '{ "uri" : "http://localhost:7474/db/data/node/0",
"key" : "name_space", "value" : "P.G. Wodehouse"}'

$ curl -X POST http://localhost:7474/db/data/index/node/authors \
-H "Content-Type: application/json" \
-d '{ "uri" : "http://localhost:7474/db/data/node/0",
"key" : "name_space", "value" : "P.G. Wodehouse"}'

and a query on just the surname:

$ curl 'http://localhost:7474/db/data/index/node/authors?query=name_space:woodhouse~'
[ ]
$ curl 'http://localhost:7474/db/data/index/node/fulltext?query=name_space:woodhouse~'
[ {
... stuff ...
} ]

Well, I think that clears that up. The chapter mistakenly makes a distinction between key-value or hash style indexes and full-text search inverted indexes. Instead there are two ways of querying an index, one (which we saw first) that could be called key-value or hash style and the one we’ve concentrated on that I’ll call query style. Underneath, the two indexes are largely the same.

This small section of the book still leaves plenty to ponder though: evelopers need to manage adding and removing nodes to the index themselves, and the index keys and values aren’t restricted to properties of the nodes or edges being indexed…

Leave a Reply

Your email address will not be published. Required fields are marked *