Implementing Full Text Search on Google App Engine
Despite being a product of search giant Google, App Engine doesn’t yet provide in-built support for full-text searching of substantial strings in your datastore entities. There are a few approaches to building your own, which involve using equality filters to search on the start of a string or ListProperties to hold lists of terms garnered from your text (as long as you stay within the limits of allowable indexes for a given enitity). However, if you want to be able to run an index on larger documents or support more advanced search features like faceting and scoring you may find yourself scratching your head. Unfortunately, the sandbox environment of GAE also restricts your ability to employ third-party open source search solutions like Lucene.
Native full text search functionality will no doubt come to App Engine in due course. But in the meantime my solution has been to use a remote-hosted Solr instance from WebSolr and a slightly modified version of PySolr to get the job done. Why PySolr rather than other Python-based interface packages like Haystack or Sunburnt? The simple reason is that none of these will work out-of-the box on App Engine and PySolr was the simplest of them all to modify for my (relatively modest) needs. You can grab a copy of the PySolr code modifed for App Engine if you want it.
Here’s a quick overview of my setup in case you are looking to do something similar. I use Django as my framework, so your specifics may vary.
- Put a copy of PySolrGAE in your app directory so you’ll be able to import the module into your views as needed.
- Add the following variables in your settings:
SOLR_PATH = ‘http://index.websolr.com/solr/yourkey/’
SOLR_BATCH_SIZE = 100
MAX_RESULT_SIZE = 100
(obviously, your values will differ!) - Set up a schema document (XML) and put it up on your Solr instance so it knows what particular fields you will be passing to it and how it should tokenise, stem and otherwise work its magic on the text within them. The Solr documentation is pretty good, so it is easy to pick up.
- Import the module (e.g. from apps.search.pysolrGAE import Solr) into your views and use it to interface with your solr instance. The ‘readme’ included with the modified PySolr code gives an overview of the syntax for adding, deleting and modifying entries in your index. I’ve managed to set up views to delete the index, re-create it, and return results which are then passed to a template. You can also set up a hook in the ‘save’ method of your models to incrementally add/modify or delete items depending on what you’ve done to a particular entity.
One of the nice things about Solr is that you can pass it a field which will not be indexed but is stored alongside an entry. You can get this field returned as part of a query response. Hence, you can set up an HTML rendered version of the search result snippet for a particular entry and pass it to Solr at the time you add the entry to the index. Then, when you run a query you can get that field back and simply pass it through to your template. This saves you a round trip to the datastore to get a copy of the entity for presentation. Sweet!
Paul 23:43 on Saturday, August 15, 2009 Permalink
I have put the slashes but still throwing the same error!!
/first-page/
Ben 16:24 on Monday, August 17, 2009 Permalink
Hmmm. When I was hunting around I saw a couple of other possible causes for this, but I suspect you’ve already come across those. From memory I eventually found the source of my problem by playing around in the admin panel – after changing the page title a few times and trying to view the page from the admin panel I realised django wasn’t rendering the url correctly. That at least narrowed it down to a pattern/url problem rather than there being no communication with the flatpage ap at all. Maybe a similar approach would help you narrow down the source of your issue.
Raj 13:14 on Thursday, May 27, 2010 Permalink
That was totally it! Thanks!
Scott Crosby 2:57 on Saturday, June 12, 2010 Permalink
Thanks! You saved me hours :-)
Michael 16:06 on Thursday, July 15, 2010 Permalink
I just came across your post and found a different problem with the same symptoms, so I wanted to post a comment to help out anyone in the future who stumbles across this: I received the same error because I had added “127.0.0.1:8000″ as a separate site, rather than editing “example.com”, so my site ID was 2, rather than 1. Instead of modifying my settings.py file to have SITE_ID = 1, I went to the shell and changed the localhost site to have an id of 1, and then it worked.
Duy 19:29 on Tuesday, October 26, 2010 Permalink
Ha! Thanks a lot. This is what I’m looking for!
Srinivasa 1:30 on Friday, December 10, 2010 Permalink
Solution provided by Machael and Ben works in different scenarios, but they are right. Thank you for the solution. This will help people who ignore few things while reading the ‘Practical Django Projects’ second edition.
Patrick 22:05 on Wednesday, February 9, 2011 Permalink
After days of debugging and trying to figure out why “get_object_or_404()” in django/contrib/flatpages/views.py was returning “No FlatPage matches the given query” I stumbled across this page. Thank you! Why is flatpages not more forgiving! (why does it not just append a slash if it doesn’t exist!!)
Ben 7:28 on Thursday, February 10, 2011 Permalink
Glad it helped Patrick. I’ve not looked at what it does for this sort of error, but you can add APPEND_SLASH = True to your settings.py to auto-append trailing slashes to incoming urls. This will only work if you have django.middleware.common.CommonMiddleware installed in your middleware settings, but it might sort out the slash-sensitive flatpage issue.
Nai 21:49 on Friday, March 4, 2011 Permalink
Had the same problem, different solution. I had to add ‘django.contrib.flatpages.middleware.FlatpageFallbackMiddleware’, to my middleware in settings.py
Hope this helps someone too
Mark Andrews 10:57 on Saturday, July 23, 2011 Permalink
thanks, man! that was driving me nuts!
Vitalii 23:57 on Sunday, August 28, 2011 Permalink
Very, very, very big THANKS to Michael!!!!!
Nima 6:41 on Saturday, December 3, 2011 Permalink
Hi
I have same problem
I set SITE_ID to 1
and also delete example.com
and I am sure about slashes :)
but still same problem :(
mert ozcan 3:48 on Sunday, January 15, 2012 Permalink
I realized the book is not clear about the directory where we should place default.html file..
Its basically like this.. ..cms/first-page/flatpages/default.html