Search can make or break your website. Sunspot and Solr give you an intuitive engine that maps directly to your Ruby objects.
As your site amasses content — be it stories, SKUs, or statistics — a tailored, effective, and exacting search engine becomes increasingly vital. Imagine a bookstore that doesn’t index its tomes by title or author, or a clothing retailer that doesn’t index garments by size. Without search, each site is useless. In general, the quality and relevance of search results makes or breaks a site.
Typically, at least some search results are generated by the site’s underlying database. A database can maintain and catalog enormous volumes of data. Thus, an inquiry for an obscure piece of content would likely fall to the database since it’s the canonical repository. Complex, multi-variate queries may also fall to the database as it’s designed specifically for the purpose.
However, a database query can be slow. Like any engineering feat, a database has strengths and weaknesses and one size rarely fits all. Hence, it’s also typical for other actors to provide search results. For instance, a content management application might use an entirely separate engine to index the prose and respond to keyword, phrase, and proximate searches.
Of course, it’s also de rigueur for an application to maintain one or more RAM-resident indices to speed common lookups and preclude repetitive computations. Additionally, software such as memcached provides surrogate memory extending across multiple machines.
Each technique—the (relational or object) database, special-purpose engine, and in-memory data structures—is a viable option; whether one, some, or all of the approaches are valid depends on the application at hand.
You are no doubt familiar with MySQL, SQLite, Oracle, and any number of other database packages. Each is capable, and there are scores of strategies to tune performance, from writing efficient queries, to tuning the database host’s I/O subsystem. You are also likely familiar with memcached, Squid, and other helpful proxies (both figurative and literal) that offload work from the database and application server. You can find plenty of books on the latter subjects.
And what about the specialized search engine? Certainly, there are plenty of commercial text indexers (FAST, Documentum) and even a few open source solutions (Zebra, Lucene, Sphinx), but what if you need to index 350,000 auto parts, 50,000 email messages, or 4,000 grocery items?
If you code in Ruby, Sunspot is an ideal solution for any of those inventories. Created by Mat Brown, Sunspot is an intuitive and expressive domain specific-language for indexing and searching Ruby objects. Sunspot is powered by Solr, the open source enterprise search server based on Lucene. Solr can highlight matches, replicate its index across many servers, shard indices, and, for the purposes of this discussion, perform advanced full-text and faceted searches.
Wanted: Indexing Engine. Apply Within.
To search your data with Sunspot, you create an index for one or more classes, populate the index, and then search using any of the indexed fields. Here’s an example use of Sunspot to index books for sale.
# A book includes instance variables for
# the author, a title, a publisher, an edition, a 10- and 13-digit
# ISBN number, a blurb, a publication date, and a price.
To create an index, call the class method Sunspot.setup(class) block, where class is the name of the Ruby class to catalog, and block is a list of Sunspot declarations (part of Sunspot’s domain-specific language) to describe how Solr should treat one, some, or all class attributes.
string :isbn10, :isbn13
string :sort_title do
title.downcase.sub(/^(an?|the)\W+/, '') if title = self.title
float and time and the others are Sunspot DSL keywords to define Solr types. Most of the type keywords are eponymous, but string and text require a little clarification. Use the former for values (like a UPC code) where full-text indexing doesn’t make sense; use the latter for full-text fields.
The code above builds an index for all attributes of a Book, since a consumer might want to search or sort on any of those values. The code also adds a virtual attribute, :sort_title, to help improve the readability of search results. You can search the entire text of :title and sort results either by :title or :sort_title.
The next step is to populate the index with data. (The previous step merely describes what fields to index, but does not catalog values). Assuming you’ve created a new collection of books in the array book_list, adding the books to the index requires just two statements.
Sunspot.index( book_list )
The first statement adds each book in the list book_list to the in-memory Solr index. The second statement commits the new additions to the index to the Solr instance’s persistent store.
Sunspot for Ruby includes other methods to manipulate the index, too, such as remove(instances) and remove_all(classes). The former removes one or more instances from the index; a subsequent commit is required to affect the persistent index. (Optionally, you can call remove!(instance) to delete and commit in one fell swoop.) The latter method removes all instances of one or multiple classes from the index, (and has an analog remove_all!(class(es)) to implicitly commit.
And, naturally, there is a method to search and provide a variety of criteria.
The query above says, “Search the current index for all first edition books less than $19.99 where ‘Zaphod’ appears in a full-text field, and return the results sorted by the special sorting title.” In addition to those results, the statement facet :publisher also produces a list of publishers whose books match the criteria. You can use a facet to drill down and produce a list of matching books sold by a specific publisher.
Next: Show Me the Code