Looking for ways to overcome indexing bottlenecks at Craigslist lead to an investigation of Sphinx, a powerful, free full-text search engine that works extremely well with MySQL.
For many years, the only easy way to add decent search capability to a MySQL-backed web site was to use its full-text index support. It was fast, efficient, and reasonably configurable. But as more and more sites began to deal with larger datasets and moved from MyISAM to InnoDB, they found that it was harder to support their search needs. InnoDB does not provide full-text indexing, so that often meant keeping around a set of slaves that still ran MyISAM for the sole purpose of handling full-text search requests.
That’s exactly what we were doing at Craigslist before I began looking for a better way. My search for alternatives wasn’t motivated so much by the need to maintain the MyISAM slaves–that part was automated. The real problem is that we kept running into an invisible “glass ceiling” in performance. Low concurrency on the full-text searches eventually meant throwing far more hardware at the problem than was really necessary. Response times would go through the roof, CPU was not being overly taxed, disks were mostly idle, and there was still free memory.
As many DBAs have discovered, MyISAM does not scale well in a multi-core world. Even the mulitple key caches, which were added to partially addess this problem, are a fairly corase work around for the problem.
Sphinx was a project born out of the need for something better than MySQL’s full-text search for sites already using MySQL. It’s available both as a MySQL storage engine and as a stand-alone daemon. There are pros and cons to both deployment methods, but I’m going to focus mainly on the stand-alone daemon for this article. Not only is it what I’m most familiar with, I believe it to be far more powerful and flexible.
In fact, our migration from MySQL full-text to Sphinx has made a world of difference. We’re able to handle significantly higher query volumes with fewer machines, smoother scaling, and have a lot more features to take advantage of. So, what makes Sphinx so good? For the next few weeks, I’ll dig into Sphinx to help you understand what makes it so powerful, how it works, and how to put it to work.
Sphinx is designed to scale horizontally and take advantage of all available CPU power if needed. When Sphinx is started, a single searchd process optionally pre-loads some index data structures, binds to TCP port 3312 (the Sphinx default) and awaits connections. When a connection comes in, searchd forks a child to handle the request and goes back to waiting for new connections. The child takes the request, performs the search, and returns the results. If the client asked for a persistent connection, the searchd child awaits the next request. Otherwise it terminates.
This is the sort of “shared nothing” architecture you see in high-volume Apache/PHP or Apache/mod_perl applications. There’s no locking among the searchd processes, so it’s trivial to scale the number of concurrent requests you can process by making more CPU cores available. This design also means that a fatal bug in searchd will only affect the current process (and request) without others noticing. In other words, it’s a model that provides for excellent isolation too.
If you need to scale beyond a single machine, either to due to query volume or data size, Sphinx makes that easy using distributed indexes. You can add some trivial horizontal partitioning in your application to spread the indexes among Sphinx servers hosted on multiple machines which are the referred to as “agents.” Then you configure a sphinx instance that knows about all of the agents. That sphinx instance is what you’ll query from your application. It will, in turn, send the query to the individual agents in parallel, wait for the results to come back, combine them, and return them to you. From the application point of view, you don’t need to know if the indexes are located on a single machine or a dozen remote agents.
Of course, you can use your own vertical partitioning scheme too.
When I describe Sphinx performance, I often feel like I’m back in 2001 talking about MySQL. Sphinx is very very fast. In fact, I think most people will be shocked by how well it performs–not just in handling searches, but also when building or re-building its indexes. Part of the reason for this is Andrew’s relentless focus on performance. The Sphinx core is very efficient C++ that results in small, CPU-cache friendly binaries.
Sphinx works to pack as much data as possible into a small space, making every bit count. In fact, you can control the exact number of bits used to represent attributes of the documents that you index.
Another reason that Sphinx is so fast has to do with philosophy. Sphinx is fairly minimalistic in its approach. While it is configurable, it does not have a bewildering set of configuration options and dynamic on-the-fly options. Sphinx encourages you to sit down, define what you need, configure and test, and then deploy a highly efficient solution.
Under the Hood
At a high-level view, Sphinx provides very fast full-text search across multiple fields with custom ranking and filtering of the result set based on attributes. It supports common boolearn operators, proximity, wildcards, stemming, custom character sets, sorting, and more. But what does all that really mean?
Let’s say you’re building a Sphinx index of your email archive. You’d probably define an index with several text fields, just like you’d have columns in database: subject, body, to, from, and cc. Then you’d define a number of attributes as well–items you wish to use in filtering and/or sorting but that you don’t need to perform text searches on: date, size, attachment_count, etc. Then you’d feed all the documents (messages) to the Sphinx indexer, which takes care of breaking them into tokens and attributes, building the indexes, and any stemming, filtering, or transformation you need done.
By default, you might build a simple Google-like interface to your email archive. Type in a term or two like “finance spreadsheet”, hit the search button, and get back a list of highly relevant results. To do that behind the scenes, you’d issue an ANY query across all the fields (subject, body, to, from, and cc). Sphinx would find all of documents (messages) that match any of those terms in any of those fields, weight and sort them, and return the list of document ids (probably message ids in this example).
You have the flexibility to ask Sphinx to do the matching, weighting, sorting, and grouping in a number of different ways. The defaults are often quite sufficient for many applications
Next week I’ll walk through a simple Sphinx setup and demonstrate some features of the client API as well.
If you’re looking for a high-performance, stable, and very functional alternative to MySQL’s full-text search, give Sphinx a look. Odds are that it can do virtually everything you need and use fewer resources in the process.
Sphinx isn’t the only game in town. Apache Solr packages up the venerable Lucene search system into a very usable and highly configurable package that can run under a variety of application servers. Solr is also quite fast and is expendable via custom Java code. It has much better out of the box support for what’s known as “faceted search”–the sort of drill down into category and features you see at popular shopping web sites like NewEgg.The only downsides I saw were a steeper learning curve (mainly complexity and configuration) and the need for a bit more infrastructure if you’re not already a Java shop. Solr is definitely worth looking at.