Jeremy Zawodny Archive

Jeremy Zawodny is a software engineer at Craigslist where he works on MySQL, Search, and various back-end infrastructure. He's also the co-author of "High Performance MySQL" and blogs at http://jeremy.zawodny.com/blog/
InnoDB Performance Monitoring with innotop
Manually extracting relevant information from repeated incantations of SHOW ENGINE INNODB STATUS while trying to figure out what InnoDB is doing is not only error prone, it's just plain hard to do. And since MySQL doesn't expose the data you really want in an INFORMATION_SCHEMA table (yet?), the option is use an external program to help: innotop.
My Top Resources of 2009
With the year winding to a close, it's a good time to look back and think about what some of the most interesting and useful resources were. This includes software tools, web sites, blogs, and so on.
Bang for the Buck
With the new year right around the corner, it's worth thinking about where you can get the biggest bang for your buck--quite literally. In a lot of organizations, budgeting is a funny exercise that requires you to "use it or lose it" at the end of the year while also having surprisingly detailed plans for next year's money.
MySQL Upgrade Testing
Upgrading MySQL can be a bit of a leap of faith. You hope for everything to go well but really don't want to rely on mere hope to ensure that you don't find yourself with a nasty surprise late one night.
Some Reasonable Defaults for MySQL Settings
Out of the box, MySQL isn't exactly tuned for resilience on a busy network where things occasionally go haywire.
Hacking with CouchDB
Working with CouchDB is very straightforward. There's virtually no setup involved and no complicated libraries to hassle with.
An Introduction to CouchDB
CouchDB is one of the most popular and mature document-oriented databases. Let's have a look at the features that make it so popular, get it installed, and start putting it to use.
Data By The Numbers
When dealing with large distributed systems, knowing some basic performance and failure numbers helps you understand what you can reasonably expect both in terms of performance and reliability.
NoSQL: Distributed and Scalable Non-Relational Database Systems
Non-SQL oriented distributed databases are all the rage in some circles. They're designed to scale from day 1 and offer reliability in the face of failures.
Everything is Unix
Programming in a higher-level language, it's often easy to forget about using lower-level Unix facilities in tricky situations. Here are a few examples to give you an idea of what you might be missing.
Consistent Hashing for Scaling Out
Consistent Hashing is a useful technique for horizontal scaling while also protecting yourself against future growth as well as server failures.
Expecting to Fail
How can you help to ensure availability of a of a web site or other network service? We'll look at some of the techniques at our disposal.
Database Storage Performance Testing in a Hurry
Sometimes you need answers to important questions quickly. When benchmarking new disk and disk-like subsystems, how can you get relevant and useful info without a lot of time?
The Curious Case of the Failing Connections, Part 2
This week we continue on the trek to locate the source of failing MySQL connections and uncover a solution.
The Curious Case of the Failing Connections
Troubleshooting is often an adventure that takes you to unexpected places and teaches you a lot along the way.
Redis: Lightweight key/value Store That Goes the Extra Mile
Need a key/value store that doesn't compromise functionality for performance? Have a look at redis.
RethinkDB: Rethinking the Database using Modern Assumptions
As technology evolves, it's often worth re-thinking how we do things. A small group of engineers is doing just that for MySQL.
Quick and Dirty MySQL Performance Troubleshooting
What are the first things you should look at after learning of a sudden change in MySQL server performance?
PBXT: Your Next MySQL Storage Engine?
The PBXT storage engine for MySQL is nearing a stable release. What's so special about PBXT, anyway?
Understanding MySQL Appliances and Hardware Acceleration
Believe it or not, there are servers specifically designed to run MySQL -- not to mention other hardware that help to accelerate your databases. Here's a look at what's out there.
MySQL Sandbox: Treat MySQL Instances like Virtual Machines
Install isolated side-by-side MySQL instances right the first time with this time-saving virtual manager.
Mattkit: A Great MySQL Toolbox
Here's a look at some of the most useful MySQL tools you've probably never seen (or heard of).
MySQL Performance from the Start
What are a few of the things you really need to keep in mind when starting that next big MySQL project?
bashreduce: A Bare-Bones MapReduce
Harness the power of distributed computing using everyday Unix command-line tools and a clever little bash script.
Sphinx: Queries and APIs
Now it's time to get serious and look at writing some simple code that can query a running Sphinx index and take advantage of its advanced query features.
Sphinx: Getting Practical
A continuing look at putting the Sphinx search engine to use.
Drizzle: Rethinking the MySQL Database Kernel
Drizzle is a re-thought and re-worked version of the MySQL kernel designed specifically for high-performance, high-concurrency environments. In this exclusive article, MySQL guru Jeremy Zawodny takes an inside look at the goals and state of Drizzle development.
Sphinx: Search Outside the Box
Looking for ways to overcome indexing bottlenecks at Craigslist lead to an investigation of Sphinx, a powerful, free full-text search engine that works extremely well with MySQL.
XtraDB: InnoDB on Steroids
Percona's XtraDB is a fork of InnoDB with a ton of extra options and enhancements. Here's a look at what it can do for your busy database.
The State of MySQL
Robust development from outside the Sun/MySQL sphere, new storage engines and the return of Monty are just some of the signs that MySQL is healthy, despite may reports to the contrary.
Choosing the Right Zip Code
In the old days, disk space cost a pretty penny, so saving space was essential. But now that disk space costs about $0.50 per gigabyte, a lot of folks never worry about deleting files, let alone compressing them. However, if you're administering a large, shared server (such as for email), it seems that you can never have too much space.
Keychain: Hassle-free SSH
If you're running Linux, you should be aware that using telnet is a no-no. With the wide availability of network sniffers and automated password grabbing tools, telnet is simply not a secure way to work. Instead, use ssh and keep your passwords in keychain.
Open Source and APIs
An open source application really isn’t open, unless it’s APIs are open, too.
MySQL Cluster Configuration
Last month, we looked at MySQL's new storage engine, NDB (also known as NDBCluster or MySQL Cluster). Now it's time to look at the compilation, installation, and configuration process.
MySQL Clustering
This month and next, we'll look at the most significant addition to MySQL 4.1: native clustering. This month, let's start with an overview of the new clustering technology, see how it's been integrated into MySQL, and understand the benefits it provides. Next month, we'll cover the steps necessary to get a cluster up and running.
Understanding the Query Cache
When MySQL 4.0 was released, it included a host of new features. We've already discussed MySQL 4.0 several times in linuxdlsazine, but the query cache only received a brief mention in the September 2002 "LAMP Post" column (available online at http://www.linuxdls.com/2002-09/lamp_01.html). And since the query cache is disabled by default, there's a good chance you've not stumbled across it yet.
The GIMP 2.0
In the early days of Linux, users had modest needs to create graphics, so the then-nascent GNU Image Manipulation Program (GIMP) served them well. However, as Linux and the GIMP became popular, more sophisticated users -- even some graphics professionals -- began to rely on the GIMP for their day-to-day needs. As often occurs, as demand for the GIMP grew, so did the number of feature requests. Fortunately, the GIMP developers worked hard to keep up with expensive, proprietary image editing software available on other platforms, and today, the GIMP is "the Photoshop of Linux," a category-killer application.
Peering Under the Hood, Part One
On a busy server, it's often hard to keep track of what's running and when, so from time to time, you may find yourself wondering what MySQL is doing. Luckily, MySQL provides a degree of transparency that makes it relatively easy to peer inside and see what's up.
Which Zip Is Right For You?
In the old days, disk space cost a pretty penny, so saving space was essential. But now that disk space costs about $0.50 per gigabyte, a lot of folks never worry about deleting files, let alone compressing them. However, if you're administering a large, shared server (such as for email), it seems that you can never have too much space.
MythTV: The Open Source PVR
For years now, Linux users have been proud of the fact that their favorite operating system is at the heart of the most popular personal video recorder (PVR) system around: TiVo.
Replication Tips and Tricks in MySQL
In March's "LAMP Post" column, we started to look at MySQL's replication subsystem. We covered how replication works, as well as putting it to use by configuring the master and slave(s). This month, in this inaugural "MySQL" column, let's spend some time looking at the lesser known aspects of MySQL replication, including filtering and log inspection.
Mozilla Goodies
A year and a half ago, we selected the Galeon web browser as "Project of the Month" (http://www.linuxdls.com/2002-07/potm_01.html). Galeon used the Gecko rendering engine from Mozilla and layered tons of useful features on top of it. It wasn't the leanest browser around, but it had nearly every feature you could want and was proving to be quite popular.
Replication in MySQL
Master/slave replication first appeared in a beta release of MySQL back in 2000. In the three or so years since then, replication has become an essential feature for most of MySQL's high-end users. And contrary to many assumptions, MySQL's replication is quite easy to use, especially when compared to the replication systems that are part of high-end commercial databases. This month and next, let's take a look at MySQL's replication feature and the various ways you can put it to use.
BitTorrent: ISOs for Everyone — Fast!
Have you ever tried to download the latest ISO images for your favorite Linux distribution during the first week that it's available? If so, you've probably even had trouble finding an up-to-date mirror that'd let you in, and after finding one, you were probably disappointed to see a 20 KB/sec download speed (or worse) on your cable modem or DSL line that normally downloads at 10 times that speed. And as Linux becomes more popular, the problem's only getting worse.
MyISAM Tables
In September 2002's "LAMP Post" column (http://www.linuxdls.com/2002-09/lamp_01.html), we briefly touched on the idea of using multiple storage backends (table types) in MySQL:
Dia: Diagram This!
Writing documentation for software is no fun. But what do you do when you have to document some code you've been working on, a network you're building, or a database you're designing? Get someone else to do it? Perhaps, but what if you have to do it? Easy. You draw impressive looking diagrams. After all, a picture is worth a thousand words, right?
Data Reduction, Part 2
Last month, we left off part way through a data reduction effort with ten million Apache log records in a single MySQL table that was taking up far too much disk space and memory. We analyzed the data, found ways to normalize the schema to reduce the space required, and created the new tables. Now let's finish the job by creating a script that can intelligently move data from the old table into the new ones.
Emacs Remote Editing with Tramp
Die-hard Emacs users often make a point of doing as much work as possible directly in their favorite editor. Indeed, mighty Emacs lets you do pretty much anything on your local machine. But as the world has become more and more network-centric, odds are that you'll need to edit a file on a remote machine. Sure, you can start a shell within Emacs, ssh to the remote machine, and edit the file. But Emacs doesn't perform terminal emulation very well, so your choice of editors is quite limited.
Data Reduction, Part 1
Roughly a year ago, we spent two months looking at logging web hits in MySQL, using Apache and mod_log_sql. In the October 2002 issue (available online at http://www.linuxdls.com/2002-10/lamp_01.html), we looked at what Chris Powell's mod_log_sql does for you and tried a basic configuration. (After that article appeared, Chris released a new version that fixed a few bugs we discovered in the process of writing that article. Consider upgrading if you haven't already.) Then in November 2002 (http://www.linuxdls.com/2002-11/lamp_01.html), we started building a basic web interface in PHP to present a view of the logged data. Using that framework, you could construct pages to list the most popular URIs, referers, and so on -- all in real-time. That, after all, is part of the beauty of mod_log_sql. You get the benefits of an SQL interface without any unnecessary delays.
SpamBayes: Get Rid of Spam
In Paul Graham's now famous article, "A Plan for Spam" (http://www.paulgraham.com/spam.html), Graham argued for a much different and radically simplified approach to spam filtering. Instead of using extensive rule-based schemes. Graham suggested using a statistical approach that learned from your e-mail. Shortly after, Bayesian mail filters begun popping up everywhere. This month let's look at SpamBayes, one of the most popular and effective Bayesian tools around.
Network Tricks with netcat
A Smarty Solution
Last month, we looked into Smarty (http://smarty.php.net), a powerful templating system for PHP. To recap, Smarty provides a separation of program logic from web site design. Your code does whatever it needs to do to fetch, process, and store data, and when you need to display something, you instantiate Smarty and ask it to render a particular template, passing in any necessary data. Smarty templates are written in a simple markup language that looks a bit like a cross between XML and PHP.
The Bitflux Editor
Every once in a while, we run across a piece of software that assembles technologies in a different way with surprising and impressive results. The Bitflux Editor (or BXE) is just that sort of software: it's a browser-based, WYSIWYG XML editor that runs inside Mozilla. And since Mozilla is a cross-platform browser, BXE is inherently a cross-platform XML WYSIWYG editor. Better yet, BXE requires no Java or ActiveX plug-ins at all. There aren't many other editors that can make that claim.
Smarter PHP with Smarty
Have you ever noticed that some of the features that make PHP so popular and useful are the very same features that come back to bite you as your project evolves and gets larger and larger?
Emulation, Virtualization, and More
When it comes to running non-Linux software or a second operating system under Linux, many users turn to a commercial solution such as VMWare (http://www. vmware.com), or a full-blown virtual machine, or CodeWeavers' CrossOver (http://www.codeweavers.com/products/ crossover). But the open source world has a lot to offer, too.
Reverse Proxying with Squid
Many large organizations use caching proxy servers to save on network bandwidth utilization (and costs) and improve browsing response times. In fact, an entire industry has grown up around caching proxy appliances. But in the open source world, we've had one of the most advanced proxy servers for many, many years. Squid (http://www.squid-cache.org) is to caching proxy servers as Apache is to web servers -- the hands-down open source winner.
In the last year or so, we've looked at a lot of email tools in this column, including SpamAssassin, Squirrelmail, grepmail, and Mailman. But so far, we haven't looked at any desktop mail programs. To remedy that, let's look at the GNOME project's Balsa.
Adding Search to Your Site, Part 2
Last month, we looked at adding search to a site using the open source ht://Dig search tools. As you'll recall, ht://Dig handles the crawling, indexing, and search duties. However, not everyone has the access or resources required to install ht://Dig, so this month we'll try an alternative approach -- using Google from PHP.
Imagine having a Linux distribution that uses the latest open source software, auto-detects all of your hardware, and doesn't cost a dime. Now imagine that it can also be run completely from a CD, yet still contain full-blown desktop applications such as KDE, OpenOffice, KOffice, and so on.
Adding Search to Your Site, Part I
While most large organizations already have a search feature on their web site, many small- and medium-sized organizations do not. For whatever reason, there's long been a perception that getting good search results on your web site is complicated or expensive. This month's column begins a two-part series about adding search features to your web site.
With all of the fancy, graphical email applications available for Linux, newcomers are often surprised to learn that many long-time Linux users still use old, text-based email management tools. These old-timers thoroughly embrace the ancient Unix philosophy of using several small, discrete command-line tools rather than a single monolithic application. This month, we look at grepmail, one of the most indispensable command-line email utilities.
Benchmarking with Apache Bench
Last month, we looked at some of the issues that affect PHP performance and explored PHP caches and optimizers, two kinds of add-ons that can provide a substantial performance boost to your PHP web applications. Rather than dig into any of those products (they all have sufficient documentation and good support communities), let's focus on a related issue: performance testing. Or, said another way, once you've installed a performance boosting add-on or made a configuration change, how can you determine if it's helping or hurting?
The Spread Toolkit
This month's installment of "Do It Yourself" switches gears a bit. Rather than focus on an application, this month's column looks at a development library called the Spread Toolkit, a powerful network communication system. Spread isn't a new project. It's existed in one form or another for roughly five years. Strangely, during all that time, it hasn't received the attention it deserves.
PHP Caching and Optimization
PHP is an excellent language for building Web applications. PHP's syntax is likely to be familiar to anyone who's programmed in C/C++ or Perl, and PHP integrates with literally hundreds of third party libraries, providing access to everything from IMAP and MySQL to GD for image manipulation and SNMP for monitoring network devices.
Over the last several years, a small group of open source projects has evolved to become much more than just great software. One notable example is Mozilla. Long criticized for being bloated, bug-ridden, and behind schedule, Mozilla has evolved into a capable development platform. Other open source projects use Mozilla's core technology, such as the popular Gecko rendering engine, to build next-generation tools and applications. In many ways, Mozilla has become an umbrella for such projects, including this month's "Do It Yourself" software, the Phoenix Web browser.
As Web applications grow more complex, they also become more and more vulnerable. As with other areas of network security, it's best to have multiple levels of protection in place. Ideally, you'd have security controls in place on your router and firewall, as well as in your application and database servers.
Throttling Apache
Popularity often comes at a high price, especially on-line where news, fads, links, and word-of-mouth literally spread at the speed of light. The creators of the popular "Hot or Not" site (http://www.hotornot.com) learned this lesson the hard way. Overnight, their traffic went through the roof. In response, they had to spend a fair amount of time and effort figuring out how to manage (and pay for) the traffic their site generated.
MySQL 4.x:
MySQL powers countless databases and data-driven Web sites. MySQL 4, the latest release of the Open Source database, includes features that put it on par with products from database stalwarts Oracle and Microsoft. Unbelievable? Believe it.
The popularity and prevalence of always-on home network service (via DSL or cable modem) has changed the notion of what an Internet Service Provider (ISP) needs to provide. In the old days, an ISP hosted your Web site and email on their servers. You used their network to browse the Web, read newsgroups, and POP your email. Since your connection was temporary, there wasn't a way to get email delivered directly and reliably to your computer.
TWiki Extensions and Other Wikis
In last month's LAMP Post column (available online at http://www.linuxdls.com/2002-12/lamp_01.html), we looked at TWiki, a popular web collaboration tool. This month, let's dig a bit deeper into TWiki, and consider TWiki alternatives.
If you're a programmer, one of the great things about Linux and Unix is that everything is a file -- or at least acts like one. From devices to sockets, the "everything is a file" paradigm has served Unix well for a long, long time.
Wiki Time
The World Wide Web was created to enhance scientific collaboration and foster information sharing. But for some reason, the vast majority of the Web remains read-only. Few sites let random users make changes to the content, structure, or look and feel of a Web page. On most sites you can't easily edit what's there, or upload new documents and files, or easily allow others to do the same.
The popularity of free news and discussion Web sites like NewsForge (http://newsforge.com), Slashdot (http://www.slashdot.org), and Use Perl (http://use.perl.org) and the explosive growth of weblogs has created a need for good Rich Site Summary (RSS) aggregation software. (RSS is an XML-based file format that allows sites and weblogs to "export" headlines and story information for use elsewhere.) RSS aggregators monitor web sites and weblogs for new content, and create a list of the latest headlines and abstracts (when available).
Live Logs
Last month, we configured Apache with mod_log_sql to log all Web traffic to a MySQL database, using one table for each virtual domain.
Bash Completion
In the May 2002 Power Tools column (http://www.linuxdls.com/2002-05/power_01.html), we looked at one of the most compelling features of zsh: its ability to complete often-used command-line arguments and switches. If you were tempted to drop bash in favor of zsh, don't switch just yet -- bash still has a few tricks up its sleeve.
Getting a Handle on Traffic
When running a Web site of any size, it helps to learn about the visitors you're attracting. The traditional solution for monitoring Web traffic is a log file analysis tool such as analog (http://www.analog.cx). analog is very fast, but what if you'd like real-time or near-real-time statistics? You could run analog from cron every five minutes, but what if you also want to issue ad-hoc queries against your logs to answer very specific questions like, "What's the average number of pages that each Internet Explorer user views?"
Robocode: Virtual Robot Wars
Given the amount of press Java garners, you'd think that every programmer is busily building Java applications. The reality is that many programmers have yet to give Java a try. If you're one of those programmers, Robocode might just be the project you need to jump into Java.
What’s New in MySQL 4.0
Version 4 of MySQL (http://www.mysql.com) has been in development since 2001. By the time you read this, MySQL 4.0 should be a stable release (or at least be in late-beta -- not finished yet, but still quite suitable for development work that you expect to deploy later this year).
Building a Simple Calculator
Being a magazine editor isn't easy. In addition to fretting deadlines and the regular care-and-feeding of authors, editors spend a lot of time and effort reading, re-writing, re-reading, and tweaking articles. Editing is really more of an art than a science -- and that's why editors haven't been replaced by computers. Yet.
Browsing the Web on a Linux box just got a lot better. Galeon, a GNOME-based browser, raises the bar for performance and ease-of-use. Galeon is fast, easy to configure, and packs features not available in mainstream browsers.
MySQL Administration Made Easy
MySQL is great for building database-driven Web sites of all shapes and sizes. It's fast, easy to configure, and incredibly reliable. But MySQL lacks a mature, easy-to-use GUI administration tool. Yes, you can use the mysql command-line, but that's rather tedious and you don't get a good overall picture of your server without doing a lot of typing.
Apache Toolbox
One of the keys to Apache's success is its extensible modular architecture. Developers have created custom modules for authentication, streaming audio, database access, and so on. However, Apache itself only comes with a handful of core modules installed.
Customizing PostNuke
Last month began our look at PostNuke (http://www.postnuke.com), a popular PHP-based Web site framework. Out of the box, PostNuke provides a modular and customizable interface for building community Web sites. However, PostNuke's default setup is only intended as a starting point. So this month we'll look at some ways to add features and flavor to your PostNuke site.
Mailman, the GNU List Manager
For nearly as long as folks have used the Internet for sending and receiving e-mail, mailing list management (MLM) software has been around. If you're looking to set up a mailing list server, Mailman is probably just what you need. It is popular, fast, easy to use, and easy to hack on.
Setting up PostNuke
Last month began our look at a class of Web-based tools for creating dynamic Web sites. The LAMP-powered tools provide a framework for news and announcements, threaded discussions, weblogs, polls, and dynamic content. This month we'll go through the process of installing and setting up PostNuke, a popular PHP-based system.
Building Community Web Sites
If you're a well-seasoned Net surfer, you've undoubtedly noticed that the number of "community" Web sites run by individuals, small companies, and other organizations has increased dramatically over the last few years. These sites serve as places for folks with common interests to meet, discuss ideas, share information, and collaborate in many interesting ways.
PHPLib and User Authentication
Last month we looked at basic user authentication with PHP. The methods shown were useful for simple applications that require only minimal security. More complex applications, however, tend to require a more flexible and robust authentication system as well as session handling, permissions, and so on. Building such a system on your own would probably require a lot of time, during which you'd be reinventing wheels (and bugs). This month we'll look at PHPLib and some of the features you could take advantage of.
Simple User Authentication
Web application security is often an afterthought. You start with the best of intentions -- build- ing a quick prototype which allows your users to get a feel for how the application might work. But the next thing you know, they're using it regularly and you've invested quite a bit of time and effort in the former prototype.
Migrating to PHP
Welcome to the first of many columns in which we'll explore the various technologies in the LAMP family. For those who aren't familiar with the acro-nym, LAMP stands for Linux, Apache, MySQL, and Perl/Python/PHP -- some of the Open Source world's "best of breed" tools.
MySQL Server Performance Tuning
Get under the hood of MySQL to find out how you can speed up your database applications.
MySQL Performance Tuning
With tools like Apache, Perl, PHP, and Python, building great MySQL applications is easy. Making sure they are fast, however, requires a bit more insight. Here's what you need to know.