Paging all data exports

There’s always a need for importing, exporting, and synchronizing data between various applications, especially in corporate environments.

As data has grown by orders of magnitude in my databases, the amount of resources required to run these processes has shot up dramatically. Many databases are now 10 times larger and exports that once took 2 or 3 minutes now take a good hour or more.

Think about that for a second and you’ll notice something strange. Shouldn’t 10X more data only take 20-30 minutes to export rather than an hour.

I’ve recently had to go back and make changes to some (of my own) legacy code doing such an export and was so aggravated by the excessive run times, I decided to do something about it.

After looking into it, I figured out that the problem I was running into is the excessive use of garbage collection. By pushing gigabytes of data into memory, not only was the OS working overtime to reallocate all of that, the garbage collector was having to constantly run in an effort to keep memory use down. When I tried running the process on my macbook that “only” has 2 gig of ram, the process ate up all 2 gig and virtually starved itself to death. I finally gave up and killed it.

The solution may seem obvious, but I really haven’t come across it before:

Page all data that gets pulled.

I always thought of data paging as something done to improve the user experience and not a method of improving performance when you need to work on all of the data (quite the opposite really). While it does slow things down for a small number of results, it really makes a HUGE difference as your application grows.

As a Rubyist, I now rely on my relatively new friend paginating_find to meet my paging needs. I set up an enumerated version of my export process to pull 1000 record pages at a time. Experiment with results-per-page until you get your desired performance. A lower number of results_per_page will tax your database, while a higher number will tax your web server and increase your memory requirements. I found that at 1000 records the database would carry about 15% of the load, and about 100MB of memory would be used.

What was the performance difference? The process has gone from 60+ minutes per run to just over 7 minutes. Wow!

Posted by chrisp Sat, 27 Jan 2007 02:56:00 GMT


Comments

Leave a response

Leave a comment