Thursday, September 18, 2008

Speeding up HBase, part 1

UPDATE: This work on scanners is now committed into trunk, enjoy the new speed!

Stability and reliability were the focus of the previous HBase versions, now it is time for performance! In this first post on the subject, I will describe the work done on the scanners.

As described in the Performance Evaluation page of the wiki, current numbers compared to Bigtable are not that great (though a lot better than it was). One big hotspot we have in HBase is remote procedure calls (RPC) and some profiling we did showed us that it's sometimes even bigger than the overhead of HDFS. So, our strategy for HBase 0.19.0 will be to batch as much rows as we can in a single RPC in order to diminish it's impact. It will require different implementations for different operations.

Implementation and tests

Scanners in HBase already provide the fastest way to retrieve data but compared to how a direct scan in a HDFS MapFile is, it is clear that something is wrong (3 737 rows per second instead of a staggering 58 054). HBASE-877 is about fixing that hotspot. The implementation is simple: since we know that we want many rows, why not get a bunch of them each time we communicate with a region server and then serve directly from cache? In other words, fetch x rows, serve from cache x times, repeat until rows are exhausted. That works for big jobs, but what if a user wants to scan only 2-3 rows (or even 1), can we do something so that it doesn't fetch 30 rows? If we make sure that that number is configurable, it will also work fine for small scans.

My first tests with my patch confirm that batching rows is efficient. In the following chart, I ran the Performance Evaluation test 30 times, going from batching 1 row to batching 30 rows:
As we can see, the test took around 108 seconds to scan all 1 million+ rows on my machine (2.66 Ghz, 2GB of memory, SATA disk). That's 9692 rows per second, better than the PE in the wiki but my machine is also stronger. But at 30 rows per fetch, it scanned 30 028 rows per second! That's a lot better for a very simple fix. I think this could be further optimized by fetching the rows in a second thread with some logic to make sure we don't load the full table right away... and that's for a future version of HBase.

Configuration

There are many ways to configure how many rows are fetched. First, hbase-default.xml will define a default value for hbase.client.scanner.caching that you will be able to override in hbase-site.xml like any other value. The other way is to use HTable.setScannerCaching to define an instance-specific value. For example, you have to scan thousands of rows and right after that you only have to scan 2 rows and close the scanner (all on the same table), you should just pass 30 the first time and 2 the second time to setScannerCaching. That's a bit aggressive, but you can do it if you want.

Something you must verify before setting the caching to a high number of is the size of your rows. While testing the patch, I used 100MB rows to have many splits and kept the caching at 20. Bad idea, my machine began swapping like hell!

Conclusion

HBase 0.19.0 is all about speed and the first optimization is done on scanners. By fetching more rows, the cost of doing RPCs is greatly reduced. By keeping it configurable, it also fits all use cases. In my opinion, those who are using HBase as a source for MapReduce jobs will see a big improvement since it does full scans of tables (as long as the caching is configured at around 30 or less for very big rows).

If you want to use this sweet improvement right now, just download HBase 0.18.0 and patch it with my patch in HBASE-887 or wait until it gets committed.

Happy scanning!

Monday, September 15, 2008

Small, frequently updated tables

New in 0.2.1 and 0.18.0 is a finer grained configuration of major compactions. I won't go into the details today for what is a compaction so I'll refer to the architecture page of the wiki.

So, suppose you have a small table (10 rows) which is frequently updated (100 times / minute) and a MAX_VERSION of 1. From a developer point of view, you expect that this table will eat only a few MB in HDFS but upon inspection you will see that it's not the case. What happens is that when you add new cells, the old ones are kept but marked as deleted and will only be cleared after a major compaction which happens once a day! One thing you can do is to set the hbase.hregion.majorcompaction to a smaller value but this will affect your whole cluster and it's not recommended. With the introduction of HBASE-871, you can now set this value for each family. In Java, the code looks like:

family.setValue("hbase.hregion.majorcompaction", "1800");

Here the family is a HColumnDescriptor and the value 1800 will get you major compactions every 30 minutes (more or less). That'it!

Wednesday, September 3, 2008

My Personal Notes on 0.2.1

0.2.1 is a bug fix release with some improvements. We mainly fix issues not discovered with the 0.2.0 release candidates.

BUG FIXES

  • Many serious issues regarding MAX_VERSIONS and deletes were fixed to the point that using 0.2.0 is not recommended.
  • Another very serious issue fixed is that using any character under the “,” in the ASCII table can result in unreachable rows. For examples, see HBASE-832. This other issue should also be considered as a strong argument to upgrade.
  • The timeout for the scanner went from 30 seconds to 60 by default. This prevents getting UnknownScannerException during heavy tasks (mainly MapReduce).
  • Writing in a region while scanning it could result in a temporary deadlock if a split was required. For example, a MapReduce job with no Reduce that modified each row it mapped and that committed the change in the Map would surely fail after some time.
  • The row counter present in HQL is now in the new shell. It is supposed to be slow.
  • Committing a BatchUpdate with no row will now complain.
  • Various other issues were resolved that fixed stuff in the logs, cleaned the code, and fixed the API for deleteFamily.

IMPROVEMENTS

  • Be aware that 0.2.1 is based on Hadoop 0.17.2.1 which was released as 0.17.2.
  • When dropping a table in the shell, you will now have a nice message if the table was not disabled.
  • The pauses were reduced so creating a new table, for example, is faster.
  • All methods that take a Text that were not deprecated now are. Be aware that in 0.18 these methods will be removed so this is your last warning!
  • Various speed improvements regarding compactions and Jenkins Hash (used to encode a region’s name).

INCOMPATIBLE CHANGES

  • The Thrift IDL was updated to match the new API. Be sure to have a look!