A report from OpenSQLCamp

Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

What do you get when you put together 80 to 100 hard-core database geeks from ten different open source databases for a weekend? OpenSQLCamp , which was held most recently at MIT. Begun three years ago, OpenSQLCamp is a semi-annual unconference for open source database hackers to meet and collaborate on ideas and theories in the industry. It's held at various locations alternately in Europe and the United States, and organized and run by volunteers. This year's conference was organized by Sheeri Cabral, a MySQL community leader who works for PalominoDB

This year's event included database hackers who work on MySQL, MariaDB, PostgreSQL, VoltDB, Tokutek, and Drizzle. In contrast to the popular perception that the various database systems are in a no-holds barred competition for industry supremacy, most people who develop these systems are more interested in collaborating with their peers than arguing with them. And although it's OpenSQLCamp, programmers from "NoSQL" databases were welcome and present, including MongoDB, Membase, Cassandra, and BerkeleyDB.

While the conference was mainly database engine developers, several high-end users were present, including staff from Rackspace, GoDaddy, VMWare, and WidgetBox. The conference's location meant the participation of a few MIT faculty, including conference co-chair Bradley Kuzsmaul. While few of the students who registered actually turned up, attendees were able to learn informally about the software technologies which are now hot in universities (lots of work on multi-processor scaling, apparently).

Friday

The conference started with a reception at the WorkBar, a shared office space in downtown Boston. After a little drinking and socializing, participants slid immediately into discussing database and database industry topics, including speculation on what Oracle is going to do with all of its open source databases (answer: nobody knows, including the people who work there), recent releases of PostgreSQL and MySQL, and how VoltDB works. Whiteboard markers came out and several people shifted to technical discussions and continued the discussion until 11pm.

Jignesh Shah of VMWare brought up some interesting SSD testing results. In high-transaction environments, it seems that batching database writes actually reduces throughput and increases response times, completely contrary to performance on spinning disks. For example, Jignesh had experimented with asynchronous commit with large buffers, which means that the database returns a success message to the client and fsyncs the data in batches afterward. This reduced database write throughput, whereas on a standard spinning disk RAID it would have increased it up to 30%. There was a great deal of speculation as to why that was.

A second topic of discussion, which shifted to a whiteboard for comprehensibility, was how to put the "consistency" in "eventual consistency" without increasing response time. This became a session on Sunday. This problem, which is basic to distributed databases, is the question of how you can ensure that any write conflict is resolved in exactly the same way on all database nodes for a transactional database which is replicated or partitioned across multiple servers. Historical solutions have included attempting to synchronize timestamps (which is impossible), using centralized transaction counter servers (which become bottlenecks), and using vector clocks (which are insufficiently determinative on a large number of nodes). VoltDB addresses this by a two-phase commit approach in which the node accepting the writes checks modification timestamps on all nodes which could conflict. As with many approaches, this solution maintains consistency and throughput at a substantial sacrifice in response times.

Saturday

The conference days were held at MIT, rather ironically in the William H. Gates building. For those who haven't seen Frank Gehry's sculptural architecture feat, it's as confusing on the inside as it is on the outside outside, so the first day started late. As usual with unconferences, the first task was to organize a schedule; participants proposed sessions and spent a long time rearranging them in an effort to avoid double-scheduling, which led to some "concurrency issues" with different versions of the schedule. Eventually we had four tracks for the four rooms, nicknamed "SELECT, INSERT, UPDATE and DELETE".

As much as I wanted to attend everything, it wasn't possible, so I'll just write up a few of the talks here. Some of the talks and discussions will also be available as videos from the conference web site later. I attended and ran mostly discussion sessions, which I find to be the most useful events of an unconference.

Monty Taylor of Drizzle talked about their current efforts to add multi-tenancy support, and discussed implementations and tradeoffs with other database developers. Multi-tenancy is another hot topic now that several companies are going into "database as a service" (DaaS); it is the concept that multiple businesses can share the same physical database while having complete logical separation of data and being unaware of each other. The primary implementation difficulty is that there is a harsh tradeoff between security and performance, since the more isolated users are from each other, the less physical resources they share. As a result, no single multi-tenancy implementation can be perfect.

Since it was first described in the early 80's, many databases have implemented Multi-Version Concurrency Control (MVCC). MVCC is a set of methods which allow multiple users to read and modify the same data concurrently while minimizing conflicts and locks, supporting the "Atomicity", "Consistency", and "Isolation" in ACID transactions. While the concept is conventional wisdom at this point, implementations are fairly variable. So, on request, I moderated a panel on MVCC in PostgreSQL, InnoDB, Cassandra, CouchDB and BerkeleyDB. The discussion covered the basic differences in approach as well as the issues with data garbage collection.

Jignesh Shah of VMWare and Tim Callagan of VoltDB presented on current issues in database performance in virtualized environments. The first, mostly solved issue was figuring out degrees of overcommit for virtualized databases sharing the same physical machine. Jignesh had tested with PostgreSQL and found the optimal level in benchmark tests to be around 20% overcommit, meaning five virtual machines (VMs) each entitled to 25% of the server's CPU and RAM.

One work in progress is I/O scheduling. While VMWare engineers have optimized sharing CPU and RAM among multiple VMs running databases on the same machine, sharing I/O without conflicts or severe overallocation still needs work.

The other major unsolved issue is multi-socket scaling. As it turns out, attempting to scale a single VM across multiple sockets is extremely inefficient with current software, resulting in tremendous drops in throughput as soon as the first thread migrates to a second socket. The current workaround is to give the VMs socket affinity and to run one VM per socket, but nobody is satisfied with this.

After lunch, Bradley ran a Q&A panel on indexing with developers from VoltDB, Tokutek, Cassandra, PostgreSQL, and Percona. Panelists answered questions about types of indexes, databases without indexes, performance optimizations, and whether server hardware advances would cause major changes in indexing technology in the near future. The short answer to that one is "no".

As is often the case with "camp" events, the day ended with a hacking session. However, only the Drizzle team really took advantage of it; for most attendees, it was a networking session.

Sunday

Elena Zannoni joined the conference in order to talk about the state of tracing on Linux. Several database geeks were surprised to find out that SystemTap was not going to be included in the Linux kernel, and that there was no expected schedule for release of utrace/uprobes. Many database engineers have been waiting for Linux to provide an alternative to Dtrace, and it seems that we still have longer to wait.

The VoltDB folks, who are local to Boston, showed up in force and did a thorough presentation on their architecture, use case, and goals. VoltDB is a transactional, SQL-compliant distributed database with strong consistency. It's aimed at large companies building new in-house applications for which they need extremely high transaction processing rates and very high availability. VoltDB does this by requiring users to write their applications to address the database, including putting all transactions into stored procedures which are then precompiled and executed in batches on each node. It's an approach which sacrifices response times and general application portability in return for tremendous throughput, into the 100,000's of transactions per second.

Some of the SQL geeks at the conference discussed how to make developers more comfortable with SQL. Currently many application developers not only don't understand SQL, but actively hate and fear it. The round-table discussed why this is and some ideas for improvement, including: teaching university classes, contributing to object-relational mappers (ORMs), explaining SQL in relation to functional languages, doing fun "SQL tricks" demos, and working on improving DBA attitudes towards developers.

In the last track of the day, I mediated a freewheeling discussion on "The Future of Databases", in which participants tried to answer "What databases will we be using and developing in 2020?" While nobody there had a crystal ball, embedded databases with offline synchronization, analytical databases which support real-time calculations, and database-as-a-service featured heavily in the discussion.

Wrap-up

While small, OpenSQLCamp was fascinating due to the caliber of attendee; I learned more about several new databases over lunch than I had in the previous year of blog reading. If you work on open-source database technology, are a high-end user, or are just very interested in databases, you should consider attending next year. Watch the OpenSQLCamp web site for videos to be posted, and for the date and location of next year's conferences in the US and Europe.