Keywords:

H-Store, In-Memory Database System (IMDS).

Introduction:

Previous research shows that traditional modern RDMS databases are not performing optimal because most of the databases (DB2, SQL Server and Oracle) trace their roots to System R from 1970 [1, 2]. Moreover current online transactional processing (OLTP) databases spend only 12% of execution time on actual work and all the other time is spent on locking, logging and data buffering [1]. These databases were designed more than 25 years ago and hardware characteristics were quite different than today. In particular, the abundance of main memory, storage space in hard drives and processing speed of modern processors has improved significantly. One would expect that modern RDMS will be using the improvements in technology but surprisingly the architecture is still similar to System R, which limits the use of modern technology and hardware [2]. In summary, the current RDMSs were developed for business data processing market in a time of different user interfaces and hardware characteristics. Hence they include the following System R features.

  1. Disk oriented storage
  2. Multithreading to hide latency
  3. Locking based concurrency control
  4. Log based recovery

Since then, new markets have emerged in database market like text processing, data warehousing, stream processing and scientific and intelligence database. Specialized database solutions have already been developed for the mentioned database [2]. H-Store is an attempt to provide specialized database engine for OLTP market.

H-Store

H-Store is an open source, in-memory, row based and relational research database. It is a specialized database to handle only OLTP data. It is completely ACID complaint and runs on cluster of shared nothing machines. Multiple single threaded engines coordinate to provide efficient execution of OLTP transactions. H-Store is written from scratch and does not inherit anything from System R. It supports modern hardware and it significantly faster [3]. H-Store can be found at this Github url.

1. H-Store Cluster

H-Store system consists of cluster of two or more computational nodes deployed within the same administration domain. A node is a computational unit that hosts multiple sites. A site is a single threaded independent daemon that performs the basic operations. External OLTP applications connect to this site to execute transactions. Each node consists of multiple processor cores. So, each site is assigned a single processor core and a part of main memory. Sites do not share any data structure or memory with other sites on same machine or on different machine.

H-Store Node diagram

H-Store Node diagram

 

OLTP applications call H-Store system to request the pre-defined stored procedures written in Java. These parameterized stored procedure are executed on sites and contain both control code and SQL statements. An instance of stored procedure is called transaction in this system.

2. System Architecture

H-Store Architecture diagram

H-Store Architecture diagram

 

H-Store architecture can be divided into 2 separate parts

2.1 Deployment model

H-Store provides a deployment framework to administrators, which takes database schema, cluster information, stored procedures and sample workload as input and generates compiled stored procedures, skeleton query plans and physical layout of database as output.

Sample workload is a log file that contains a list of previously executed transactions. Deployment framework uses this to get hints for generating skeleton query plans. Compiled stored procedures are comprised of multiple distributed query skeleton plans, one for each query in the transaction. The deployment framework also outputs unique invocation handlers that can be used to reference the stored procedures at runtime. After this phase, all the compiled stored procedures are sent to each site in the cluster.

The database designer is responsible for splitting the data into separate autonomous partitions that can be sent to different sites in the cluster. The partitioning is done automatically based on the partitioning column defined by the developer as part of the database schema. This column should be carefully selected for each table because wrong partition can result in loss of performance. The goal of partitioning column is create as many single-sited transactions as possible. In case you change the partitioning column, you will have to stop the database, recompile, restart the database and reload the data. The deployment framework will automatically partition the whole data according to the new column.

Tree Schema of Customers Database

Tree Schema of Customers Database

 

In most of the OLTP workloads, tables have 1-n relationship to its ancestor table and the schema is a tree schema. Most of the OLTP databases are explicitly created like a tree schema [3]. Such schemas are popular; for example, “Customer” (root table) produces “Orders”, which have line items. These schemas have an obvious horizontal partitioning and root table can be range or hash partitioned on primary key(s). Every descendent table can be partitioned such that all equality-joins in the tree span only a single site. This makes the OLTP application single-sited because almost all of the transactions can be executed on a single site, which makes the database very efficient. For example, considering the previous example many SQL queries will be rooted with a specific customer and will include a where clause with customer id like customer_id=27. In some cases, if the application is not a tree schema but becomes one after removing read-only tables than those types of schema’s can also be made single-sited. In such cases we replicate the entire read-only table on each site to improve performance [2].

2.2 Runtime model

The runtime model consists of multiple components. OLTP application can access any site in the cluster to execute its request. The request is received at transaction initiator and coordinator component of site. This component decides that which site will execute the control code (Java code). It also annotates the skeleton query plans generated at deployment time with parameters received in request. The plan helps the site to figure out that which site in the cluster has the data and can execute the fragment of the SQL query. This component is also responsible for the coordination of distributed transactions. It uses messaging fabric to communicate with other sites using sockets.

Another important component of site is the transaction manager, which consists of three sub components. Stored procedure executor is responsible for executing the control code. Query execution engine is a C++ based query engine responsible for executing the parameterized SQL queries that are part of transaction. It uses the data in main memory of the site to run the query. Last component is a system catalog and it contains all the output generated at deployment time. Since each site has compiled stored procedures and query plans, the sites are independent.

At execution time, the client uses the unique stored procedure handles to initiate a new transaction request. All the parameters required to run the transaction are also passed to the transaction. Each site of node in cluster can execute the transaction even if it does not have the data needed for the transaction. Since on runtime we know all the parameters, we can annotate the skeleton query execution plans generated at deployment time with the correct target sites for each queries sub operation. The updated plan is passed to the transaction manager. If the plan needs to execute a sub part on some other site then it sends fragments of plan over socket to other sites. After the execution of the fragment the data is sent back to the initiating site. Unless all the results are received on the initiating site, the site that is running the control code is blocked. The final results are passed back to the client application.

3. Example

Let’s see a few examples that how read and write will work in H-Store (considering the distributed nature of the database). In each example, we are considering that we have 4 nodes in our cluster and each node has 2 cores and each core has its own partition of data and works in a shared nothing architecture. All the requests made by client application are passed onto H-Store and it decide which node to forward the request to.

3.1 Read

Lets try to execute the following query

In this specific case, the client application is requesting the data of customer with ID of 1. Since we know that data is replicated in multiple partitions, we can read data from any partition. So, the request is passed onto Node 1 and transaction coordinator passes the request to first core which returns the data from its partition.

H-Store Select query example

H-Store Select query example

 

3.2 Complex Read

Now, let’s try a more complex query that involves more nodes.

This is a distributed query because it requires access to multiple machines to get the final output. Let’s assume that the initial request was passed to Node 1 and all the data related to Orders table is places in Node 1 and Node 2 and there replica’s are places in Node 3 and Node 4. Node 1 will run the control code. Results will be fetched from Node 1 and 2 partitions and data will be aggregated by the control code running at Node 1. Node 1 will be in blocked state until the data from Node 2 is returned back.

H-Store Count(*) query

H-Store Count(*) query

3.3 Insert/Update/Delete

We have already seen the select queries. Let’s try the update query. The insert and delete work in a similar fashion.

Since, it is an update statement we also need to update the replica’s. Let’s say the data for customer 4 is placed in Node 2 in partition 4. Its replica is placed in Node 4 and partition 4. Let’s say that the request is again sent to Node 1 (Even though it can come to any node). Node 1 can run the control code or pass it to other node. In this case, let’s say that it passes to Node 2. Node 2 will run the control code. It will update the data in its partition and will wait for Node 4 to update its data too (since Node 4 contains the replica). Once, this is done the results will be returned back to the user.

H-Store Update query

H-Store Update query

3.4 Code example

We have seen diagrams but not code. A simple stored procedure in H-Store looks like the following. It is a stored procedure to add a new Order. Each stored procedure is written in JAVA and has to extend VoltProcedure class. It has a run method with several parameters that will be passed on runtime. The method voltQueueSQL puts the SQL query in a queue and voltExecuteSQL will execute it.

The parameters to the stored procedure will be passed on runtime. The code for client applications will look something like.

Sometimes we might not have a stored procedure for a query. In that case, we can run ad-hoc queries using the client object

4. Durability and fault tolerance

H-Store is based on main memory and data is distributed over multiple nodes in the cluster. Fault tolerance and durability of the system is very important. To the best of our knowledge we were not able to find any data on fault tolerance of H-Store but a commercial version of H-Store named VoltDB provides this information [4]. According to it, H-Store database provides a configurable time limit after which all the data from main memory of cluster is transferred to a persistent device (i.e. Spinning disk and/or SSDs). It supports both local replication (for quick fail over) and remote replication (for disaster recovery) [5].

It also maintains a command log, which logs all the transactions that were requested between the last snapshot and current time. So, if the system fails after the snapshot then redoing the transactions that exist in the command log can restore the state of cluster.

5. Scalability

H-Store consists of a cluster of shared nothing machines. This architecture makes it highly scalable. Again H-Store doesn’t provide much information here but its commercial version VoltDB answers this question [5]. This database can be scaled bidirectionally – by increasing the capacity of existing cluster nodes (scaling up) and by increasing the number of nodes in a cluster (scaling out). Scaling increases the number of database partitions and execution queues. More transaction requests can be handled per second by the cluster. Further, scaling doesn’t require any changes in the database schema or application code, nor replacing existing hardware. Reconfiguring the cluster hardware might require first saving and shutting down the database and restarting it. After rebooting, the deployment framework can take into account the new hardware that was added to the system.

6. High Availability

Data persistence is guaranteed by having the data reside in multiple main memories partitions. For high availability active-active architecture is used and each replica is an active partition, which can handle client application request. H-Store uses an innovative replication strategy called k-safety. If H-Store is configured for k-safety then it automatically replicates partitions so that database can withstand loss of “k” nodes (due to software or hardware problems). For example value of k=0 means that there will be no replication and database will be stopped in case of node loss. If we have k>0, then cluster can work normally without any interruption (even if a node fails) because the replica is a fully functioning member of the cluster. During normal operation all transactions are executed synchronously on all replicas.

In case a node goes offline it can be reintroduced in the running cluster once the hardware or software fault has been resolved. As it rejoins the cluster, it retrieves a copy of data for its partition from its sibling nodes. The siblings continue to work normally while they are updating the newly rejoined node. Once the node gets all the data, it begins normal operation and starts to accept client’s transaction requests.

7. Related Work

Since 1980, the main memory became relatively inexpensive and it became possible to keep large databases in memory. The concept of main memory DBMS also became usual. Many in memory DBMSs were developed like MonetDB, Altibase, IBMSolidDB and Oracle-TimesTen but many of these systems inherit the architectural baggage of System R [3]. Other distributed main memory databases have also focused on the migration of legacy architectural features to this environment. H-Store is a complete redesign and a lot of focus has been on automatic database partitioning [7] and optimizing the transaction execution in a parallel environment [6], which makes it quite optimized, and performance oriented.

 

References:

[1] S. Harizopoulos, D. J. Abadi, S. Madden, and M. Stonebraker. “OLTP through the looking glass, and what we found there,”

[2] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it’s time for a complete rewrite). In VLDB ’07, pages 1150– 1160, 2007

[3] R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: a HighPerformance, Distributed Main Memory Transaction Processing System
[4] VoltDB. High Performance, Scalable RDMS for Big Data and Real-Time Analytics. [http://downloads.voltdb.com/datasheets_collateral/technical _overview.pdf]

[5] Volt DB. Technical Overview: A High Performance, scalable RDMS for Big Data, High Velocity OLTP and Realtime Analytics [http://www.odbms.org/wpcontent/uploads/2013/11/VoltDBTechnicalOverview.pdf]

[6] A. Pavlo, P.C. Jones, S. Zdonik. On Predictive Modeling for Optimizing Transaction Execution in Parallel OLTP Systems

[7] A. Pavlo, C. Curino, S. Zdonik. Skew-Aware Automatic Database Partitioning in Shared-Nothing, Parallel OLTP Systems