



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
JOY ARULRAJ,Georgia Institute of Technology ... cloud-native database management systems (DBMSs) [12]. ... 3 CASE STUDY: INTEL DATA MANAGEMENT PLATFORM.
Typology: Lecture notes
1 / 5
This page cannot be seen from the preview
Don't miss anything!




The availability of cost-effective, highly-available, performant cloud computing platforms (e.g., Amazon Web Services, Microsoft Azure) over the last decade has given rise to a new class of cloud-native database management systems (DBMSs) [ 12 ]. These systems differ from their traditional counterparts in the following ways:
Cloud-native DBMSs adopt a shared-disk model so that they can independently scale the compute and storage resources. The canonical storage hierarchy of these systems consists of the following tiers:
Authors’ addresses: Joy Arulraj, Georgia Institute of Technology; David Cohen, Intel.
2 • Joy Arulraj and David Cohen
CPU
PM
Disk
Network
DRAM
(a) Shared-nothing architecture
CPU
PM
Network
Disk
DRAM
(b) Shared-disk architecture
Fig. 1. DBMS Architectures - Comparison of shared-nothing and shared-disk architectures.
assembly instructions [ 4 , 11 ]. The DBMS uses the direct access (DAX) mechanism to bypass the traditional I/O stack (page cache and block layer). It manages the data on a file system extended for DAX-enabled PM (e.g., XFS on a fsdax device [ 6 ]). DAX enables direct, byte-addressable access to the contents of the file system.
We now illustrate the design principles of cloud-native database systems through a case study. We present the architecture of the Data Management Platform (DMP) developed by Intel [ 5 ]. We later discuss how we extend a transactional DBMS to leverage PM on DMP.
Overview: DMP is a distributed, data management system that is geared towards diverse workloads (e.g., transactional databases, machine learning pipelines). It manages a collection of dis-aggregated, containerized NVMe SSDs that are accessible via an NVMe-over-Fabric (NVMe-oF) interface [ 3 ]. The logical volumes residing in these containers are optimized for
4 • Joy Arulraj and David Cohen
for the duration of the operations of this DBMS instance. This ensures that the cloud object store has a record of all the modifications applied to the database, thereby enabling support for point-in-time recovery. Distributed Shared Log: In a traditional DBMS, the write-ahead log (WAL) is the source of truth while recovering from a failure. In a cloud-native DBMS, we generalize this idea to a distributed shared log (i.e., an event stream) [ 12 ]. We could configure the WAL to use such a cloud event stream. However, the latency of immediately persisting entries to the distributed log is too expensive. So, as a stop-gap solution, the WAL is locally stored on PM. A replicated logging module that is tailored for PM would increase the write throughput achievable by the cloud-native DBMS. PM-centric Optimizations: DMP enables the MyRocks storage engine to exploit the byte- addressability of PM devices by eschewing the page-centric optimizations inherent in other cloud-native DBMSs with minimal code modifications [ 9 ]. The MyRocks engine maintains its local state across the combination of DRAM and PM (XFS on a fsdax device [ 6 ]). Since the PM device exhibits memory-like performance, the page cache pages would be unnecessary copies of the data stored on that device. DAX eliminates the these extra copies by directly performing reads and writes to the PM device (configured in App Direct Mode). The MyRocks engine maps the PM device directly into userspace. It accesses 256 B chunks of an SSTable mapped into PM as opposed to loading the entire 4 KB page. The engine stores the tail of the WAL and caches the top-level tiers of the LSM tree on local PM. It stores the mutable memory tables and a few SSTables on DRAM. We tune the RocksDB parameters to minimize the impact of flushing MemTables to new SSTable files and SSTable compaction operations. We disable the RocksDB block cache. Thus, the engine directly fetches the blocks from the appropriate SSTable file cached in the locally- attached PM volume (i.e., Storage-over-AppDirect). We plan to leverage persistent skiplists in the future to guide the read operations through the LSM tree. The MySQL cloud-native DBMS supports two key capabilities: (1) intra- and inter-cluster replication, and (2) point-in-time recovery (PiTR). It currently provides these capabilities by leveraging two logs: (1) the unmodified MySQL binary log (binlog), and (2) the MyRocks WAL. The unmodified binlog does not adhere to the principles of a cloud-native DBMSs. The I/O overhead associated with the binlog and the MySQL group commit effectively throttles achievable write throughput. We plan to address this issue in the future by developing a PM-aware, replicated tail-of-the-log module.
Several open problems arise with the advent of cloud-native DBMSs:
Leveraging Persistent Memory in Cloud-Native Database Systems • 5
Cloud-native DBMSs are moving away from the monolithic architecture of their traditional counterparts by decoupling storage and compute resources. The performance of these systems is, thus, constrained by I/Os written over the network. Locally-attached persistent memory and remote PM-based SSD devices accessible via an NVMe-over-Fabric interface help alleviate this bottleneck, as we illustrated through our case study of MyRocks storage engine on Intel’s Data Management Platform. The advent of cloud-native DBMSs has given rise to several open problems that should be of interest to both researchers and practitioners in storage systems.
[1] David DeWitt and Jim Gray. 1992. Parallel database systems: The future of high performance database processing. Technical Report. University of Wisconsin-Madison. [2] Facebook. 2016. MyRocks: A space- and write-optimized MySQL database. https://engineering.fb.com/core- data/myrocks-a-space-and-write-optimized-mysql-database/. [3] NVM Express Inc. 2016. NVM Express over Fabrics. https://nvmexpress.org/wp-content/uploads/NVMe_ over_Fabrics_1_0_Gold_20160605-1.pdf. [4] Intel. 2019. Intel Optane Memory. https://www.intel.com/content/www/us/en/architecture-and-technology/ optane-memory.html. [5] Intel. 2019. Manage and Monetize Exponential Data Growth with Intel’s Data Management Platform. https://itpeernetwork.intel.com/manage-monetize-exponential-data-growth-with-intels-data- management-platform/. [6] Intel. 2019. Provision Intel Optane DC Persistent Memory. https://software.intel.com/en-us/articles/quick- start-guide-configure-intel-optane-dc-persistent-memory-on-linux. [7] MinIO. 2019. High Performance Object Storage. https://min.io/. [8] Rockset. 2019. RocksDB-Cloud: A Key-Value Store for Cloud Applications. https://github.com/rockset/ rocksdb-cloud. [9] Reza Sherkat, Colin Florendo, Mihnea Andrei, Rolando Blanco, Adrian Dragusanu, Amit Pathak, Pushkar Khadilkar, Neeraj Kulkarni, Christian Lemke, Sebastian Seifert, et al. 2019. Native store extension for SAP HANA. In VLDB 2019. [10] Michael Stonebraker. 1986. The Case for Shared Nothing. Data Engineering Bulletin 1 (1986). [11] Alexander van Renen, Viktor Leis, Alfons Kemper, Thomas Neumann, Takushi Hashida, Kazuichi Oe, Yoshiyasu Doi, Lilian Harada, and Mitsuru Sato. 2018. Managing non-volatile memory in database systems. In SIGMOD 2018. [12] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. 2017. Amazon aurora: Design considerations for high throughput cloud-native relational databases. In SIGMOD.