360º View of Monitored Systems at Any Scale: LeanXcale Linear Horizontal Scalability

8 min readAug 6, 2021

Motivation: Supporting 360º view of monitored systems at any scale

Monitoring applications must deal with systems of very different sizes; from a simple web server to several large data centers. Every device and aspect of the platform should be monitored individually to provide the highest level of granularity.

The bigger the scale of the monitored system, the higher the data rate that needs to be ingested, the bigger the number of KPIs that need to be computed and the larger the volume of historic data that needs to be queried in combination with the current data. This means that the underlying database of the monitoring system should be able to handle the load of the highest scale of the monitored systems.

If the underlying database is not scalable, the most common solution is sharding. In other words, to have several independent databases and each of them is dedicated to hold data about a fraction of the platform. This solution faces several issues:

KPIs that aggregate data from different database instances cannot be computed in real-time. Alarms based on thresholds of these KPIs are not triggered or triggered late if a Manager of Managers (MoM) alternative is used.
Events that are stored in different shards and have a common root-cause appear as independent events (data across shards cannot be queried with SQL since each shard is an independent database manager with no visibility of the other shards).
Mutual relationships between different database metrics are not considered, reducing forecast accuracy.
Due to these limitations, deciding the metrics inventory management (what metrics of a device are persisted) becomes more and more complex.

What is really needed and valuable is the availability of a global 360° view for a monitored system that is independent of its scale. This goal can be achieved by using a single global database. However, to address the problem at any scale, the database needs to be horizontally scalable. When a particular deployment of the database cannot handle more data, one just needs to add a new node to be able to handle more data.

PROBLEM: SQL DATABASES DO NOT SCALE, NOSQL LACKS SQL QUERIES

Figure 1: Other Databases Scale Out Logarithmically so cost increases exponentially

A main issue is that SQL operational databases today are either centralized or, if they are distributed databases, scale out logarithmically (see our blog post on Cluster Replication), resulting in very inefficient and quite limited scaling. There are a good number of NoSQL technologies that can scale and some of them can scale out linearly, typically, some distributed key-value data stores. However, NoSQL data stores are not efficient at querying data.

Figure 2: NoSQL databases Result In Losing SQL as Query Language

This situation results in either going to NoSQL and having a lower performance when performing the queries over the NoSQL data or creating a complex architecture, such as lambda architecture, that combines SQL and NoSQL. Both approaches result in a high TCO due to the low efficiency or the higher cost of developing and maintaining the code.

SOLUTION: LEANXCALE HORIZONTAL LINEAR SCALABILITY

One differential key feature of LeanXcale is its horizontal linear scalability (see our blog post on Scalability). LeanXcale is a shared-nothing distributed database (see our blog post on Shared Nothing) that runs on commodity hardware, either on-premise or in the cloud. Figure 1 depicts the architecture of LeanXcale in terms of subsystems. At the bottom, there is LeanXcale’s own proprietary storage engine known as KiVi. KiVi is an ultra-efficient distributed storage engine (actually, a relational key-value data store).

Figure 3: Architecture of LeanXcale Platform

KiVi is itself a distributed relational key-value data store. In fact, LeanXcale relational tables can be accessed from both the SQL interface and the KiVi native NoSQL interface. This dual interface is extremely convenient because data can be efficiently ingested at very high rates with very little resources by using the native key-value interface. In fact, KiVi actually implements all SQL algebraic operators but joins. This means that, through its interface, it can process any queries with filtering, aggregation, grouping and sorting, without the overhead of SQL processing. KiVi is integrated with LeanXcale ultra-scalable transactional manager, which means that it is a fully ACID, key-value data store.

On top of KiVi and the transactional manager sits our distributed SQL query engine. The SQL query engine enables access to the data using SQL. It provides full SQL and JDBC and ODBC drivers to access the database. The horizontal scalability delivered by LeanXcale is linear, which is the optimal type of scalability. If one uses a cluster with 200 nodes, then one gets 200 times the throughput of a single node. Although it sounds natural, it is very hard to obtain and most distributed databases scale only logarithmically (centralized databases exhibit null scalability), which prevents much improvement of the throughput of a single node deployment. In Figure 2, it can be seen the linear scalability achieved by LeanXcale with the industrial benchmark for operations databases, TPC-C.

Figure 4: LeanXcale Linear Scalability with TPC-C Benchmark from 1 to 200 nodes

USE CASE EXAMPLE: SCALING IT INFRASTRUCTURE MONITORING (ITIM)

A monitoring tool is a system that retrieves information from different devices, agents or probes. It identifies the topology and creates a set of metrics and KPIs. These metrics/KPIs allow us to forecast the behavior of the system, spot problems and identify their root cause. They persist in a database. When the volume of metrics grows, several strategies need to be used. The most common are sharding and the usage of complex architectures.

Let’s describe a fictional story in ITIM, featuring a new on-line monitoring company LxCMon, to characterize the limitations of these patterns and later highlight the benefits of linear scalability in this context. In this story, LxCMon has developed a multitenant SaaS platform to monitor VMs’ behavior. After some time, the SaaS platform starts to have remarkable traction in SME and corporate markets. At this point, the DevOps team realizes that their database’s capacity is about to be overwhelmed. The engineering team decides to use explicit sharding of the database to address the scalability issue, so the customer base is split across several isolated database instances.

However, the metrics from corporate customers become massive, LxCMon needs to persist with them in several independent database instances. Since some global metrics require the information stored in these independent database instances to be aggregated, LxCMon cannot provide their corporate customers with their computation in real-time. This lack of real-time information negatively impacts their forecasting ability and their MTTR (Mean Time-To Repair).

On the other hand, some SME companies are disruptive startups and others ultimately disappeared. Balancing the data across the different shards is an impossible task, resulting in some of them being overloaded while other ones are underloaded with the consequent waste of capacity. Deciding how to split the metrics from customer-base across the independent database instances to reduce the HW footprint becomes a very challenging moving target.

To solve all these issues, the LxCMon engineering team implements a complex architecture with several types of databases. This complexity increases the total cost of ownership because of more licenses, higher footprint, multiple copies of the data resulting in higher storage costs and more required experts on different database engines. Furthermore, as the different database engines evolve, troubleshooting becomes more difficult because it can be a problem in one engine or another, or in the integration. This yields a price increase with longer outages, as the more complex the solution, the longer the process to find the issue and solve it becomes. This makes the platform less appealing for users that have to pay more for the same monitoring tool with a quality of service that keeps degrading over time. On the other hand, the time to develop new functionalities gets longer due to platform complexity. Due to the longer time to market (TTM), the product becomes less competitive and the level of disruption decreases. Ultimately, LxCMon traction declines.

A database that scales linearly would have been the optimal solution for a monitoring platform such as LxCMon. A monitoring solution based on a linearly scalable database can:

Maintain the same total cost of ownership (TCO) per collected metrics independently of the parallel amount of collected metrics.
Have a simple inventory process where every metric is in the single database.
Reduce the TTM since the architecture is just as simple as the one needed for a small monitored system, thus overcoming the scalability limitations without increasing software complexity.
Since there is a unique global database that is fed in real-time, the MTTR is reduced due to the forecast accuracy being increased. That enables the general adoption of machine learning techniques, resulting in a boost of AIOps.

SCALABILITY ISSUES WITH SQL DATABASES

SQL databases mostly resort to cluster replication (see our blog post on Cluster Replication). Cluster replication lies in using full replication at all nodes to be able to scale out the read workload. However, the write workload has to be processed at every node, which yields logarithmic scalability (see Figure 2). This means that, at most, the throughput of a single node can be multiplied by 2 or 3 using a cluster of 5 to 10 nodes, which is highly inefficient.

Some other databases rely on a shared disk architecture (see our blog post on Shared Nothing for an overview of this architecture and comparison with other database architectures). Shared disk architecture needs to perform distributed locking across the different nodes to avoid the two nodes updating the same disk block simultaneously. This distributed locking means broadcasting the lock request to all the nodes over each data block that is being modified, and to broadcast the changes in the block to all the nodes with a copy of the block, before releasing the lock. This severely limits the scalability of the approach and yields logarithmic scalability, also resulting in logarithmic scalability as with cluster replication.

QUERY ISSUES WITH NOSQL DATA STORES

Continue reading on LeanXcale blog.