

The emergence of data lake solutions gave data engineers and developers capabilities of using unstructured and structured data quickly and conveniently without extra flattening tasks.
As data lake relies on a file system-like repository for the data source and utilizes the power of Spark and Hadoop technologies, organizations choose to use it for their ever-growing datasets.
Because of the demand, the data lake market is expected to grow, at least in the U.S., year by year according to this market analysis report. This is certainly good news for users since data lake service providers will compete and keep enhancing their services to take a larger share of the market.
In this article, we will look at three of the competing companies that have unique features and that aim at simplifying data lake management and querying.
It is a common practice, in traditional data lake solutions, that you get to choose a column for partitioning, and, once chosen, you need to re-distribute data files. When you deal with small data and with a few data lake tables, it won't be a problem. However, as table count and data size grow, your data team can start feeling a maintenance burden. Varada's dynamic indexing can prevent this.
Varada's technology, in the background, breaks down a large dataset into a nano block that consists of 64k rows. Then, dynamic indexing inspects each nano block of the original dataset and automatically chooses an index for each nano block. When doing that, it checks actual data contents and structures.
When it comes to indexing, Varada uses Bitmap, Dictionary, Trees, text analysis, etc. A different indexing algorithm can be applied for each nano block, which can make indexing more precise and effective.
In another layer, their monitoring system audits your queries and keeps evaluating cluster performance. Varada uses machine learning to identify repetitive patterns and hotspots in workload and queries. This allows the platform to adopt the optimal acceleration. The metrics are visible to users so that they can take informed actions to optimize not only their queries but also to reduce time spent on data operations and time-to-insights.
Dremio's technology is called Data Reflections and it is integrated into their data lake solution. As the naming of their tech suggests, it represents the source data in a physically optimized way. When a user runs a query against a table, the Data Reflection of the table is partially or entirely used to meet the query instead of reading through whole data source files. Since the table you want to query is already in a computed state (again, partially, or entirely), this can enhance query performance.
Data Reflections are kept in a columnar representation based on Apache Parquet and Apache Arrow with further enhancement by compression technologies including dictionary encoding, run-length encoding, and delta encoding.
There are three types of Data Reflections:
• Aggregation reflections: they can be particularly used to accelerate BI style aggregation queries – those using group by and aggregation functions in the select section. These reflections can be defined by providing a set of dimensions and measuring values. When creating an aggregation reflection, Dremio automatically rolls up selected measure values for all combinations of the specified dimension values. Then, upon query time, the reflection can serve results from the pre-rolled-up version of the data instead of scanning and rolling up the raw data from the beginning.
• External reflections: they can be used to leverage datasets that were created outside of Dremio as reflections. For example, a common use case is using aggregation tables created and maintained by an existing process outside of Dremio as a reflection, without having to replicate a similar pipeline in Dremio.
• Raw reflections: they are used to accelerate range lookups, common joins, repetitive transformation patterns, or just slow datasets. These reflections are defined by providing a set of fields to be included in the reflection.
Although there are different reflection types, the concept of them is similar – precomputing what you want to query so that, at query time, you can expect faster performance.
Ahana is a fully integrated and managed data lake for AWS. AWS Athena already uses Presto for its query engine. However, users can face limitations as they use AWS Athena. The first hurdle that people can experience is that they do not have control over Presto clusters. As a managed data lake service, AWS manages clusters giving black-box environments to its users. Furthermore, AWS makes all users share the clusters. When you run a query, it goes to a shared queue that holds other query requests from other users. Then, it processes each request one by one.
In contrast, Ahana Cloud allows users to launch their own Presto clusters in AWS and gives control to them. With the Ahana SaaS Console, people can manage multiple Presto clusters as well as other system components like the Hive Metastore. When you provision your Ahana environment, their solution provides one instance of Apache Superset – an open-source BI / dashboarding tool. There, you can run queries and visualize your data.
In this article, we highlighted three data lake solutions that can simplify data lake management and querying. Each solution has different features and one may suit you better than the others do. If you don't have existing data lake infrastructure, it will be important to discuss internally to list priorities and select the right solution. If you already are using a data lake in your organization, you might probably experience inconveniences with the existing one. Then, you can use them to ensure your next solution can resolve those issues.
Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
_____________
Disclaimer: Analytics Insight does not provide financial advice or guidance on cryptocurrencies and stocks. Also note that the cryptocurrencies mentioned/listed on the website could potentially be risky, i.e. designed to induce you to invest financial resources that may be lost forever and not be recoverable once investments are made. This article is provided for informational purposes and does not constitute investment advice. You are responsible for conducting your own research (DYOR) before making any investments. Read more about the financial risks involved here.