Physical or Virtualized Data Marts?

Data marts and data warehouses have served the business users’ requirements for a long time, but in today’s agile world, business users cannot wait for the data to be available after a few weeks or even months if the companies follow the SDLC cycle to create data marts.

The creation of data marts is required mainly in following scenarios:

The blending of data: Data coming from varied sources which may be a mix of relational and non-relational data. Most of the BI tools allow blending of data but the performance of the tools suffers especially if the volume of data is large, and user experience for drill downs and aggregations is also not as per expectations.

Data transformations and aggregations: Transformations to raw data and aggregations to a higher granular level are main use cases for using ETL tools to create data marts. Although basic transformations and aggregations can be done using BI tools, aggregating data on the fly could be costly and if the transformations are complex, we will have to use ETL tools.

Single source of truth:  For financial companies, and other companies where data accuracy, consistency, and reliability are of utmost importance, having a single source of truth is essential from a reporting perspective, and data marts can serve as a container for all financial and audit reporting.

Easy and direct integration with BI or visualization tools:  Data marts have been generally created in relational databases which provide a direct ODBC/JDBC source to the BI tools and don’t have integration challenges.

Can we fulfill these requirements without creating Data Marts?

The distributed computing and virtualization tools are helping to resolve some of these challenges, obliterating the need to build extensive and expensive data marts and providing ways to present data to the business in an agile manner.

Traditional virtualization tools like Composite / Denodo can provide a canonical layer from multiple sources of data. The virtualization layer provides pass-thru capabilities to the BI tools using direct ODBC / JDBC connectivity. But high-volume data may present challenges based on cluster sizing and server RAM limitations. Simple transformations can be done using these tools.This approach works fine with small to medium data sets with simple transformations.

With distributed computing, new frameworks and tools are now in the market, which provide big data (mostly Spark based) processing engines. These engines are not limited to a single server CPU/Memory for processing a single job, as they can distribute a single job to multiple servers. Installations of these tools can utilize existing Spark or Hadoop clusters. These tools can fulfill high-performance requirements with large volumes without creating physical data structures. The data can be kept virtualized or written back to databases or file system for audit purposes. A new open source framework called Apache Arrow has been developed and tools like Dremio look very promising for virtualization and big data processing.

Presto / Spark can be used for providing virtualized / real-time processing and some of these frameworks are being used by companies like airBNB, Netflix, Facebook. A new generation of visualization tools like druid can be used for real-time visualizations.

Big data based OLAP tools like Kyvos, Atscale can also be used as a virtual layer but need prep work to get data in cubes.

Data prep tools are a new set of lightweight ETL tools that are gaining popularity for agile data marts creation. These tools will create a data mart, but development and execution are more agile than the traditional ETL tools. Alteryx, Paxata are some of the prominent tools.

To conclude; companies which are looking for more agile options for providing blended high-volume data along with simple to complex transformations and persistence options should look at the above new frameworks and tools for BI / ETL and data prep. It could be little challenging to look at the plethora of tools which are in the market today, but starting with some quick POC’s is always the best option to understand their capabilities and what best fits in the organization based on access patterns and business requirements.

written by Rakesh Gupta

#Dataintegration #Bigdata #datawarehouse #dataanalytics #technology #businessintelligence #datavisualization