In this podcast I talk with Mike Rabinovici of Dimodelo Solutions about data being the new currency, the importance of showing customers the art of the possible, and last but not least my go to TV show. Permalink to Data Virtualization vs. ETL, which stands for extraction, transformation, and loading. If you are building a data warehouse, should you move all the source data into the data warehouse, or should you create a virtualization layer sas data integration studio fast track pdf top of the source data and keep it where it is?
Another common scenario is if you will be joining data sets from multiple sources frequently and the performance needs to be super fast. These turn out to be the scenarios for most data warehouse solutions. But there could be cases where you will have many ad-hoc queries that don’t need to be super fast. And you could certainty have a data warehouse that uses data movement for some tables and data virtualization for others. Also keep in mind the virtualization tool you choose may not support some of your data sources. When it comes to building an enterprise reporting solution, there is a recently released reference architecture to help you in choosing the correct products. It will also help you get started quickly as it includes an implementation component in Azure.
The idea is you are deploying a base architecture, then you will modify as needed to fit all your needs. But the hard work of choosing the right products and building the starting architecture is done for you, reducing your risk and shortening development time. However, this does not mean you should use these chosen products in every situation. But for many who just need an enterprise reporting solution, this will do the job with little modification. Permalink to Is the traditional data warehouse dead? Is the traditional data warehouse dead? This has led to a question I have started to see from customers: Do I still need a data warehouse or can I just put everything in a data lake and report off of that using Hive LLAP or Spark SQL?
Is the data warehouse dead? I think the ultimate question is: Can all the benefits of a traditional relational data warehouse be implemented inside of a Hadoop data lake with interactive querying via Hive LLAP or Spark SQL, or should I use both a data lake and a relational data warehouse in my big data solution? The short answer is you should use both. The rest of this post will dig into the reasons why. Permalink to Why use a data lake? Why use a data lake?
Permalink to What is a data lake? What is a data lake? The main benefits I hear of a data lake-only approach: Don’t have to load data into another system and therefore manage schemas across different systems, data load times can be expensive, data freshness challenges, operational challenges of managing multiple systems, and cost. While these are valid benefits, I don’t feel they are enough to warrant not having a relational data warehouse in your solution. Hadoop-related technologies, but could mean a combination of Hadoop and relational technologies and tools. Should You Migrate Your Data Warehouse? I have seen Hadoop adopters typically falling into two broad categories: those who see it as a platform for big data innovation, and those who dream of it providing the same capabilities as an enterprise data warehouse but at a cheaper cost.
Big data innovators are thriving on the Hadoop platform especially when used in combination with relational database technologies, mining and refining data at volumes that were never before possible. However, most of those who expected Hadoop to replace their enterprise data warehouse have been greatly disappointed, and in response have been building complex architectures that typically do not end up meeting their business requirements. IT make multiple copies of the data and cleaning, joining and mastering it to make it easier for users to report off of the data but dealing with the delay in waiting for IT to do all this. Another risk in the first case is slower performance because the data is not laid out efficiently. Permalink to Why use a SSAS cube? Why use a SSAS cube?
Relational data warehouses continue to meet the information needs of users and continue to provide value. Many people use them, depend on them, trust them, and don’t want them to be replaced with a data lake. But not all data and information workers want to become power users. These people are best served with a data warehouse. I can’t stress enough if you need high data quality reports you need to apply the exact same transformations to the same data to produce that report no matter what your technical implementation is. If you call it a data lake or a data warehouse, or use an ETL tool or Python code, the development and maintenance effort is still there.