【转】Data Modeling vs Analytic Modeling in SAP HANA

时间:2021-11-29 00:38:21
I have been working with clients and helping them implement solutions on SAP HANA standalone for about four months. I have learned quite a bit from firsthand experience in dealing with items ranging from loading data, configuring security, and creating multi-dimensional models within HANA. I have also seen firsthand the potential HANA has to be a game changer in terms of data processing and analysis. In preparation for these projects I have conducted an exorbitant amount of research to help me gain a deeper understanding of the engines, capabilities and technology within HANA. I am now at a point where I feel that I have done my homework and can comment on a few observations based on both my experiences and research. The first item that I believe to be an important discussion point is differentiating the terms HANA Modeling and good old fashion Data Modeling. With the current version of HANA there is still a difference, and I will explain those differences in this article. For the record, this article is centered on standalone SAP HANA and not BW Powered by HANA. There is a significant difference between the two solutions and most aspects of this article will not apply to companies that plan to implement BW Powered by HANA.

Before I dig deep into the differences, let me define what I mean by Data Modeling. In general terms, Data Modeling is a mix of both business processes and ETL technology that results in the transformation of normalized and disparate data into a clean and organized structure suitable for decision making. It plays a key role in an organization’s Data Governance program. In many cases the result of a Data Modeling exercise will be a data warehouse or data mart. However, the primary purpose of the Data Modeling task is to provide consistent and quality access to data. Without following these processes, Business Intelligence for large and sometimes small organizations will be anything but Intelligent. I have seen this processes overlooked many times over the past decade and can attest, that organizations that fail to implement a solid data foundation will flounder in delivering meaningful analytics. If you would like to dive deeper into the various facets of Data Modeling, there is a descent description of it available on Wikipedia . In terms of this article and SAP HANA, I am referring to utilizing this process to transform, clean and load data into a physical star schema model utilizing HANA columnar tables as the target. If you have done any research or seen presentations on SAP HANA you are likely aware that the HANA studio client tool contains a subset of modeling tools and objects. HANA modeling allows you to create rich multi-dimensional analytics and views of data. The product is very similar to OLAP by result but fundamentally different in terms of data storage and indexing. Most OLAP tools on the market today will build a proprietary database called a Cube, and store it on a disk array (MOLAP). User will then access this data via OLAP tools or OLAP middleware. HANA modeling does not create persistent cubes, but rather allows developers to define metadata that produces CUBE like features. HANA will use this metadata and its calculation engines to retrieve data from the base in-memory columnar tables at lightning fast speeds. I could spend several thousand additional words describing all the reasons that HANA is revolutionary in terms of aggregating data, but there is more than enough information available on SAP SDN to explain how HANA works. For this article, I simply want to describe the differences between traditional Data Modeling and modeling analytics on SAP HANA. If you plan to implement solutions on SAP HANA standalone you need to understand that HANA alone does not solve problems dealing with the structure, organization or quality of your data. With that said, HANA will provide the ability to analyze, aggregate, compare, calculate and forecast extremely large data volumes at speeds unlike any database currently available. It truly is unique in terms of how it has merged in-memory database storage, parallelization of calculations and OLAP processing into a single solution. However, HANA is not an ETL tool and it is not a solution that allows you to forgo Data Governance and Data Modeling. To emphasize this point I want to speak technical about the types of transformations and de-normalization that cannot be achieved by HANA alone (in its current release). I will also review the types of transformations and de-normalization that can be achieved by HANA today. Based on my experience and research there are three types of transformations that will have to be solved outside of HANA. There are more, but I will remain focused on the actual problems I have experienced. In all of my projects to date, we selected Data Services 4.0 to load data into HANA. This turned out to be a wise decision, because it was able to help manage the complex data modeling problems that occurred due to the normalized nature of the data we loaded.

Transformation Example 1 The first problem comes in the form of normalized hierarchical data. Hierarchal data can be stored in relational tables in two extremes. The first extreme is based on a normalized approach. This is a technique where records are stored with a parent child relationship. In the example below each ITEM ID is stored with its PARENT ITEM ID, TYPE and DESCIRPTION. This is not a very practical example, but given the number of NDRs I have signed to date, I want to make sure that I don’t disclose any customer data or data structures. Hopefully you can use your imagination.

ITEM ID PARENT ITEM ID ITEM DESCIRPTION ITEM TYPE
1 North America Continent
2 1 Florida Stage / Region
3 2 Miami City
4 1 Georgia Stage / Region
5 4 Atlanta City
6 1 Texas Stage / Region
7 6 Dallas City
8 4 123 Main St Address
9 2 4356 Palms Ave. Address
10 6 456 Cowboy Way Address
11 4 123 Jones St Address

This storage method will reduce the overall size of the data within the table by eliminating duplicate character values. It will also increase the performance of inserting new values because you only have to perform a single row insert or update when modifying the relationships. This technique is very typical and you will see it implemented in OLTP optimized databases or source system applications for a variety of reasons. The other extreme of storing hierarchical data comes in a more de-normalized format. In this example the data values will be stored in a way that is highly column repetitive but low record cardinality. In this example the ADDRESS ID is the primary key of the table.

Address ID Address City Name State / Region Continent
8 4356 Palms Ave. Miami Florida North America
9 123 Main St Atlanta Georgia North America
10 456 Cowboy Way Dallas Texas North America
11 123 Jones St Atlanta Georgia North America

If you think in terms of creating Dimensions in a star schema, the de-normalized form of this data is required to eliminate rows from being repeated after joining this table to a Fact Table. If your source data in normalized you need a way to transform it into the de-normalized form before joining it to a Fact Table. While you might be able to devise a solution using a recursive procedure, a database view, or other HANA coding, you should keep in mind that this type of de-normalization processing will be performed each time a dependent query is executed. Even with HANA and all its features, this is neither a scalable methodology nor recommended approach. Given the above example, the best solution is to utilize Data Services to extract the data from the source system, transform (de-normalize) the data, then load it into HANA. With this approach, you perform the work on a scheduled or near real-time basis but not each time the data is queried. There are other reasons that storing this type of data in its de-normalized form is advantageous on HANA. For more details you can review the SAP HANA Database developer’s guide .

 Transformation Example 2 Another example where Data Services was needed occurred when a reverse pivot transformation was required. With normalized data it is highly likely that you find an example where multiple attributes of a single value are stored as multiple records. In the example below you will see that a single order has multiple status dates:

ORDER ID STATUS TYPE STATUS DATE
1 CREATED 1/1/2012
1 SHIPPED 1/4/2012
2 CREATED 1/5/2012
3 CREATED 1/5/2012
3 SHIPPED 1/8/2012
3 INVOICED 1/9/2012

If you join this table, in its native format, to the ORDER DETAILS table, you will again create a situation where the order amount is repeated for each status. In Data Modeling terms we can solve this problem by utilizing a reverse pivot transformation on the ORDER STATUS TABLE. Once pivoted the results will look like this:

ORDER ID DATE_CREATED DATE_SHIPPED DATE_INVOICED
1 1/1/2012 1/4/2012
2 1/5/2012
3 1/5/2012 1/8/2012 1/9/2012

Once the dates are pivoted on the ORDER ID axis we can be assured that there will be only one record for each ORDER ID. This will be very useful in situations where you need to combine these dates with other summarized order related metrics without affecting the overall cardinality of the results. Below is an example results sets.

ORDER ID DATE CREATED DATE SHIPPED DATE INVOICED Num Order Lines Order Amount
1 1/1/2012 1/4/2012 2 2345.45
2 1/5/2012 6 3456.90
3 1/5/2012 1/8/2012 1/9/2012 7 4567.19

Transformation Example 3 The final example revolves around the need to perform data quality updates or checks on data. There are no features or functions within HANA to perform data quality updates to source system data as it is loaded into HANA or queried from HANA. This can be very problematic if there are no constraints on the source system applications that populate the source data. There are several transforms in Data Services that allow developers to validate and consequently update data based on rules, master data, or other data sources. Imagine that you have an analytic that is based on city and month. If your city is derived from an address that is captured via a manual data input process and there are no validations being performed on the front end system, it is highly likely that there will be quality issues with your data. If you load the “dirty” or “trash” data into HANA, without validation, your analytic will contain multiple variations of city names. How valuable are the results of an analytic when they look like the following example?

Month CITY SALES
Jan Atlant 345,6787
Jan Atlanta_ 3,456
Jan antlanta 21,456
Jan Atlanta 1,345,678
Jan AtlantaGa 456
Jan Atlantas 234

In my opinion, just because SAP HANA is capable of performing calculations 3000 times faster than other legacy databases, does not mean that you can forgo the governance processes of data. I have seen postings and tweets on the internet where some seam to suggest that the rules of Data Governance and Data Modeling have been replaced by the capabilities of SAP HANA. If you are of that opinion, I would have to respectfully disagree. In terms of Data Governance and Data Modeling, SAP HANA is simply a platform to store massive amounts of multi-dimensional data and subsequently provide lightning fast access to that information. That means that other processes and tools must be leveraged in order to load quality data into HANA. Now that I have highlighted a few examples of where SAP HANA is unable to transform data, I would like to describe a few scenarios where SAP HANA is able to de-normalize or transform data. When developing Attribute Views (Dimensions) in HANA studio you are able to perform the following functions.

  • Join tables on one or more columns
  • Filter columns to eliminate various results
  • Derive columns based on standard HANA SQL functions
  • Create database views and incorporate them into the Attribute View
  • Build Date and Time based Attribute Views based on system tables stored in HANA. This is similar to building a Data Dimensions, but HANA can manage the dates.

You will find that some of these options are very similar to the capabilities of the BusinessObjects Universe but they are implemented very differently. Depending on the normalized nature of your source system data, the above options will likely be sufficient to “model” the data for analytic purposes. However these options will not provide data quality mechanisms nor advanced transformation capabilities as discussed before. To give you a better idea of what an attribute view is let me give you a few examples. If your company is selling computers you would likely have attribute views that generate a list of the products and product categories. You would also likely find Attribute Views for Sales Dates, Sales Locations, Distribution Centers, Vendors and many other items. When developing Analytic Views you will find a similar set of capabilities but the goal of the Analytic View is to define the data foundation and link it to existing Attribute Views. The data foundation is effectively the table that contains the values that are measured. Again, if your company is selling computers and storing the sales transactions in the sales_order_table, the sales_order_table would likely be your foundation table. Within the data foundation you will define private attributes and measures. The measures will represent items like “Sales Amount”, “Quantity” and “Sales Tax”. You would then join the sales_order_table, utilizing the private attributes defined in the foundation, to existing Attribute View using defined keys. In the end you should have a star-schema logic view of the data that contains both Dimensions and Measures. When developing calculation views you will find a few more options to help calculate measures that might span multiple Analytic Views. There are also options to help you forecast measures. The Calculation View is very unique in that provides support for both graphical and script based calculations. The script based interface allows developers to utilize ANSI92 SQL or special “CE” functions on SAP HANA to generate views of the data. The CE functions are preferred because they are optimized for the parallel execution of the code. The graphical interface will allow developers to perform common calculations such as aggregating values from multiple Analytic Views that share common Attributes and forecasting Measures. The scripting engine allows developers to perform more complex calculations that con not be expressed graphically. The modeling capabilities of SAP HANA are very unique in that they are directly imbedded into the RDMS. Other products will separate the RDMS and OALP tools into different engines. For example, Microsoft SQL Server and Microsoft SSAS are separate engines. With SAP HANA, the two components are intertwined (In-Memory) to allow for superior query performance. Hopefully after reading this posting you will have a better understanding of the modeling capabilities of SAP HANA. It is also my hope that you will understand the role that Data Services 4.0 plays in Data Modeling.