You can think of Parquet tables, like the part of the HANA column table after MERGE DELTA, whereas the HBASE table is more like the uncompressed part of a HANA column table PRIOR to MERGE DELTA. HADOOP IMPALA PARQUET tables use Column store logic (similar to HANA column tables) which need which take more effort to write too efficiently, but are much faster at reads (assuming not all the fields in a row are return, not that dis-similar to HANA Column tables as well). HADOOP HBASE source tables are better for small writes and updates, but are slower at reporting. IMPALA PARQUET Column Store ( 60 Million Records in 3 Seconds) IMPALA HBASE table ( 40K records in 4 seconds) : With Impala the source table type may impact speeds as well as these 2 simple examples demonstrate. Simple IMPALA query on my extremely small and low powered HADOOP cluster (reading the SAME table as HIVE) ( < 1 Second) NOTE: In the HADOOP system, you can see above the HIVE’s map reduce is kicked off Simple HIVE query on my extremely small and low powered HADOOP cluster ( 23 Seconds) Once created you can open the definition of the new virtual tables, as per normal HANA tables. NOTE: I’ve previously created an ‘HADOOP’ schema in HANA to store these virtual tables. Select Create virtual tables, from your Remote Source, in the schema of your choice. There are some tables types (file types) that can only be read by HIVE or IMPALA, but there is a large overlap and this may converge over time. The metastore just points to the tables files location within the HADOOP ecosystem, whether stored as text files, HBASE tables or column store PARQUET files (to list just a few). Data is NOT replicated in HIVE tables and IMPALA tables. In the above screen shots you will notice that both HIVE1 & IMPALA1 share the same tables as they use the same HADOOP metastore. NOTE: For me expanding the HIVE1 tree takes almost 20 seconds each time expanding a node (perhaps it uses mapreduce?), IMPALA1 nodes in the hierarchy expanded quickly. Once you have your ODBC drivers install properly Remote Sources can be added for both HIVE and IMPALAĮxpanding the Remote Sources shows the tables that can be access by HANA. With the disclaimers out of the way this is how SDA works. If you do though get it working in a sandbox environment why not help by adding your voice for it be certified and added to the ‘official’ list. NOTE: Since CDH is not currently on this list I’m sure SAP will NOT recommend you using this in a production environment. However since SDA uses ODBC I’ve managed to get it working using a third party ODBC driver from Progress|DataDirect. Unfortunately I’m using Cloudera’s open-source Apache Hadoop distribution (CDH), which isn’t on SAP’s approved list yet. With that in mind I thought it would be interesting to test them both in HANA using SDA. I’ve only tested Impala so far, but I’ve noticed speeds of 10 to 100 times improvement over standard HIVE SQL queries. Cloudera’s Impala, Hortonworks Stinger initiative and MapR’s Drill are all trying to address real-time reporting. Unfortunately for real-time responsiveness HIVE SQL currently isn’t the most optimal tool in HADOOP. Using Smart Data Access (SDA) with HADOOP seems to me a great idea for balancing the strengths of both tools. SDA: HADOOP - using the remote HADOOP data source | SAP HANA SDA: HADOOP - Configuring ODBC drivers | SAP HANA UPDATE (Jan 29 2014): SAP HANA Academy now has a great collection of videos using Smart Data Access. UPDATE (Dec 04 2013): As of SPS07 Hortonworks HDP1.3 (When’s HDP 2.0 coming?) appears to have been added to the official list, and remote caching of HADOOP Sources has been added, which should hopefully speed queries up for those tables in HADOOP that aren’t changing frequently. SAP Note 1868702: Information about installing the drivers that SAP HANA smart data access supports SAP Note 1868209: Additional information about SPS06 and smart data access Intel Distribution for Apache Hadoop: version 2.3 (This includes Apache Hadoop version 1.0.3 and Apache Hive 0.9.0.).SAP Sybase Adaptive Service Enterprise: version 15.7 ESD#4.Not only does this capability provide operational and cost benefits, but most importantly it supports the development and deployment of the next generation of analytical applications which require the ability to access, synthesize and integrate data from multiple systems in real-time regardless of where the data is located or what systems are generating it.”Ĭurrently Supported databases by SAP HANA smart data access include: “SAP HANA smart data access enables remote data to be accessed as if they are local tables in SAP HANA, without copying the data into SAP HANA.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |