HBase, Hive, CloudBase

* developer
HBase : Apache?
Hive : Facebook
CloudBase : Business.com

* Hadoop
HBase – Hadoop 각 버전에 대응
Hive – Hadoop 0.17
CloudBase – Hadoop 0.18+

* SQL 지원
HBase : 간단한 db 연산, 기존 프로그램과 연동 어려움.

Hive : Hive QL (SQL 과 조금 다름), JDBC 드라이버 제공, 기존 프로그램과 연동 어려움.

  hive> CREATE TABLE pokes (foo INT, bar STRING);

Creates a table called pokes with two columns, the first being an integer and the other a string

  hive> CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);

Creates a table called invites with two columns and a partition column called ds. The partition column is a virtual column. It is not part of the data itself but is derived from the partition that a particular dataset is loaded into.

By default, tables are assumed to be of text input format and the delimiters are assumed to be ^A(ctrl-a).

  hive> SHOW TABLES;

lists all the tables

  hive> SHOW TABLES '.*s';

lists all the table that end with ‘s’. The pattern matching follows Java regular expressions. Check out this link for documentation http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html

hive> DESCRIBE invites;

shows the list of columns

As for altering tables, table names can be changed and additional columns can be dropped:

  hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);

  hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');

  hive> ALTER TABLE events RENAME TO 3koobecaf;

Dropping tables:

  hive> DROP TABLE pokes;

Loading data from flat files into Hive:

  hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;

Loads a file that contains two columns separated by ctrl-a into pokes table. ‘local’ signifies that the input file is on the local file system. If ‘local’ is omitted then it looks for the file in HDFS.

The keyword ‘overwrite’ signifies that existing data in the table is deleted. If the ‘overwrite’ keyword is omitted, data files are appended to existing data sets.

NOTES:

NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace. The root of the Hive directory is specified by the option ‘hive.metastore.warehouse.dir’ in hive-default.xml. We advise users to create this directory before trying to create tables via Hive.

  hive> LOAD DATA LOCAL INPATH './examples/files/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');

  hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-08');

The two LOAD statements above load data into two different partitions of the table invites. Table invites must be created as partitioned by the key ds for this to succeed.

SELECTS and FILTERS

  hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';

selects column ‘foo’ from all rows of partition <DATE> of invites table. The results are not stored anywhere, but are displayed on the console.

Note that in all the examples that follow, INSERT (into a hive table, local directory or HDFS directory) is optional.

  hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';

selects all rows from partition <DATE> OF invites table into an HDFS directory. The result data is in files (depending on the number of mappers) in that directory. NOTE: partition columns if any are selected by the use of *. They can also be specified in the projection clauses.

Partitioned tables must always have a partition selected in the WHERE clause of the statement.

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;

Selects all rows from pokes table into a local directory

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;

  hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100; 

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;

  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;

  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>';

  hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;

  hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;

Sum of a column. avg, min, max can also be used

CloudBase : ANSI SQL, JDBC 드라이버 제공, 기존 프로그램과 연동 어려움.

CREATE TABLE table1 ( c1 VARCHAR, c2 INT) COLUMN SEP ‘|’
This creates an empty table- table1 with 2 columns- c1, c2 having VARCHAR and INT datatypes respectively. The column sep (‘|’) indicates the files that will be added into this table will have ‘|’ as the delimiter.

CREATE TABLE table1 ( c1 VARCHAR, c2 INT) COLUMN SEP ‘|’ HDFS PATH ‘logs/weblogs’
This creates a table- table1 and attach this name with the path- ‘logs/weblogs’ on Hadoop File System. If this path exists and has files then those files will constigute the table data. The column sep indicates that the ‘|’ char is used as delimter char in these files. If the path does not exist, it will be created. In that case the table won’t have any data, i.e. select query on this table will return 0 rows.

CREATE TABLE table1 ( c1 VARCHAR, c2 FLOAT) COLUMN SEP ‘t’ IMPORT DATA FROM ‘/home/logs/weblogs’
This creates table- table1 and copy data from the given local path. TAB (‘t’) is used as a delimter char. The data is loaded on Hadoop File Sysem in the directory- cloudbase/data/<tablename>

CREATE XML TABLE <tablename>
( <column name> <data type> ‘XJPath’ [, <column name> <data type> ‘XJPath’ , …] )
START WITH ‘start_tag’ END WITH ‘end_tag’
[ HDFS PATH <hdfs path> ]
[ IMPORT DATA FROM <local path> ]
[ COMMENT <string> ]

If the root tag has no attribute, one can simply use the ROOT TAG clause. The START TAG along with END TAG is preferred only when xml root tag has attributes.

The input log file/data can be in any format as long as XML records can be identified by start and end tags. For example,

some text

<book> xml data record 1</book>

some text

<book> xml data record 1</book>






Drop table syntax-


DROP TABLE tablename [ DO NOT PURGE DATA ]

INSERT statement is used to insert data-

• Into an already existent CloudBase table from local files.
• Into an already existent CloudBase table from another CloudBase table.
• Into a RDBMS Table (via a database link).

Inserting data into CloudBase table from local files
INSERT INTO <tablename> IMPORT DATA FROM ‘local/path’

The local files should have the same structure as defined in table’s meta data, i.e. local files should have the same delimiter, same number of fields as defined in table’s meta data.

Inserting data into CloudBase table from an existent table
INSERT INTO <tablename>
[ ( column_name1, column_name2, …) ]
<query>

Examples:
INSERT INTO table2
SELECT c1, c2, c3 FROM table1
This will dump the output of query into table2

INSERT INTO table2
( col2, col1, col3 )
SELECT c1, c2, c3 FROM table1
This will dump the output of query into table2 in the specified column order.
If insufficient columns are specified,then INSERT statement will insert NULL values for the remaining columns.

Inserting data into an external (RDBMS) table
INSERT INTO <tablename>@database_link_name
[ ( column_name1, column_name2, …) ]
<query>

Examples:
INSERT INTO table2@sql_server_link
SELECT c1, c2, c3 FROM table1
This will dump the output of query into table2 in the RDBMS as specified by the database link- sql_server_link

If insufficient columns are specified,then INSERT statement will insert NULL values for the remaining columns.

Select statement syntax
The syntax of SELECT statement is as follows-

SELECT [ DISTINCT ] [ TOP <number> ]

column1 [ , colum2, column3, … ]

[ into_clause ]

FROM <from_list>

[ join_clause ]

[ where_clause ]

[ group_by_clause [ having_clause ] ]

[ order_by_clause ]

[ on error clause ]

Clauses present in ‘[ ]’ indicate that they are optional

FROM clause
One can use tables, views and/or sub queries in the from-clause. Aliases can be optionally set for table names and views. However, setting aliases for sub queries is mandatory.

Examples:
• SELECT * FROM test_table1
• SELECT * FROM ( SELECT c1, c2 FROM test_table1) AS T
• SELECT DISTINCT c1 FROM test_table1

JOIN clause
CloudBase supports inner and outer (left, right, full) joins via ANSI SQL’s explicit join syntax-
• SELECT * FROM test_table1 AS t1 INNER JOIN test_table2 AS t2 ON t1.c1 = t2.c1

Join Syntax-
• SELECT * FROM test_table1 AS t1 [ OUTER ] LEFT | RIGHT | FULL JOIN test_table2 AS t2 ON <join condition>

Sub queries can also be used instead of tables/views –
• SELECT * FROM test_table1 AS t1 INNER JOIN (SELECT * FROM test_table2) AS t2 ON t1.c1 = t2.c1

WHERE clause
Rows returned by queries can be filtered by using Where clause. At present CloudBase supports the following conditions in the Where Clause-

• Arithmetic condition- compare columns using arithmetic operators- <, >, =, <=, >=, !=, <> (either of !=, <> can be used for not equal to)
• Between condition- e.g. WHERE c1 BETWEEN 10 and 100. NOT can also be used with BETWEEN- WHERE c1 NOT BETWEEN 10 and 100
• IS NULL condition- e.g. WHERE c1 IS NULL. NOT can also be used- WHERE ci is NOT NULL
• Like condition- e.g. WHERE c1 LIKE ‘%abc%’. NOT can also be used- WHERE c1 NOT LIKE ‘%abc%’. To match on a single character use ‘?’. One can use regular expressions in LIKE clause. See below for details.

Boolean AND / OR can be used to tie conditions together, e.g-
• SELECT * FROM test_table1 WHERE c1 LIKE ‘z%’ AND c2 > 10
• SELECT * FROM test_table1 WHERE c2 > 40 OR c3 < 20

CloudBase supports SQL wild cards- ‘%’ and ‘?’ in LIKE conditions. For conditions that can not be expressed using these wild cards, one can use regular expressions. CloudBase supports regular expression similar to Java programming language. One can read about Java regular expressions here

Examples:
• SELECT c6 FROM test_table3 where c6 like ‘b[ea]ta’
matches ‘beta’, ‘bata’

• SELECT * FROM test_table3 WHERE c like ‘(l|L)earning’
matches ‘learning’, ‘Learning’

• SELECT c FROM test_table WHERE c LIKE ‘(z|Z)o{3}+m’
matches all values of column c that begins with z or Z followed by EXACTLY 3 o’s and then m

• SELECT c FROM test_table WHERE c LIKE ‘(z|Z)o{3,}+m’
matches all values of column c that begins with z or Z followed by AT LEAST 3 o’s and then m

• SELECT c FROM test_table WHERE c LIKE ‘(z|Z)o{3,6}+m’
matches all values of column c that begins with z or Z followed by AT LEAST 3 but NOT MORE THAN 6 o’s and then m

To escape regular expression construct, use ‘’. For example-
• SELECT c FROM test_table WHERE c LIKE ‘[abc]’
matches literal string- ‘[abc]’

Aggregate functions and GROUP BY clause
CloudBase supports aggregate functions and group by clause. The aggregate functions supported by CloudBase are- SUM, COUNT, MAX, MIN, and AVG. One can use GROUP BY cluase in conjunction with the aggregate functions to group the result-set by one or more columns. In Group by clause, one can use column names, alias or index of column present in the select clause. e.g.-

Examples:
• SELECT c4, COUNT(*) FROM test_table1 GROUP BY c4
• SELECT c1, SUM(c2) FROM test_table1 GROUP BY c1
• SELECT c1 as a, SUM(c2) FROM test_table1 GROUP BY a
• SELECT c1, c4, SUM(c2) FROM test_table1 GROUP BY 1,2
• SELECT COUNT(c1),MAX(c2),MIN(c3), c4 FROM test_table1 GROUP BY c4

DISTINCT in Aggregate functions
• SELECT COUNT( DISTINCT c1), MAX(c2) FROM test_table1
• SELECT COUNT( DISTINCT c1), MAX(c2), MIN(c3), COUNT(DISTINCT c4) FROM test_table1

DISTINCT in Aggregate functions with GROUP BY
• SELECT COUNT( DISTINCT c1), COUNT( DISTINCT c2), COUNT( DISTINCT c3), COUNT( DISTINCT c4), MAX(c2), MIN(c2), MAX(c3), MIN(c3), c4 FROM test_table1 GROUP BY c4

HAVING clause with GROUP BY
• SELECT COUNT(c1) cnt_c1, COUNT( DISTINCT c2) cnt_d_c1, SUM(c2) sum_c2, SUM(c3) sum_c3, c4 FROM test_table1 GROUP BY c4 HAVING cnt_d_c1 > 2 AND sum_c3 > 100

• SELECT COUNT(c1) cnt_c1, MAX(c4) max_c4, MIN(c4) min_c4, MAX(c5) max_c5, MIN(c5) min_c5, SUM(c4) sum_c4, SUM(c5) sum_c5, COUNT(DISTINCT c1), SUM(DISTINCT c4), c6 FROM test_table3 GROUP BY c6 HAVING c6 LIKE ‘%p%’

ORDER BY clause
ORDER BY clause can be used to sort result set on one or more columns. One can sort in ascending order (default behavior) or in descending order. Just like Group By clause, one can use column name, alias or column index in the ORDER BY clause.

Examples:
• SELECT c1, c2, c3, c4 FROM test_table1 ORDER BY c1
• SELECT c1 as a, c2 as b, c3 as c, c4 as d FROM test_table1 ORDER BY a, b
• SELECT c1, c2, c3, c4 FROM test_table1 ORDER BY 3
• SELECT * FROM test_table1 ORDER BY c4 DESC, c2

insert 동작이 Hive, CloudBase 모두 로컬 파일 시스템이나 HDFS 상의 파일, 혹은 DB상의 테이블, 또는 외부 프로그램의 출력 결과를 stream 으로 batch 업로드 하게 되어 있음. 프로그램 안에서 개별 레코드 데이터를 넣는 방법이 없음. 대량의 텍스트 데이터를 db에 업로드 후, 원하는 db 연산을 한 결과를 받아보는 식의 batch 작업 형태임. 프로그래밍을 위한 db 라기 보다, 대량 텍스트 데이터를 DB 연산 형식을 차용해 mapreduce 의 복잡한 프로그래밍을 하지 않고 SQL 질의만으로 원하는 결과를 얻는 식 임. SQL문을 적절히 처리하여 내부적으로는 hdfs 상에서 mapreduce 작업으로 결과를 가져옴. 개별 레코드에 대한 insert 가능한지는 API 문서나, 사용 샘플 코드를 봐야 하겠지만, 파악했던 Document 페이지들 상에서의 샘플 코드에서도 DATA 혹은 XML 파일을 배치 업로드 하는 것만 있음. 그런 면에서 HBase 가 오히려 프로그램과의 연동이 좋음.

* non-java 지원
HBase : jython, ruby, grooby script, REST, Thrift
Hive : Trhift, SerDe
CloudBase : jython

CloudBase 가 ANSI SQL 지원으로 프로그래밍 쪽은 가장 적합하나, jython 말고는 non-java 에 대한 지원이 없다. Hive는 jython 외에 Thrift 를 지원하지만, Thrift 방식으로 작업 할 경우, Native java 로 포팅하는 수준의 작업량이 발생 하리라 예상 됨.

* 결론

범용 DB라기 보다는, 일별로 발생하는 검색 로그나 아파치 로그 등과 같이 대용량 텍스트 파일에 대하여, 별도의 mapreduce 프로그램을 만들지 않고도, SQL 질의 만으로 hdfs 상에서 mapreduce 방식으로 원하는 결과를 가져올 수 있는 점이 장점이라 보임. 하지만 insert, 혹은 db load의 방식이 결국은 데이터를 hdfs 상에 업로드를 해야 하는 구조이기 때문에, 기존의 어플리케이션의 DB 커넥션 부분의 변경이 불가피 하고, non-java에 대한 지원, 특히 C/C++ 과의 연동이 편하지 않음.

insert 시 hdfs 상의 파일을 DB에 넣는 작업이 필요하고, select시 mapreduce 과정이 일어나므로 실시간 서비스에는 적합하지 않음. 대용량의 batch 작업에 적합.

* reference :
1. Thrift – http://incubator.apache.org/thrift

2. REST – http://en.wikipedia.org/wiki/REST

3. hive tutorial 동영상 – http://www.cloudera.com/hadoop-training-hive-tutorial