Using SQLContext, Apache Spark SQL can read data directly from the file system. This is useful when the data you are trying to analyze does not reside in Apache Hive (for example, JSON files stored in HDFS).
Using HiveContext, Spark SQL can also read data by interacting with the Hive MetaStore. If you already use Hive, you should use HiveContext; it supports all Hive data formats and user-defined functions (UDFs), and it enables you to have full access to the HiveQL parser. HiveContext extends SQLContext, so HiveContext supports all SQLContext functionality.
[spark@dn04 root]$ spark-shell
scala> import org.apache.spark.sql._
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> val dfcsv = sqlContext.read.csv("/data/common/emp.csv")
scala> df.select("text").show(2)
scala> df.select($"text",$"id").show(2)
scala> df.select().count()
Query Hive Table
Connect to Spark with beeline
By default Hotronworks Spark2 Thrift Server runs on 10016 port, just connect using it.
Connect to Spark SQL via JDBC/ODBC
scala> dfcsv.show(2)
+----+-----+--------+----+---------+----+----+---+
+----+-----+--------+----+---------+----+----+---+
| _c0| _c1| _c2| _c3| _c4| _c5| _c6|_c7|
+----+-----+--------+----+---------+----+----+---+
|7369|SMITH| CLERK|7902|17-DEC-80| 800|null| 20|
|7499|ALLEN|SALESMAN|7698|20-FEB-81|1600| 300| 30|
+----+-----+--------+----+---------+----+----+---+
only showing top 2 rows
Reading JSON file in HDFS
val df = spark.read.json("/flume/twitter/FlumeData.1501481206851")
scala> df.show(2)scala> df.select("text").show(2)
scala> df.select($"text",$"id").show(2)
scala> df.select().count()
Query Hive Table
scala> sql("SELECT * FROM scott.dept").show()
+------+----------+--------+
|deptno| dname| loc|
+------+----------+--------+
| 10|ACCOUNTING|NEW YORK|
| 20| RESEARCH| DALLAS|
| 30| SALES| CHICAGO|
| 40|OPERATIONS| BOSTON|
+------+----------+--------+
Connect to Spark with beeline
By default Hotronworks Spark2 Thrift Server runs on 10016 port, just connect using it.
[spark@dn04 current]$ beeline
Beeline version 1.2.1000.2.6.1.0-129 by Apache Hive
beeline> !connect jdbc:hive2://dn04:10016/scott
Connecting to jdbc:hive2://dn04:10016/scott
Enter username for jdbc:hive2://dn04:10016/scott:
Enter password for jdbc:hive2://dn04:10016/scott:
Connected to: Spark SQL (version 2.1.1.2.6.1.0-129)
Driver: Hive JDBC (version 1.2.1000.2.6.1.0-129)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://dn04:10016/scott> show tables;
+-----------+------------+--------------+--+
| database | tableName | isTemporary |
+-----------+------------+--------------+--+
| scott | dept | false |
| scott | dual | false |
| scott | emp | false |
+-----------+------------+--------------+--+
3 rows selected (0.121 seconds)
Connect to Spark SQL via JDBC/ODBC
Related Posts:
No comments:
Post a Comment