Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, May 04, 2020

Connect to Presto from Spark

If you have Presto cluster as your processing layer, you could connect to it from Spark using Scala.


1- Copy the presto driver to the spark master location eg; /opt/progs/spark-2.4.5-bin-hadoop2.7/jars


2- run the spark shell and connect using scala

[solr@te1-hdp-rp-nn01 ~]$ spark-shell

--######### READING FROM PRESTO ####################
//import org.apache.spark.sql.SQLContext;
//import org.apache.spark.sql.SparkSession;
//import org.apache.spark.SparkContext;
//val spark = SparkSession.builder.master("local").appName("Read From Presto").getOrCreate(); //spark session
//sc.stop(); //stop existing spark context
//val sc = new SparkContext(); //create your own spark context


val JDBC_DRIVER = "com.facebook.presto.jdbc.PrestoDriver";
val DB_URL = "jdbc:presto://x.x.44.135:6060/kudu/default";


//set the jdbc options

val jdbcOptions = spark.read.format("jdbc");
jdbcOptions.option("driver",JDBC_DRIVER);
jdbcOptions.option("url",DB_URL);
jdbcOptions.option("user", "presto326");
//jdbcOptions.option("dbtable", "default.syslog");

//load data to dataframe using jdbc options

jdbcOptions.option("query","SELECT * FROM default.syslog limit 15"); //pushdown to presto
val df = jdbcOptions.load();  //now sent to Presto
df.show();
df.registerTempTable("mysyslog");//local table to spark
df.printSchema();
//sqlContext.sql("select * from mysyslog limit 5").show(); //local query to spark
spark.sql("select * from mysyslog limit 5").show(); //local query to spark

Notes:

1- SparkContext
- Main entry point for Spark functionality.
- A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster.
- Only one SparkContext may be active per JVM.
2- SparkSession
- The entry point to programming Spark with the Dataset and DataFrame API.

2 comments:

veera said...

Thank you for sharing such an intresting and unique content blog.

keep writing more blogs......

big data hadoop online training

Sreyobhilashi Institute said...

I think its ok if u r importing data from oracle/small amount of data.
but get data from oracle using huge amount of data few extra options also available like lowerbound, upperbound etc, pls explain those.

Thanks in advance
Docker and Kubernetes is next generation platform
Apache spark now a days hot cake and huge demand in the market. In future combination of these two create wonders in software industry.
Thanks to share ur knowledge
Regards
Venu
bigdata training institute in Hyderabad
spark training in Hyderabad