Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Friday, June 23, 2017

Installing and Configuring PrestoDB

Introduction

Presto (invented at Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. Unlike Hive, Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. A single Presto query can combine data (through pluggable connectors) from multiple sources, allowing for analytics across your entire organization. It is targeted at analysts who expect response times ranging from sub-second to minutes.


Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.


Presto currently has limited fault tolerance capabilities when querying. If a process fails while processing, the whole query must be re-run. On the other hand, it executes queries 10-30x faster than Hive. Thus, even if there is a process failure and a query must be restarted, the total runtime will often still beat Hive’s significantly.


Another caveat is that Presto has an in-memory only architecture. So if there is a particularly large data set which exceeds the total memory capacity available to Presto, query execution will fail.



Architecture



Presto's full installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers. It requires Linux, java and Python and a data directory for storing logs.











[root@en01 ~]# useradd presto

[root@en01 ~]# passwd presto



Installing


1- Download the Presto server tarball from below location and unzip it.

https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.179/presto-server-0.179.tar.gz


[root@en01 ~]# mkdir /data/presto_data

[root@en01 hadoopsw]# tar -xzf presto-server-0.179.tar.gz

[root@en01 hadoopsw]# chown presto /data/presto_data

[root@en01 hadoopsw]# chown presto presto-server-0.179
[root@en01 hadoopsw]# chmod -R 755 /data/presto_data
[root@en01 hadoopsw]# chmod -R 755 /usr/hadoopsw/presto-server-0.179

Configuring Presto

Create an etc directory inside the installation directory. This will hold the following configuration files:
  • node.properties -  environmental configuration specific to each node
  • jvm.config -  command line options for the Java Virtual Machine
  • config.properties -  configuration for the Presto server
  • log.properties - optional log levels file
  • Catalog Properties eg; catalog/jmx.properties -  configuration for Connectors (data sources)
[root@en01 hadoopsw]# su - presto
[presto@en01 ~]$ export PRESTO_HOME=/usr/hadoopsw/presto-server-0.179
[presto@en01 ~]$ export PATH=$PATH:$PRESTO_HOME/bin
[presto@en01 ~]$ mkdir -p /usr/hadoopsw/presto-server-0.179/etc/catalog

vi .bash_profile
[presto@en01 ~]$ cat .bash_profile
# .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export PATH
export PRESTO_HOME=/usr/hadoopsw/presto-server-0.179
export PATH=$PATH:$PRESTO_HOME/bin

source .bash_profile

vi $PRESTO_HOME/etc/node.properties
[presto@en01 ~]$ cat $PRESTO_HOME/etc/node.properties
node.environment=dev
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff #unique for every node
node.data-dir=/data/presto_data

All Presto nodes in a cluster must have the same environment name.

vi $PRESTO_HOME/etc/jvm.config
[presto@en01 ~]$ cat $PRESTO_HOME/etc/jvm.config
-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

vi $PRESTO_HOME/etc/config.properties
[presto@en01 ~]$ cat $PRESTO_HOME/etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://en01:8080

Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters. above configuration will set up a single machine for testing that will function as both a coordinator and worker. You could use the below minimal configuration for separate coordinator and worker.


For coordinator


coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://my.loaldomain.com:8080

For Worker

coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery.uri=http://my.loaldomain.com:8080

config.properties explanation


coordinator: Allow this Presto instance to function as a coordinator (accept queries from clients and manage query execution).

node-scheduler.include-coordinator: Allow scheduling work on the coordinator. For larger clusters, processing work on the coordinator can impact query performance because the machine’s resources are not available for the critical task of scheduling, managing and monitoring query execution.

http-server.http.port: Specifies the port for the HTTP server. Presto uses HTTP for all communication, internal and external.

query.max-memory: The maximum amount of distributed memory that a query may use.

query.max-memory-per-node: The maximum amount of memory that a query may use on any one machine.

discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.

discovery.uri: The URI to the Discovery server. Because we have enabled the embedded version of Discovery in the Presto coordinator, this should be the URI of the Presto coordinator. Replace my.localdomain:8080 to match the host and port of the Presto coordinator. This URI must not end in a slash.


Catalog Properties

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table emp in database scott, that table would be accessed in Presto as hive.scott.emp.

Catalogs are registered by creating a catalog properties file in the etc/catalog directory. For example, create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog.



[presto@en01 ~]$ vi $PRESTO_HOME/etc/catalog/hive.properties


connector.name=hive-hadoop2
hive.metastore.uri=thrift://en01:9083


[presto@en01 ~]$ vi $PRESTO_HOME/etc/catalog/jmx.properties

[presto@en01 ~]$ cat $PRESTO_HOME/etc/catalog/jmx.properties

connector.name=jmx

Running Presto

You can start presto daemon in background with below command.

[presto@en01 ~]$ launcher start
Started as 4253

It can be run in the foreground, with the logs and other output being written to stdout/stderr

[presto@en01 bin]$ launcher run
Unrecognized VM option 'ExitOnOutOfMemoryError'
Did you mean 'OnOutOfMemoryError=<value>'?
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

If you don't have required JDK installed you will get above error. Check java version

[presto@en01 bin]$ java -version
openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)

Get the latest java version (Presto requires Java 8u92+), install it and remove OpenJDK (in my case)

[root@en01 ~]# yum -y remove java*
After installing required java , check version again
[root@en01 ~]# java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Run presto again

[presto@en01 bin]$ launcher run
WARNING: Current OS file descriptor limit is 4096. Presto recommends at least 8192
2017-06-15T16:26:53.448+0300    INFO    main    io.airlift.log.Logging  Logging to stderr
2017-06-15T16:26:53.451+0300    INFO    main    Bootstrap       Loading configuration
2017-06-15T16:26:53.575+0300    INFO    main    Bootstrap       Initializing logging
2017-06-15T16:26:54.468+0300    INFO    main    Bootstrap       PROPERTY                                                              DEFAULT                           RUNTIME

....
....
2017-06-15T16:27:01.032+0300    INFO    main    com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========

Presto provides a web interface for monitoring and managing queries. You can verify the presto cluster from browser.


http://prestoserver:8080



Installing CLI

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.

Downloand Presto CLI from below location
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.149/presto-cli-0.149-executable.jar

copy jar in /usr/hadoopsw/presto-server-0.179/bin and rename it and make it executable.

[presto@en01 bin]$ mv presto-cli-0.149-executable.jar prestocli

[root@en01 bin]# chmod +x prestocli

Run CLI

[presto@en01 bin]$ prestocli --server localhost:8080 --catalog jmx --schema default

presto:default>

[root@en01 bin]# jps
19848 prestocli
18478 PrestoServer
11903 QuorumPeerMain
19935 Jps

presto:default> show schemas from jmx;
       Schema
--------------------
 current
 history
 information_schema
(3 rows)

Query 20170615_150002_00004_m3cau, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [3 rows, 47B] [21 rows/s, 330B/s]

No comments: