Installing and Configuring PrestoDB

Introduction

Presto (invented at Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. Unlike Hive, Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. A single Presto query can combine data (through pluggable connectors) from multiple sources, allowing for analytics across your entire organization. It is targeted at analysts who expect response times ranging from sub-second to minutes.

Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

Presto currently has limited fault tolerance capabilities when querying. If a process fails while processing, the whole query must be re-run. On the other hand, it executes queries 10-30x faster than Hive. Thus, even if there is a process failure and a query must be restarted, the total runtime will often still beat Hive’s significantly.

Another caveat is that Presto has an in-memory only architecture. So if there is a particularly large data set which exceeds the total memory capacity available to Presto, query execution will fail.

Architecture

Presto's full installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers. It requires Linux, java and Python and a data directory for storing logs.

[root@en01 ~]# useradd presto

[root@en01 ~]# passwd presto

Installing

1- Download the Presto server tarball from below location and unzip it.

https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.179/presto-server-0.179.tar.gz

[root@en01 ~]# mkdir /data/presto_data

[root@en01 hadoopsw]# tar -xzf presto-server-0.179.tar.gz

[root@en01 hadoopsw]# chown presto /data/presto_data

[root@en01 hadoopsw]# chown presto presto-server-0.179

[root@en01 hadoopsw]# chmod -R 755 /data/presto_data
[root@en01 hadoopsw]# chmod -R 755 /usr/hadoopsw/presto-server-0.179

Configuring Presto

Create an etc directory inside the installation directory. This will hold the following configuration files:

node.properties - environmental configuration specific to each node
jvm.config - command line options for the Java Virtual Machine
config.properties - configuration for the Presto server
log.properties - optional log levels file
Catalog Properties eg; catalog/jmx.properties - configuration for Connectors (data sources)

[root@en01 hadoopsw]# su - presto

[presto@en01 ~]$ export PRESTO_HOME=/usr/hadoopsw/presto-server-0.179

[presto@en01 ~]$ export PATH=$PATH:$PRESTO_HOME/bin

[presto@en01 ~]$ mkdir -p /usr/hadoopsw/presto-server-0.179/etc/catalog

vi .bash_profile

[presto@en01 ~]$ cat .bash_profile

# .bash_profile

PATH=$PATH:$HOME/.local/bin:$HOME/bin

export PATH

export PRESTO_HOME=/usr/hadoopsw/presto-server-0.179

export PATH=$PATH:$PRESTO_HOME/bin

source .bash_profile

vi $PRESTO_HOME/etc/node.properties

[presto@en01 ~]$ cat $PRESTO_HOME/etc/node.properties

node.environment=dev

node.id=ffffffff-ffff-ffff-ffff-ffffffffffff #unique for every node

node.data-dir=/data/presto_data

All Presto nodes in a cluster must have the same environment name.

vi $PRESTO_HOME/etc/jvm.config

[presto@en01 ~]$ cat $PRESTO_HOME/etc/jvm.config

-server

-Xmx4G

-XX:+UseG1GC

-XX:G1HeapRegionSize=32M

-XX:+UseGCOverheadLimit

-XX:+ExplicitGCInvokesConcurrent

-XX:+HeapDumpOnOutOfMemoryError

-XX:+ExitOnOutOfMemoryError

vi $PRESTO_HOME/etc/config.properties

[presto@en01 ~]$ cat $PRESTO_HOME/etc/config.properties

coordinator=true

node-scheduler.include-coordinator=true

http-server.http.port=8080

query.max-memory=5GB

query.max-memory-per-node=1GB

discovery-server.enabled=true

discovery.uri=http://en01:8080

Every Presto server can function as both a coordinator and a worker, but dedicating a single machine to only perform coordination work provides the best performance on larger clusters. above configuration will set up a single machine for testing that will function as both a coordinator and worker. You could use the below minimal configuration for separate coordinator and worker.

For coordinator

coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://my.loaldomain.com:8080

For Worker

coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
discovery.uri=http://my.loaldomain.com:8080

config.properties explanation

coordinator: Allow this Presto instance to function as a coordinator (accept queries from clients and manage query execution).

node-scheduler.include-coordinator: Allow scheduling work on the coordinator. For larger clusters, processing work on the coordinator can impact query performance because the machine’s resources are not available for the critical task of scheduling, managing and monitoring query execution.

http-server.http.port: Specifies the port for the HTTP server. Presto uses HTTP for all communication, internal and external.

query.max-memory: The maximum amount of distributed memory that a query may use.

query.max-memory-per-node: The maximum amount of memory that a query may use on any one machine.

discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.

discovery.uri: The URI to the Discovery server. Because we have enabled the embedded version of Discovery in the Presto coordinator, this should be the URI of the Presto coordinator. Replace my.localdomain:8080 to match the host and port of the Presto coordinator. This URI must not end in a slash.

Catalog Properties

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table emp in database scott, that table would be accessed in Presto as hive.scott.emp.

Catalogs are registered by creating a catalog properties file in the etc/catalog directory. For example, create etc/catalog/hive.properties with the following contents to mount the hive-hadoop2 connector as the hive catalog.

[presto@en01 ~]$ vi $PRESTO_HOME/etc/catalog/hive.properties

connector.name=hive-hadoop2
hive.metastore.uri=thrift://en01:9083

[presto@en01 ~]$ vi $PRESTO_HOME/etc/catalog/jmx.properties

[presto@en01 ~]$ cat $PRESTO_HOME/etc/catalog/jmx.properties

connector.name=jmx

Running Presto

You can start presto daemon in background with below command.

[presto@en01 ~]$ launcher start

Started as 4253

It can be run in the foreground, with the logs and other output being written to stdout/stderr

[presto@en01 bin]$ launcher run

Unrecognized VM option 'ExitOnOutOfMemoryError'

Did you mean 'OnOutOfMemoryError=<value>'?

Error: Could not create the Java Virtual Machine.

Error: A fatal exception has occurred. Program will exit.

If you don't have required JDK installed you will get above error. Check java version

[presto@en01 bin]$ java -version

openjdk version "1.8.0_65"

OpenJDK Runtime Environment (build 1.8.0_65-b17)

OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)

Get the latest java version (Presto requires Java 8u92+), install it and remove OpenJDK (in my case)

[root@en01 ~]# yum -y remove java*

After installing required java , check version again

[root@en01 ~]# java -version

java version "1.8.0_121"

Java(TM) SE Runtime Environment (build 1.8.0_121-b13)

Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Run presto again

[presto@en01 bin]$ launcher run

WARNING: Current OS file descriptor limit is 4096. Presto recommends at least 8192

2017-06-15T16:26:53.448+0300 INFO main io.airlift.log.Logging Logging to stderr

2017-06-15T16:26:53.451+0300 INFO main Bootstrap Loading configuration

2017-06-15T16:26:53.575+0300 INFO main Bootstrap Initializing logging

2017-06-15T16:26:54.468+0300 INFO main Bootstrap PROPERTY DEFAULT RUNTIME

....

2017-06-15T16:27:01.032+0300 INFO main com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========

Presto provides a web interface for monitoring and managing queries. You can verify the presto cluster from browser.

http://prestoserver:8080

Installing CLI

The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.

Downloand Presto CLI from below location
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.149/presto-cli-0.149-executable.jar

copy jar in /usr/hadoopsw/presto-server-0.179/bin and rename it and make it executable.

[presto@en01 bin]$ mv presto-cli-0.149-executable.jar prestocli

[root@en01 bin]# chmod +x prestocli

Run CLI

[presto@en01 bin]$ prestocli --server localhost:8080 --catalog jmx --schema default

presto:default>

[root@en01 bin]# jps

19848 prestocli

18478 PrestoServer

11903 QuorumPeerMain

19935 Jps

presto:default> show schemas from jmx;

Schema

--------------------

current

history

information_schema

(3 rows)

Query 20170615_150002_00004_m3cau, FINISHED, 1 node

Splits: 18 total, 18 done (100.00%)

0:00 [3 rows, 47B] [21 rows/s, 330B/s]

DBMentors - Inam Bukhari's Blog

Pages

Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Friday, June 23, 2017