Introduction
Presto (invented at Facebook) is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. It allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. Unlike Hive, Presto doesn’t use the map reduce framework for its execution. Instead, Presto directly accesses the data through a specialized distributed query engine that is very similar to those found in commercial parallel RDBMSs. A single Presto query can combine data (through pluggable connectors) from multiple sources, allowing for analytics across your entire organization. It is targeted at analysts who expect response times ranging from sub-second to minutes.
Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.
Presto currently has limited fault tolerance capabilities when querying. If a process fails while processing, the whole query must be re-run. On the other hand, it executes queries 10-30x faster than Hive. Thus, even if there is a process failure and a query must be restarted, the total runtime will often still beat Hive’s significantly.
Architecture
Presto's full installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers. It requires Linux, java and Python and a data directory for storing logs.
[root@en01 ~]# useradd presto
[root@en01 ~]# passwd presto
Installing
1- Download the Presto server tarball from below location and unzip it.
https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.179/presto-server-0.179.tar.gz
[root@en01 ~]# mkdir /data/presto_data
[root@en01 hadoopsw]# tar -xzf presto-server-0.179.tar.gz
[root@en01 hadoopsw]# chown presto /data/presto_data
[root@en01 hadoopsw]# chown presto presto-server-0.179
[root@en01 hadoopsw]# chmod -R 755 /usr/hadoopsw/presto-server-0.179
Configuring Presto
Create an etc directory inside the installation directory. This will hold the following configuration files:- node.properties - environmental configuration specific to each node
- jvm.config - command line options for the Java Virtual Machine
- config.properties - configuration for the Presto server
- log.properties - optional log levels file
- Catalog Properties eg; catalog/jmx.properties - configuration for Connectors (data sources)
[root@en01 hadoopsw]# su - presto
[presto@en01 ~]$ export PRESTO_HOME=/usr/hadoopsw/presto-server-0.179
[presto@en01 ~]$ export PATH=$PATH:$PRESTO_HOME/bin
[presto@en01 ~]$ mkdir -p /usr/hadoopsw/presto-server-0.179/etc/catalog
vi .bash_profile
[presto@en01 ~]$ cat .bash_profile
# .bash_profile
PATH=$PATH:$HOME/.local/bin:$HOME/bin
export PATH
export PRESTO_HOME=/usr/hadoopsw/presto-server-0.179
export PATH=$PATH:$PRESTO_HOME/bin
source .bash_profile
vi $PRESTO_HOME/etc/node.properties
[presto@en01 ~]$ cat $PRESTO_HOME/etc/node.properties
node.environment=dev
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff #unique for every node
node.data-dir=/data/presto_data
vi $PRESTO_HOME/etc/jvm.config
[presto@en01 ~]$ cat $PRESTO_HOME/etc/jvm.config
-server
-Xmx4G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
vi $PRESTO_HOME/etc/config.properties
[presto@en01 ~]$ cat $PRESTO_HOME/etc/config.properties
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=1GB
discovery-server.enabled=true
discovery.uri=http://en01:8080
config.properties explanation
coordinator: Allow this Presto instance to function as a coordinator (accept queries from clients and manage query execution).
http-server.http.port: Specifies the port for the HTTP server. Presto uses HTTP for all communication, internal and external.
query.max-memory: The maximum amount of distributed memory that a query may use.
query.max-memory-per-node: The maximum amount of memory that a query may use on any one machine.
discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.
discovery.uri: The URI to the Discovery server. Because we have enabled the embedded version of Discovery in the Presto coordinator, this should be the URI of the Presto coordinator. Replace my.localdomain:8080 to match the host and port of the Presto coordinator. This URI must not end in a slash.
Catalog Properties
Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table emp in database scott, that table would be accessed in Presto as hive.scott.emp.
[presto@en01 ~]$ vi $PRESTO_HOME/etc/catalog/hive.properties
[presto@en01 ~]$ vi $PRESTO_HOME/etc/catalog/jmx.properties
[presto@en01 ~]$ cat $PRESTO_HOME/etc/catalog/jmx.properties
connector.name=jmx
Running Presto
You can start presto daemon in background with below command.
[presto@en01 ~]$ launcher start
Started as 4253
[presto@en01 bin]$ launcher run
Unrecognized VM option 'ExitOnOutOfMemoryError'
Did you mean 'OnOutOfMemoryError=<value>'?
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
If you don't have required JDK installed you will get above error. Check java version
[presto@en01 bin]$ java -version
openjdk version "1.8.0_65"
OpenJDK Runtime Environment (build 1.8.0_65-b17)
OpenJDK 64-Bit Server VM (build 25.65-b01, mixed mode)
Get the latest java version (Presto requires Java 8u92+), install it and remove OpenJDK (in my case)
[root@en01 ~]# yum -y remove java*
After installing required java , check version again
[root@en01 ~]# java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
Run presto again
[presto@en01 bin]$ launcher run
WARNING: Current OS file descriptor limit is 4096. Presto recommends at least 8192
2017-06-15T16:26:53.448+0300 INFO main io.airlift.log.Logging Logging to stderr
2017-06-15T16:26:53.451+0300 INFO main Bootstrap Loading configuration
2017-06-15T16:26:53.575+0300 INFO main Bootstrap Initializing logging
2017-06-15T16:26:54.468+0300 INFO main Bootstrap PROPERTY DEFAULT RUNTIME
....
....
2017-06-15T16:27:01.032+0300 INFO main com.facebook.presto.server.PrestoServer ======== SERVER STARTED ========
Presto provides a web interface for monitoring and managing queries. You can verify the presto cluster from browser.
http://prestoserver:8080
Installing CLI
The Presto CLI provides a terminal-based interactive shell for running queries. The CLI is a self-executing JAR file, which means it acts like a normal UNIX executable.Downloand Presto CLI from below location
https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.149/presto-cli-0.149-executable.jar
copy jar in /usr/hadoopsw/presto-server-0.179/bin and rename it and make it executable.
[presto@en01 bin]$ mv presto-cli-0.149-executable.jar prestocli
[root@en01 bin]# chmod +x prestocli
Run CLI
[presto@en01 bin]$ prestocli --server localhost:8080 --catalog jmx --schema default
presto:default>
[root@en01 bin]# jps
19848 prestocli
18478 PrestoServer
11903 QuorumPeerMain
19935 Jps
presto:default> show schemas from jmx;
Schema
--------------------
current
history
information_schema
(3 rows)
Query 20170615_150002_00004_m3cau, FINISHED, 1 node
Splits: 18 total, 18 done (100.00%)
0:00 [3 rows, 47B] [21 rows/s, 330B/s]
No comments:
Post a Comment