Presto And Hive Tutorial

‍

Presto is an interesting technology. It isn’t really a database – its more of a query engine.

Presto connects to multiple data sources and allows you to query them at the same time. This is a concept known as “federated queries”, and it makes transferring data from, say, a production PostgreSQL database, to a Hive data warehouse, very easy! It has tight integration with Hive, which makes it a great alternative to traditional Data warehouse products like Vertica or Redshift.

Best of all, it is open source, and free!

In this tutorial, we will use AWS services to create a single node Presto cluster, and connect it to a managed Hive data warehouse service from AWS called AWS Glue.

Step 1 – Create an EC2 instance

We will be creating a Presto node from scratch, using an EC2 instance running Ubuntu. There are easier ways to do this, but due to fragmentation in the Presto ecosystem, there isn’t a single official source of AMIs or Docker images. This makes it hard to recommend anything other than building it yourself.

Anyways, lets sign into the EC2 console, and launch a new instance:

Ubuntu X86
R5 Large for 16 GB RAM
How Presto works is it loads all the data you need into RAM before processing. Because of this, its important to have a decent amount of ram.
Create new IAM role
Create a new role
Attach policies
EC2
AWS Glue Service Role
S3 full access
Name it whatever you want. I choose presto_role
Go back to configure instance
Refresh
Use that role
Give it 50 gig root storage just so we definitely won’t run out of space
No tags
No networking changes
Create a key pair and download. Name it test_pair, or whatever else you want, just remember the name. I am calling it “test pair 3”

Step 2 – SSH into the EC2 instance

Get the SSH key you downloaded into the ~/.ssh directory
If you’re wondering what .ssh is, its a hidden directory, in your home directory, that holds SSH keys.
SSH keys are a file on your computer that proves you are who you say you are, and is the most common way to access computers in the cloud. Basically, unless we configure an SSH key, we won’t be able to access the server we just spun up.
SSH keys are only supposed to be read by you, not other users on the same computer. If you get error that the file isn’t secure enough – we can fix that by going to the ~/.ssh directory and running the following command:

chmod 400 test_pair_3.pem

‍

Then, get the public IP address of the instance from the EC2 dashboard, and run the following command to SSH into the machine.

ssh -i <your-ssh-key>, ubuntu@<ec2-instances-ip-address>

‍

Step 3 – Install Presto dependencies

Presto needs Python and Java to function. We can install those with Ubuntu’s built in package manager, apt.

sudo apt update
sudo apt -y install python
sudo apt -y install default-jre

‍

Step 4 – Install Presto

Now it’s time to download and install Presto onto the server we just started. Start by getting the download link from the Presto download page.

You can use the command below to download Presto, or swap out the link with whatever you got from the download page to make sure you’re using the latest version of Presto.

curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.244.1/presto-server-0.244.1.tar.gz

‍

Then, unzip the file you just downloaded with:

tar -xvzf presto-server-0.244.1.tar.gz

Cofigure Presto

There are 3 files we need to alter to get Presto working in general:

config.properties – This file configures options for the Presto Cluster. We only have a single node at this point, but these options are very relevant when you start to scale to multiple nodes.
jvm.config – This file alters the behavior of the JVM (Presto runs on Java). A common mistake is to not configure a key RAM variable, -Xmx.
node.properties – This file sets a few variables specific to the node that you are currently working on.

Also, we need to configure a file hive.properties to allow Presto to connect to Hive, which is the data source we will be working with. Remember, Presto isn’t a database itself!

Steps

Go into the directory you just unzipped with cd presto-server-0.244.1
Create a new directory called etc, which will be used for configuration purposes with mkdir etc
Go into the directory cd etc
Create a file called config.properties, and open it with the Vim text editor with the command vi config.properties
Press i to go into edit mode, and insert the following code. You can learn more about what’s going on here at the Presto documentation.

coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=5GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080

‍

Press :wq to write the file and quit Vim.

Go through the same process to create a file called jvm.config with the following contents:

-server
-Djdk.attach.allowAttachSelf=true
-Xmx14G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

‍

Once again for the node.properties file

node.environment=testing
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

‍

Now create a directory called catalog and go inside of it with mkdir catalog, then cd catalog

And once again use Vim to insert the following contents into hive.properties:

connector.name=hive-hadoop2
hive.metastore=glue
hive.allow-drop-table=true
hive.allow-rename-table=true
hive.s3-file-system-type=presto
hive.s3.max-client-retries=50
hive.s3.max-error-retries=50
hive.s3.max-connections=1000
hive.s3.connect-timeout=5m
hive.s3.socket-timeout=5m

‍

Note the line hive.metastore=glue. That line makes Presto use AWS’s glue service as its Hive metastore.

Normally we would have to spin up a Hadoop and Hive cluster to manage the data Presto is querying, but instead we will let AWS do all that for us in an S3 bucket. If you want to learn more about what’s going on here, check out the Presto Hive Connector documentation.

Run Presto

Presto is now all configured. Let’s open up a few tabs with screen to run Presto, as well as view its logs, and access it with the CLI. If any of these directions are confusing to you, please reference the follow along video at the top of the page.

# Start a screen
screen

# Start the Presto program
cd ../../bin
sudo ./launcher start

# Open logs to ensure Presto is running

#Create new tab
ctrl-c c
cd /var/presto/data/var/log
tail 1000 -f server.log

Install Presto CLI

Let’s install the Presto CLI to run SQL commands! We can install it anywhere, so let’s just install it on our home directory.

Create a new screen tab with ctrl-c c.
Get the Presto CLI download link from the Presto download page
Download the CLI to our home directory with:

# Download the CLI:
curl https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.244.1/presto-cli-0.244.1-executable.jar -o presto

# Give ourselves the ability to execute the file:
chmod +x presto

Run the Presto CLI

The CLI is just a binary that we can run. Do that with ./presto --server localhost:8080

Create S3 bucket

OK, we now have the Presto server running, as well as the CLI. The last thing we need to do before we start running commands is give the Hive data warehouse a place to store data, namely, an S3 bucket!

Go to AWS S3 console
Create bucket, name it whatever you want, but remember the name.

Create a schema and a table

Create schema

CREATE SCHEMA hive.my_schema WITH (location = 's3://YOUR_BUCKET_NAME/');

Create table

CREATE TABLE hive.my_schema.my_table (
id int,
name varchar
);

Insert some data

insert into hive.my_schema.my_table values (1, 'cal');

Query the data

select * from hive.my_schema.my_table

Last, check out the S3 console!

If everything went as planned, Presto, through Hive (which is managed by AWS Glue), will have dynamically created, then inserted data into an S3 bucket in the ORC format. The ORC format is a column-oriented data storage format, part of what makes Presto so fast!

From here, you can start to experiment with Presto’s other connectors to drop real amounts of useful data into an S3 bucket to connect a BI application to run big data analytics.

I hope this was useful to you. Reach out if you find any errors!