Presto is an interesting technology. It isn’t really a database – its more of a query engine.
Presto connects to multiple data sources and allows you to query them at the same time. This is a concept known as “federated queries”, and it makes transferring data from, say, a production PostgreSQL database, to a Hive data warehouse, very easy! It has tight integration with Hive, which makes it a great alternative to traditional Data warehouse products like Vertica or Redshift.
Best of all, it is open source, and free!
In this tutorial, we will use AWS services to create a single node Presto cluster, and connect it to a managed Hive data warehouse service from AWS called AWS Glue.
We will be creating a Presto node from scratch, using an EC2 instance running Ubuntu. There are easier ways to do this, but due to fragmentation in the Presto ecosystem, there isn’t a single official source of AMIs or Docker images. This makes it hard to recommend anything other than building it yourself.
Anyways, lets sign into the EC2 console, and launch a new instance:
chmod 400 test_pair_3.pem
Then, get the public IP address of the instance from the EC2 dashboard, and run the following command to SSH into the machine.
ssh -i <your-ssh-key>, ubuntu@<ec2-instances-ip-address>
Presto needs Python and Java to function. We can install those with Ubuntu’s built in package manager, apt.
sudo apt update
sudo apt -y install python
sudo apt -y install default-jre
Now it’s time to download and install Presto onto the server we just started. Start by getting the download link from the Presto download page.
curl -O https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.244.1/presto-server-0.244.1.tar.gz
Then, unzip the file you just downloaded with:
tar -xvzf presto-server-0.244.1.tar.gz
There are 3 files we need to alter to get Presto working in general:
Also, we need to configure a file hive.properties to allow Presto to connect to Hive, which is the data source we will be working with. Remember, Presto isn’t a database itself!
coordinator=true
node-scheduler.include-coordinator=true
http-server.http.port=8080
query.max-memory=5GB
query.max-memory-per-node=5GB
query.max-total-memory-per-node=5GB
discovery-server.enabled=true
discovery.uri=http://localhost:8080
Press :wq to write the file and quit Vim.
Go through the same process to create a file called jvm.config with the following contents:
-server
-Djdk.attach.allowAttachSelf=true
-Xmx14G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError
Once again for the node.properties file
node.environment=testing
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data
Now create a directory called catalog and go inside of it with mkdir catalog, then cd catalog
And once again use Vim to insert the following contents into hive.properties:
connector.name=hive-hadoop2
hive.metastore=glue
hive.allow-drop-table=true
hive.allow-rename-table=true
hive.s3-file-system-type=presto
hive.s3.max-client-retries=50
hive.s3.max-error-retries=50
hive.s3.max-connections=1000
hive.s3.connect-timeout=5m
hive.s3.socket-timeout=5m
Note the line hive.metastore=glue. That line makes Presto use AWS’s glue service as its Hive metastore.
Normally we would have to spin up a Hadoop and Hive cluster to manage the data Presto is querying, but instead we will let AWS do all that for us in an S3 bucket. If you want to learn more about what’s going on here, check out the Presto Hive Connector documentation.
Presto is now all configured. Let’s open up a few tabs with screen to run Presto, as well as view its logs, and access it with the CLI. If any of these directions are confusing to you, please reference the follow along video at the top of the page.
# Start a screen
screen
# Start the Presto program
cd ../../bin
sudo ./launcher start
# Open logs to ensure Presto is running
#Create new tab
ctrl-c c
cd /var/presto/data/var/log
tail 1000 -f server.log
Let’s install the Presto CLI to run SQL commands! We can install it anywhere, so let’s just install it on our home directory.
# Download the CLI:
curl https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.244.1/presto-cli-0.244.1-executable.jar -o presto
# Give ourselves the ability to execute the file:
chmod +x presto
The CLI is just a binary that we can run. Do that with ./presto --server localhost:8080
OK, we now have the Presto server running, as well as the CLI. The last thing we need to do before we start running commands is give the Hive data warehouse a place to store data, namely, an S3 bucket!
Create schema
CREATE SCHEMA hive.my_schema WITH (location = 's3://YOUR_BUCKET_NAME/');
Create table
CREATE TABLE hive.my_schema.my_table (
id int,
name varchar
);
insert into hive.my_schema.my_table values (1, 'cal');
select * from hive.my_schema.my_table
If everything went as planned, Presto, through Hive (which is managed by AWS Glue), will have dynamically created, then inserted data into an S3 bucket in the ORC format. The ORC format is a column-oriented data storage format, part of what makes Presto so fast!
From here, you can start to experiment with Presto’s other connectors to drop real amounts of useful data into an S3 bucket to connect a BI application to run big data analytics.
I hope this was useful to you. Reach out if you find any errors!