Tuesday, February 3, 2015

Apache Drill: Writing SQL Queries on NoSQL Databases

Apache Drill is an amazing project by Apache foundation, specifically made to enrich the self-service analytics. It makes querying the semi structured data ridiculously easy for the Business Analysts and Data Scientists. You'd definitely love this if you are working with NoSQL database, Hadoop or scratching your head to write a code for reading a JSON file.


All the documentation is available here, however, before you read it, I'd like to show it's magic by an example of joining a MongoDB collection with a JSON file and later I'll talk about its use cases.

Connecting to Drill: 

Don't bother much about the installation right now, it is all there Drill in 10 mins guide. I have drill 0.7 installed on my machine and after running the below command line utility from the $DRILL_HOME you'll see the drill prompt.
./sqlline -u jdbc:drill:zk=local

At the prompt type SHOW DATABASES; and hit enter. It will list down all the registered databases you have access to:

Screen Shot 2015-02-03 at 12.07.16 PM

Querying using Drill:

In my case, I have my MongoDB instance running and registered in the storage plugin.

Switch to the Mongo DB you want to query by typing USE <database _name>

Screen Shot 2015-02-03 at 12.09.56 PM

I have registered one of the local file system to access JSON files and the beauty is that I can still use it in my query even though the default schema is changed to MongoDB.

In my example, I have tweets sample data (from Twitter Streaming API )in MongoDB and simple language lookup file in JSON. The idea here is to show how Apache Drill supports heterogeneous sources in a single query as well as to find out the number to tweets based on language. Hence, I will get the count of tweets from Mongo and lookup the language name from the terms.json file. For that, our query will be:

SELECT count(*), L.lang_name from tech M, myFiles.`/Users/udaysharma/drill/terms.json`L WHERE M.lang=L.lang GROUP BY L.lang_name;

And the output is :

Screen Shot 2015-02-03 at 4.41.34 PM

Isn't it "Awesome"? The you can now just query any Big Data source by writing simple SQL queries. It does support almost all the ANSI SQL syntax and function. Though you might get some unexpected errors as you play around, I'm sure as the community works on the future releases, it will become more reliable and a key tool for the Data Analysts.

Read the following resources, if interested:
Happy Drilling! ;)

No comments :

Post a Comment