Data exploration tools and tips

Thu, Nov 26, 2020 by George Y. Kussumoto

Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming.

- Rob Pike

Python is popular among data science folks out there and they are familiar with data exploration. Even though I rarely worked with data science, it's fundamental to any developer to understand first before committing a line of code.

It takes time to understand the data. There's a variety of tools that can help you advance your exploration. Still, it's your job to uncover the history your data is trying to tell you. The following tools helped me a great deal over the years.

The most popular format is csv and similar formats that are used by spreadsheet applications. Despite the lack of any standards and a lot of edge-cases, everyone knows how to use a spreadsheet. csvkit is a gem to handle csv using a set of utilities that's easy to script and use in the command line. csvstat is great for that first run, and it will give you a general understanding of data. For instance, if one column has null values, common, largest, smallest values, and so on.

The other utilities are more focused on extraction and formatting. If you are familiar with the command line this is a huge productivity boost over GUI applications, since all the usual CLI tools are available as well (fzf, ag, sort, head, tr, and so on). Also, analyzing large amounts of data is resource-intensive, it's good to keep alternatives in check.

In particular, csvsql is a must to at least know about it. Here's my advice, once you are past the initial structures and relations in your data, use csvsql to populate an SQLite database. Now you have all the power of SQL at your will.

I think it's foreign key...

I've already used the very same approach to migrate/import data from third-parties. It never ceases to amaze me how a lot of software can only "communicate" using csv.

Using a database, you can use one of the dbcli tools to stay on CLI, but also sqlitebrower or beekeeperstudio. Not to mention all the available ecosystems that support databases (including NoSQL and that jazz). You can also combine Python and SQL and use something like pugsql. Plain SQL it's not so popular in the Python community, mainly I believe due to Django's powerful ORM. Still, don't hesitate to write simple SQL and avoid overly complex solutions. Tools such as redash and metabase are great if they already are in place in your company, but overkill to run on your machine.

Moving forward to more interactive tools, there are Jupyter notebooks and it's amazing. Even if you are not familiar with Python, I encourage you to try.

The notebooks will make sense if it's more visual or require complex steps to evaluate multiple ways to figure out the data. It might be easier to open for collaboration using something like nbviewer. It's also cool when you are sharing your screen to present something and be able to open a shell right on the browser.

If it seems kind of bloat to set up a notebook, there's datasette. Totally not the same thing, but a layer on top of your database providing an API, interactivity, and some visualization capabilities. I don't have experience with this one, but I'm curious to find a use-case to experiment with a little bit.

Speaking of APIs, json responses or files can be easily "grepable" using gron. If you need more, jq is probably common sense already.

If none of these fits, so maybe you'll like to do it manually. I have a few 🥁 handy tips:

Shell script: before using a programming language, try to use something hacky first. Later you can rewrite that properly.
python -i (and ipython): execute the script and drop you at the shell. Especially useful if you find yourself using history to load past commands over and over.
rich: so your eyes will not bleed while looking at cryptic characters and rich.inspect() is truly awesome.
Iterate & consolidate: it's easy to get lost in your experiments. Get used to tracking your goals, what's in your way, and how to reproduce a result.
Pen/paper: it's still used for a reason, keep it close. Don't use fancy moleskines, use scratch paper. Fancy stuff makes you wonder wheater or not it's worth and you'll probably regret whatever decision you take.

I'm pretty sure I'm missing a lot of other tools, but those I can personally vouch for. They helped me so many times and in the most diverse situations. Now you are all geared up, what's next?

I can't possibly answer that. In my experience, we are usually looking for two main things and it's variations: search and/or behavior. Examples:

Organize data to quickly answer certain queries
Find patterns over-time / distribution / dependencies
Detect mutation / state transitions
Commonality / distinction between models
Space / time scalability
Storage constraints / partitions / consistency

Data is one of the fundamental blocks of our profession. We are going to deal with it in one way or another, so it's wise to get acquainted with tools to improve our productivity and understanding. It's the one thing that we need to get right, if so, everything else will follow.

Keep exploring!

~ 🤘