BRIM

Joining CSV Data Without SQL

An IP Geolocation Use Case
Author Phil Rzewski

Whether for website analytics or determining the origin of a network attack, Internet geolocation is handy technology. In brief, given an IP address associated with some network traffic, can we deduce from where on earth the traffic originated? There’s plenty of software and services out there to help provide this information, but a familiar one is the GeoLite2 data from MaxMind. Not only is it free, it’s provided in the form of downloadable data sets. This allows us to pick it apart in interesting ways.

MaxMind makes their GeoLite2 data available in a couple formats, one being the common (and sometimes loathed) CSV format. They provide helpful docs that describe the schemas and a handy cookbook for defining SQL tables, importing the CSV data into those tables, and then performing a join between two of the data sets. This guide is super helpful, and if followed to the letter, you can experience the joy at the end of entering an IP address and getting back a city name. But before you can get there, as a prerequisite you need to enumerate all the field names and desired data types to load it into a specific structured format. Maybe there’s another way?

To validate the Zed project’s usefulness in real-world use cases, I often apply Zed to data sets and guides like these. In Zed’s docs, we sometimes speak of how it offers a gradual slope. Whereas MaxMind’s guide effectively shows how “Your CSV data must become this SQL to ride this ride!”, Zed allows you to apply a minimal set of concepts from the language, data model, and tooling to solve your problem.

Replicating these operations on the GeoLite2 data with Zed shows a nice example of this gradual slope in action. Read on to climb the slope from a quick one-shot query to a performant join that avoids the hassle of SQL tables.

Examining the Data

Let’s say we’ve downloaded and unpacked the most recent GeoLite2 City: CSV Format ZIP. Much like MaxMind’s guide, we’ll focus on these three CSV files.

$ ls -l *Blocks* *en*
-rw-rw-r--@ 1 phil  staff  242435417 Oct 16 08:36 GeoLite2-City-Blocks-IPv4.csv
-rw-rw-r--@ 1 phil  staff   81213494 Oct 16 08:36 GeoLite2-City-Blocks-IPv6.csv
-rw-rw-r--@ 1 phil  staff   10036753 Oct 16 08:36 GeoLite2-City-Locations-en.csv

To peek at some sample data, we could refer to MaxMind’s description of the schemas or load the CSV files into tools like Excel or Google Sheets. However, Zed provides a handy point of entry thanks to its ability to auto-detect most common input formats. With a quick one-shot invocation of the zq command-line tool, we can use Zed’s head operator to see samples of the data.

$ zq -Z 'head' GeoLite2-City-Blocks-IPv4.csv
{
    network: "1.0.0.0/24",
    geoname_id: 2077456.,
    registered_country_geoname_id: 2077456.,
    represented_country_geoname_id: null,
    is_anonymous_proxy: 0.,
    is_satellite_provider: 0.,
    postal_code: null,
    latitude: -33.494,
    longitude: 143.2104,
    accuracy_radius: 1000.
}

$ zq -Z 'head' GeoLite2-City-Locations-en.csv
{
    geoname_id: 1392.,
    locale_code: "en",
    continent_code: "AS",
    continent_name: "Asia",
    country_iso_code: "IR",
    country_name: "Iran",
    subdivision_1_iso_code: 2.,
    subdivision_1_name: "Māzandarān",
    subdivision_2_iso_code: null,
    subdivision_2_name: null,
    city_name: "Shahr",
    metro_code: null,
    time_zone: "Asia/Tehran",
    is_in_european_union: 0.
}

Compared to the SQL cookbook, notice that we didn’t need to enumerate any column names or specify data types in order to read the CSV files. Much like we see with spreadsheet tools, Zed has smart default behaviors when reading CSV inputs, recognizing which fields can be treated as numbers or strings and where null values appear. For many one-shot queries, this treatment may be all that’s needed. For example, let’s say we wanted to see the cities in Iran.

$ zq -f text 'country_name=="Iran" | cut city_name' GeoLite2-City-Locations-en.csv
Shahr
Jahan
Kaman
...

Richer Data Types Enable Richer Queries

Now that we’re familiar with our data, let’s revisit our original goal of mapping an IP address to its geographical location.

We see the “blocks” data is organized by CIDR block. Since it came from CSV, the network field starts life as a string. The SQL cookbook showed how to turn this string into a column with PostgreSQL’s cidr type and then use their >>= operator to check if an IP is inside such a block.

Keeping with the theme of a gradual slope, we can apply the equivalent Zed concepts only where needed and rely on the default treatment of all other fields. Here’s what happens when we cast the network field to Zed’s net type.

$ zq -Z 'network:=net(network) | head' GeoLite2-City-Blocks-IPv4.csv
{
    network: 1.0.0.0/24,
    geoname_id: 2077456.,
    registered_country_geoname_id: 2077456.,
    represented_country_geoname_id: null,
    is_anonymous_proxy: 0.,
    is_satellite_provider: 0.,
    postal_code: null,
    latitude: -33.494,
    longitude: 143.2104,
    accuracy_radius: 1000.
}

Notice the lack of quotes around the value 1.0.0.0/24, indicating it’s no longer treated as a string. Now a simple search using Zed’s cidr_match function can show us the block that covers a particular IP address.

$ zq -Z 'network:=net(network) | cidr_match(network, 146.243.121.22)' GeoLite2-City-Blocks-IPv4.csv
{
    network: 146.243.120.0/23,
    geoname_id: 4952487.,
    registered_country_geoname_id: 6252001.,
    represented_country_geoname_id: null,
    is_anonymous_proxy: 0.,
    is_satellite_provider: 0.,
    postal_code: 2777.,
    latitude: 41.7505,
    longitude: -71.2089,
    accuracy_radius: 20.
}

Type casting is but one small part of the wider topic of Zed’s shaping and type fusion features. These flexible and powerful data transformation techniques can provide the “enumerate-every-field-and-data-type” treatment often seen with SQL… but only when you need to! Since we’re already making good progress having only surgically altered one field, let’s keep going and see how far we get.

Joining to City Data

We’ve mapped our IP address into a subnet, but that’s only gotten us details like postal code and latitude/longitude. This might be adequate for plotting points on a map, but what if we want to see a city name?

MaxMind’s cookbook describes the significance of the geoname_id field appearing in both data sets, then shows how to leverage it to perform a SQL join to get the city detail.

Zed has its own join operator and we can apply it here, once again without having had to turn our CSV into SQL tables. We’ll use Zed’s file operator to feed our multiple data sources into the respective left and right sides of the join. Finally, since our program is getting longer, we’ll spread it out over multiple lines.

$ zq -Z '
file GeoLite2-City-Blocks-IPv4.csv
| network:=net(network)
| cidr_match(network, 146.243.121.22)
| join (
  file GeoLite2-City-Locations-en.csv
) on geoname_id=geoname_id location:=this
'

{
    network: 146.243.120.0/23,
    geoname_id: 4952487.,
    registered_country_geoname_id: 6252001.,
    represented_country_geoname_id: null,
    is_anonymous_proxy: 0.,
    is_satellite_provider: 0.,
    postal_code: 2777.,
    latitude: 41.7505,
    longitude: -71.2089,
    accuracy_radius: 20.,
    location: {
        geoname_id: 4952487.,
        locale_code: "en",
        continent_code: "NA",
        continent_name: "North America",
        country_iso_code: "US",
        country_name: "United States",
        subdivision_1_iso_code: "MA",
        subdivision_1_name: "Massachusetts",
        subdivision_2_iso_code: null,
        subdivision_2_name: null,
        city_name: "Swansea",
        metro_code: 521.,
        time_zone: "America/New_York",
        is_in_european_union: 0.
    }
}

SQL users know join is a big topic all its own, and indeed, there’s a Zed join tutorial that walks through examples using the many available flavors. However, if the example in MaxMind’s cookbook made sense to you, hopefully you’ll find the Zed one above equally intuitive. In brief, the logic is:

  1. Filter the “blocks” input down to just the record containing our target CIDR block
  2. Locate the “city” record that matches our “block” by finding shared geoname_id values
  3. Copy the matching “city” data into our “block” record as a nested field called location

Wanna Jump in a Lake?

If all we needed was this one-shot result that our IP address is in Swansea, we’re done. We’d have saved ourselves plenty of time by not having to install database software, define tables, and load CSV files into them all as a prerequisite to running our join.

But what if we expect to run this kind of query repeatedly? On my aging MacBook, the query above took about 14 seconds with zq. Considering we read a few hundred megabytes of CSV on the fly, transformed a field, performed a CIDR match, then did a join between two data sources, maybe that’s a reasonable cost. But it’s going to add up over multiple queries.

This brings us to our next step up the gradual slope: The Zed lake. The Zed lake is another topic that can be explored to needed depth, but for our immediate needs we can think of it as a place to more efficiently store and query our data. With a Zed service running, the following commands load our minimally-scrubbed CSV inputs into respective pools of the lake.

$ zed create -orderby network blocks
$ zed create -orderby geoname_id locations
$ zq 'network:=net(network)' GeoLite2-City-Blocks-IPv*.csv | zed load -use blocks -
$ zed load -use locations GeoLite2-City-Locations-en.csv

We’ve created pools called “blocks” and “locations” to hold the data from our respective CSV inputs. The -orderby option allows us to specify a pool key field by which data will be sorted when stored. This has the potential to significantly speed up queries involving this field, much like a primary key does for SQL. Since we still want the benefits of the minimal casting of the network field in our “blocks” data, we repurpose our zq from earlier to perform that transform and feed that output into zed load to populate that pool. And since we don’t need to do any transforms on the city data, a straight call to zed load is adequate to populate that pool. Once loaded, all lake data is stored in Zed’s compact binary ZNG format.

By going through these one-time steps, we can now avoid repeating the expensive operations of CSV parsing or type casting in the future and instead jump right to executing efficient queries. Since we’re now hitting pools in the lake, the Zed query syntax changes minimally compared to what we saw with zq and file inputs. Let’s wrap our query with time and see how much faster it runs.

$ time zed query -Z '
from blocks
| cidr_match(network, 146.243.121.22)
| join (
  from locations
) on geoname_id=geoname_id location:=this
'

{
    network: 146.243.120.0/23,
    geoname_id: 4952487.,
    registered_country_geoname_id: 6252001.,
    represented_country_geoname_id: null,
    is_anonymous_proxy: 0.,
    is_satellite_provider: 0.,
    postal_code: 2777.,
    latitude: 41.7505,
    longitude: -71.2089,
    accuracy_radius: 20.,
    location: {
        geoname_id: 4952487.,
        locale_code: "en",
        continent_code: "NA",
        continent_name: "North America",
        country_iso_code: "US",
        country_name: "United States",
        subdivision_1_iso_code: "MA",
        subdivision_1_name: "Massachusetts",
        subdivision_2_iso_code: null,
        subdivision_2_name: null,
        city_name: "Swansea",
        metro_code: 521.,
        time_zone: "America/New_York",
        is_in_european_union: 0.
    }
}

real	0m0.193s
user	0m0.020s
sys	0m0.013s

Below 0.2 seconds? Not bad!

Parameters From the Shell

We’ve got good momentum going, so let’s take one more step up the gradual slope and see how to generalize this approach with Zed’s user-defined operators. This allows you to parameterize the variable portions of your logic to hide complexity and enable reusability.

Since the IP address is what’s changing here, we turn that into an argument addr. We’ll also store our program in a text file loc.zed.

op find_location(addr): (
  from blocks
  | cidr_match(network, addr)
  | join (
    from locations
  ) on geoname_id=geoname_id location:=this
)

Zed’s CLI tools provide a -I option to invoke programs from such files. Therefore we can now run the query from the shell like so.

$ zed query -Z -I loc.zed 'find_location(146.243.121.22)'
{
    network: 146.243.120.0/23,
    geoname_id: 4952487.,
    registered_country_geoname_id: 6252001.,
    represented_country_geoname_id: null,
    is_anonymous_proxy: 0.,
    is_satellite_provider: 0.,
    postal_code: 2777.,
    latitude: 41.7505,
    longitude: -71.2089,
    accuracy_radius: 20.,
    location: {
        geoname_id: 4952487.,
        locale_code: "en",
        continent_code: "NA",
        continent_name: "North America",
        country_iso_code: "US",
        country_name: "United States",
        subdivision_1_iso_code: "MA",
        subdivision_1_name: "Massachusetts",
        subdivision_2_iso_code: null,
        subdivision_2_name: null,
        city_name: "Swansea",
        metro_code: 521.,
        time_zone: "America/New_York",
        is_in_european_union: 0.
    }
}

And to achieve total shell nirvana, let’s dynamically query for our own IP and use Zed’s yield operator to see what city comes back.

$ zed query -Z -I loc.zed "find_location($(curl -s ifconfig.co)) | yield location.city_name"

"San Francisco"

As validated by the smell of sourdough bread emanating from my oven, this result is indeed correct!

Wrapping Up

Hopefully we’ve given you a better understanding of Zed’s gradual slope. Whether you need a one-shot query result to get on with your day or you’re looking to build a reusable production data pipeline, Zed lets you jump on board with the appropriate level of sophistication and performance.

Of course, this only scratches the surface of what’s possible. In addition to the links above, here are some suggestions of other areas to explore.