Wrangling JSON Arrays: Zed vs SQL

Author James Kerr

Arrays are prevalent in programming languages, NoSQL databases, and storage formats like JSON and YAML. However, most table-based systems were not designed to handle them. Storing arrays usually requires an extension and querying them with SQL is awkward. You certainly don’t want to export them to a CSV.

The Zed data model was designed to harness the power of both relational tables and schema-less objects. Let’s see how it handles arrays.

Example Problem

Imagine you have an API that returns an array of objects. You need to convert that API response into a CSV file to upload into your relational database. We will use Zed to transform this data.

    "data": [
        {"position": [0,0]},
        {"position": [0,1]},
        {"position": [1,0]},
        {"position": [1,1]},
        {"position": [0,2]},
        {"position": [2,0]},
        {"position": [2,2]}

To follow along, download the zq binary and use like so:

zq '<zed script>' <data file>

All the Zed examples in this post can be placed where the <zed script> placeholder is above.

The “over” Operator

Let’s start by removing that top-level object. In Zed, we can iterate over an array and move each item into the main sequence of values. We do this by using the over operator and passing it the name of the field that contains the array. Here we iterate over the data property.

over data	

This unwrapped the array and produced a sequence of object values instead. Now let’s pipe that output to our next segment which goes over the position property on each of these objects.

over data | over position

This is interesting. Zed unwrapped each array at the position property and put its values into the main sequence of result values. We essentially flattened the position arrays into a single sequence of integers. This might be useful for some problems, but not with this dataset.

Let’s keep those arrays intact, but remove the outer object.

over data | yield position

You can think of the yield operator as a “return” in programming languages. It outputs something to the next script segment. By default, all matching values are yielded at the end of a script, but here we only yield the position field.

The “scoped over” Variation

Now let’s suppose we need to add a constant to each coordinate point. To do this, we will see a variation of the over operator called the “scoped” over. It will create a sub-sequence of values for each array it encounters. That sequence can be operated on just like any other sequence of values. Having the scope is useful because the sub-sequence can be collected back up into a single array value. Take a look.

over data | over position => (this + 256 | collect(this))

You can think of the script between the parens as a streaming map function. The value of this in the parens is the integer within those position arrays. Each integer will have the number 256 added to it, then those values will be piped to collect, an aggregate function which receives a sequence of values and puts them all into a single array value.

Finally, let’s put this in the proper format for our CSV import.

over data 
| over position => (this + 256 | collect(this)) 
| yield {x: this[0], y: this[1]}

In this last segment of our script, we yield an object literal with an x property and y property. The value of this is the array containing the two numbers. We can access the items in the array with standard square bracket indexing passing in the index of the item, this[0].

In Zed, if you are yielding an object literal, you can omit the yield keyword as it is implied. The final version of our ETL script looks like this.

over data 
| over position => (this + 256 | collect(this)) 
| {x: this[0], y: this[1]}

Using Zed with zq

If you have this data in a file, you could create the CSV file with the zq command line utility.

 zq -f csv '
 over data 
 | over position => (this + 256 | collect(this)) 
 | {x: this[0], y: this[1]}
 ' my-data.json

The syntax for this zq command above is:

zq -f <output format> '<zed script>' <data file>

zq can read and write to many data formats including JSON, CSV, and Parquet. It’s like the Rosetta Stone of data.

Wrapping Up

In this article we’ve seen how to:

  • unwrap an array into a sequence of values,
  • flatten nested arrays by using a series of “over” operators,
  • operate on each item of an array,
  • collect a sequence of values into an array,
  • and access the items within an array.

It’s easy to get excited about all the ways Zed makes working with data simpler.

What is Zed?

Zed is a platform for managing, storing, and processing data. It’s a superset of both schema-defined tables and unstructured documents; an emerging concept we call super-structured data.

The storage layer, type system, query language, zq command-line utility, and Zui desktop app are a few of the Zed tools available to the data community.

How Would This Look in SQL?

With the JSON extension enabled in DuckDB, we can achieve the same results with these queries.

To flatten all the array items into a single column:

 duckdb -c "
 WITH arrays AS (
    SELECT unnest(data).position AS pos
    FROM read_json_auto('data.json')
 SELECT unnest(pos) from arrays
│ unnest(pos)│   uint64    │
14 rows   │

And our final output would look like this:

 duckdb -c "
 WITH arrays AS (
    SELECT list_transform(
        unnest(data).position, n -> (n + 256)::INTEGER
    ) AS pos
    FROM read_json_auto('lateral.json')
 SELECT pos[1] as x, pos[2] as y from arrays
│   x   │   y   │
│ int32 │ int32 │

Talk With Us

What do you think about the differences between the Zed and SQL version?

Share your thoughts with us on Hacker News or Twitter.