Super-structured Data

By Steven McCanne May 17, 2022

We all know why dealing with real-world data is so hard. It’s a big, hairy mess.

While cliche nowadays, you’re no doubt familiar with the “80/20 rule” in data analytics, and probably even experienced it yourself:

80% of your time is spent gathering, cleansing, and storing data, while 20% of your time is spent actually analyzing it and getting real work done.

You often end up stuck between the document model of JSON and the relational model of SQL databases. Going back and forth between the two worlds is such a big headache.

Thank goodness there’s a new and better way. Let’s get schemas and messy JSON out of our way.

It’s called super-structured data.

Hold onto your hats.

The Authoritarian’s Way

The gold standard for data analytics is to “cleanse” your messy JSON data and organize it all in a data warehouse, where data must conform to relational schemas so everything fits neatly into tables.

In this world, data must conform to the “one true way” of the data warehouse.

Unanticipated data forms must be discarded or stored elsewhere until changes can be made to the “ingest pipeline” and to the warehouse schemas to accommodate any new shape of messy data.

A common trick is to make super wide tables with lots of “nulls” that can hold all of the different shapes of data that might show up — only to be foiled by a different form of messy data that eventually doesn’t fit.

Somehow this approach to cleaning data feels a bit too forced.

Metaphorically speaking, the relational model feels a lot like authoritarianism.

The Anarchist’s Way

Around 2010, the NoSQL movement arose in reaction to this schema-rigid authoritarianism.

In this approach, the database is “schema-less” and data of any shape can be stored anywhere in the database, typically structured around the document model of JSON.

This anything-goes approach, however, often leads to quite a mess in real-world deployments. It is easy and tempting to allow any data in the system as requirements evolve, leading to a mishmash of JSON data shapes that have to be teased apart through ever more complex application logic.

Extending our metaphor, the document model feels a lot like anarchy.

How Did We Get Here?

The authoritarians like to call the mishmash of anarchist’s JSON data a “data swamp”, while the anarchists insist that it’s so much easier to get up and running with a document database that it’s well worth coping with the potential mess.

Anarchy or authoritarianism? Pick your poison.

You all know the history.

Back in the 1980s, the database wars came to an end when SQL and the relational model emerged as the undeniable champions.

From there, SQL-based data warehouses appeared in the 1990s enabling the new concept of business intelligence, while in the late 1990s, the Internet and Web took off like a rocket.

Then, by the early 2000s, the predominance of Web-scale companies with tech stacks built entirely from scratch led to a massive proliferation of messy data. Unfortunately, the best warehouses of that day simply couldn’t scale to the data demands of the Googles and the Yahoos.

Necessity is the mother of invention and those big Web companies soon developed custom solutions for doing warehouse-style analytics across massive clusters of commodity servers. In 2004, Google published their influential paper on MapReduce, and Yahoo later released open-source software called Hadoop based on Google’s MapReduce programming model.

A bit later, researchers at UC Berkeley improved upon the Hadoop design with Spark.

It was the dawn of Big Data. 🤮

The Authoritarian Backlash

No good deed goes unpunished, and rest assured in 2008, DeWitt and Stonebraker famously ranted that MapReduce was

“a giant step backwards”,
“a poor implementation”,
“not novel at all”, and
“overlooked the lessons of 40 years of database technology”.

Of course, they were right.

But back then, the Web-scale anarchists couldn’t just go out and purchase a sufficiently large authoritarian warehouse license to solve their ever-growing problems with messy data. Those data warehouses didn’t mesh with the fast-moving anarchy of the day and weren’t economically viable at the massive scale required.

It would be another decade before the worlds of big data and relational warehouses truly began to converge.

NoSQL: The Anarchist’s Database

In the meantime, many application developers came to loathe the object-relational mapping (ORM) pattern that required a complex layer of moving parts between their dynamic and often messy application data and the authoritarian relational model.

Why couldn’t apps just write JSON data straight into a database? That would be so much easier.

So, along came the document-model database to the rescue.

Some of these systems like MongoDB embraced a pure “NoSQL” approach while others like CouchDB eventually added a SQL-like query language based on SQL++, which extends SQL to operate over the document data model of JSON.

A Cautious Reaction

This NoSQL stuff got popular and the authoritarians spoke again.

A good eight years after the MapReduce rant, Stonebraker softened his critique of the anarchists, stating in the Red Book that

[NoSQL systems] are easy for a programmer to get going and do something productive. RDBMSs, in contrast, are very heavyweight, requiring a schema up front.

He concluded:

This is a wake-up call to the commercial vendors to make systems that are easier to use.

So what’s happened in the half dozen years since these insights from the Red Book?

Well, both the document and relational models have continued to thrive and have been firmly cemented into enterprise data stacks. Just look at the market capitalizations of all the companies involved (even after the bursting of Tech Bubble v2.0 in spring 2022).

While the anarchist NoSQL databases have managed to hold their own against the authoritarian databases as the storage tier for many application deployments, they’re not so hot at high-speed analytics and complex multi-dimensional warehouse problems.

You may have seen a project or two moving data out of these systems and into ClickHouse or a cloud warehouse when insurmountable scaling problems were hit. Clearly, the schema-rigid relational model has won the battle of analytics, and serves as the foundation for the modern cloud warehouse.

Schemas Go Viral

Given these trends, the schema concept has made a big move out of the database, and has become a fundamental design element that shows up everywhere these days.

If your data is going to land in a SQL warehouse, why not push schema enforcement as far upstream as possible? This way, the authoritarian data teams who own the model definitions can impose constraints on the anarchist engineering teams to prevent them from haphazardly creating messy data.

After all, those pesky engineers don’t understand the business value of data, right? So best to put some handcuffs on them. Isn’t authoritarian control so sweet?

To this end, schemas lie at the heart of popular data formats like Avro and Parquet. And, in the client-server realm, Protocol Buffers and Thrift configure the schema directly into the compiled implementations of the communicating end points.

But there’s a cost to pushing the authoritarian model upstream: a schema-rigid architecture leads to fragile and brittle interdependencies and implementing change can be difficult and time consuming.

You want to make a change? Okay, update all your schema definitions, recompile everything, and redeploy. Someone makes a seemingly innocuous change to a client data structure used on your mobile app and your mission-critical data pipeline comes to a screeching halt. You know the fire drill.

While great for data modeling, schemas can really get in the way when you’re just trying to move and store data.

It turns out central control of everything can make things hard.

Just Add Thrust

With enough thrust, pigs can fly, or so goes the saying. So why not just throw more engineering at the schema problem?

And sure enough, a whole sub-industry has emerged to take your data from JSON cloud APIs and put it into schemas.

The idea here is that instead of manually creating schemas, what if the schemas were automatically created for you? When something doesn’t fit in a table, how about automatically adding columns for the missing fields?

This schema-oriented way of thinking has led to a world where schemas are a given and any impedance mismatch between real-world, messy data and tabular schemas shall be solved with ever more layers of software complexity and engineering.

Super-structured Data

We asked ourselves a crazy question: could it be that we’ve built everything upon the wrong foundational primitives?

Maybe Stonebraker was right? Maybe it’s the schemas that are getting in our way?

We realized this schemas-are-everywhere way of thinking is like putting a square peg (JSON) in a round hole (relational tables). Yes you can do it, but there’s nothing natural about it and having two distinct ways of doing things creates friction and complexity that leads to wasted time and increased cost.

Could mixing a little controlled anarchy into our authoritarianism perhaps be helpful?

After working on this problem for a couple years, we arrived upon the concept of super-structured data guided by the following principle:

Instead of pre-defining schemas to which all values must conform, data should instead be self-describing and organized around a deep type system, allowing each value to freely express its structure through its explicit type.

With super-structured data, the mishmash of relational tables and semi-structured data embedded in tables all turns into a well-defined set of values that all conform to precisely defined super-structured types. Both JSON anarchy and schema-rigid authoritarianism are just special cases of the super-structured model.

In other words, super-structured data is a superset of both JSON and relational tables. All JSON documents are super-structured values and any relational table can be represented with a super-structured type.

For example, the JSON value

{"s":"foo","a":[1,"bar"]}

would traditionally be called “schema-less” and in fact is said to have the vague type “object” in the world of JavaScript or “dict” in the world of Python. However, the super-structured interpretation of this value’s type is instead:

type record with field s of type string and field a of type array of type union of types integer and string

We call the former style of typing a “shallow” type system and the latter style of typing a “deep” type system. The hierarchy of a shallow-typed value must be traversed to determine its structure whereas the structure of a deeply-typed value is determined directly from its type.

So given a deep type system, when a sequence of values in fact conforms to, say, a uniform “record type”, then such a collection of record values looks precisely like a relational table. For example, the sequence of JSON values

{"id":1,"name":"Alice"}
{"id":2,"name":"Bob"}
{"id":3,"name":"Carlos"}

has a natural correspondence with a SQL table created by

CREATE TABLE contacts (
	id INTEGER,
	name TEXT
);

In this case, the rows of this table are typed as type record with field id of type integer and field name of type string.

If we, in turn, employ named types as part of the super-structured type system, we can instead create a type called “contacts” that looks just like the SQL table:

type contacts {id:int64,name:string}

In the super-structured model, data is self-describing and we can employ decorators to bind the name to the type as in

{id:1,name:"Alice"}(=contacts)
{id:2,name:"Bob"}(=contacts)
{id:3,name:"Carlos"}(=contacts)

Since the underlying type of contacts is implied by the value, there is actually no need for an explicit type declaration.

Now the SQL statement

SELECT name FROM contacts WHERE id=2

could be interpreted either traditionally as a query for a row of a relational table named contacts, or in terms of super-structured data, as a query over a set of super-structured data where the FROM clause refers to a first projection by type contacts and the SELECT clause refers to a second projection of the column name.

In this way, anarchy and authoritarianism can live side by side with a single data model and authoritarian tables can be projected from a pool of super-structured data as a simple type query.

Hasn’t This Been Done Before?

Surely, this concept of super-structured data isn’t rocket science. Why don’t things already work this way?! From a 10,000 foot view, these ideas feel familiar.

The EdgeDB project advocates for “types not tables”, which certainly rhymes with the super-structured goal of using types instead of schemas to organize data. And while EdgeDB’s type system is deeply typed, its storage layer is just a traditional relational database. While this approach masterfully solves some important and thorny problems (all while strategically reusing mature relational database technology), it does not solve the underlying data representation problem. Instead, EdgeDB is essentially a new data silo whose type system cannot be used to serialize data external to the system.

Okay, but can’t we get super-structured properties with other existing data formats?

Let’s have a look.

Even though JSON isn’t a candidate, BSON and Ion are efficient, binary cousins of JSON and were created to provide a type-rich elaboration of the semi-structured model. Unfortunately, both approaches have shallow type systems so they are not a candidate for super-structured data.

But what about Parquet, Avro, or the hugely popular Arrow format?

Indeed, these formats all have deep typing but are schema rigid: an encoded sequence of values requires an up-front schema definition and all of the values in the sequence must conform to that one schema. Also, Parquet does not have union types, so mixed-type arrays and dictionaries aren’t expressible, though this could be addressed in a future version of the format.

In a nutshell,

JSON, BSON, and Ion are schema-less but have shallow typing, while
Parquet, Avro, and Arrow have deep typing but are schema rigid.

Super-structured data, on the other hand, provides the best of both worlds:

Super-structured data has deep types without schema rigidity.

Schema Registries to the Rescue

Wait a minute. Can’t you solve the schema rigidity problem with a schema registry?

Indeed, a number of years ago, developers wanting to transmit diversely typed sequences of data over a Kafka queue clearly tripped over the problems of schema-rigid formats.

They needed a solution: why not just use a schema registry to persist all the possible schemas in use?

In this approach, each transmitted value is tagged with a small-integer “schema ID” and the schema registry provides a centralized service for mapping these IDs to the intended schema. Consequently, a heterogeneous sequence of Avro or Protocol Buffers values can be transmitted over a Kafka topic by prepending the schema ID to each encoded value. The receiver can then look up and cache each schema using the ID and the schema registry.

When deployed with Avro, this schema-registry pattern begins to resemble our model for super-structured data. In particular, not only does Avro have a deep type system but it also includes union types, which accommodates multi-typed arrays and tuples. And it has a null type, which when combined with a union type can represent optional values in a record (or JSON object) just like optional values in a relational column.

Given all this, Avro with a schema registry comes closest to our concept of super-structured data. However, the schema-registry service not only creates operational overhead, but makes the approach entirely unsuitable for a self-contained format for data serialization. Without live, online access to the schema registry, a client of this approach cannot decode any Avro-encoded payload.

In short, a schema registry creates a parallel universe problem: everything is organized around schemas so data in flight and data at rest must both conform to the same set of schemas. When data at rest resides in a relational database, we now have to keep the tables in the database consistent with the schemas in the separate registry service.

The schemas are getting in the way again. What a mess!

The Zed Project

To tackle the myriad of challenges with schema-rigid authoritarianism juxtaposed with JSON anarchy, our small team at Brim Data has been developing, iterating, and refining the ideas for super-structured data under the umbrella of The Zed Project.

At the foundation of Zed, we’ve developed a family of super-structured formats that all adhere to a common Zed data model. The super-structured formats include

ZSON - a human-readable format based on Zed as a superset of JSON
ZNG - an efficient binary, format based on Zed and analogous to Avro
VNG - an efficient columnar format based on Zed and analogous to Parquet

A novel advantage to this design is that one cohesive data model supports the three important variations of serialized data:

a human readable form for easy interpretation,
an efficient sequence form for search, and
an efficient columnar form for vectorized analytics.

Zed is the first system to unite these three models with a unified set of formats where converting between the various forms incurs no loss of information.

To crack the problem of efficiently representing super-structured types across a sequence of values, the ZNG and VNG formats utilize a concept called a type context. A type context allows us to replace a globally scoped schema registry with locally scoped type definitions that are embedded within the data sequence itself. Types need only be defined once and can then be reused. And the type context can always be “reset” within large files or data streams so they can be seekable or fragmented into independently decodable chunks. Moreover, values can be moved from one context to another with a fast and simple table lookup.

At this point, you might wonder why create Zed and these formats in the first place? Rest assured, we didn’t just set out to work on super-structured data for its own sake.

Necessity is the mother of invention and our journey to super-structured data started when we found it hard to retain the rich and deeply typed event information from Zeek logs without force fitting heterogenous log data into warehouse tables or dumbing down Zeek events into JSON for storage in document-oriented search systems. We also realized that in order to do both search and analytics well, you had to stand up two systems: search systems aren’t very good at analytics and warehouse systems aren’t very good at search.

To this end, we began prototyping these ideas in a command-line tool called zq, which is like jq, but of course operates upon super-structured Zed data instead of just JSON and has easy-to-use search built in. Also, since Zed is a superset of other data models, we’ve included support in zq for reading and writing data in other formats like JSON, CSV, and Parquet.

Our vision is that super-structured data should make it really easy to scale down search and analytics to your laptop, or scale up to a large-scale cloud deployment of a data lake based on Zed, i.e., “Zed lake”. Thus, we’ve been developing a lake format based on Zed, which is managed and served by another command-line tool simply called zed.

A Zed lake is sort of like a lakehouse but is based on super-structured data, requires no schema definitions, and has a user-friendly, history-navigable commit model like Git. Our work on Zed lakes is less mature than zq and the Zed formats, but the lake implementation has already proven robust enough to run in production at a non-trivial scale by many of our community users.

To take advantage of the Zed data model, we have also developed a new search, query and data-transformation language that we simply call the “Zed Language”. The Zed language is the primary means to interact with zq and the zed query commands.

To be honest, we struggled a bit as to whether we should just embrace SQL as the query interface. Does the world really need yet another query language?

Yet the problem with SQL for our use case is that it’s simply an awful user experience for search. Many of our community users use Zed in a lean-forward style of interactive keyword search with a certain amount of lightweight analytics. Forcing these users to switch to SQL would be a major step back for them.

In the end, we decided to continue to develop the Zed language and explore the audacious goal of blending keyword search, warehouse-style analytics, data exploration primitives, and data transformation logic all in one unified language. This might sound a bit crazy, but we think we’re onto something here.

In the long run, we’ll no doubt support a dedicated SQL query engine that can operate on virtualized SQL tables projected from Zed types, but for now, our team is small and we’re exploring how far we can go with the Zed language.

Finally, we’ve built a desktop application called “Zui” that provides an interactive search, analytics, and exploration experience for Zed data. Through its integrations with Zeek and Suricata, many of our community users rely upon Zui for threat hunting and incident response. Other users have implemented ETL pipelines in Zed and monitor and debug their pipelines using Zui. Some of our other users leverage Zui for exploratory data analysis when trying to decipher large, complex JSON objects that were produced elsewhere in their organization.

The Zui app utilizes the zed-js library to bring super-structured data and the Zed data model to the JavaScript world. We don’t aspire for Zui to be a notebook, but rather have leveraged zed-js to explore some initial integrations with notebook systems like Observable. In a future article, we’ll write about our Observable integration.

Try it out

If you’d like to try Zed and Zui, it’s all pretty easy. You can:

We love working with all our users to help guide us to the best ways of solving your real, everyday problems. Give us a holler and we look forward to chatting.

Wrapping Up

It’s hard to make data easy and the jury is certainly out on Zed, but let’s see how far we can get.

Let’s see if we can use the Zed type system to get schemas out of our way.

Let’s see if we can do better than shaving the hard edges off JSON’s square peg to fit in the round hole of relational schemas and dataframes.

Maybe, just maybe, by mixing a bit of controlled anarchy into the world of schema-rigid authoritarianism, Zed can make data engineering much, much easier after all.

Acknowledgements

Noah Treuhaft coined the awesomely perfect term super-structured data to describe what we’ve been working on.

Garrison Hess came up with the clever metaphor of “anarchy vs. authoritarianism” as a reaction to the design motivation of Zed.