BRIM

Easier Data Debugging With Zed’s First-class Errors

Author James Kerr

We’ve all experienced the pain of debugging our data when something upstream changes.

If unexpected input is received, the row or even the whole ingest is rejected. The error is usually logged to a file somewhere and it’s up to us to match the logs with the source data to repair the problem. This recent tweet by Hamilton Ulmer gives voice to the issue.

Zed makes finding and fixing data much easier with its error data type.

error("missing field")
error({msg: "contains only zeros", src_array: [0,0,0]})

In Zed, errors are data. They persist on disk right next to your other data. We’ve built an entire approach to data based on types instead of schemas so errors can land anywhere a value can land without breaking anything. Wouldn’t it be great if you could see errors in place instead of mysterious NULLs?

The Error Data Type

error(value)

The error type can hold any single value within it. The value can be a simple string describing the problem or a complex record with nested fields providing context.

Errors are yielded by some of Zed’s built-in functions and users can create their own errors by invoking the error() function in a Zed script.

What does this mean for data engineers?

1. Errors Don’t Halt Ingest

How many times has an upstream schema change arrested the flow of data through the pipeline? Zed’s super-structured data model liberates us from schemas and can continue to accept data even when unexpected problems occur.

2. Errors are Queryable

The Zed query filters below can be used to identify data containing errors.

kind(val) == "error"
// or
is_error(val)

Both lines return true if val is an error type.

has_error(val)

This returns true if val is an error or a complex type containing an error somewhere within it.

3. Errors Can Contain Structured Data

A shaper script on ingest can identify the problematic data with has_error and can include the structured source data in the error value, giving you the context you need to address the problem. The bad data can be queried, cleaned, and restored.

A Practical Use Case

Let’s explore these concepts by ingesting bank transactions from mint.com into a local Zed lake.

We will perform the following tasks:

  1. Inspect a CSV of bank transactions
  2. Create a shaper to clean the data
  3. Load it into a pool in our local Zed lake
  4. Find errors
  5. Fix errors

To follow along, install the zed and zq CLI tools and clone the example repository.

Start a local Zed lake with the zed serve command.

zed serve -lake <directory_to_hold_data>

Here is a small sample of my bank transactions downloaded from mint.com. Look at all those coffee shops.

"Date","Description","Original Description","Amount","Transaction Type","Category","Account Name","Labels","Notes"
"5/12/2023","LYFT   *TEMP AUTH HOLD","LYFT   *TEMP AUTH HOLD","5.00","debit","Ride Share","CREDIT CARD","",""
"5/11/2023","TST* Automat","TST* Automat","5.22","debit","Restaurants","CREDIT CARD","",""
"5/10/2023","Spotify USA","Spotify USA","12.99","debit","Music","CREDIT CARD","",""
...

To show just how liberated from schemas we are with Zed, we’ll load this data directly into the lake without any transformation.

zed create raw
zed load -i csv -use raw mint.csv 

The first command made a new pool called “raw”. The second loaded mint.csv into it. Let’s run a query to make sure we have all 25 records.

zed query 'from raw | count()'

25 (uint64)

Instead of loading raw data into the lake, we could clean up the data locally with zq. The zq command is for processing data in local files, streams, and URLs while the zed query command is for interacting with data in a lake. They both use the same Zed language.

Here we create a shaper to give our data more form, enrich the types, and drop unwanted fields.

zq -i csv -Z '{
  date: time(Date),
  amount: float64(Amount),
  desc: lower(Description),
  type: this["Transaction Type"],
  account: lower(this["Account Name"]),
  category: lower(Category)
}' mint.csv

This is a query that yields a new record type for each row of the original CSV. It renames fields, lowercases strings, and casts values into richer types. The cleaned data now looks like this.

{
  date: 2023-05-12T00:00:00Z,
  amount: 5.,
  desc: "lyft   *temp auth hold",
  type: "debit",
  account: "credit card",
  category: "ride share"
}

By default, zq will output in a binary format called ZNG which is one of the input formats automatically recognized by zed load. The lake accepts data from stdin so we can pipe the zq output right into zed load. We use the - dash to denote stdin.

zed create shaped
zq -i csv 'yield {
  date: time(Date),
  amount: float64(Amount),
  desc: lower(Description),
  type: this["Transaction Type"],
  account: lower(this["Account Name"]),
  category: lower(Category)
}' mint.csv | zed load -use shaped -

We made a new pool called “shaped”, transformed the CSV, and landed it in the lake. Let’s verify the data is clean.

zed query -Z 'from shaped | has_error(this)'

In Zed, “this” refers to the current input value. It appears that one record contains an error.

{
  date: 2022-12-05T00:00:00Z,
  amount: 127.12,
  desc: error("lower: string arg required"),
  type: "debit",
  account: "credit card",
  category: "check"
}

This is a record containing a field named “desc” with an error as its value. The error contains a string informing us that the lower function was called with a non-string argument. It’s good information, but we don’t know what the original argument was. We could scan the original CSV for the offending line, but there is a better way. Let’s enhance our shaper script.

{
  original: this,
  cleaned: {
    date: time(Date),
    amount: float64(Amount),
    desc: lower(Description),
    type: this["Transaction Type"],
    account: lower(this["Account Name"]),
    category: lower(Category)
  }
}
| yield has_error(cleaned)
  ? error({msg: "shaper error", original, cleaned}) 
  : cleaned

This now wraps the original data and the cleaned data in a record. It then checks if the cleaned data contains any errors. If it does, it yields an error value containing a string message along with the original and cleaned records from the previous query segment.

Because it’s becoming more complex, we’ll save this script to its own file called “shaper.zed”. We can include it before processing the CSV by passing the -I flag to zq.

zq -i csv -I shaper.zed -Z 'mint.csv'

Let’s load this into a new pool and inspect the results.

zed create transactions
zq -i csv -I shaper.zed 'mint.csv' | zed load -use transactions -

If we filter the data with is_error, we will get this one result.

zed query -Z 'from transactions | is_error(this)'

This query returns…

error({
 msg: "shaper error",
 original: {
   Date: "12/05/2022",
   Description: 1909.,
   "Original Description": 1909.,
   Amount: 127.12,
   "Transaction Type": "debit",
   Category: "Check",
   "Account Name": "CREDIT CARD",
   Labels: null,
   Notes: null
 },
 cleaned: {
   date: 2022-12-05T00:00:00Z,
   amount: 127.12,
   desc: error("lower: string arg required"),
   type: "debit",
   account: "credit card",
   category: "check"
 }
})

A Description field in the original CSV had a value of “1909”. It was a restaurant I visited in Temecula, CA. The zq CSV reader interpreted that as a float64 instead of a string. Let’s use the context in the error value to fix this one record.

zed query -Z '
from transactions
| is_error(this)
| under(this)
| { ...cleaned, desc: lower(string(original.Description)) }
'

We filter for error values and extract the value the error holds with the under function. The last line yields a new record with all the fields from the cleaned record and a new desc field. Before lowercasing the text, we cast it to a string. This is the result.

{
  date: 2022-12-05T00:00:00Z,
  amount: 127.12,
  desc: "1909.",
  type: "debit",
  account: "credit card",
  category: "check"
}

Much better. The Zed architecture really shines in this next example. To fix the data in the pool, we’ll run the query above then pipe the output right back into the pool.

zed query '
  from transactions
  | is_error(this)
  | under(this)
  | { ...cleaned, desc: lower(string(original.Description))}
' | zed load -use transactions -

We could also delete the error from the pool now that it’s fixed.

zed delete -use transactions -where 'is_error(this)'

Every Zed lake keeps a commit log of the data objects added or removed from the lake. In a production system, more data may have been loaded into the lake, so we’d want to delete data at a known commit to avoid unintended deletes. Here’s how to view the pool’s commit history.

zed log -use transactions
Author: jkerr@Jamess-MacBook-Pro-2.local
Date:   2023-05-17T22:50:47Z

    deleted 1 data object

    2PwMSNaFgIj3kEzdh1UAmEJy7ze 25 records in 1429 data bytes

    added 1 data object

    2PwMmDblQgY32iQ2eJCA5RCf0DC 24 records in 1163 data bytes

commit 2PwMlYWoVFDGzXa5J5AffqBmYis
Author: jkerr@Jamess-MacBook-Pro-2.local
Date:   2023-05-17T22:50:41Z

    loaded 1 data object

    2PwMlS3IFf8DcdENRXDnCaAQ9xh 1 record in 102 data bytes

commit 2PwMSNGKfhVioHOHTgFhoIAdcWB
Author: jkerr@Jamess-MacBook-Pro-2.local
Date:   2023-05-17T22:48:09Z

    loaded 1 data object

    2PwMSNaFgIj3kEzdh1UAmEJy7ze 25 records in 1429 data bytes

Wrapping Up

First-class errors in data engineering formats is a brand new concept. Does it strike a chord with you? We think it makes working with unpredictable real-world data a whole lot easier.

Note: The Zed lake is new technology nearing an MVP release. However, it is used in production at meaningful scale by a number of data engineering groups.

Next Steps

Ready to go further with Zed? Here are some suggestions.