Binning data stream in Vega

By Mirek on 7/25/2022 (tags: charts, JSON, Vega, categories: tools, code)

In this post I’ll briefly show you how to perform basic data transformation on data stream in Vega chert spec. Keep reading.

Vega has very rich set of transformations that you can combine and apply on the source data. A list of all of them is available here.

In this post we will present the bin transform. Hovewer we will not cover all of the options and parameters, but instead, we will focus on real example to demonstrate the general idea of bin transform. Then we will also use couple of other transform to manipulate the data to our need.

To discover all of the possibilities you can try the online editor, where you can play around with parameters and see the live preview of the generated chart as well as the data stream transformation and many more. I highly recommend to start playing with vega this way.

In a short, as the documentation says,

The bin transform discretizes numeric values into a set of bins.

In another words it assigns each data element to one of bins, based on its numeric value.
So, someone could ask, how is that different from grouping or aggregating elements?

Well, when you group data, you define a grouping key - the field (or fields), by which each data element is grouped. Then all elements with same key value go to the same group.

In bin transform it’s quite different and mostly for two reasons:

1. One bin can span elements with different key-field values, that all falls into the bin extent (bin range), while in grouping all group elements have same key-value,

2. The bin transform does not produce groups, but instead it assigns a bin definition to each of data elements.

Let’s see that in an example. We will use aforementioned Vega online editor .

Assume we have a set of data that represent annual sallary of one employee and the depertment of the company where he/she works in.

[
  {"dep": "F", "salary": 48},
  {"dep": "C", "salary": 33},
  {"dep": "E", "salary": 47},
  {"dep": "A", "salary": 25},
  {"dep": "H", "salary": 42},
  {"dep": "A", "salary": 22},
  {"dep": "G", "salary": 51},
  {"dep": "B", "salary": 34},
  {"dep": "D", "salary": 41},
  {"dep": "A", "salary": 22},
  {"dep": "A", "salary": 26}, 
  {"dep": "F", "salary": 38},
  {"dep": "C", "salary": 32},
  {"dep": "E", "salary": 41},
  {"dep": "A", "salary": 45},
  {"dep": "H", "salary": 52},
  {"dep": "A", "salary": 23},
  {"dep": "G", "salary": 27},
  {"dep": "B", "salary": 54},
  {"dep": "D", "salary": 31},
  {"dep": "A", "salary": 24},
  {"dep": "A", "salary": 26}
]

Now we want to see the salary distribution in different ranges. Let’s say we want to see how many people earn between 20k and 30k per year and how many between 40k and 50k. Btw. you can download the full spec for this demo at the bottom of the chart.

Let’s add the bin transform.

"transform": [
{
    "type": "bin",
    "extent": [0,100],
    "field": "salary",
    "step": 10,
    "as": [ "salaryFrom", "salaryTo" ]
}
]

We tell the Vega engine to create bins with step of 10, so 0-10, 10-20, 20-30 and so on. Then take each data element, test value of its salary field, assign the bin this value belongs to and attach the bin ranges to the data element as new fields salaryFrom and salaryTo, where lower bound is inclusive and upper one is exclusive. In the extent parameter we need to provide the minimum and maximum range for the bins values. Sort of domain. We can quite easilly grab that from the data source with usage of extent transform.

Having the bin transform applied, we can now look up the result data stream in the editor.

Great, but now how to proceed? We basically need to count elements in each bin and display it as a bar volume. We want the bottom axis to represents salary ranges (our bins) and the vertical axis the count of employees that falls in to specific salary range.
For that we can now use the aggregate transform with count operation

{
   "type": "aggregate",
   "groupby": [ "salaryFrom", "salaryTo" ],
   "ops": [ "count" ],
   "as": [ "binCount" ]
}

We group the data stream by both salaryFrom and salaryTo fields. First, because we want to have that both fields in the output groups and second, because these fields are already correlated to each other and always fall with the same group, representing one bin. Then we do a count operation in each group and return its value as an extra field called binCount.

So now we know how many items we have in each bin. Let’s see how it already renders as a bar chart.

Very nice but we want to have the bins in ascending order. Well that’s also quite simple. We can use the collect transfrom, that basiccally allows to sort data elements in the stream.

{
   "type": "collect",
   "sort": { "field": "salaryFrom" }
}

and we got it!

The full Vega spec is available below. Just copy the json and paste it in the Vega online editor. Feel free to play around.

Cheers

Download attachement - 4 KB