Crossfilter Tutorial

September 17, 2012


Earlier this year, Square released a Javascript library called Crossfilter. Crossfilter is like a client-side OLAP server, quickly grouping, filtering, and aggregating tens or hundreds of thousands of rows of raw data very, very quickly. Crossfire is intended to be combined with a graphing or charting library like D3, Highcharts, or the Google Visualization API; it doesn’t have a UI of its own.

If you have experience with OLAP or multi-dimensional processing, you should be able to ramp up on Crossfilter fairly quickly. If you only have experience with relational databases, it may take a little longer. If you’ve never used the SQL group-by feature, then you face a steep learning curve.

Quick Terminology Primer

First, you’ll need to understand facts , dimensions , and measures . (If you’re already familiar with these terms, then skip this section.)

Imagine you want to answer the question “How many orders do we process per week?”

You could calculate this by hand by iterating through all of the orders that your business had processed, grouping them into weeks. In this case, each order entry would be called a fact , and you would probably store this in an OrderFacts table. The week would be a dimension ; it is a way you want to slice the data. And the count of orders would be a measure , it is a value that you want to calculate.

Imagine another question, “How much revenue do we book per salesperson per week?” Again, your facts would be stored in an OrderFacts table. You would now have two dimensions , salesperson and week. And finally, your measure is dollars per order.

Below, we’re going to answer some questions like “How many living things live in my house?” and “How many legs of each type exist in my house?”

Getting Facts Into Crossfilter

It’s incredibly easy to get your fact data into Crossfilter: just use JSON. Each row is a fact.

Below, we’ve created a Crossfilter object loaded with facts about the living things in my house.

(Note: These are, for the most part, “fictional” facts. I don’t actually have (any pets, but it makes for a good tutorial.)

var livingThings = crossfilter([
  // Fact data.
  { name: “Rusty”,  type: “human”, legs: 2 },
  { name: “Alex”,   type: “human”, legs: 2 },
  { name: “Lassie”, type: “dog”,   legs: 4 },
  { name: “Spot”,   type: “dog”,   legs: 4 },
  { name: “Polly”,  type: “bird”,  legs: 2 },
  { name: “Fiona”,  type: “plant”, legs: 0 }
]);

That’s it. Now let’s find out some totals. For example, how many living things are in my house?

Calculating Totals

To do this, we’ll call the groupAll convenience function, which selects all records into a single group, and then the reduceCount function, which creates a count of the records. Not very useful so far.

// How many living things are in my house?
var n = livingThings.groupAll().reduceCount().value();
console.log(“There are ” + n + “ living things in my house.”) // 6

Now let’s get a count of all the legs in my house. Again, we’ll use the groupAll function to get all records in a single group, but then we call the reduceSum function. This is going to sum values together. What values? Well, we want legs, so let’s pass a function that extracts and returns the number of legs from the fact.

// How many total legs are in my house?
var legs = livingThings.groupAll().reduceSum(function(fact) { return fact.legs; }).value()
console.log(“There are ” + legs + “ legs in my house.”) // 14

Filtering

Now let’s test out some of the filtering functionality.

I want to know how many living things in my house are dogs, and how many legs they have. For this, we’ll need a dimension . Remember that a dimension is something you want to group or filter by. Here, the dimension is going to be the type . Crossfilter can filter on dimensions in two ways, either by exact value, or by range.

Below, we construct a typeDimension and filter it:

// Filter for dogs.
var typeDimension = livingThings.dimension(function(d) { return d.type; });
typeDimension.filter(“dog”)

That’s it. Dimensions are stateful, so Crossfilter knows about our filter, and will ensure that all future operations are filtered to only work on dogs except for any calculations performed directly on typeDimension . This is expected behavior, but I’m not sure if it’s a design choice or a design necessity. (We’ll look at the workaround later.)

var n = livingThings.groupAll().reduceCount().value();
console.log(“There are ” + n + “ dogs in my house.”) // 2

var legs = livingThings.groupAll().reduceSum(function(fact) {
  return fact.legs;
}).value()
console.log(“There are ” + legs + “ dog legs in my house.”) // 8

Let’s clear the filter, then do some grouping.

// Clear the filter.
typeDimension.filterAll()

Grouping with Crossfilter

I want to know how many living things of each type are in my house. I already have a dimension grouped by type called typeDimension .

Using typeDimension , I’m going to group the records by type, and then create a measure that returns the count called countMeasure . Once countMeasure is created, we can find the number of entries by calling countMeasure.size() (a.k.a the cardinality of the type dimension), and we can get the actual counts by calling countMeasure.top(size).

// How many living things of each type are in my house?
var countMeasure = typeDimension.group().reduceCount();
var a = countMeasure.top(4);
console.log(“There are ” + a[0].value + “ ” + a[0].key + “(s) in my house.”);
console.log(“There are ” + a[1].value + “ ” + a[1].key + “(s) in my house.”);
console.log(“There are ” + a[2].value + “ ” + a[2].key + “(s) in my house.”);
console.log(“There are ” + a[3].value + “ ” + a[3].key + “(s) in my house.”);

Awesome. Now let’s count legs by type. For this, we’ll create a dimension called legMeasure . This will use the reduceSum function instead of reduceCount , and we’ll provide a function that tells Crossfilter what field we want to sum.

// How many legs of each type are in my house?
var legMeasure = typeDimension.group().reduceSum(function(fact) { return fact.legs; });
var a = legMeasure.top(4);
console.log(“There are ” + a[0].value + “ ” + a[0].key + “ legs in my house.”);
console.log(“There are ” + a[1].value + “ ” + a[1].key + “ legs in my house.”);
console.log(“There are ” + a[2].value + “ ” + a[2].key + “ legs in my house.”);
console.log(“There are ” + a[3].value + “ ” + a[3].key + “ legs in my house.”);

Filtering Gotchas

As mentioned earlier, when you filter on a dimension, and then roll-up using said dimension, Crossfilter intentionally ignores any filter an said dimension.

For example, this does not return what you would expect:

// Filter for dogs.
typeDimension.filter(“dog”)

// How many living things of each type are in my house?
// You’d expect this to return 0 for anything other than dogs,
// but it doesn’t because the following statement ignores any
// filter applied to typeDimension:
var countMeasure = typeDimension.group().reduceCount();
var a = countMeasure.top(4);
console.log(“There are ” + a[0].value + “ ” + a[0].key + “(s) in my house.”);
console.log(“There are ” + a[1].value + “ ” + a[1].key + “(s) in my house.”);
console.log(“There are ” + a[2].value + “ ” + a[2].key + “(s) in my house.”);
console.log(“There are ” + a[3].value + “ ” + a[3].key + “(s) in my house.”);

The workaround is to create another dimension on the same field, and filter on that:

// Filter for dogs.
var typeFilterDimension = livingThings.dimension(function(fact) { return fact.type; });
typeFilterDimension.filter(“dog”)

// Now this returns what you would expect.
var countMeasure = typeDimension.group().reduceCount();
var a = countMeasure.top(4);
console.log(“There are ” + a[0].value + “ ” + a[0].key + “(s) in my house.”);
console.log(“There are ” + a[1].value + “ ” + a[1].key + “(s) in my house.”);
console.log(“There are ” + a[2].value + “ ” + a[2].key + “(s) in my house.”);
console.log(“There are ” + a[3].value + “ ” + a[3].key + “(s) in my house.”);

Other Gotchas

Crossfilter is built to be insanely fast. To do that, rather than completely re-calculating groups as filters are applied, it calculates incrementally. Crossfilter does that by using a bitfield to track whether or not a fact exists in a specific dimension. For that reason, Crossfilter dimensions are expensive, so you should think carefully about creating them and create as few as possible.

Shameless Plug

I’m the co-founder of FiveStreet, a technology startup company that helps leading real estate agents beat their competition when responding to online leads. We’re actively looking for a designer, business development folks, and (of course) more customers. If you can help me connect to any of the aforemention folks, please get in touch.

« Back