The Answer Is In Your Data. And It’s “No” Until You Ask.

Posted on 15 November 2012 by

0


This post covers some really basic stuff, but it’s fundamental to what we do, so I think it’s worth a review.

You’ve probably heard of “big data”, which after “cloud” is the most over-used, God-awful buzz-phrase of the past couple of years.

Basically, big data means aggregating and correlating very large sets of data and then analyzing them, visualizing them and using them to make actionable conclusions.

Nota bene: Very few of the things out there which claim to be “big data” in fact are big data in its purest sense, but that actually doesn’t matter, because the meaning of the phrase is being co-opted by vendors to mean, simply, data aggregation, analysis and visualization.

In other words, you as crime analysts do “big data” every day – or at least, you should.

All too often, though, we get trapped into using the same tools and creating the same products – because they work.

Also, many beginning crime analysts conflate visualization and analysis: placing stuff on a map is not analysis, it’s articulation. Merely providing geo-spatial visualization is incredibly useful, but it’s not analysis.

What is analysis, though, is a prioritized list; a subset of your total [whatever-you’re-measuring-here] placed on a map. That way the consumer of this intelligence product knows that each flag/tag/pin/dot on the map has already itself been analyzed and found to be of particular interest, so each dot is meaningful given the context of the question that is being asked – not simply articulating facts.

A map showing the addresses of all burglaries in the past 30 days? Visualization.

A map showing the addresses of all burglaries in the past 30 days in which MO, stolen property and behavioral characteristics are similar? Analysis, visualized. (Also, a map with fewer pins.)

I think we often forget that the answer is in the data, but we need to ask it the right questions.  It’s little things about “big” data that don’t get looked at at a firm’s peril. Let me give you a non-law enforcement example:

If I’m signed into Delta.com, they know it’s me and they have access to my entire flying history. They know, for example, I’m a Diamond member (I fly a lot) and therefore automatically a (free) member of their SkyClub lounges. They know that, on this flight, my wicked-cheap economy fare was upgraded (free) to First Class (I fly a lot).

So now when I go to check in on their website, why would they show me an ad offering me the opportunity to buy a day’s worth of SkyClub for $39, and another one to “step up” to economy comfort for $19? Why not try to sell me on something more relevant to me, something there’s a chance I might actually buy?

This is a problem that is so easily solved. They would have some chance greater than zero of selling me a whatsit if only they’d look at their data and stop trying to sell me something that they in fact know I don’t need.

That is really stupid.

Eric Olson talked to us in 2011 about how analysts need to look at their data with an eye to throwing most of it away in order to be left with a meaningful set against which to run queries. His concepts really make sense, and Dave and I try to use them all the time.

As a simple example, last week Dave and I were helping an agency look at selecting arrest warrants from their court database for a local operation, and our first step was to reduce the list from more than 4500 candidates to fewer than 500 simply by removing everyone with:

  • an incomplete address;
  • an address that couldn’t be geo-coded (indicating a problem with an element of it);
  • a warrant age over a certain number of days;
  • an address farther than 30 miles from the center of the jurisdiction;
  • and about three other things.

Total time investment? Fifteen minutes.

Result? Every address was at least an accurate (if not “good”), local one that had relatively fresh data.

Now we can get to work and start to ask useful questions that will pare that list down from 500 to the 50 that we believe will be good addresses – that is, addresses at which we will likely find our fugitives.

And 50 is a number that anyone can handle.

That’s not by any means “big data” – in fact, we were working with a spreadsheet. But if we learn from the way people look at big data, we can see that, in fact, they spend most of their time doing similar operations.

Regardless of your opinion of the politics of the New York Times’ Nate Silver, his methodology is worth reading (and we note that he accurately predicted the electoral results in 50 out of 50 states this past election). Note that almost all of stages one, two and three is discarding and weighting data. His data is arguably not “big” data either, though we will say that it’s certainly bigger than the data most of us get to use on a regular basis.

How do we feel when we realize that we had information about a certain individual in database two, but we only looked in databases one and three? Really stupid.

So look at your data: it’s not just telling you something…It’s telling you everything.