Tomorrow and Saturday, I’ll be attending the Sloan Sports Analytics Conference in Boston – for those of you haven’t heard of it, it’s a conference where stat geeks all get together to mingle and give presentations on the recent advances in sports statistics. I’m going to post updates on the more interesting presentations at the conference over the next couple of days.
The area with the most buzz right now seems to be basketball – with 8 out of the 20 research presentations/posters accepted being about basketball (this may also have something to do with the fact that the conference is the brainchild of Daryl Morey – the stats-centric GM of the Houston Rockets).
The most popular sport in the world – soccer has only one paper accepted. Its topic is a bit flimsy by advanced statistical analysis standards, it’s on the impact of high altitude on match outcomes. Compare this to the state of play in baseball, and it’s safe to say soccer is light years behind.
It’s easy to understand why – baseball is so simple compared to other sports with most confrontations being one-on-one (batter vs. pitcher, fielder vs. batted ball, etc). From this standpoint, baseball statistics are easy to collect, even in realtime, and any stat geek can basically gain access to reams of data, free-of charge. The biggest advances of late have been in fielding statistics, which require another layer of data collection, but still relatively straightforward.
Soccer – on the other hand – is faster moving and flowing, with so many interdependencies. Many people have written about the difficulties of applying advanced statistical analysis to soccer, there’s an interesting post on this subject at the soccer blog Run of Play here. (http://www.runofplay.com/2009/01/12/two-kinds-of-faulty-statistics-in-football/)
After reading Moneyball and a few Sabermetrics tomes, I, like many people, started thinking about how advanced statistical techniques could be applied to sports beyond baseball, like soccer and American football. I, of course came to the same conclusion as everyone else: the data wasn’t there. But in some ways, it didn’t really make sense. With all the money in sports, how could the data NOT be there? Why wouldn’t someone did whatever it took to collect it? The costs must outweigh the benefits, I was sure.
Coincidentally, I am on the board of a social enterprise called Digital Divide Data, which has the mission of helping the world’s poorest by creating sustainable IT jobs. DDD has been looking for ways to grow its impact by creating more jobs for the underprivileged in Southeast Asia. Lightbulb! What if we could train people to watch soccer matches and collect unbelievably detailed statistics? As soon, as the idea was born, I couldn’t turn back.
StatDNA, the startup I now run, started officially in early 2010. We’ve trained 30 people and have had them analyzing in incredible detail the entire 2010 Brazilian Serie A and B leagues, with each game taking around 25 hours of analysis. We’ve also moved on to the English Premier League more recently and have over 20 million pieces of data in our database. Other companies do collect advanced soccer statistics, but no one can hold a candle to the amount of information we are collecting. We’ve collected data on things that matter like defensive pressure, shot power and passing trajectories, just to mention a few, that have never been collected before.
We know our data’s not perfect – soccer is a difficult game to capture- but it’s definitely a step in the right direction. And we view this just as the first step. We’re just about to finish our first set of advanced analytics, and we’ve developed some pretty interesting concepts like “GoalEx,” the equivalent of RunEx from baseball. Goal expectancy is statistically modeled on every touch of the ball and players are given credit for their contribution to goal scoring.
We have our own blog at blog.statdna.com and twitter statDNA, where we will post our findings over the coming months. We’re also hoping to light a fire under the soccer analytics world. While all advanced soccer statistics are currently proprietary, we are providing free of charge access to over 300 games of data for soccer researchers, and are awarding a StatDNA research prize this fall. Let’s see if next year, SSAC has a few more soccer papers.
As I go to SSAC with high hopes, I’m reminded of one of the central tenets of Moneyball: how long it took for sabermetrics to become mainstream in baseball. Hopefully, baseball has removed some barriers for the later adopting sports.