Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
I’ve seen the above New York Times quote appear in dozens of Big Data blog posts and slide presentations since the publishing of the referenced article in August. It is sinking in that algorithms applied to raw data won’t bring much insight. I wrote several posts about what this janitor work means in a Building Continuous Optimization context, here and here. For another deeper level of detail, consider this slide from the OSIsoft Real-Time Building Data project presentation put together by Carnegie Melon University researcher Bertrand Lasternas: