Late last year I, along with a group of talented colleagues, was helping to organize a National Academies workshop on communicating the quality of federal data. The issue, at first glance, seems like it should be simple: If it’s a survey, present margins of error on the estimates. If it’s administrative data, provide metadata. Then call it a day.
However, as many readers will understand, data quality is much more complicated and nuanced than that. Quality is a huge umbrella under which sample size, universe, coverage, timeliness, representativeness, and many other dimensions fall. In short, it’s the series of questions any good data scientist will ask themself when trying to pick the right dataset to answer a question: