As part of their final year projects, I get my students to source their own datasets. I have several reasons for this, but the main one is that you don’t really appreciate how messy data can be until you try to put together a suitable dataset yourself…
Over the last few years it has become easier to source many different types of data, although the Office for National Statistics website search is still a mess. However, every year I still find data in embedded within publically available documents but not in a very usable form.
One of my pet peeves is data being made available in the form of pdfs rather than in a more useful format that can be easily imported into statistical packages. A current example of this is data about the Ebola outbreak being made available in pdfs. However, some people [such as @cmrivers] with better scripting ability than I do have managed to turn it into something more useful and have popped it onto github.
Whatever about issues surrounding making data available in the first place, if data is to be made public, make it available in a useful format!