Thursday, May 23, 2013

Summer of Data Cleaning

I love almost everything about research: collecting data, reading theory, writing articles, analyzing data, etc.  However, there is one thing I dislike very much about research, and that is processing all the data I collect in order to analyze it.  Despite el-wa7sh (my dissertation) coming in at 400 pages (quite long for my field), I barely scratched the surface of the data I collected for my dissertation with it.  On the one hand, this is useful, as I have lots of material to work towards tenure on.  On the other, I have to do a considerable amount of data processing to make it workable and also deidentify all of the data to close the active study with the IRB and have them leave me alone.  Data processing includes things like the dreaded transcription, which I still have more of to transcribe in a fashion suitable for linguistic, rather than content analysis, blacking out names and pictures from Facebook pages  (and of course the more active the user, and the better data I got, the longer this takes), making nice tables for SPSS and linking them to my qualitative data, etc.  Essentially, it is incredibly boring, and I have lots of it to do.  While many academics resolve to write things over the summer, I have submitted two articles so far this year, so instead I'm resolving to finally get all of my dissertation data cleaned up (which then means more writing, a much more enjoyable task).  I am secretly hoping that in the post-baby breastfeeding sleep deprivation state, this type of boring work will become challenging, and thus more interesting, but I have my doubts.

Planning classes for the Fall (that I won't be teaching due to maternity leave, but that's a different story) is my other summer occupation, but more on that later, including adventures in online language teaching!


  1. I have a ton of data collected via Facebook to de-identify - I'm curious, what do you find is the best way to do it?

  2. I use a program called PDFPen, which is for Macs only though. It allows all sorts of PDF editing, so I save the page as a PDF. To black out faces, I use the rectangle tool, which allows me to insert a black (or whatever color I choose) rectangle over the face. I actually made a special key command for Select Rectangle to make it go faster. Removing names is where PDFPen really shines, as there is a command to search and replace, so I can replace the real names with pseudonyms, or redact them to nothing or a box. Since it will replace all, this can save a lot of time as the same people tend to show up over and over on the page. It is still time-consuming and boring, but this does save time over when I was attempting to black them out individually. PDFPen also crashes a lot when I do this, which is annoying, but still the best solution I've found so far--let me know if you know of something else!