Team-Based Mixed-Methods Project Management using R, SQL, & Regex - Part I.

25 August 2017

This is the first of what I intend to be a series of posts documenting the troubleshooting adventures, ideas/solutions, and (some) results of my role as the project coordinator, data manager, and technological administrator for a team-based, semi-large-scale, mixed-methods research and data analytic project using, as the title implies, R, SQL, (Perl-based) Regular Expressions (i.e., Regex).

About the project:

The project in question is a national telephone interview survey study focused on informing intimate-partner-violence-related policy and program implementation across the United States. The data were generated via semi-structured telephone interviews conducted with one-to-two representatives from each state’s committee or organization tasked with implementing and/or overseeing standards related to intimate partner violence perpetrator intervention programs.

Data Sources & Datasets

There are currently two primary datasets resulting from this study:

  1. A spreadsheet (“data/states-sim.csv”) containing the quantitative/discrete survey data, additional relevant data-points that I collected from various public data sources.1
  2. An SQL database containing qualitative data generated from responses to the survey’s open-ended questions. These qualitative data were extracted from the transcript for each interview conducted and are organized according to the survey’s eight major sections. The source-text under each of the eight sections is separated into excerpts pulled from each interview transcript via a single line of text containing (1) the original transcript’s filename (which consists of the state name abbreviation, (2) where applicable, an underscore (“_”) and either an “A” or “B” specifying the transcript for states with two interviews), and (3) the character-count-based location of the excerpt in the original transcript at the beginning of each excerpt.

The latter database is my biggest hurdle in terms of team-based data analysis coordination and management, as this database is, probably obviously, the source for my team’s qualitative coding and data analytic phase of our overall mixed-methods analytic process. For the team-based coding, I trained my team members in using the RQDA R package and created six identical RQDA project files for each of the members of the team (including myself). The RQDA project files were “pre-loaded” with the codebook we developed for the project. This codebook is hierarchically-organized with codes separated into categories representing each of the 19 open-ended survey questions.

Exactly two team members were self-assigned to code each of the 19 open-ended survey questions. The codings need to be evaluated for inter-rater reliability among the two assigned coders for each open-ended question. Thus, the databases need to be merged. This is a simple process in a simple world where all other things are constant and equal. Unfortunately, reality has never been simple nor constant nor equal. In addition, my supervisor, and myself, would like the codings to also be linked to the original transcript file, which is unfortunately, and on my own fault, not something the qualitative database was originally setup to handle (at least not programmatically).

Hence, my goal here is to find a way to programmatically merge the six databases and link the coding data to the original transcript files.


  1. e.g., state-level violence-related CDC’s Violence Prevention Data Sourcesand sociodemographic and public opinion data available from Pew Research Center and Gallup.com’s public polling data reports



     
Design by Rachel ("Riley") Smith-Hunter based on the Tufte CSS theme by Dave Liepmann & Edward Tufte
Creative Commons License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.