22nd April 2012 at 9:41 pm #2447
I am a PhD student and interested in discussion around measuring reliability when examining qualitative data. I am aware of interrater reliability with respect to percentage of agreement but have been asked to explore correlational measures in the data I am working with which is coding thematically from interview transcripts. Has anyone investigated ways of demonstrating/measuring the extent to which the raters agree on ratings?23rd April 2012 at 1:39 am #2463
The essence of checking reliability for coding qualitative data is to make sure the coding of the data is consistent across time and people. We therefore need to show that the coder codes the data consistently over time (intra-coder reliability) and that her coding is not significantly different if others were to code the same data (inter-coder reliability). The practice has been to randomly choose about 15-20% of the data (if there are 100 pages of transcription of interviews, e.g., 15-20 pages) and making 3 copies of them. The first (usually the main coder) codes the first copy of the sample using the devised coding scheme. After about 1-2 weeks s/he codes the second copy of the sample data. The correlation between the two codings will yield intra-coder reliability. The third copy of the sample data will be coded by another coder who has been trained how to code using the devised coding scheme. It is even recommended there are some practice and discussion sessions before s/he embarks on the coding of the sample data. Once the third copy of the sample data is coded by the second coder, the correlation between the first coder’s coding and the second coder’s coding will be used as an index of inter-coder reliability. Usually intra-coder index is higher than inter-coder reliabilty for obvious reasons. Once intra- and inter-coder reliability indices are set within reasonbale ranges, the first coder can do all the codings with the justification that the coding is systematic and consistent.
Hope this helps.
Mehdi Riazi23rd April 2012 at 1:55 am #2462
Thank you Mehdi. Your comments are a timely reminder for me to consider intra-coder reliability as well. What are your thoughts on consensus estimates like Cohen’s kappa vs say a consistency estimate like Pearson’s or Cronbach’s alpha? How important do you think these are in reporting the results of qualitative data? Thanks again.23rd April 2012 at 2:54 am #2461
You’re welcome Susan. I’ve seen all the three used and reported for this type of reliability with Cohen’s Kappa being more common.23rd April 2012 at 11:38 pm #2460
You haven’t mentioned the context of your concern re reliability coding, other than being student work. The point is that, in a very large proportion of circumstances where qualitative data is used, inter-rater reliability of coding is simply inappropriate, but in others, it is necessary. e.g., if your analysis is interpretive/inductive/emergent (e.g., as in grounded theory) then there’s no reason why someone else, potentially with a different perspective/question/concern/experience would interpret your data in the same way as you have – so then you have to train them as Mehdi has suggested, and give them a very structured coding guide (which is antithetical to the method), and what does that prove other than that you can train someone to think like you? If, however, you are working in a team, then you need to build team consensus about what it is you are looking for and so together you might come up with a coding system (while retaining some flexibility) and develop systems to check and maintain consistency, or, if you want to convert your qual coding to variable data for stats analysis, then you need a higher level of consistency in the coding.
So – there is no one answer or pathway to the reliability issue!23rd April 2012 at 11:53 pm #2459
Yes that is true and thank you Pat for raising this. The concern I have is the extent of ‘proof’ to which two raters agree that I am being asked to demonstrate. My analysis uses an established theoretical framework. I have no desire to convert to a stats analysis, so what would be the accepted norm in terms of establishing reliability in this instance or is there none?27th April 2012 at 6:37 pm #2458Roger GommMember
In response to Pat’s assertion that inter-rater reliability is sometimes inappropriate I would say that in making your methods (your route to what you claim you have found) accountable, and in sharing your findings, you have to devise codings which are shareable and that inter-rater reliability is always an important issue. Afterall the end product a researcher is aiming for is something other people will agree with.By shareable I don’t mean that other people have to agree with you, but that they should be able to see and follow the same classification procedures that you have used. In that sense the collaborater(s) who serve(s) as the other coder(s) stand(s) as a proxy for your readership,and an inter-rater reliability test should at least show up incomprehensible, unprincipled and idiosyncratic coding.
In addition, inter rater reliability testing may also serve as a check on intra rater reliability, insofar as disagreements between raters may be the result of inconsistency on behalf of the researcher.
A poor level of inter-rater reliability can be interpreted as derived from the coding behaviour of either (or all) coders. So a poor result should be a start of an investigation of the disagreements, rather than an end in itself.
With regard to intra coder reliability Pat is hinting at a very real problem where the research is emergent because you would expect a coder to disagee with him/herself after sufficient time had elapsed to allow for enough forgetting to make the intra reliability testing a sensible procedure.Faced with this problem the best antidote for intra rater issues is an inter rater test on the final coding scheme.27th April 2012 at 10:16 pm #2457
Thank you Roger. Your comments have been very helpful and assisted me further to background my thinking around reliability measures.3rd May 2012 at 4:00 pm #2456bernard smithParticipant
I think the question of inter and intra coder reliability is fascinating not because we need to find ways to assure ourselves of the reliability of the work (although I imagine that is an important issue for some research) but because we might in fact treat seriously as our very topic the issues that Susan seems to be working to negate or resolve. In other words, as qualitative researchers it seems to me that we should focus in on how people IN FACT make sense of data and so code rather than look for ways to neutralize their sense-making in an attempt to provide for coder interchangeability. It’s a little like the (apocryphal?) story of the psychologist who rather than assume the intelligence of the children who scored below 100 in the IQ test asked them to explain their answers. To his astonishment the answers he received showed incredible creativity and thoughtfulness. That they were not the answers in the answer book was important for those looking to measure the IQs of those kids. That the answers the children provided made great sense and sociologically speaking helped show how “intelligence” is socially constructed3rd May 2012 at 10:44 pm #2455
What Bernard has said points to the way that I would prefer to argue reliability for a single investigator study – not as a measure, but to be very transparent in explaining how I arrived at the codes I am using, to provide examples of what is and is not included, and to show how I am using them, i.e., be thorough in your documentation and in keeping an audit trail of your thinking about them, in particular, showing how they both reflect the input of your participants (or other data) and serve the purpose of your research. So – like the children who can explain why they do things in non-conventional ways, you can explain why you’ve done what you did. This is far more powerful than an artificially constructed measure, whatever measure you use – and that is what I would be putting to the person who asked you to produce measures (presumably a thesis supervisor).
Incidentally, a number of the qualitative software programs will calculate reliability measures for you (such as Kappa, and % agreement/disagreement), but more usefully, some (including NVivo, with which I am most familiar) will show you visually who coded which bit of text with a particular code – thus providing a basis for discussion about why you did or did not code a passage a certain way. If you are learning to code, this can be useful, or if you are in a team and need to coordinate coding, or if you are training a research assistant to think like you!4th May 2012 at 12:08 am #2454
Thank you Pat and Bernard for your thoughts. You are correct Pat in that it is a requirement/suggestion (?) from a supervisor that I produce measures to demonstrate reliability in my coding measures. As a practitioner in my field of study (not in research methodology though) for over 30 years, I have carefully documented all my research processes and feel confident that I can explain what I have done and why. I am finding it very difficult to even think of a way of showing inter rater agreement related to some unexpected findings in the research which have ended up as being quite significant. I acknowledge and have used NVivo and the Kappa function for some of the data but am puzzled about just how or should I explore reliability on the unexpected outcomes. It’s complicated!4th May 2012 at 2:36 am #2453
To me it seems what Bernard nicely illustrated with the example of “intelligence test” is more related to ontological (the reality; the object) and thus the validity of the inference we try to make out of the data than to the methodological issue of ensuring consistency in coding the data for analysis which is expected in research reports as Susan’s supervisor asked for the evidence. The two aspects of the “intelligence test” Bernard elaborated on signify two ontological positions: positivims vs. social constructionism which certainly pertain to our conceptions of what “intelligence” is and which will lead to the validity of the inferences we’ll make out of the data we have. This, I think, needs to be taken care at the level of developing the coding scheme: the conceptualisation of the constructs to be coded in the data; the nature of what “intelligence”, e.g., is and what its constituents are. For the psychologist in Bernard’s example to be able to show “creativity” and “thoughtfulness” (as two constituents of intelligence) in students’ explanations of their answers, s/he certainly needed to code instances of students’ utterances as examples of “creativity” and “thoughtfulness”. It is, I think, at this level that the consistency of coding instances of what the psychologist defined/conceived as “creativity” and “thoughtfulness” come into account. The question is whether we need to see any evidence that instances of “creativity” and “thoughtfulness” (a socially constructed definition, presumably) are coded in the same way all through the data sets or not?
Mehdi4th May 2012 at 3:21 am #2452bernard smithParticipant
Mehdi, While I am not disagreeing with you, I think my point lies elsewhere. What I am trying to suggest – and my point takes Susan’s question for a walk, as it were – is that rather than treat inter and intra coder reliability as a problem to be solved (which is what Susan’s research supervisor is asking ) should we not be treating inter and intra-coder reliability as a topic for exploration? Asking: how do we code? What are our practices? When and how do coders reach consensus? What prevents them? Those kinds of questions problematize – make visible – the practices we are in fact engaged in when we code data.
That for all intents and purposes (sometimes more and sometimes less), we DO solve the problems of reliability among coders and treat the results of such soultions as the data simply ignores the work that is done in solving the problem. That work is pushed backstage. The REAL work is done afterwards. In other words, we (ie members of society) seem to want to treat our construction of reality (in every arena – from medical errors to suicide to social problems (my background is in medical sociology) as something other than the work of our practices, when in my opinion, the WORK we all do in constructing the reality that then confronts us and which we need to live with and by and through should be the subject of our research. But as social scientists we tend to act as if the everyday practices of folk (our own practices included) are problems to be mitigated, ameliorated, attenuated, removed, so that we can focus on reality much like those in the physical sciences do , and so view our relationship to this reality as fundamentally , and ontologically distinct. In other words, to be good social scientists We – report – our – observations (and we hide the fact that our practices constitute the reality that we then report) – and I guess I am arguing that the only thing that makes good sense to explore how we construct our reality – including, for example, how Susan and her colleagues code data or how Susan codes data yesterday and today and tomorrow rather than to be convinced that there is strong inter-coder and intra -coder reliability… But my question is not the question that either Susan or her supervisor is interested in exploring. They want to treat it as a problem to be solved and not a question to explore.4th May 2012 at 3:44 am #2451
Seems to me that your supervisor is asking you to do something completely irrelevant for the issue you have – how to justify / explain / validate your unexpected findings. The more appropriate response would be:
a) explain carefully how the findings came about, including a detailed description, examples, and importantly, the context in which they have arisen – and then note why it is significant
b) to ‘theoretically sample’ either amongst your existing participants or for new participants specifically chosen for the likelihood that they can give you more information on this particular aspect of your work.
From all that, you should be able to say where this comes up, for whom, etc etc.4th May 2012 at 4:57 am #2450
Got your point Bernard as it’s a bitter reality that geniune research practice requirements sometimes turn into cliche formulations. I absolutely agree that we need to treat intra- and inter-coder reliability as a topic for exploration and explanation. My experiences with qualitative data (L2 literacy and test validation) have been to devise a coding scheme based on the data and relatred literature and finalising it with co-researchers through discussions and then using the coding scheme to codify the whole data for subsequent analysis and inference. My understanding has been that by devising a coding scheme derived from the datasets and in consultation with related literature (community of researchers) and discussion with like-minded people (co-researchers) we, in fact, socially construct the object of the study through defining key and significant concepts in a meaningful and systematic way. Once such coding scheme is devised then its consistent use for coding the whole datasets can be ensured with myself and one of the co-researchers (who contributed to the development of the coding scheme).
- The forum ‘Default Forum’ is closed to new topics and replies.