Content analysis is a common data analysis process whereby researchers investigate content within a message or text. This process is often described as a replicable, systematic, objective, and quantitative description of communication content features based on a specific context. This analysis process provides constituent structural techniques to analyze communication content that is commonly open-ended and fairly unstructured. The purpose of a content analysis may vary, from describing the characteristics or features of the content to making implications about the cause and/or effect of the content. Content analysis techniques can be applied to a wide range of public and private written content such as letters, newspaper articles, open-ended survey data, transcribed interviews, as well as content in oral (e.g., live speeches or lectures) and visual form (e.g., videotaped interactions, pictures, film).
Content analysis allows researchers to examine and describe both the manifest and latent content meaning in the message. Manifest content refers to the surface or visible features in the message that needs little interpretation by the reader. Manifest content analysis commonly includes features that are physically present and countable within a message. For example, a researcher may count the number of negative words or phrases used during a couple’s discussion about a current disagreement to better understand the couple’s conflict communication. Latent content refers to the underlying features or meaning within the manifest content. Latent content is the deep structural meaning conveyed in the message and requires more explanation for the interpretation. Based on the couple’s conflict example, a researcher may examine the communication content for features of power or dominance by each individual during the conflict. Both manifest and latent content analysis require some interpretation, depending on the depth and level of abstraction.
The remainder of this entry discusses the process for conducting a content analysis, specifically sampling and data types, coding units, coding scheme, and code book. It also discusses coding, reliability, and finalizing data. Finally, this entry also provides a brief summary of some of the benefits and drawbacks in conducting a content analysis.
Based on the purpose of the study objectives and research questions, the researcher will need to determine the sampling framework or data type that is relevant to his or her research study. Sampling involves identifying and selecting the communication content the researcher intends to analyze. The sampling and data type largely depend on the nature of the communication content—whether it is open-ended survey responses, videotaped interactions, speech, art, letters, or a television series, the sampling or data collection differs. For example, if a researcher is interested in advertisement in an online magazine, he or she needs to select a specific magazine and decide how many issues and the years of publication. In a different study, a researcher may want to know how, if at all, individuals talk about infertility and use survey-based data to collect open-ended narratives on how individuals discuss infertility. Following this sampling or data collection stage, the researcher then needs to decide on the unit of communication content or text he or she will focus on during the coding process.
A coding unit often refers to a specific portion of content or text to be coded. The researcher selects the coding unit based on previously established objective and research questions the researchers wish to identify in the analysis. Broadly, coding units may include words, phrases, sentences, paragraphs, images, or an entire document or interaction.
Klaus Krippendorff proposed five different types of coding units for content analysis research. First, a researcher may code for physical units, which means counting quantity or space devoted to content. For example, this may include an analysis of counting the number of articles published on children (e.g., 18 years old or younger) in communication journals in the past decade. A second type of coding unit includes counting references to people, objects, or issues within the content, which is commonly referred to as referential unit. For example, in conducting a referential unit of content analysis a researcher may watch the television series Friday Night Lights and count the various issues that emerge, such as drugs, pregnancy, abuse, and bullying. A third form of coding unit is syntactical unit. This type of coding unit involves examining words, sentences, paragraphs, or complete documents to count number of times certain words or phrases are used within the content. For example, a researcher may want to examine how often men and women use the phrase “I’m sorry,” or “I apologize” during a voice-recorded conversation of a couple’s disagreement. The fourth coding unit is the propositional unit (i.e., thought unit). This unit focuses on coding each time a person expresses or asserts his or her thoughts about a specific topic. This unit may range from a few sentences or multiple paragraphs. For example, in a study on parent and adult children reasoning for estrangement, each reason for estrangement may vary from one to several sentences, counting as one unit. The final coding unit is the thematic unit and commonly involves larger sections of communication content or text. This unit of analysis might include asking participants to share a detailed story about a traumatic event or experience in their lives. This content would be analyzed based on overall features or categories that emerged from the narratives.
Once the researcher has decided on the coding unit, the next step is to develop a coding scheme. A coding scheme involves developing specific categories that will be used to analyze the content. In this process, the researcher may use inductive or deductive methods in deriving the coding categories. Inductive methods allow for categories to freely emerge from the data, whereas deductive methods involve established theory to help guide the development of the categories. In parallel with the inductive or deductive process, the researcher must also make sure the categories are mutually exclusive (coding unit fits in one and only one category) and exhaustive (all coding units examined belong in the proper category). This is often a time-consuming step in the content analysis process as the researcher examines the content multiple times until there are clearly defined categories that are mutually exclusive and exhaustive. Researchers normally pilot-test the categories in a small sample of the data before beginning the full-scale content analysis. Piloting is important to identify problems in the coding scheme or whether categories need to be added or collapsed.
The resulting final categories are detailed in a code book wherein each code is assigned a number, and each category is described (see Table 1). The code book helps to ensure clear, replicable, and systematic coding of the data. Once the code book is finalized, coding and reliability checks begin.
|Table 1 Code Book Example|
|1 = Abuse: Includes emotional, psychological, sexual, and physical forms of abuse.|
|2 = Beliefs: Includes differences in religious, spiritual, sexual and/or moral belief systems.|
|3 = Deception: Includes lying and manipulation.|
|4 = Control: Includes absence of privacy or intrusiveness.|
|5 = Drug/alcohol use or abuse|
Coding a unit of content (e.g., letters, speeches, pictures) into a category is referred to as coding and the individuals conducting the coding are called coders. Content analysis often involves a minimum of two coders to allow the researcher to establish intercoder reliability. The researcher may be one of the coders or the researcher may choose two coders, preferably individuals who are blind to the study’s research questions so that they are not motivated by any bias to code in favor of a particular outcome. At this stage of the analysis, the researcher carefully trains the coders to use the coding categories established in the code book by having them code a small section of the data. This is often a lengthy process that can take multiple training sessions to ensure that the coders are prepared before they independently code the entire dataset or a percentage of overlapping data. To reach reliability, the two coders must establish consistency between their codes or what is often referred to as intercoder reliability. In communication research, an acceptable intercoder reliability score is equal to or greater than .70. One way to calculate intercoder reliability is through reliability coefficient, which basically reflects the two coders’ number of unit agreements divided by the total number of units coded, which is provided by the following formula:
RC = 2AU1 + U2
where RC = reliability coefficient, 2A = number of units agreed upon by the two independent coders, U1 = number of units identified by coder 1, U2 = number of units identified by coder 2.
However, most communication researchers prefer to use a more robust measure to calculate intercoder reliability called Cohen’s kappa. Cohen’s kappa takes into account the coder agreement that would be expected based on chance alone. Here is the formula:
K = Po + Pe1 – Pe
In this formula, K = kappa, Po = the observed agreement among coders, and Pe = the agreement expected by chance alone. Although these formulas are helpful to calculate reliability by hand, a statistical software program, such as SPSS, allows a researcher to enter both coder’s category scores into the program and quickly run coder agreement and Cohen’s kappa.
After all data have been coded and intercoder agreement has been established, the final step of the content analysis is for coders to meet and resolve any differences. This process involves coders going through the content together and coming to agreement on any codes that differ so that only a single code is assigned to each unit of data. Then, the coders supply their results to the researcher. The researcher then examines the results, explaining descriptive statistics (e.g., total number and percentages for each category), providing qualitative exemplars, and running further statistical tests, if necessary, to answer the study’s research questions.
Content analysis is a useful research approach that can be applied to a wide variety of small and larger content or text data. This technique allows a researcher to collect more in-depth and rich data in a systematic, replicable, and objective way. In addition, a researcher can ask more complex questions of how communication content (i.e., what people actually say or write) relates to attitudes and behavior variables in a study. A study could use a content analysis to evaluate memorable parent-teenagers sex-talk messages and then run statistical tests to see if there is a relationship between the messages and the teenager’s self-reported sexual attitudes and behaviors.
Furthermore, content analysis is often an unobtrusive approach because researchers can examine communication content in written or oral form in more naturally occurring contexts compared to experimental based contexts. For example, a researcher may collect individuals’ holiday letters to be analyzed or use previously recorded interactions of couples telling the story of how they fell in love. All these reasons add depth, rigor, and creativity to a study.
But, content analysis also has its drawbacks. Based on the availability of the data type or sampling, it could introduce sampling biases. For instance, the data may be collected or chosen in a way that some data are less likely to be used compared with other data. In addition, the process of developing the coding categories, as well as coding the content, involves some level of interpretation that may also produce researcher or coder biases. This can happen when an individual asserts his or her own opinions or knowledge of the subject when explaining the data. Researchers also need to be aware that when examining a specific coding unit (e.g., word, phrase, or paragraph) in isolation from the larger content or context, they may run the risk of losing or altering the meaning of the content.