Week 4 Responses QR

Respond to my colleague post separately:

Colleague post#1: Listed below is my response to this weeks discussion question. I look forward to your comments and questions.

Null hypothesis testing tells the researcher that something has occurred and it is significant. However, it can not tell us exactly what occurred. It can give the researcher the ability to state that what occurred did not occur by chance. But the reseracher can not state specifically what occurred. When testing to discredit the null hypothesis, flawed research can result if the researcher conducts the test incorrectly, thereby leading to Type I and Type II errors. Schmit (1996) argues that relying on significance testing has systematically retarded the growth of cumulative knowledge in psychology (p.115), while Cohen argues that testing for the null hypothesis encourages misinterpretation, misuse, and overconfidence, resulting in poor scientific judgment and distorted conclusions across studies (p.1308). These two articles got me thinking about what else would be affected if the findings were incorrect? This is where my past classes came in, and I began to examine threats to validity and reliability, specifically focusing on how flawed research contributes to compromises in the validity and reliability of the study findings. In research, there are two forms of validity; external validity and internal validity. External validity refers to how findings in one study could apply to a population in another study (Pearl & Bareinboim, 2024). This is known as generalizability (Pearl & Bareinboim, 2024). Thus, if the calculations in a study are done incorrectly, a researcher could make the argument that the findings do not apply to other areas because of x, y, z. Internal validity measures how well a study is conducted (its structure) and how accurately its results reflect the studied group (Cuncic, 2025, para.2). The same argument applies here, if the statistical tests are not done correctly, then the results will not be accurate, and will not reflect the studied group (Cuncic, 2025, para.2). Reliability refers to the consistency and reproducibility of measurements. It assesses the degree to which a measurement tool produces stable and dependable results when used repeatedly under the same conditions (McLead, 2024, para.1). When reporting the statistical testing section of research this is where the researcher says, x, y, z test was used to conduct a, b, c. This test is deemed reliable because it has been used in x number of tests with y consistent findings. To the average person, this sounds valid. However, the average person does not consider how the test was conducted, what the population was, or whether the study was conducted correctly. I teach this concept in my business classes using late-night infomercials as the example, specifically talking about how statistics and statements are skewed to sell a product. I tell my students, before you go out and buy the latest thing you saw on late-night TV, ask yourself this question: Can I find the research results they are talking about? If not, do not buy the product because it is most likely a gimmick, and they are using statistics to make their product sound legit.

References

Abos, P. (2024). Validity and Reliability: The extent to which your research findings are accurate

and consistent. Retrieved February 4, 2026, from

https://www.researchgate.net/publication/384402476_Validity_and_Reliability_The_

extent_to_which_your_research_findings_are_accurate_and_consistent

Cohen, J. (1990). Things I Have Learned (So Far). American Psychologist 45(12), 1304-1312.

Cuncic, A. (2025). Internal Validity vs. External Validity in Research. Retrieved February 4, 2026,

from https://www.verywellmind.com/internal-and-external-validity-4584479

MLead, s. (2024). Reliability vs. Validity in Research. Retrived Fevruary 4, 2026, from

Reliability vs Validity in Research

Pearl, J., & Bareinboim, E. (2022). External Validity: From Do-Calculus to Transportability Across

Populations. ACM Books.

Schmidt, F., L. (1996). Statistical Significance Testing and Cumulative Knowledge in Psychology:

Implications for Training of Researchers. Psychology Methods, 1(2), 115-129.

Colleague post #2: The Major Flaws of Null Hypothesis Significance Testing

Null hypothesis significance testing (NHST) has been the dominant framework in psychological research for decades, yet many scholars argue that it contains fundamental flaws that limit the fields ability to build reliable and cumulative scientific knowledge. Two of the most influential critiques come from Cohen (1990) and Schmidt (1996), both of whom highlight how NHST is frequently misunderstood, misapplied, and overvalued in psychological science (Cohen,1990; Schmidt,1996 )

Cohen (1990) argues that researchers routinely misinterpret the meaning of the pvalue, treating it as a direct indicator of truth rather than a conditional probability based on hypothetical repeated sampling. He emphasizes that statistical significance does not imply practical significance, noting that trivial effects can become significant with large enough samples (Cohen,1990 ). Cohen also points out that NHST encourages researchers to ignore effect sizes and statistical power, which are essential for understanding the magnitude and reliability of findings. His critique suggests that the field often mistakes detectability for importance, leading to a literature filled with statistically significant but scientifically uninformative results (Cohen,1990; Schmidt,1996 )

Schmidt (1996) extends this critique, arguing that NHST actively prevents psychology from developing cumulative knowledge. Because NHST focuses on binary decisions, significant or not, its approach obscures the true size and consistency of effects across studies (Schmidt,1996 ). Schmidt contends that this dichotomous thinking contributes to publication bias, unstable findings, and a lack of theoretical progress. He also notes that NHST is overly sensitive to sample size, producing significant results for trivial effects in large samples and nonsignificant results for meaningful effects in small samples (Schmidt,1996 ). According to Schmidt, the belief that NHST provides objective scientific rigor is illusory, and the field must shift toward effect sizes, confidence intervals, and meta-analytic thinking to advance (Schmidt, 1996).

Together, Cohen and Schmidts critiques reveal that NHST is not merely a flawed statistical tool but a barrier to scientific progress when used uncritically. Their work has helped push psychology toward more informative approaches, such as estimation statistics, effect size reporting, and metaanalysis, that better support cumulative knowledge and theoretical development. As the field continues to confront issues such as the replication crisis, these critiques remain highly relevant and underscore the need for statistical practices that prioritize meaning over mere significance (Cohen, 1990; Schmidt, 1996). During the replication crisis, another author, Bargh et al. (1996), did not directly critique NHST, but their study became one of the most influential examples of the flaws that Cohen (1990) and Schmidt (1996) warned about. The priming effects reported by Bargh were statistically significant but later failed to replicate, illustrating how NHST can produce unstable findings, encourage overreliance on pvalues, and hinder cumulative scientific progress. Replication failures (e.g., Doyen et al., 2012) provide empirical support for Cohens and Schmidts critiques (Cohen,1990; Bargh et al.,1996; Schmidt,1996;D oyen et al.,2012 )

What do these flaws mean for the field of Psychology? The flaws mean as follows. Psychology risks building theories on statistical significance is driven by sample size rather than meaningful effects, the field may chase’significant’ but trivial findings. NHST contributes to the replication crisis. Low power, overreliance on p-values, and publication bias all contribute to poor replicability, one of the biggest issues in modern psychology. The field needs to shift toward estimation and cumulative science (Cohen, 1990; Schmidt, 1996). Both Cohen and Schmidt advocate for effect sizes, confidence intervals, Meta-analysis, power analysis, and transparent reporting. NHST should not be the primary decision-making tool. Neither Cohen nor Schmidt argued for eliminating NHST entirely, but both insisted it should be supplemented or replaced by more informative statistical approaches. Finally, Cohen and Schmidt both argue that NHST is fundamentally limited and misleading. Their critiques helped spark the modern movement towards effect sizes, confidence intervals, meta-analysis, and estimation-based statistics, approaches that provide richer, more meaningful information than a simple p-value ( Cohen,1990; Schmidt,1996 )

Summary Table

Issue Cohen (1990 ) Schmidt (1996)

Misinterpretation of p-value yes yes

No statistical significance, practical

Significance yes yes

Low power in psychology yes

NHST blocks cumulative knowledge yes

Dichotomous thinking yes

Sample size distortions yes yes

Need for effect sizes & power yes yes

References

Bargh, J. A., Chen, M., & Burrows, L. (1996). Automaticity of social behavior: Direct effects of trait construct and stereotype activation on action. Journal of Personality and Social Psychology, 71(2), 230244. American Psychological Association

Doyen, S., Klein, O., Pichon, C.L., & Cleeremans, A. (2012). Behavioral priming: Its all in the mind, but whose mind? PLOS ONE, 7(1), e29081.Public Library of Science DOI: 10.1371/journal.pone.0029081

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 13041312.

American Psychological Association

DOI: https://doi.org/10.1037/0003-066X.45.12.1304

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115129.

American Psychological Association

DOI: https://doi.org/10.1037/1082-989X.1.2.115.

Colleague post #3: Null hypothesis significance testing has several major flaws that have long been recognized as problematic for psychological science. One key issue, emphasized by Cohen (1990), is that statistical significance is often misunderstood as evidence that an effect is important or meaningful, when it only indicates that an effect is unlikely to be zero given the sample size. Because the null hypothesis almost always states that an effect is exactly zero, a condition that is rarely true in real-world psychological phenomena, researchers are often rejecting a hypothesis that was never plausible to begin with. As a result, null hypothesis significance testing encourages a misleading focus on p values rather than on the size of effects, their practical importance, or the theoretical meaning of the findings. This has led to a culture in which results are reduced to binary decisions (significant vs. not significant), oversimplifying complex psychological processes.

Another serious problem, highlighted by Schmidt (1996), is that reliance on significance testing actively slows the development of cumulative knowledge in psychology. Because null hypothesis significance testing focuses heavily on controlling Type I error (false positives), it often ignores Type II error (false negatives) and statistical power. This leads to many real effects going undetected, especially in studies with small sample sizes. Schmidt shows that entire research literatures can appear contradictory simply because some studies reach statistical significance and others do not, even when all are estimating the same underlying effect. This vote-counting approach creates the false impression that effects are inconsistent or unreliable, when in fact the inconsistency is largely due to sampling error and low power rather than true differences in psychological phenomena.

Together, Cohen (1990) and Schmidt (1996) argue that the field of psychology has been misled by overreliance on null hypothesis significance testing and that this has serious consequences for theory building, replication, and practical application. Both authors emphasize that researchers should shift their focus toward effect sizes, confidence intervals, and meta-analysis, which provide more informative and honest summaries of research findings. Without this shift, psychology risks continuing to produce fragmented and confusing research literatures that obscure rather than clarify real effects. Moving beyond simple significance testing is therefore essential for advancing cumulative knowledge and improving the scientific credibility of the field.

References

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 13041312.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115129.

WRITE MY PAPER