LIB applied clinical pharmacology - Head

Library: Topics in Applied Clinical Pharmacology

The topics are grouped in main sections, within each section there is a series of contributions set in ‘accordion’ (expanding when you click on the item’s headline). Within a contribution you will find links either to other URL or to pdf’ed texts for download. For larger writings (e.g. ‘manuals’) you may be asked to contact us to obtain access. This is a continuous project that will be updated regularly, also based on your feedback, comments and suggestions.


Excellent platforms to search disease specifications (epidemiology, etiology, diagnosis, treatment, prognosis, etc.):



EU National




{06.Jun.2020 – U: 01.NOv.2020 | ACPS-CdM}

{ACPS-CdM: 15.Sep.2020}

Biomedical Calculators

Organ specific

  • GCS Glasgow Coma Scale
    The Glasgow Coma Score is calculated as the sum of scores for the three domains (eye, verbal, motor; 4, 5, and 6 scorable items, respectively; max: 15 points)
    Click for Details
    *Individual components may be not testable due to any of the following (note this is not a comprehensive list): Eye: local injury and/or oedema | Verbal: intubation | All (eye, verbal, motor): sedation, paralysis, and ventilation eliminating all responses

ICU focused

Disease severity on admission | and use it to predict outcome

  • APGAR evaluated in newborns 1 and 5 minutes after birth
    Click for Details

    The Apgar score is a method to quickly summarize the health of newborn children against infant mortality. Virginia Apgar, an anaesthesiologist at NewYork–Presbyterian Hospital, developed the score in 1952 to quantify the effects of obstetric anaesthesia on babies. The five criteria are summarized using words chosen to form a backronym (Appearance, Pulse, Grimace, Activity, Respiration; i.e. skin colour, pulse rate, reflex irritability/grimace, muscle tone, and respiratory effort). Each scored 0 (Absent), 1 (Intermediate), or 2 (Present). Minimum score = 0; maximum = 10. Initial testing for “Reflex Irritability” was nasal and oropharyngeal suctioning with rubber catheter meant to elicit grimace, sneeze, or cough. Later, a rapid, tangential slap of the sole of the foot was found to be an easier and more effective stimulus.
    {Apgar V. A proposal of a New Method of Evaluation of the Newborn Infant. Current Researches in Anesthesia and Analgesia. 1953, 32: 261-267 | Apgar V, Holaday DA, James LS, et. al. Evaluation of the newborn infant. JAMA. 1958, 168: 1985-1988 | Casey BM, McIntire DD, Leveno KJ. The continuing value of the Apgar score for the assessment of newborn infants. N Engl J Med. 2001 Feb 15;344(7):467-71 }
  • APACHE: Acute Physiology and Chronic Health Evaluation (APACHE)
    • APACHE IIAPACHE Calculator
      The APACHE-II Score provides an estimate of ICU mortality based on a number of laboratory values and patient signs taking both acute and chronic disease into account.
      Click for Details
      Note: The data used should be from the initial 24 hours in the ICU, and the worst value (further from baseline/normal) should be used.
      Liver insufficiency: Biopsy proven cirrhosis | Documented portal hypertension | Episodes of past upper GI bleeding attributed to portal hypertension | Prior episodes of hepatic failure / encephalopathy / coma
      Cardiovascular: New York Heart Association Class IV Heart Failure
      Respiratory: Chronic restrictive, obstructive or vascular disease resulting in severe exercise restriction, i.e. unable to climb stairs or perform household duties | Documented chronic hypoxia, hypercapnia, secondary polycythemia , severe pulmonary hypertension (>40 mmHg), or respirator dependency
      Renal: Receiving chronic dialysis
      Immunosuppression: The patient has received therapy that suppresses resistance to infection e.g. immuno-suppression, chemotherapy, radiation, long term or recent high-dose steroids, or has a disease that is sufficiently advanced to suppress resistance to infection, e.g. leukaemia, lymphoma, AIDS
      {Knaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: A severity of disease classification system. Critical Care Medicine 13:818–829, 1985}
      APACHE III score consists of several parts including the primary reason for ICU admission, age, sex, race, pre-existing comorbidities, and location prior to ICU admission
      Click for Details
      {Knaus WA, Wagner DP, Draper EA, et al. The APACHE III prognostic system Risk prediction of hospital mortality for critically ill hospitalized adults. Chest. 1991;100(6):1619–36}. The APACHE III score ranges from 0 to 299 points
      Click for Details
      APACHE II score was published in 1985; APACHE IV is the latest version, published in 2006. Built on the study of a more recent patient population and standard of care, it has now become the recommended score to be used instead of APACHE II and III. APACHE scores are probably the most widely used in Intensive care, to quantify the severity of the illness of the patients. They represent a useful tool to compare populations of patients, in clinical studies, or in quality audit.
      Like APACHE II, APACHE IV provides the basis for the calculation of an estimated risk of death. Also provides an estimation of the length of stay.
      From APACHE II to APACHE IV, the scoring system has been complexified, taking into account a greater number of variable (and extracting scores from several variable in a slightly different way), and making reference to a large database of coefficient individualized for an increased number of disease, for the calculation of the risk of death and length of stay. APACHE III and IV are very similar, using the same variables. Only the disease-specific coefficients have been updated.
  • Acute Physiology Score
    • APS: Acute Physiology Score = APACHE IV minus patient’s age and Chronic Health Condition details.
  • (Simplified) Acute Physiology Score
    • SAPS II: Simplified Acute Physiology Score
      Click for Details
      SAPS was first described in 1984 as an alternative to the APACHE. The original score was assayed in the first 24 h of admission to the ICU and included 14 physiological variables, but did not include previous diseases. It is now replaced by SAPS II and SAPS III score, which include 12 physiological variables during the first 24 h of admission to the ICU and include the reason for admission (planned or emergency surgery or other reasons), the previous medical condition and years of age. It is not recommended for children under 18 years of age, patients with burns, or cardiac patients
    • SAPS III: Simplified Acute Physiology Score
      Click for Details
      {Y Sakr, C. Krauss, ACKB Amaral, et al. Comparison of the performance of SAPS II, SAPS 3, APACHE II, and their customized prognostic models in a surgical intensive care unit, BJA: British Journal of Anaesthesia. 2008;10: 798–803}
  • MPM: The Mortality Prediction Model
    Click for Details
    = estimator of in-hospital mortality based on chronic medical conditions and the diagnosis of acute and physiological variables. The MPM0 on admission includes 15 variables; subsequent MPMs contain 5 admission variables and 8 additional variables selected to reflect the condition of patients who remain in the ICU for more than 24 h. Scoring can be done every day during the stay of critically ill patients in the ICU (e.g., MPM24, MPM48, MPM72)
  • Mortality Prediction Models-Admission (MPM 0)
  • Mortality Prediction Models-24 (MPM-24)
  • Mortality Prediction Models-48 (MPM-48)
  • Mortality Prediction Models-Over Time
  • PRISM: Pediatric Risk of Mortality


  • SI: Shock Index = ratio of HR and SBP
    Click for Details
    was developed to identify trauma patients in a hypovolemic shock {Allgower, M.; Burri, C. [“Shock index”]. Dtsch. Med. Wochenschr. 1967, 92, 1947–1950}.
    A value of 0.7 represents normal SI, whereas SI of >1 is highly indicative of hemodynamic instability and mortality upon arrival at the ED {Mutschler, M.; Nienaber, U.; Munzberg, M.; Wolfl, C.; Schoechl, H.; Paffrath, T.; Bouillon, B.; Maegele, M. The Shock Index revisited—A fast guide to transfusion requirement? A retrospective analysis on 21,853 patients derived from the TraumaRegister DGU. Crit. Care 2013, 17, R172. & Rady, M.Y.; Nightingale, P.; Little, R.A.; Edwards, J.D. Shock index: A re-evaluation in acute circulatory failure. Resuscitation 1992, 23, 227–234.}.
    An SI of ≥1 generally indicates an uncompensated shock state of the patient and resuscitation may be necessary
    {Nakasone, Y.; Ikeda, O.; Yamashita, Y.; Kudoh, K.; Shigematsu, Y.; Harada, K. Shock index correlates with extravasation on angiographs of gastrointestinal hemorrhage: A logistics regression analysis. Cardiovasc. Interv. Radiol. 2007, 30, 861–865. & Sloan, E.P.; Koenigsberg, M.; Clark, J.M.; Weir, W.B.; Philbin, N. Shock index and prediction of traumatic hemorrhagic shock 28-day mortality: Data from the DCLHb resuscitation clinical trials. West. J. Emerg. Med. 2014, 15, 795–802.}.
    An SI of ≥1 is also associated with higher mortality rate {Mitra, B.; Fitzgerald, M.; Chan, J. The utility of a shock index ≥1 as an indication for pre-hospital oxygen carrier administration in major trauma. Injury 2014; 45: 61–65.}
  • rSI: reversed Shock Index = ratio of SBP to HR
    Click for Details
    the reverse shock index (rSI)was specifically targeted on the hemodynamic condition of trauma patients. The patient is in a potential shock when his (or her) SBP is decreased and lower than the HR (i.e., rSI of <1). This concept of rSI is intuitive utilizing two vital signs (SBP and HR) without any additional calculation and it can be used quickly in a prehospital scenario or a crowded ED.
    rSI of <1 was associated with poor outcome in trauma patients and is helpful to identify the patients with a high risk to mortality, even when there is no notable hypotension
    {Chuang, J.F.; Rau, C.S.; Wu, S.C.; Liu, H.T.; Hsu, S.Y.; Hsieh, H.Y.; Chen, Y.C.; Hsieh, C.H. Use of the reverse shock index for identifying high-risk patients in a five-level triage system. Scand. J. Trauma Resusc. Emerg. Med. 2016; 24: 12.
    Kuo, S.C.; Kuo, P.J.; Hsu, S.Y.; Rau, C.S.; Chen, Y.C.; Hsieh, H.Y.; Hsieh, C.H. The use of the reverse shock index to identify high-risk trauma patients in addition to the criteria for trauma team activation: A cross-sectional study based on a trauma registry system. BMJ Open 2016; 6: e011072.
    Lai, W.H.; Rau, C.S.; Hsu, S.Y.; Wu, S.C.; Kuo, P.J.; Hsieh, H.Y.; Chen, Y.C.; Hsieh, C.H. Using the Reverse Shock Index at the Injury Scene and in the Emergency Department to Identify High-Risk Patients: A Cross-Sectional Retrospective Study. Int. J. Environ. Res. Public Health 2016; 13: 357.
    Lai, W.H.; Wu, S.C.; Rau, C.S.; Kuo, P.J.; Hsu, S.Y.; Chen, Y.C.; Hsieh, H.Y.; Hsieh, C.H. Systolic Blood Pressure Lower than Heart Rate upon Arrival at and Departure from the Emergency Department Indicates a Poor Outcome for Adult Trauma Patients. Int. J. Environ. Res. Public Health 2016; 13: 528}
  • rSIG: The reverse shock index (rSI) multiplied by Glasgow Coma Scale (GCS) score (rSIG)
    = multiplying the GCS score with systolic blood pressure (SBP)/hear rate (HR)
    Click for Details
    A retrospective study from multi-centers using registered data of 168,517 patients from the Japan Trauma Data Bank proposed that a new score, the rSI multiplied by GCS score (rSIG = SBP/HR × GCS score), can be used to identify those trauma patients with a high risk for mortality and requirement of a blood transfusion within 24 h {Kimura, A.; Tanaka, N. Reverse shock index multiplied by Glasgow Coma Scale score (rSIG) is a simple measure with high discriminant ability for mortality risk in trauma patients: An analysis of the Japan Trauma Data Bank. Crit. Care 2018, 22: 87}

Trauma Severity

  • AIS: Abbreviated Injury Scale used to grade the injury severity to an anatomical location on a six-point ordinal scale, ranging from minor (1 point), moderate (2 points), serious (3 points), severe (4 points), critical (5 points), to unsurvivable (6 points). {Rating the severity of tissue damage. I. The abbreviated scale. JAMA 1971, 215, 277–280.}
  • ISS: Injury Severity Score is commonly used to grade the injury severity of trauma patients by the summation of squares of AIS score in the three most severe injuries of six predefined body regions (head& neck; face; chest; abdomen; extremities; external – max. value: 75). {Baker, S.P.; O’Neill, B.; Haddon, W., Jr.; Long, W.B. The injury severity score: A method for describing patients with multiple injuries and evaluating emergency care. J. Trauma 1974; 14: 187–196}
  • RTS: Revised Trauma Score
    Click for Details
    The Revised Trauma Score includes Glasgow Coma Scale (GCS), systolic blood pressure (SBP), and respiratory rate (RR) and excludes capillary refill and respiratory expansion, which are difficult to assess in the field. Two versions of the revised score have been developed, one for triage (T-RTS) and another for use in outcome evaluations and to control for injury severity (RTS). T-RTS, the sum of coded values of GCS, SBP, and RR, demonstrated increased sensitivity and some loss in specificity when compared with a triage criterion based on TS and GCS values.
    {De Munter, L.; Polinder, S.; Lansink, K.W.; Cnossen, M.C.; Steyerberg, E.W.; de Jongh, M.A. Mortality prediction models in the general trauma population: A systematic review. Injury 2017, 48, 221–229.}

    RTS: weighted sum of GCS, SBP, and RR = 0.9368 GCS +0.7326 SBP +0.2908 RR
    RTS values range from 0 to 7.84.
  • TRISS: Trauma-Injury Severity Score
    Click for Details
    The trauma injury severity score was created by Boyd et al. after a 1987 study aimed to integrate the evaluation of trauma in a generic procedure. The TRISS became a standardized approach for the evaluation of outcome of trauma care. It is based on the patient age in years and the components of the RTS and ISS.
    {Boyd CR, Tolson MA, Copes WS. Evaluating trauma care: the TRISS method. Trauma Score and the Injury Severity Score. J Trauma. 1987; 27(4):370-8.
    Stewart TC, Lane PL, Stefanits T. An evaluation of patient outcomes before and after trauma center designation using Trauma and Injury Severity Score analysis. J Trauma. 1995; 39(6):1036-40.
    Singh J, Gupta G, Garg R, Gupta A. Evaluation of trauma and prediction of outcome using TRISS method. J Emerg Trauma Shock. 2011; 4(4): 446–449.
    Thanapaisal C, Saksaen P. A comparison of the Acute Physiology and Chronic Health Evaluation (APACHE) II score and the Trauma-Injury Severity Score (TRISS) for outcome assessment in Srinagarind Intensive Care Unit trauma patients. J Med Assoc Thai. 2012;95(11): S25–S33}.

Scores that assess the presence and severity of organ dysfunction

  • MODS: Multiple Organ Dysfunction Score
    Click for Details
    = constructed by simply scoring the extent of functional impairment of six organ systems which are strongly correlated with mortality in ICU and the hospital at all. MODS covers six domains: the cardiovascular system (heart rate× right atrial pressure/mean arterial pressure), the pulmonary system (PaO2/FiO2 ratio), the renal system (serum creatinine), the hepatic system (serum bilirubin), the haematologic system (platelet count), and the central nervous system (Glasgow Coma Score). Each domain is scored 0 to 4; 0: normal function; 4: marked physiologic derangement. The scores in each domain provide a measure of dysfunction in the system of interest, and the composite scores provide a measure of global dysfunction each ICU-day in the ICU.
  • SOFA: Sequential Organ Failure Assessment
    Click for Details
    SOFA (Sepsis-related Organ Failure Assessment Score) was designed by a group of scientists from the European Society of Intensive Care Medicine, aimed to describe the degree of organ dysfunction in sepsis. However, since that time it is used for assessment of organ dysfunction of critically ill patients, regardless of the cause.
    It rates the 6 major organ systems: respiratory, cardiovascular, CNS, renal, liver, and coagulation (1 to 4 points), to give the final score of 6 to 24 points maximum {Sawicka W, Owczuk R, Wujtewicz MA, Wujtewicz M: The effectiveness of the APACHE II, SAPS II and SOFA prognostic scoring systems in patients with haematological malignancies in the intensive care unit. Anaesthesiol Intensive Ther, 2014; 46(3): 166–70}
  • LODS (Logistic Organ Dysfunction Score) is determined in the first 24 h.
    Click for Details
    The LODS lies somewhere between a mortality prediction score and an organ failure score because it combines a global score summarizing the total degree of organ dysfunction across the organ systems with a logistic regression equation that can be used to convert the score into a probability of mortality {Cook DJ, Fuller HD, Guyatt GH, Marshall JC, Leasa D, Hall R, Winton TL, Rutledge F, Todd TJR, Roy P, Lacroix J, Griffith L, Willan A, for the Canadian Critical Care Trials Group: Risk factors for gastrointestinal bleeding in critically ill patients. N Engl J Med 1994, 330:377-381 | Le Gall, et al. in JAMA, Sept 11, 1996; 27: 802-810}

Scores that assess nursing workload use

  • TISS: Therapeutic Intervention Scoring System
    Click for Details
    The Therapeutic Intervention Scoring System (TISS) quantifies type and number of intensive care treatments. This system, therefore, indicates the work load of intensive care and may be used for calculating costs in the ICU. TISS was originally elaborated by Cullen in 1974 based on 57 therapeutic procedures and was designed to assess the severity of illness in the ICU. Each intervention scores 1 to 4. Patients are grouped in 5 classes. It was not sufficiently validated and was abandoned. However it first introduced the idea of “patient point managed per nurse”. A TISS update was elaborated by Keene in 1983. The therapeutic procedures were increased to 76. It was assumed that a single nurse can manage 40/50 points per day. Though not validated, it became the most used tool to assess complexity of treatment and nurse/patient ratio. However, many of these items are obsolete and frequently related to the severity of illness rather than to specific nursing intervention
    {Mälstam J, Lind L. Therapeutic intervention scoring system (TISS)–a method for measuring workload and calculating costs in the ICU. Acta Anaesthesiol Scand. 1992 Nov;36(8):758-63}
  • NEMS, Nine Equivalents of Nursing Manpower Use Score
    Click for Details
    This score was elaborated by Miranda in 1997. It was derived from TISS and TISS28. Only 9 items, related to specific organ support, nursing and diagnostic/therapeutic interventions inside or outside the ICU, are considered. These items were weighted by multivariate analysis, obtaining a score comparable to the TISS28 score. Each nurse can deal with 45/50 points per day.
    {Guccione A, Morena A, Pezzi A, Iapichino G. [The assessment of nursing workload]. Minerva Anestesiol. 2004 May;70(5):411-6.}

Traumatic Brain Injury (TBI)

  • Glasgow Outcome Scales
    Click for Details
    There are two Glasgow outcome scales, an original version and an extended one. These are presented above one after another to offer clinicians a rapid but extensive stratification tool for the assessment of patients with traumatic brain injury. The GOS was first described in 1975 in a study by Jennet and Bond {Jennett B, Bond M. Assessment of outcome after severe brain damage. Lancet. 1975; 1(7905):480-4}. It consists of five statuses that make up an objective evaluation for initial status and recovery. GOS has also been used successfully in predicting the long-term rehabilitation after TBI.
    The Extended Glasgow Outcome Scale or GOS-E was meant to address the shortfalls of the original version and consists of eight statuses and a structured interview to be applied with it. GOS-E has shown more reliability and content validity in practice. The GOS-E is more sensitive than GOS in terms of changes in mild and moderate traumatic brain injuries. Both of these scales provide a general assessment of mental function, trauma severity and outcome after head injury. In research of traumatic injury GOS are recommended to be administered at 3, 6 and 12 months. For patients with TBI there are also rehabilitation models available, like the Disability Rating Scale that consists of interview questions that offer information about the patient status, rehabilitation and perceived disability

    • GOS Glasgow Outcome Scale:
      Click for Details
      1: Dead |
      2: Persistent Vegetative State / Patient exhibits no obvious cortical function |
      3: Severe Disability (Conscious but disabled) / Patient depends upon others for daily support due to mental or physical disability or both. |
      4: Moderated Disability (Disabled but independent) / Patient is independent as far as daily life is concerned. The disabilities found include varying degrees of dysphasia, hemiparesis, or ataxia, as well as intellectual and memory deficits and personality changes. |
      5: Recovery / Resumption of normal activities even though there may be minor neurological or psychological deficits.
    • GOS-E Glasgow Outcome Scale – Extended Version:
      Click for Details
      1: Dead |
      2: Vegetative State / Condition of unawareness with only reflex responses but with periods of spontaneous eye opening. |
      3: Low Severe Disability |
      4: Upper Severe Disability / Patient who is dependent for daily support for mental or physical disability, usually a combination of both. If the patient can be left alone for more than 8h at home it is upper level of SD, if not then it is low level of SD |
      5: Low Moderate Disability |
      6: Upper Moderate Disability / Patients have some disability such as aphasia, hemiparesis or epilepsy and/or deficits of memory or personality but are able to look after themselves. They are independent at home but dependent outside. If they are able to return to work even with special arrangement it is upper level of MD, if not then it is low level of MD |
      7: Low Good Recovery (GR) |
      8: Upper Good Recovery / Resumption of normal life with the capacity to work even if pre-injury status has not been achieved. Some patients have minor neurological or psychological deficits. If these deficits are not disabling then it is upper level of GR, if disabling then it is lower level of GR.


{12.May.2020 | ACPS-CdM}

{under construction} soon to com!

{12.May.2020 | ACPS-CdM}

Renal function:

CKD: Chronic Kidney Disease

  • CKD: Kidney Failure Risk 8-criteria equation (Estimate risk of progression to end-stage renal disease in CKD patients: sex | age | eGFR | urinary albumin-creatinine-ratio | serum calcium | serum phosphorus | serum bicarbonate | serum albumin)
  • CKD: Kidney Failure Risk 4-criteria equation (Estimate risk of progression to end-stage renal disease in CKD patients: sex | age | eGFR | urinary albumin-creatinine-ratio)

AKI: Acute Kidney Injury

{27.Mar.2018 U: 12.May.2020 | ACPS-CdM}

Applied Clinical Pharmacokinetics - BABE Guidelines

{27.Mar.2018 | ACPS-CdM}

BABE-reporting templates including BA-specification:


General provisions

Fixed Dose Combinations

  • EMA/CHMP/158268/2017: Guideline on clinical development of fixed combination medicinal products – Mar.2017
    • {EMA/757184/2015: Submission of comments on ‘Guideline on Clinical Development of Fixed Combination Medicinal Products’ (EMA/CHMP/281825/2015) – Ocxt.2018}
    • {CHMP/EWP/191583/2005: Questions And Answers Document On The Clinical Development Of Fixed Combinations Of Drugs Belonging To Different Therapeutic Classes In The Field Of Cardiovascular Treatment And Prevention – 2005}{
    • {EMA/CHMP/779887/2012: Concept paper on the need to revise the Guideline on the clinical development of fixed dose combinations of medicinal products regarding dossier content requirements – Mar.2013}

Modified Release Products – General

  • EMA/CHMP/EWP/280/96: Guideline on the pharmacokinetic and clinical evaluation of modified release dosage forms (EMA/CPMP/EWP/280/96 Corr1) – Nov.2014

Oral Modified Release products

Transdermal patches

LaLa: Locally Applied, Locally Acting Products

  • CPMP/EWP/239/95 final: Note for guidance on the clinical requirements for locally applied, locally acting products containing known constituents – Nov.1995
  • EMA/CHMP/QWP/558185/2014: Concept paper on the development of a guideline on quality and equivalence of topical products – Dec.2014
  • EMA/CHMP/558326/2013: Concept paper on the development of a guideline on the demonstration of therapeutic equivalence for locally applied and locally acting products in the gastrointestinal  tract – Sep.2013
  • CPMP/EWP/239/95 Rev. 1: Guideline on equivalence studies for the demonstration of therapeutic equivalence for products that are locally applied, locally acting in the gastrointestinal tract as addendum to the guideline on the clinical requirements for locally applied, locally acting products containing known constituents [Draft] – Mar.2017

GCP-Inspections (BABE)


General – CFR

General – Statistical Approaches

Early Pharmacokinetics – General Considerations

BA & BE for Oral Products – General Considerations

BA (& BE) for Oral Products – NDAs or INDs

BE for Oral Products – ANDAs

Product-specific BE Guidance


Combination Products

Modified-release Products

See also: Wang YL, Chang YT, Yang SY, Chang YW, Kuan MH, Tu CL, Hong HC, Lai IC, Gau CS, Hsu LF. Approval of modified-release products by FDA without clinical efficacy/safety studies: A retrospective survey from 2008 to 2017. Regul Toxicol Pharmacol. 2019 Apr;103:174-180.

Topical – Locally applied, Locally acting


Therapeutic Proteins

    • FDA Guidance for Industry – Considerations in Demonstrating Interchangeability With a Reference Product (Final – May.2019)
      {… to demonstrate that a proposed therapeutic protein product is interchangeable with a reference product for the purposes of submitting a marketing application or supplement under section 351(k) of the Public Health Service Act (PHS Act) (42 U.S.C. 262(k))












{27.Mar.2018 | Update: 26.Oct.2020 | ACPS-CdM}

The GCP Inspectors Working Group has developed procedures for the coordination, preparation, conduct and reporting of GCP inspections carried out in the context of the Centralised Procedure. These inspections are adopted by the CHMP and may be routine or may be triggered by issues arising during the assessment of the dossier or by other information such as previous inspection experience. They are usually requested during the initial review of a Marketing Authorisation Application, but could arise post-authorisation (e.g. inspection of studies conducted or completed as part of the condition of a marketing authorisation, or because of concerns arising about the studies previously submitted).

The EMA has established and maintains a website platform that is specifically dedicated to the guidelines re. GCP Inspections for clinical trials including bioequivalence studies. In addition, attention is drawn to two important aspects of prime interest:

{ACPS-CdM: 20.Oct.2020}

Background – Recommended Reading

(Regulatory) Guidance

BCS Monographs

{ACPS-CdM: 20.Oct.2020}

Applied Clinical Pharmacokinetics - Semantics

Establishing and maintaining blinding throughout the complex process of capturing and analysing data may be challenging, but avoids compromising objectivity. Bioanalytical (BA) determinations and pharmacokinetic (PK) analyses are at risk of being result-driven unless blinded. In the EU, BABE Guidance specifically instructs that bioanalysis should be conducted without information on treatment. However, most bioanalysts will be aware of the subject and period ID; no such restrictions are imposed with regard to the PK-analysts.

Blinding could be established by:

  • Not disclosing the actual sample ID (subject, treatment, time after dosing) of blood/plasma samples by using coded sample IDs
  • To impose a ruling that any re-analysis should be reported
  • To set strictly restrictive rulings on bioanalytical re-analysis (EU BABE-Guidance: » Reanalysis of study samples should be predefined in the study protocol (and/or SOP) before the actual start of the analysis of the samples. Normally reanalysis of subject samples because of a pharmacokinetic reason is not acceptable. This is especially important for bioequivalence studies, as this may bias the outcome of such a study «
  • To disclose sample IDs only after completing all BA-determinations
  • Not disclosing the actual PK-profile ID (subject, treatment) of the time courses of the blood/plasma concentrations by using coded profile IDs
  • To disclose profile IDs only after completing all PK-analyses

We consider these steps to be highly recommendable; however, we are not aware of any guideline that encourages/enforces such blinding, at PK-analyst level, in particular.

{27.Mar.2018 U: 25.Oct.2020| ACPS-CdM}

Most regulatory pharmacokinetic (PK) arguments rely on summary values to describe and compare the time courses of the blood/plasma concentrations (the “PK-profile”) across treatments (e.g. formulations, doses, co-medications, co-morbidities, etc.).

To this purpose, the descriptors of the PK-profile should be chosen such that they are sufficient to identify meaningful differences in systemic exposure that might be relevant for the treatment outcome in terms of efficacy and safety.

This is generally accepted to be well covered by non-compartmental analysis (NCA). NCA relies on very basic non-pretentious mathematics and it does not impose model-assumptions; therefore NCA is unlikely to be confounded by whether the chosen mechanistic interpretation of the PK-profile is ‘correct’ or not.

While relying on very basic mathematics it can be easily processed with unsophisticated software. There is a large range of PK software tools and packages, most of which also containing NCA: see for instance PharmPK-List of PK-Software.

Among these, we have excellent experience with PCModfit, which operates with all analytical steps well documented within a single platform; this is important for documenting the selection of the data points used to derive the apparent terminal log-linear disposition rate constant and half-life. This tool operates reliably and we have documented excellent agreement of its results with other software tools (incl. WinNonlin).

There is no regulatory ruling on using specific PK-software; however, it is a well-established urban myth that regulators would only accept analyses processed with the NCA-subroutines of expensive pharmacometric software platform packages such as Phoenix WinNonlin. Surely this package is reasonably claimed to be “The Industry Standard for PK/PD Modeling and Simulation”; however, the cost is high and out of proportion for just doing NCA.

{27.Mar.2018 | ACPS-CdM}

Most regulatory pharmacokinetic (PK) arguments rely on summary values to describe and compare the time courses of the blood/plasma concentrations (the “PK-profile”) across treatments (e.g. formulations, doses, co-medications, co-morbidities, etc.). This is generally accepted to be well covered by non-compartmental analysis (NCA). NCA relies on very basic non-pretentious mathematics and it does not impose model-assumptions; therefore is unlikely to be confounded by whether the chosen mechanistic interpretation of the PK-profile is ‘correct’ or not.

NCA derives simple descriptors of the PK-profile: maximum observed concentrations (Cmax), time of occurrence of Cmax after dosing (tmax), quantifiable i.e. truncated area under the time course of the concentrations (AUC_tz), total i.e. extrapolated AUC (AUC_∞), apparent terminal disposition rate constant (λ), apparent terminal disposition half-life (t½), the quantifiable and total area under the statistical first-moment curve (AUMC), and mean residence time (MRT).

Among the NCA-criteria, Cmax and AUC_∞ are of prime relevance while representing the peak and total (area) systemic exposure; the MRT is a highly robust expression of the timely distribution of the AUC (since it relies on all profiling points). Cmax/AUC_∞ is useful since it allows qualifying differences in the rate of systemic bioavailability while reflecting changes in Cmax disproportionate to changes in the amount of bioavailability.

For good reason, the AUC_∞ (=F.Dose/CL) is a prime characteristic since it allows quantifying differences in the fractional amount of systemic bioavailability (F) across treatments if the clearance (CL) can be accepted to be treatment-independent.

Since the AUC_∞ expresses the balance between the amount of systemically bioavailable drug (F.Dose) and the disposition thereof (CL), the latter can be calculated from the AUC if F is known: CL = F.Dose/AUC_∞. F is known only for intravascular dosing (iv-bolus or iv-infusion); in contrast, for extravascular dosing F is generally incomplete to an unknown extent.
Nevertheless, NCA-programs often calculate Dose/AUC_∞ and report it as CL/F. This so-called ‘apparent oral clearance’ is not meaningful since it results from two ‘unknowns’ (F and CL) and most certainly does not reflect actual CL or changes thereof in any meaningful fashion. If anything, CL/F is nothing but the reciprocal of the dose-normalised AUC_∞.

Another such oddity is that NCA-programs often report Vd/F (‘apparent oral distribution volume’), which can only be used as an estimate of the distribution volume (Vd) if F were known (Vd/F = CL/λ or Vd/F = CL.MRT). Although it is no measure of volume, Vd/F serves some mechanistic role since it represents the composite dilutional effect on how a given dose results in the observed concentrations depending on a) the fraction of the ‘dose’ actually reaching in the systemic circulation and b) how this residual is then diluted by the impact of all distribution compartments.

We advocate not using CL/F or Vd/F unless F is reliably known. This is particularly important in SmPC-descriptions of a drug’s PK-behaviour since the terms are often used without clearly specifying them as ‘apparent’ in this context.


This is how we would have liked to leave things until recently. To our surprise, CL/F and Vd/F are mentioned in the draft 2019 FDA Guidance on FED studies as criteria that should be reported (FDA Guidance for Industry – Assessing the Effects of Food on Drugs in INDs and NDAs – Clinical Pharmacology Considerations – Draft – Feb.2019). Possibly, this stems from the same desk that lists CL/F as “physiologically plausible structural element” of a population PK model (FDA Guidance to Industry – Population Pharmacokinetics – Jul.2019). Rarely seen a better illustration of the intellectual constraints of model-driven pharmacometrics.

{09.May.2018 – Update:22.Oct.2020 | ACPS-CdM }

Topics in Bioequivalence Testing

Well-groomed regulatory rules are in place to facilitate the development and authorization of multi-origin, i.e. generic drugs. These rules are focused on developing mere copies of the originator and securing their prescribability by evidencing average bioequivalence (ABE). The rulings do not secure switchability (unless constraints are imposed that limit the asymmetry of the GMR point estimates). Little is left from the exciting biostatistical debate in the 1990’s about alternative approaches such as individual and population BE but for the now well-established practice of replication designs and reference-scaling for high-variability drugs and drug products. Also, several limitations become apparent when conventional ABE-rulings (as conceived for oral immediate-release products) are amended to regulate other regulatory scenarios (e.g. modified-release products and fixed-combination products).

In the following, several such topics are discussed. For each topic, references are presented that guide further reading.

Marketing authorization of multi-origin (generic) drugs is based on the argument of equivalence vs the originator. To this purpose, therapeutic equivalence studies can be waived if pharmacokinetic equivalence can be evidenced (“bioequivalence” [BE]). This is generally based on the testing of average bioequivalence (ABE), i.e. on comparing the 90% confidence interval estimates of the true ratios of the treatment means of the peak and area exposure levels for test:reference relative to a predefined default tolerance zone of 80.00 to 125.00%. Such ABE is the well-established quality mark of in vivo biopharmaceutical performance that is accepted to suffice for the regulatory authorization of generics (prescribability).

Since the early 1990s, the concern is commonly expressed that ABE is not well-suited to evidence comparability between (different) formulations in terms of within-subject switchability (switching R to T rather than repeating R). Other approaches such as individual and population BE might be better suited – biostatistically – to assess interchangeability of multi-origin products within the target population. In the late 1990s, the FDA published several draft guidance documents favoring such alternative approaches. Individual bioequivalence is particularly well-suited for certain drugs by addressing within-subject variabilities, allowing for scaling of the tolerance zone to the reference variability, and by assessing subject-by-formulation interaction. In spite of their biostatistical elegance and mathematical sophistication, these alternatives have never been a real challenge to ABE as the decisive quality mark for the authorization of generics. However, these discussions helped to endorse the use of replication designs and reference scaling (possibly with a point-estimate constraint) for ABE of high-variability drugs (HVD); others might argue that high-variability itself is a sufficient reason not to use ABE (even with replication) for the validation of generic products.

There is no upper limit on the sample size for ABE. Therefore, it is possible to force the interval estimate of the relative exposure levels within the tolerance zone in spite of a large asymmetry of the point estimate by using a very large sample. This asymmetry increases the risk of poor switchability: although generic A (BE point estimate 0.9) and generic B (BE point estimate 1.10) are each equivalents to the reference originator, they are not equivalent among themselves (BE point estimate: 1.22 for B to A).
This phenomenon has been referred to as bioavailability or bioequivalence “drifting”.
This issue and the need to resolve possible concerns about it are not addressed in the respective guidelines for the testing of bioequivalence of immediate release or modified release formulations. By the intervention of the Czech regulatory authorities (see Clinical Aspects Of The Development Of Fixed Dose Combination Products by Jiří Haman & Marina Feřtek (SÚKL)), it comes up as a rather enigmatic comment in the guideline on fixed-dose combination with possibly quite broad implications (since little is known about how to resolve this issue).

Inadequacy of the common ABE practice to resolve concerns about switchability is not just a biostatistical issue. To an important extent, treatment adherence relies on how the patient relates to his/her medication. This depends on the product’s name, package, form, size, smell, taste, etc. These aspects ought to be taken into account when considering switching a patient on chronic therapy to an alternative product. In the EU, this is particularly important, since a generic product and a reference product may be considered to have the same pharmaceutical form if they have the same form of administration as defined by the Pharmacopoeia; furthermore, Article 10(2)(b) of the amended Directive 2001/83/EC [Community Code] provides that the various immediate-release oral forms, i.e. tablets, capsules, oral solutions, and suspensions, are considered to be the same pharmaceutical form for the purposes of Article 10 (see also Eudralex Volume 2A Notice to Applicants – Section


Prescribability and Switchability

Switchability and BA/BE-drifting

Individual BE

{ACPS-CdM: 08.Oct.2020}

Replication Designs and Reference Scaling for High-Variability Drugs

In the EU, fixed-dose combinations (FDC) are regulated by EMA/CHMP/158268/2017 (“Guideline on clinical development of fixed combination medicinal products” – Mar.2017); these should best be read in conjunction with the public review comments when the guideline was drafted (EMA/757184/2015 “Submission of comments on ‘Guideline on Clinical Development of Fixed Combination Medicinal Products’ – Oct.2018). In comparison, in the US, there is an elaborate Guidance for newly developed FDC (FDA Guidance for Industry – Codevelopment Of Two Or More New Investigational Drugs For Use In Combination – Jun.2013), but not for FDC under an ANDA.

Any FDC-product, in the EU, is expected to meet the following three basic scientific requirements:

  • Justification of the pharmacological and medical rationale for the combination
  • Establishment of the evidence base for the:

    • relevant contribution of all active substances to the desired therapeutic effect (efficacy and/or safety):
    • positive benefit-risk for the combination in the targeted indication
  • Demonstration that the evidence presented – if based on combined administration of separate active substances – is relevant to the fixed combination medicinal product for which the application is made

Three therapeutic scenarios are foreseen: add-on treatment, substitution therapy and initial combination treatment; each requires its own type of evidence.

For generic-FDC, the documentation expected to be presented largely depends on i) the scenario of development and authorisation of the reference-FDC (“add-on” | “substitution” – see below) and ii) the extent of pivotal documentation on the efficacy and safety of the combination available for the reference-FDC (or for the loose combination upon which the authorisation of the reference-FDC took reference). Also, the Guideline has introduced several calls for caution (raising concerns that require extra work) that lack precision and distinction (also depending on what may be expected for an “add-on”-FDC differently from a “substitution”-FDC.

The need for studies and the type of studies for generic FDC-products is regulated by FDC-Guideline (EMA/CHMP/158268/2017 “Guideline on clinical development of fixed combination medicinal products” – Mar.2017). Only a relatively short Paragraph (section 4.5) is dedicated to generic-FCDs. Understanding this section and its implications is a complex challenge since it is full of riddles.

In order to understand the implications, you need to clarify upfront:

  • Was the reference-FDC developed/authorised as an “add-on” product (acc. Section 4.1 of the FDC-Guideline) or as a “substitution” product (acc. Section 4.2)?
  • Have pivotal phase-3 studies been carried out with the reference-FDC in the indications presently applied for?
    Alternatively: have pivotal phase-3 studies been carried out with the loose combination and has the reference-FDC been shown to be bioequivalent with this loose combination (“formulation” effect – see next)?
  • Do the components carry a high likelihood of interaction (e.g. high protein binding; single elimination pathway; elimination via enteric/hepatic CYP3A4, etc.) or can it be claimed that the likelihood of interactions between the components is negligible
  • Are any PK studies available (published/accessible) on the reference-FDC product?
    • “Interaction-effect” studies (as for “add-on” products): FDC vs each of its components administered separately: {AB} vs {A} and {AB} vs {B}
    • “Formulation-effect” studies (as for “substitution” products): FDC vs its components administered simultaneously: {AB} vs {A} +{B}

With this in mind, the next step is to read carefully what is specified in the pertinent section of the Guideline:

EMA/CHMP/158268/2017 – Section 4.5 Generic medicinal products

» The development of a generic medicinal product is based on demonstrating bioequivalence with the reference fixed combination medicinal product (see also first paragraph section 4.6) «

See below regarding possible implications of Section 4.6

» This should be demonstrated for all active substances in the fixed combination medicinal product according to the relevant guidelines that are mentioned in section 3. «

In this case, this means that CPMP/EWP/QWP/1401/98 Rev. 1 (“Guideline On The Investigation Of Bioequivalence” – Jan.2010) needs to be complied with.

However, it is our understanding that a straightforward BE-comparison of the generic-FDC with its originator-FDC will only suffice if pivotal phase-3 studies were carried out with the originator-FDC or with the loose combination upon which the authorisation of the originator-FDC took reference. In the absence of such data, the applicant for marketing authorisation of the generic-FDC may be expected to present extra data; however, it is unclear what data this ought to be.

» …. Also, for generic fixed combination medicinal products it needs to be verified that the evidence base that may have been generated for the reference product with individual active substances (rather than with the fixed combination medicinal product, to which reference is being made) applies to the generic fixed combination medicinal product. In this case, two pharmacokinetics bridges may need to be built, one between the reference fixed combination medicinal product and its active substances and one between the generic and reference fixed combination medicinal product. «

For FDCs that are positioned as “substitution” this would mean “two pharmacokinetics bridges may need to be built, one between the reference fixed combination medicinal product and its active substances taken simultaneously” (i.e. testing for equivalence or absence of formulation effect).

In contrast, for FDCs developed as “add-on” the requirement would need to mean “two pharmacokinetics bridges may need to be built, one between the reference fixed combination medicinal product and its active substances taken separately“. (i.e. testing for interaction).

In both cases, the requirement imposed on the generic is odd since it implies, doing, repeating or extending the work that ought to have been done by the originator in the first place.

The last sentence of section 4.5 raises even more questions:
» A justification should be provided why ‘drifting’ of bioavailability is not considered relevant and hence why the original demonstration of efficacy and safety is relevant to the generic «

This sentence “hangs” without context. In order to understand this, a look needs to be taken on EMA/757184/2015 (“Submission of comments on ‘Guideline on Clinical Development of Fixed Combination Medicinal Products’ (EMA/CHMP/ 281825/2015)” – Oct.2018), which reports this concern extensively. It relates to shifts in the between-treatment contrasts of complex bioequivalence series; in simplest terms: generic-A is equivalent with R (GMR for A/R: 0.90), generic-B is equivalent with R (GMR for B/R: 1.10); since both are equivalent with the reference, they are “prescribable”; however, they are obviously not “switchable” (GMR fort B/A: 1.22 – see also Section 5.4.2).       
Applied in the present context, this might mean:    
i) reference-FDC{AB} is equivalent with {A}+{B}(the loose combination of both components given simultaneously)
ii) test-FDC{AB} is equivalent with reference-FDC{AB}     
→ can this be accepted to mean that test-FDC{AB} is equivalent with {A}+{B}?
It is our understanding that this might (only) be resolved without an extra study (like the one recommended in Section 4.6 of the FDC-Guideline)
if the contrasts for i) and ii) are not too asymmetric (i.e. their GMR Point Estimates are close to 1.00).

However in some PAR we found the awkward statement that BA/BE drfiting was not a problem since pivotal studies had been undertaken with the reference (we wonder how the assessor came to this intriguing conclusion).

This hurdle is repeated in the next section of the FDC-Guideline:

EMA/CHMP/158268/2017 – 4.6. Bridging the evidence base to the fixed combination medicinal product

» Clinical data establishing the contribution of each active substance and the positive benefit-risk are often obtained from the combined use of individual active substances. In this case demonstration of similar pharmacokinetics (usually through demonstrating bioequivalence) of the fixed combination medicinal product versus its individual active substances taken simultaneously is required. This is to satisfy the third basic requirement for an MAA for fixed combination medicinal products «

Irrespective of the type of FDC (“add-on” or “substitution”) this means that the test-FDC{AB} needs to be tested for equivalence (i.e. absence of formulation effects) with the loose combination of the components {A} and {B} that founded the authorisation of the originator administered simultaneously. Obviously, it may prove very difficult to identify the precise products that were used to this purpose. Furthermore, when taking reference on the safety of {A} or {B}, this cannot be confined to data on the combined use of {A}+{B} alone; unavoidably, reference will also need to be taken on the use of {A} or {B} alone.

In consequence, a generic-FDC needs to be tested for bioequivalence vs the originator-FDC. Extra studies may be demanded if no pivotal efficacy and safety studies were carried out with the originator-FDC product or with the loose combination of its components upon which the authorisation of the originator-FDC took reference. This could mean:

  • A study of the originator-FDC vs its components
    • administered simultaneously if the FDC is based upon a “substitution” scenario (i.e. testing for “no” formulation effect [BE])
    • administered separately if the FDC is based upon an “add-on” scenario (i.e. testing for interaction [may need a BE approach – and BE sample size – if it is intended to demonstrate the absence of any interaction, but may also be conceived descriptively)
  • A BE-study of the generic-FDC vs the loose combination of the components upon which the authorisation of the originator-FDC took reference (possibly also to resolve concerns about “drifting” bioequivalence)

We think that both sections 4.5 and 4.6 of the FDC-Guideline ought to have been less confused & confusing. Applicants and sponsors would have benefited from better guidance.

{ACPS-CdM: 10.Oct.2020 | U: 25.Jan.2021}

Applied Statistics

In Applied Clinical Pharmacology we work with data. Within a GCP-framework, we install and manage risk-based Quality Management Systems, and operate within such RB-QMS to ensure that trial data is “credible” (per ICH GCP E6(R2)), i.e. fit to serve a tangible worthwhile (scientific, medical/ethical) purpose. Accordingly, we have a first-line involvement on how data are selected, weighted, collected, recorded, verified, processed and reported. Even if important parts of this responsibility reside with the sponsor and the sponsor may choose to outsource such services, a CP-investigator may be expected to have a sound working knowledge of statistics in order to communicate and collaborate efficiently with these expert support groups and/or statistical co-investigators.

In the following, we present a series of statistical topics that have implications on how we, as non-statisticians, communicate/collaborate with statisticians. Unavoidably, one of the hot issues in this domain is the “tyranny of statistical significance” and the “need for a ban on p-values!” This ought to be high on the list of our priorities also in the perspective of whether and how these issues might be resolved by estimation (confidence intervals) rather than hypothesis testing.
This is a life project and more topics will follow (noninferiority, staged designs, multiplicity and hierarchy of outcome criteria, etc..).

The topic is a bit ‘nerdy’. But we try to lighten this up by using collapsable text items. In the list of references “recommended for reading”, we have added a click button to each reference that allows you to access the abstract offline; in addition, we have added links to abstract or full paper pdfs online.

General statistical resources

Samples size calculators

FARTSSIE: “Free Analysis Research Tool for Sample Size Iterative Estimation”.

Statistical fragility Index
The fragility index is a measure of the robustness (or fragility) of the results of a clinical trial.
The fragility index is a number indicating how many patients would be required to convert a trial from being statistically significant to not significant (p ≥ 0.05). The larger the fragility index the better (more robust) a trial’s data are. The intent of the fragility index is to be used in conjunction with the P-value, 95% confidence interval, and various measures describing benefit or risk (relative risk reduction, absolute risk reduction, etc)

False Positive Riks (FPR) acc. Colquhoun (2019)
Colquhoun (2019) proposes continuing the use of continuous p-values, but only in conjunction with the “false positive risk (FPR).” The FPR answers the question, “If you observe a ‘significant’ p-value after doing a single unbiased experiment, what is the probability that your result is a false positive?” It tells you what most people mistakenly still think the p-value does, Colquhoun says. The problem, however, is that to calculate the FPR you need to specify the prior probability that an effect is real, and it’s rare to know this. Colquhoun suggests that the FPR could be calculated with a prior probability of 0.5, the largest value reasonable to assume in the absence of hard prior data. The FPR found this way is in a sense the minimum false positive risk (mFPR); less plausible hypotheses (prior probabilities below 0.5) would give even bigger FPRs, Colquhoun says, but the mFPR would be a big improvement on reporting a p-value alone. He points out that p-values near 0.05 are, under a variety of assumptions, associated with minimum false positive risks of 20–30%, which should stop a researcher from making too big a claim about the “statistical significance” of such a result.


{27.Mar.2018 U: 22.Feb.2020 | ACPS-CdM}

I keep six honest serving-men (They taught me all I knew). Their names are What and Why and When and How and Where and Who. Rudyard Kipling (1902)- “The Elephant’s Child” from Just So Stories

In experimental clinical pharmacology, we usually generate rather than test (verify, i.e. falsify) hypotheses. We are not often directly exposed to what Stang et al. 2010 termed the “tyranny of statistical significance”, which they described as:

» It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against statistical significance testing (SST). The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. «

Growing out and beyond the relatively secure professional environment of a research job but entering the privileged charade of decision making in pharmaceutical drug development, I became increasingly exposed to this “tyranny” although challenging it most stubbornly. I soon discovered that whatever my seniority, my fight against idle p-values was very much in vain. P-values mattered with stakeholders and stockholders; they mattered to, and with regulators; they were crucial for investigators who wanted to publish their trial data, etc. A P value on the right side of whatever critical level, makes and breaks drugs, efforts, and people.

» The p-value quantifies the probability of observing results at least as extreme as the ones observed given that the null hypothesis is true. It is then compared against a pre-determined significance level (α). If the reported p-value is smaller than α the result is considered statistically significant. « {Vidgen B, Yasseri T. P-Values: Misunderstood and Misused. Frontiers Physics 2016}

» Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g. the sample mean difference between two compared groups) would be equal to or more extreme than its observed value «{Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 2016}

Note: unsure whether you should write “P value”, “p value”, “P value”, or “P-value”? – See: GraphPad KNOWLEDGEBASE – ARTICLE #1885

Thus, P-values mean something else than what most users and readers might think. It is easier to understand the issue by discussing what P isn’t and how it ought not to be used, as per Greenland

  • The P-value is the probability that the test hypothesis is true; e.g. if a test of the null hypothesis gave P= 01, the null hypothesis has only a 1 % chance of being true ; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true. No! - Tick for more information

    The P-value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis. The P value simply indicates the degree to which the data conform to the pattern predicted by the test hypothesis and all the other assumptions used in the test (the underlying statistical model). Thus P = 0.01 would indicate that the data are not very close to what the statistical model (including the test hypothesis) predicted they should be, while P = 0.40 would indicate that the data are much closer to the model prediction, allowing for chance variation
  • The P-value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the P value for the null hypothesis is 0.08, there is an 8% probability that chance alone produced the association. No! - Tick for more information

    To say that chance alone produced the observed association is logically equivalent to asserting that every assumption used to compute the P value is correct, including the null hypothesis.
  • A significant test result (P ≤ 0.05) means that the test hypothesis is false or should be rejected. No! - Tick for more information

    A small P-value simply flags the data as being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; it may be small because there was a large random error or because some assumption other than the test hypothesis was violated (for example, the assumption that this P value was not selected for presentation because it was below 0.05). P ≤ 0.05 only means that a discrepancy from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that observed no more than 5 % of the time if only chance were creating the discrepancy (as opposed to a violation of the test hypothesis or a mistaken assumption).
  • A nonsignificant test result (P > 0.05) means that the test hypothesis is true or should be accepted. No! - Tick for more information

    A large P-value only suggests that the data are not unusual if all the assumptions used to compute the P-value (including the test hypothesis) were correct. The same data would also not be unusual under many other hypotheses. Furthermore, even if the test hypothesis is wrong, the P-value may be large because it was inflated by a large random error or because of some other erroneous assumption. P > 0.05 only means that a discrepancy from the hypothesis prediction (e.g., no difference between treatment groups) would be as large or larger than that observed more than 5 % of the time if only chance were creating the discrepancy.
  • A large P-value is evidence in favor of the test hypothesis. No! - Tick for more information

    In fact, any P value less than 1 implies that the test hypothesis is not the hypothesis most compatible with the data, because any other hypothesis with a larger P-value would be even more compatible with the data. A P value cannot be said to favor the test hypothesis except in relation to those hypotheses with smaller P values. Furthermore, a large P-value often indicates only that the data are incapable of discriminating among many competing hypotheses (as would be seen immediately by examining the range of the confidence interval). For example, many authors will misinterpret P = 0.70 from a test of the null hypothesis as evidence for no effect, when in fact it indicates that, even though the null hypothesis is compatible with the data under the assumptions used to compute the P-value, it is not the hypothesis most compatible with the data—that honor would belong to a hypothesis with P = 1. But even if P = 1, there will be many other hypotheses that are highly consistent with the data, so that a definitive conclusion of “no association” cannot be deduced from a P-value, no matter how large.
  • A null-hypothesis P-value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated. No! - Tick for more information

    Observing P > 0.05 for the null hypothesis only means that the null is one among the many hypotheses that have P > 0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake to conclude from P > 0.05 that a study found “no association” or “no evidence” of an effect. If the null P-value is less than 1 some association must be present in the data, and one must look at the point estimate to determine the effect size most compatible with the data under the assumed model.
  • Statistical significance indicates a scientifically or substantively important relation has been detected. No! - Tick for more information

    Especially when a study is large, very minor effects or small assumption violations can lead to statistically significant tests of the null hypothesis. Again, a small null P-value simply flags the data as being unusual if all the assumptions used to compute it (including the null hypothesis) were correct; but the way the data are unusual might be of no clinical interest. One must look at the confidence interval to determine which effect sizes of scientific or other substantive (e.g., clinical) importance are relatively compatible with the data, given the model.
  • Lack of statistical significance indicates that the effect size is small. No! - Tick for more information

    Especially when a study is small, even large effects may be “drowned in noise” and thus fail to be detected as statistically significant by a statistical test. A large null P-value simply flags the data as not being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; but the same data will also not be unusual under many other models and hypotheses besides the null. Again, one must look at the confidence interval to determine whether it includes effect sizes of importance.
  • The P-value is the chance of our data occurring if the test hypothesis is true; for example, P = 0.05 means that the observed association would occur only 5 % of the time under the test hypothesis. No! - Tick for more information

    The P-value refers not only to what we observed, but also observations more extreme than what we observed (where “extremity” is measured in a particular way). And again, the P-value refers to a data frequency when all the assumptions used to compute it are correct. In addition to the test hypothesis, these assumptions include randomness in sampling, treatment assignment, loss, and missingness, as well as an assumption that the P-value was not selected for presentation based on its size or some other aspect of the results.
  • If you reject the test hypothesis because P ≤ 0.05, the chance you are in error (the chance your “significant finding” is a false positive) is 5 %. No! - Tick for more information

    To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the chance you are in error is 100 %, not 5 %. The 5 % refers only to how often you would reject it, and therefore be in error, over very many uses of the test across different studies when the test hypothesis and all other assumptions used for the test are true. It does not refer to your single use of the test, which may have been thrown off by assumption violations as well as random errors. This is yet another version of misinterpretation #1.
  • P = 0.05 and P ≤ 0.05 mean the same thing. No! - Tick for more information

    This is like saying reported height = 2 m and reported height ≤2 m are the same thing: “height = 2 m” would include few people and those people would be considered tall, whereas “height ≤2 m” would include most people including small children. Similarly, P = 0.05 would be considered a borderline result in terms of statistical significance, whereas P ≤ 0.05 lumps borderline results together with results very incompatible with the model (e.g., P = 0.0001) thus rendering its meaning vague, for no good purpose.
  • P-values are properly reported as inequalities e.g., report ‘P < 0.02’ when P = 0.015 or report ‘P > 0.05’ when P = 0.06 or P = 0.70. No! - Tick for more information

    This is bad practice because it makes it difficult or impossible for the reader to accurately interpret the statistical result. Only when the P-value is very small (e.g., under 0.001) does an inequality become justifiable: There is little practical difference among very small P-values when the assumptions used to compute P-values are not known with enough certainty to justify such precision, and most methods for computing P-values are not numerically accurate below a certain point.
  • Statistical significance is a property of the phenomenon being studied, and thus statistical tests detect significance. No! - Tick for more information

    This misinterpretation is promoted when researchers state that they have or have not found ‘evidence of’ a statistically significant effect. The effect being tested either exists or does not exist. ‘Statistical significance’ is a dichotomous description of a P-value (that it is below the chosen cut-off) and thus is a property of a result of a statistical test; it is not a property of the effect or population being studied.
  • One should always use two-sided P-values. No! - Tick for more information

    Two-sided P-values are designed to test hypotheses that the targeted effect measure equals a specific value (e.g., zero), and is neither above nor below this value. When, however, the test hypothesis of scientific or practical interest is a one-sided (dividing) hypothesis, a one-sided P-value is appropriate. For example, consider the practical question of whether a new drug is at least as good as the standard drug for increasing survival time. This question is one-sided, so testing this hypothesis calls for a one-sided P-value. Nonetheless, because two-sided P-values are the usual default, it will be important to note when and why a one-sided P-value is being used instead.

Some of the most severe distortions of the scientific literature produced by statistical testing involve erroneous comparison and synthesis of results from different studies or study subgroups. Among the worst are:

When the same hypothesis is tested in different studies and none or a minority of the tests are statistically significant (all P> 05), the overall evidence supports the hypothesis. No! - Tick for more details

This belief is often used to claim that

    • a literature

supports no effect when the opposite is

    • case

. It reflects a tendency of researchers to “overestimate the power of most research”. In reality, every study could fail to reach statistical significance and yet when combined show a statistically significant association and persuasive evidence of an effect. For example, if there were five studies each with P = 0.10, none would be significant at 0.05 level; but when these P-values are combined using the Fisher formula, the overall P-value would be 0.01. There are many real examples of persuasive evidence for important effects when few studies or even no study reported “statistically significant” associations. Thus,

    • lack

of statistical significance of individual studies should not be taken as implying that the totality of

    • evidence

supports no effect.

When the same hypothesis is tested in two different populations and the resulting P-values are on opposite sides of 0.05, the results are conflicting. No! - Tick for more details

Statistical tests are sensitive to many differences between study populations that are irrelevant to whether their results are in agreement, such as the sizes of compared groups in each population. As a consequence, two studies may provide very different P-values for the same test hypothesis and yet be in perfect agreement (e.g., may show identical observed associations). For example, suppose we had two randomized trials A and B of a treatment, identical except that trial A had a known standard error of 2 for the mean difference between treatment groups whereas trial B had a known standard error of 1 for the difference. If both trials observed a difference between treatment groups of exactly 3, the usual normal test would produce P = 0.13 in A but P = 0.003 in B. Despite their difference in P-values, the test of the hypothesis of no difference in effect across studies would have P = 1, reflecting the perfect agreement of the observed mean differences from the studies. Differences between results must be evaluated directly, for example by estimating and testing those differences to produce a confidence interval and a P-value comparing the results (often called analysis of heterogeneity, interaction, or modification).

When the same hypothesis is tested in two different populations and the same P-values are obtained, the results are in agreement. No! - Tick for more details

Again, tests are sensitive to many differences between populations that are irrelevant to whether their results are in agreement. Two different studies may even exhibit identical P-values for testing the same hypothesis yet also exhibit clearly different observed associations. For example, suppose randomized experiment A observed a mean difference between treatment groups of 3.00 with standard error 1.00, while B observed a mean difference of 12.00 with standard error 4.00. Then the standard normal test would produce P= 0.003 in both; yet the test of the hypothesis of no difference in effect across studies gives P = 0.03, reflecting the large difference (12.00 − 3.00 = 9.00) between the mean differences.

If one observes a small P-value, there is a good chance that the next study will produce a P-value at least as small for the same hypothesis. No! - Tick for more details

This is false even under the ideal condition that both studies are independent and all assumptions including the test hypothesis are correct in both studies. In that case, if (say) one observes P= 0.03, the chance that the new study will show P ≤ 0.03 is only 3 %; thus the chance the new study will show a P-value as small or smaller (the “replication probability”) is exactly the observed P-value! If on the other hand the small P-value arose solely because the true effect exactly equaled its observed estimate, there would be a 50 % chance that a repeat experiment of identical design would have a larger P-value. In general, the size of the new P-value will be extremely sensitive to the study size and the extent to which the test hypothesis or other assumptions are violated in the new study; in particular, P may be very small or very large depending on whether the study and the violations are large or small.

Caution is also required with regard to misinterpretations of confidence intervals as providing sharp answers when none are warranted. The hypothesis which says the point estimate is the correct effect will have the largest P-value (P = 1 in most cases), and hypotheses inside a confidence interval will have higher P-values than hypotheses outside the interval. The P-values will vary greatly, however, among hypotheses inside the interval, as well as among hypotheses on the outside. Also, two hypotheses may have nearly equal P-values even though one of the hypotheses is inside the interval and the other is outside. Thus, if we use P-values to measure compatibility of hypotheses with data and wish to compare hypotheses with this measure, we need to examine their P-values directly, not simply ask whether the hypotheses are inside or outside the interval. This need is particularly acute when (as usual) one of the hypotheses under scrutiny is a null hypothesis.

The specific 95% confidence interval presented by a study has a 95% chance of containing the true effect size. No! - Tick for more information

A reported confidence interval is a range between two numbers. The frequency with which an observed interval (e.g., 0.72–2.88) contains the true effect is either 100% if the true effect is within the interval or 0% if not; the 95 % refers only to how often 95% confidence intervals computed from very many studies would contain the true size if all the assumptions used to compute the intervals were correct. It is possible to compute an interval that can be interpreted as having 95% probability of containing the true value; nonetheless, such computations require not only the assumptions used to compute the confidence interval,  but also further assumptions about the size of effects in the model. These further assumptions are summarized in what is called a prior distribution, and the resulting intervals are usually called Bayesian posterior (or credible) intervals to distinguish them from confidence intervals.

An effect size outside the 95% confidence interval has been refuted (or excluded) by the data. No! - Tick for more information

As with the P-value, the confidence interval is computed from many assumptions, the violation of which may have led to the results. Thus it is the combination of the data with the assumptions, along with the arbitrary 95 % criterion, that are needed to declare an effect size outside the interval is in some way incompatible with the observations. Even then, judgements as extreme as saying the effect size has been refuted or excluded will require even stronger conditions.

If two confidence intervals overlap, the difference between two estimates or studies is not significant. No! - Tick for more information

The 95 % confidence intervals from two subgroups or studies may overlap substantially and yet the test for difference between them may still produce P< 0.05. Suppose for example, two 95 % confidence intervals for means from normal populations with known variances are (1.04, 4.96) and (4.16, 19.84); these intervals overlap, yet the test of the hypothesis of no difference in effect across studies gives P = 0.03. As with P-values, comparison between groups requires statistics that directly test and estimate the differences across groups. It can, however, be noted that if the two 95 % confidence intervals fail to overlap, then when using the same assumptions used to compute the confidence intervals we will find P < 0.05 for the difference; and if one of the 95 % intervals contains the point estimate from the other group or study, we will find P > 0.05 for the difference.

An observed 95% confidence interval predicts that 95 % of the estimates from future studies will fall inside the observed interval. No! - Tick for more information

This statement is wrong in several ways. Most importantly, under the model, 95 % is the frequency with which other unobserved intervals will contain the true effect, not how frequently the one interval being presented will contain future estimates. In fact, even under ideal conditions the chance that a future estimate will fall within the current interval will usually be much less than 95 %. For example, if two independent studies of the same quantity provide unbiased normal point estimates with the same standard errors, the chance that the 95 % confidence interval for the first study contains the point estimate from the second is 83 % (which is the chance that the difference between the two estimates is less than 1.96 standard errors). Again, an observed interval either does or does not contain the true effect; the 95 % refers only to how often 95 % confidence intervals computed from very many studies would contain the true effect if all the assumptions used to compute the intervals were correct.

If one 95% confidence interval includes the null value and another excludes that value, the interval excluding the null is the more precise one. No! - Tick for more information

When the model is correct, precision of statistical estimation is measured directly by confidence interval width (measured on the appropriate scale). It is not a matter of inclusion or exclusion of the null or any other value. Consider two 95 % confidence intervals for a difference in means, one with limits of 5 and 40, the other with limits of −5 and 10. The first interval excludes the null value of 0, but is 30 units wide. The second includes the null value, but is half as wide and therefore much more precise.

Also, 95 % confidence intervals force the 0.05-level cutoff on the reader, lumping together all effect sizes with P > 0.05, and in this way are as bad as presenting P-values as dichotomies. Nonetheless, many authors agree that confidence intervals are superior to tests and P-values because they allow one to shift focus away from the null hypothesis, toward the full range of effect sizes compatible with the data—a shift recommended by many authors and a growing number of journals.

On the need for caution with P-values and SST

Stang A, Poole C, Kuss O. The ongoing tyranny of statistical significance testing in biomedical research. Eur J Epidemiol. 2010 Apr;25(4):225-30. {ABS-Link} Click for Details

» Since its introduction into the biomedical literature, statistical significance testing (abbreviated as SST) caused much debate. The aim of this perspective article is to review frequent fallacies and misuses of SST in the biomedical field and to review a potential way out of the fallacies and misuses associated with SSTs. Two frequentist schools of statistical inference merged to form SST as it is practised nowadays: the Fisher and the Neyman-Pearson school. The P-value is both reported quantitatively and checked against the alpha-level to produce a qualitative dichotomous measure (significant/nonsignificant). However, a P-value mixes the estimated effect size with its estimated precision. Obviously, it is not possible to measure these two things with one single number. For the valid interpretation of SSTs, a variety of presumptions and requirements have to be met. We point here to four of them: study size, correct statistical model, correct causal model, and absence of bias and confounding. It has been stated that the P-value is perhaps the most misunderstood statistical concept in clinical research. As in the social sciences, the tyranny of SST is still highly prevalent in the biomedical literature even after decades of warnings against SST. The ubiquitous misuse and tyranny of SST threatens scientific discoveries and may even impede scientific progress. In the worst case, misuse of significance testing may even harm patients who eventually are incorrectly treated because of improper handling of P-values. For a proper interpretation of study results, both estimated effect size and estimated precision are necessary ingredients «

Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. Eur J Epidemiol. 2016; 31: 337–350 {TXT-Link} Click for Details

» Misinterpretation and abuse of statistical tests, confidence intervals, and statistical power have been decried for decades, yet remain rampant. A key problem is that there are no interpretations of these concepts that are at once simple, intuitive, correct, and foolproof. Instead, correct use and interpretation of these statistics requires an attention to detail which seems to tax the patience of working scientists. This high cognitive demand has led to an epidemic of shortcut definitions and interpretations that are simply wrong, sometimes disastrously so—and yet these misinterpretations dominate much of the scientific literature. In light of this problem, we provide definitions and a discussion of basic statistics that are more general and critical than typically found in traditional introductory expositions. Our goal is to provide a resource for instructors, researchers, and consumers of statistics whose knowledge of statistical theory and technique may be limited but who wish to avoid and spot misinterpretations. We emphasize how violation of often unstated analysis protocols (such as selecting analyses for presentation based on the P values they produce) can lead to small P values even if the declared test hypothesis is correct, and can lead to large P values even if that hypothesis is incorrect. We then provide an explanatory list of 25 misinterpretations of P values, confidence intervals, and power. We conclude with guidelines for improving statistical interpretation and reporting «

Chavalarias D, Wallach JD, Li AH, Ioannidis JP. Evolution of Reporting P Values in the Biomedical Literature, 1990-2015. JAMA. 2016 Mar 15;315(11):1141-8 {TXT-Link} Click for Details

» In this analysis of P values reported in MEDLINE abstracts and in PMC articles from 1990-2015, more MEDLINE abstracts and articles reported P values over time, almost all abstracts and articles with P values reported statistically significant results, and, in a subgroup analysis, few articles included confidence intervals, Bayes factors, or effect sizes. Rather than reporting isolated P values, articles should include effect sizes and uncertainty metrics «

Vidgen B, Yasseri T. P-Values: Misunderstood and Misused. Frontiers Physics 2016; 4: Article 8 {TXT-Link} Click for Details

» P-values are widely used in both the social and natural sciences to quantify the statistical significance of observed results. The recent surge of big data research has made the p-value an even more popular tool to test the significance of a study. However, substantial literature has been produced critiquing how p-values are used and understood. In this paper we review this recent critical literature, much of which is routed in the life sciences, and consider its implications for social scientific research. We provide a coherent picture of what the main criticisms are, and draw together and disambiguate common themes. In particular, we explain how the False Discovery Rate is calculated, and how this differs from a p-value. We also make explicit the Bayesian nature of many recent criticisms, a dimension that is often underplayed or ignored. We conclude by identifying practical steps to help remediate some of the concerns identified. We recommend that (i) far lower significance levels are used, such as 0.01 or 0.001, and (ii) p-values are interpreted contextually, and situated within both the findings of the individual study and the broader field of inquiry (through, for example, meta-analyses) «

» The p-value quantifies the probability of observing results at least as extreme as the ones observed given that the null hypothesis is true. It is then compared against a pre-determined significance level (α). If the reported p-value is smaller than α the result is considered statistically significant. «

Wasserstein RL, Lazar NA. The ASA Statement on p-Values: Context, Process, and Purpose. The American Statistician 2016;70: 129-133 {TXT-Link} Click for Details

» Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g. the sample mean difference between two compared groups) would be equal to or more extreme than its observed value «

  • P-values can indicate how incompatible the data are with a specified statistical model.
  • P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
  • Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
  • Proper inference requires full reporting and transparency.
  • A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
  • By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Kang J, Hong J, Esie P, Bernstein KT, Aral S. An Illustration of Errors in Using the P Value to Indicate Clinical Significance or Epidemiological Importance of a Study Finding. Sex Transm Dis. 2017 Aug; 44(8): 495–497 {TXT-Link} Click for Details

» A simulation study was carried out to illustrate that P values can suggest but not confirm statistical significance; and they may not indicate epidemiological significance (importance). We recommend that researchers consider reporting effect sizes as P values in conjunction with confidence intervals or point estimates with standard errors to indicate precision (uncertainty) «

Amrhein V, Korner-Nievergelt F, Roth T. The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research. Peer J. 2017; 5: e3544 {TXT-Link} Click for Details

» The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p-values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p values at face value, but mistrust results with larger p-values. In either case, p-values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance (p≤0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p-hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p-values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p-values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p-values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment «

Vetter TR. Descriptive Statistics: Reporting the Answers to the 5 Basic Questions of Who, What, Why, When, Where, and a Sixth, So What? Anesth Analg. 2017 Nov;125(5):1797-1802 {TXT-Link} Click for Details

» Descriptive statistics are specific methods basically used to calculate, describe, and summarize collected research data in a logical, meaningful, and efficient way. Descriptive statistics are reported numerically in the manuscript text and/or in its tables, or graphically in its figures. This basic statistical tutorial discusses a series of fundamental concepts about descriptive statistics and their reporting. The mean, median, and mode are 3 measures of the center or central tendency of a set of data. In addition to a measure of its central tendency (mean, median, or mode), another important characteristic of a research data set is its variability or dispersion (ie, spread). In simplest terms, variability is how much the individual recorded scores or observed values differ from one another. The range, standard deviation, and interquartile range are 3 measures of variability or dispersion. The standard deviation is typically reported for a mean, and the interquartile range for a median. Testing for statistical significance, along with calculating the observed treatment effect (or the strength of the association between an exposure and an outcome), and generating a corresponding confidence interval are 3 tools commonly used by researchers (and their collaborating biostatistician or epidemiologist) to validly make inferences and more generalized conclusions from their collected data and descriptive statistics. A number of journals, including Anesthesia & Analgesia, strongly encourage or require the reporting of pertinent confidence intervals. A confidence interval can be calculated for virtually any variable or outcome measure in an experimental, quasi-experimental, or observational research study design. Generally speaking, in a clinical trial, the confidence interval is the range of values within which the true treatment effect in the population likely resides. In an observational study, the confidence interval is the range of values within which the true strength of the association between the exposure and the outcome (eg, the risk ratio or odds ratio) in the population likely resides. There are many possible ways to graphically display or illustrate different types of data. While there is often latitude as to the choice of format, ultimately, the simplest and most comprehensible format is preferred. Common examples include a histogram, bar chart, line chart or line graph, pie chart, scatterplot, and box-and-whisker plot. Valid and reliable descriptive statistics can answer basic yet important questions about a research data set, namely: “Who, What, Why, When, Where, How, How Much?” «

Kennedy-Shaffer L. When the Alpha is the Omega: P-Values, “Substantial Evidence,” and the 0.05 Standard at FDA. Food Drug Law J. 2017; 72(4): 595–635 {TXT-Link} Click for Details

» A test provides ranges of that test statistic for which the null hypothesis will be accepted or rejected. Within this hypothesis testing framework, the p-value, a number between 0 and 1, can be defined in several equivalent ways. The formulation most commonly used in the medical literature defines the p-value as the “probability, under the assumption of no effect or no difference, (the null hypothesis), of obtaining a result equal to or more extreme than what was actually observed.” If an event is only of interest if it is more extreme in the same direction as the observed results (compared to the null hypothesis), then we use only that one-sided probability. More commonly, however, a two-sided probability is calculated that is agnostic to whether the more extreme event is in the same or opposite direction as the observed results. A (one-sided or two-sided) p-value is generally then compared to some pre-specified alpha level or significance level. If it is below the alpha level, the null hypothesis is rejected; if it is above the alpha level, the null hypothesis is not rejected. One can equivalently define the p-value, then, as the value of alpha for which the data would be on the border between rejecting and not rejecting the null hypothesis «

Leek J, McShane BB, Gelman A, Colquhoun D, Nuijten MB, Goodman SN. Five ways to fix statistics. Nature 2017;551:557-559 {TXT-Link}

van Rijn MHC, Bech A, Bouyer J, van den Brand JAJG. Statistical significance versus clinical relevance. Nephrol Dial Transplant. 2017 Apr 1;32(suppl-2):ii6-ii12{TXT-Link} Click for Details

» In March this year, the American Statistical Association (ASA) posted a statement on the correct use of P-values, in response to a growing concern that the P-value is commonly misused and misinterpreted. We aim to translate these warnings given by the ASA into a language more easily understood by clinicians and researchers without a deep background in statistics. Moreover, we intend to illustrate the limitations of P-values, even when used and interpreted correctly, and bring more attention to the clinical relevance of study findings using two recently reported studies as examples. We argue that P-values are often misinterpreted. A common mistake is saying that P < 0.05 means that the null hypothesis is false, and P ≥0.05 means that the null hypothesis is true. The correct interpretation of a P-value of 0.05 is that if the null hypothesis were indeed true, a similar or more extreme result would occur 5% of the times upon repeating the study in a similar sample. In other words, the P-value informs about the likelihood of the data given the null hypothesis and not the other way around. A possible alternative related to the P-value is the confidence interval (CI). It provides more information on the magnitude of an effect and the imprecision with which that effect was estimated. However, there is no magic bullet to replace P-values and stop erroneous interpretation of scientific results. Scientists and readers alike should make themselves familiar with the correct, nuanced interpretation of statistical tests, P-values and CIs. «

Wellek S. A critical evaluation of the current “p-value controversy”. Biom J. 2017 Sep;59(5):854-872 {ABS-Link} Click for Details

» This article has been triggered by the initiative launched in March 2016 by the Board of Directors of the American Statistical Association (ASA) to counteract the current p‐value focus of statistical research practices that allegedly “have contributed to a reproducibility crisis in science.” It is pointed out that in the very wide field of statistics applied to medicine, many of the problems raised in the ASA statement are not as severe as in the areas the authors may have primarily in mind, although several of them are well‐known experts in biostatistics and epidemiology. This is mainly due to the fact that a large proportion of medical research falls under the realm of a well developed body of regulatory rules banning the most frequently occurring misuses of p‐values. Furthermore, it is argued that reducing the statistical hypotheses tests nowadays available to the class of procedures based on p‐values calculated under a traditional one‐point null hypothesis amounts to ignoring important developments having taken place and going on within the statistical sciences. Although hypotheses testing is still an indispensable part of the statistical methodology required in medical and other areas of empirical research, there is a large repertoire of methods based on different paradigms of inference that provide ample options for supplementing and enhancing the methods of data analysis blamed in the ASA statement for causing a crisis «

Gagnier JJ, Morgenstern H. Misconceptions, Misuses, and Misinterpretations of P Values and Significance Testing. J Bone Joint Surg Am. 2017 Sep 20;99(18):1598-1603 {ABS-Link} Click for Details

» The interpretation and reporting of p values and significance testing in biomedical research are fraught with misconceptions and inaccuracies. Publications of peer-reviewed research in orthopaedics are not immune to such problems. The American Statistical Association (ASA) recently published an official statement on the use, misuse, and misinterpretation of statistical testing and p values in applied research. The ASA statement discussed 6 principles: (1) “P-values can indicate how incompatible the data are with a specified statistical model.” (2) “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.” (3) “Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.” (4) “Proper inference requires full reporting and transparency.” (5) “A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.” (6) “By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.” The purpose of this article was to discuss these principles. We make several recommendations for moving forward: (1) Authors should avoid statements such as “statistically significant” or “statistically nonsignificant.” (2) Investigators should report the magnitude of effect of all outcomes together with the appropriate measure of precision or variation. (3) Orthopaedic residents and surgeons must be educated in biostatistics, the ASA principles, and clinical epidemiology. (4) Journal editors and reviewers need to be familiar with and enforce the ASA principles «

Katz JN, Losina E. Uses and Misuses of the P Value in Reporting Results of Orthopaedic Research Studies. J Bone Joint Surg Am. 2017 Sep 20;99(18):1507-1508 {TXT-Link} Click for Details

» We provide context for the ASA report and the article by Gagnier and Morgenstern in the form of 2 (fictitious) research abstracts that base their conclusions solely on a p value «

Blume JD, D’Agostino McGowan L, Dupont WD, Greevy, Jr RA. Second-generation p-values: Improved rigor, reproducibility, & transparency in statistical analyses. PLoS One. 2018; 13(3): e0188299 {TXT-Link} Click for Details

» Verifying that a statistically significant result is scientifically meaningful is not only good scientific practice, it is a natural way to control the Type I error rate. Here we introduce a novel extension of the p-value—a second-generation p-value (pδ)–that formally accounts for scientific relevance and leverages this natural Type I Error control. The approach relies on a pre-specified interval null hypothesis that represents the collection of effect sizes that are scientifically uninteresting or are practically null. The second-generation p-value is the proportion of data-supported hypotheses that are also null hypotheses. As such, second-generation p-values indicate when the data are compatible with null hypotheses (pδ = 1), or with alternative hypotheses (pδ = 0), or when the data are inconclusive (0<pδ<1). Moreover, second-generation p-values provide a proper scientific adjustment for multiple comparisons and reduce false discovery rates. This is an advance for environments rich in data, where traditional p-value adjustments are needlessly punitive. Second-generation p-values promote transparency, rigor and reproducibility of scientific results by a priori specifying which candidate hypotheses are practically meaningful and by providing a more reliable statistical summary of when the data are compatible with alternative or null hypotheses «

Schober P, Bossers SM, Schwarte LA. Statistical Significance Versus Clinical Importance of Observed Effect Sizes: What Do P Values and Confidence Intervals Really Represent? Anesth Analg. 2018 Mar;126(3):1068-1072 {TXT-Link} Click for Details

» Effect size measures are used to quantify treatment effects or associations between variables. Such measures, of which >70 have been described in the literature, include unstandardized and standardized differences in means, risk differences, risk ratios, odds ratios, or correlations. While null hypothesis significance testing is the predominant approach to statistical inference on effect sizes, results of such tests are often misinterpreted, provide no information on the magnitude of the estimate, and tell us nothing about the clinically importance of an effect. Hence, researchers should not merely focus on statistical significance but should also report the observed effect size. However, all samples are to some degree affected by randomness, such that there is a certain uncertainty on how well the observed effect size represents the actual magnitude and direction of the effect in the population. Therefore, point estimates of effect sizes should be accompanied by the entire range of plausible values to quantify this uncertainty. This facilitates assessment of how large or small the observed effect could actually be in the population of interest, and hence how clinically important it could be. This tutorial reviews different effect size measures and describes how confidence intervals can be used to address not only the statistical significance but also the clinical significance of the observed effect or association. Moreover, we discuss what P values actually represent, and how they provide supplemental information about the significant versus nonsignificant dichotomy. This tutorial intentionally focuses on an intuitive explanation of concepts and interpretation of results, rather than on the underlying mathematical theory or concepts «

Gates S, Ealing E. Reporting and interpretation of results from clinical trials that did not claim a treatment difference: survey of four general medical journals. BMJ Open. 2019; 9(9): e024785 {TXT-Link}Click for Details

» The majority of trials (54.2%) inappropriately interpreted a result that was not statistically significant as indicating no treatment benefit. Very few studies interpreted the result as indicating a lack of evidence against the null hypothesis of zero difference between the trial arms « « »

Wasserstein RL, Schirm AL, Lazar NA. Moving to a World Beyond “p<0.05”. The American Statistician, 2019; 73(sup1): 1-19 {TXT-Link} Click for Details


  • Don’t base your conclusions solely on whether an association or effect was found to be “statistically significant” (i.e., the p-value passed some arbitrary threshold such as p < 0.05).
  • Don’t believe that an association or effect exists just because it was statistically significant.
  • Don’t believe that an association or effect is absent just because it was not statistically significant.
  • Don’t believe that your p-value gives the probability that chance alone produced the observed association or effect or the probability that your test hypothesis is true.
  • Don’t conclude anything about scientific or practical importance based on statistical significance (or lack thereof).


  • Accept Uncertainty (» In the real world, data provide a noisy signal. Variation, one of the causes of uncertainty, is everywhere. Exact replication is difficult to achieve. So it is time to get the right (statistical) gear and “move toward a greater acceptance of uncertainty and embracing of variation” (Gelman, 2016) «
  • «

Amrhein V, Greenland S, McShane. Retire statistical significance. Nature 2019;567:305-307 {TXT-Link} Click for Details

» Our call to retire statistical significance and to use confidence intervals as compatibility intervals is not a panacea. Although it will eliminate many bad practices, it could well introduce new ones. Thus, monitoring the literature for statistical abuses should be an ongoing priority for the scientific community. But eradicating categorization will help to halt overconfident claims, unwarranted declarations of ‘no difference’ and absurd statements about ‘replication failure’ when the results from the original and replication studies are highly compatible. The misuse of statistical significance has done much harm to the scientific community and those who rely on scientific advice. P values, intervals and other statistical measures all have their place, but it’s time for statistical significance to go «

Dunkler D, Haller M, Oberbauer R, Heinze G. To test or to estimate? P-values versus effect sizes. Transpl Int. 2020 Jan;33(1):50-55 {TXT-Link} Click for Details

» The P‐value measures the compatibility of the observed data with the null hypothesis. Technically, it expresses the probability with which, given the null hypothesis was true, data with an effect size as extreme as the observed one or more extreme than the observed one can be obtained. The P‐value cannot separate implausibility of the null hypothesis from implausibility of any of the assumptions: A small P‐value gives evidence that the data are not compatible with the specified model – encompassing the null hypothesis and all assumptions. Hence, the P‐value should be viewed as a continuous measure of compatibility of the data to the model ranging from 0 (complete incompatibility) to 1 (complete compatibility). Consequently, precise P‐values should be presented (e.g., P = 0.07 and not P = NS or P > 0.05). «

Calin-Jageman RJ, Cumming G. The New Statistics for Better Science: Ask How Much, How Uncertain, and What Else is Known. Am Stat. 2019 ; 73(Suppl 1): 271–280 {TXT-Link} Click for Details

» The “New Statistics” emphasizes effect sizes, confidence intervals, meta-analysis, and the use of Open Science practices. We present 3 specific ways in which a New Statistics approach can help improve scientific practice: by reducing over-confidence in small samples, by reducing confirmation bias, and by fostering more cautious judgments of consistency. We illustrate these points through consideration of the literature on oxytocin and human trust, a research area that typifies some of the endemic problems that arise with poor statistical practice «

{18.Mai.2020 | ACPS-CdM}