Introduction
Like words, numbers comprise a representational language. As such,
it is fair to expect that, in practical applications, it can be
shown that numbers are anchored to real objects and real processes
in the real world. In our present context, statistical findings
reported out of a controlled, or other so-called “systematic”
trial, must be expected to accurately reflect the essential qualities
of the medical (or other) practices under their review. But such
a happy coincidence cannot always be confirmed.
Thus, in a trial of homeopathy, if verum is found to perform no
better than placebo, it is usually concluded that homeopathy “failed”
the test. But the corresponding question must also be addressed
- both as a qualitative test of the specific trial and
as a principled challenge to the capabilities of the controlled
trial format: is such a format –
the idealized or abstract representation of the
controlled trial – capable in the first place of measuring
the relatively evanescent properties of homeopathic medicine?
At the outset, let me be clear that I do, in fact, consider the
controlled trial capable of measuring homeopathic action; I have
not always been convinced this was true, but at this point I absolutely
endorse such a belief. Nevertheless, we encounter at least three
matters of concern:
first, in adapting to homeopathic practices, the controlled trial
is an experimental method that tests “real objects and real
processes” indirectly, that is, it measures perception
of those processes, rather than the processes themselves, as for
example might be accomplished through laboratory testing;
second, we are not able to confirm that the controlled trial
can adequately account for the evanescent quality of
those processes in homeopathy, such as the generally subtle symptomatic
display generated by homeopathic remedies, or the gradual appearance
of symptomatic responses over a period of days or weeks; and,
third, in designing a trial environment – a protocol –
that seeks to measure homeopathy, are researchers able to consistently
control for confounding factors that devolve from the complications
inherent in the processes and products being examined, for example,
are they able to assure us that they have harvested all
symptomatic responses in the verum group, even when the remedy
being tested has a proving record of hundreds or thousands of
symptoms?
In the present installment of this two-part series, we will explore
the implications of the second point, above, in establishing limits
to the ability of the controlled trial to record subtle differences
between “real” objects and processes, when we attempt
to measure those processes indirectly. In discussing this question,
I will not restrict myself to a review only of issues directly related
to homeopathy, but will draw on a fairly broad range of examples,
to clarify the types of concerns that must be addressed, as the
controlled trial continues to develop into a more mature research
method.
The limitations encountered in applying the controlled trial to
various practices are obvious, palpable, and well recognized, even
in the skeptical camp, though no one seems until now to have made
a particular point of it. In the case of the general academic and
medical communities, this failure can be attributed to a lack of
expertise in so-called “systematic” research, and even
to a lack of interest in it. In the case of the skeptical community,
the failure to recognize obvious limitations of their practice can
only be attributed – and must be attributed –
to bias ...the significance of which is clear, for our
judgment regarding the ultimate merits of this art form –
no sarcasm intended: after all, we are all in our own domains, quite
accustomed to acknowledging that, even as scientists or professionals,
the quality of our work depends on concrete knowledge, but also
on our individual abilities to apply our knowledge and skills “artfully.”
In the second part of this two-part series, we will explore the
more practical end of this question, that is, the problem of evaluating
the quality, reliability, and credibility of the “evidence”
produced by specialists in one or the other of research methodologies.
Specifically, we will focus on the evidentiary value of clinical
observations, for example, as reported in the traditional clinical
case study, and on the quality of evidence produced from so-called
“systematic” research.
Known Standards
That human perception is, or is made out to be, notoriously unreliable
in recording objective reality, is the cornerstone of the modern
faith in numbers, in the form of the controlled trial, as being
capable of providing relief from distortions introduced by observer
bias. It is ironic, then, that in the so-called “systematic”
trial, we invariably are measuring human perception, and only
human perception, which, to underscore the obvious, we already know
to be inaccurate. Therefore, the conclusion suggests itself, if
homeopathy fails in a controlled trial, that all we have demonstrated,
with certainty, is that fallible human perception failed to distinguish
real effects from imaginary ones. To the skeptic who is hostile
to the bone to homeopathic practice, such a circular, self-reinforcing
scenario must be very reassuring.
But in the face of 200 years of unrelenting controversy, and the
competition of evidence produced empirically versus evidence produced
by quantitative measurements, we need something to break the deadlock,
beyond a self-satisfied presumption that numbers are more “objective”
than perceptions. Reinforcing this need is the previously remarked
fact, that the controlled trial is doing little more
than “observing the observer,” in other words, it is
simply applying the human perceptual apparatus through a different
filter, than is usual. In short, we must find a way
to confirm, what the statistician asserts as a matter of faith,
that numbers really are more objective and more reliable than observation.
And to do that, we must measure the capability of the statistician’s
methods against real differences between real objects and real processes
in the real world, differences that are already known and readily
demonstrated.
In other words, if we take two objects, which are differently
constituted and readily distinguished, and if in the context of
a “scientific” experiment the controlled trial fails
to distinguish between them, then we have established a limitation
to the methodology of controlled research. Further, if
such a limitation exists in one domain, it must be assumed that
it exists in other domains as well – as we assume as a matter
of principle, in any case, since even a skeptic would not insist
that the controlled trial is “good to go” for any scientific
application one cared to imagine.
Once this essential point is finally grasped, it becomes apparent
that the practitioner of the art of controlled research must henceforth
be expected to present not only a protocol for a proposed trial,
but also a Statement of Efficacy, placing the trial
within the context of the characteristics of the object under his
gaze. The statistician must be expected to define precisely
– in fact, statistically – how sensitive an instrument
is needed, to accurately measure the object, or process, he seeks
to measure. By comparison, the naked eye possesses
sufficient power to observe Betelgeuse unaided. On the other hand,
a sixteen-inch catadioptric telescope, a clear night sky, and a
time exposure might be needed before one can record the presence
of a 15th-magnitude dwarf.
It is beyond the scope of this paper to suggest specific formulae
to accomplish this task, in specifying the limits or tolerances
of the controlled trial in measuring medical process. But the forthcoming
discussion does intend to establish, at least in a rudimentary way,
what variety of process it is, in a trial of homeopathy, that demands
more precise (statistical) characterization, in the service of improving
precision and accuracy of research.
In our more earthly domain – as compared to the stargazer
– a placebo control may be all that is needed to establish
the influence of bias on observation. But, in some circumstances,
randomization between the arms of a trial may be essential to ensure
accurate results. In other circumstances, a crossover design may
be needed. These are by now standard modifications to the format
of the controlled trial, long since introduced to adapt this important
methodology to varying circumstances in trialing conventional medications.
It is unfortunate that no one has thought it advisable
to introduce similar, standardized, demonstrably effective safeguards
into trials of practices, medical or otherwise, that do not conform
in structure or process to the characteristic profile of the allopathic
method. One can only believe this failure reflects the
roots of controlled research in the allopathic tradition, and the
reticence, of those who are possessed of a violent disdain for alternative
practices, to explore technical innovations that might cast more
light on those practices, than shadow.*
The Pepsi Challenge
Blinded, randomized, and replicated many times over, The Pepsi
Challenge was one of the most successful marketing campaigns in
recent memory, demonstrating to an affable public the unquestionable
superiority of taste that Pepsi enjoyed over Coke, especially considering
that devotees of Coke, participating in these trials, often revealed
through the shocked expression on their faces, that they preferred
the “wrong” product. Of course, we must note that the
reported outcomes were not nitpicked by a panel of scientists, and
the advertising spots were paid for, rather than being earned through
a blinded peer review screening process…
…so, in other words, these methodological inadequacies allowed
numerous confounders to create suspicion, in the minds of some,
concerning the validity of the findings. For example, in one iteration
of the “trial,” the bottle of Pepsi was covered with
a wrapper marked with an “M,” while the bottle of Coke
was covered with a wrapper marked with a “Q.” Of course,
the management of the Coca Cola bottling company objected that,
because of this, the only fair conclusion regarding this trial was
that people liked the letter “M” better than they liked
the letter “Q.”
Ahh, peer review, indeed!
In any case, what these “trials” really do
demonstrate is what the skeptical community has told us all along:
human perception can’t be trusted. People don’t even
know what they like, well enough, to distinguish it from a competitor.
However, these trials also tell us that experimental measurements
of human perception are unable to distinguish one product
from another: Coke and Pepsi have similar, but nevertheless differing
recipes. And they do not taste the same. Many people have no preference
between them, but many others are devoted to one or the other. And
yet, under the watchful eye of the experimentalist, they cannot
tell the difference between them. And, since the trial, replicated
by the manufacturers of Coke, in that instance showed that people
preferred Coke over Pepsi, we are forced to conclude that the randomized,
blinded trial showed that there was no consistent evidence for a
difference in taste between the two products, although of course
the real differences between the products are clearly documented
in the patented recipes of each.
In short, the differences between Coke and Pepsi escaped notice
of the controlled trial. Do we on that account conclude that the
two products are indistinguishable? I think instead we are well
advised to seek an answer to the questions: why did human perception
fail, and why was the controlled trial unable to confirm real differences
between real products in the real world?
Clearly, if the skeptical community wishes to establish, scientifically,
the superiority of one product over the other, it will have to improve
the methodology embodied in its trials, to better account for whatever
confounders they discover that gummed up the works.
Facetiousness aside, however, I would recommend that the same challenge
awaits them in their efforts to trial homeopathy, high end audio,
and other non-allopathic processes. A few suggestions may suffice
to illustrate the types of experiments that might be constructed,
to confirm whether the controlled trial is capable of measuring
real objects and real processes in the real world:
a) In a blinded trial, auditors will be asked to identify which
performance of a piano sonata or a song is being played on a very
expensive audio system, and which is performed by a live artist,
sitting at a real piano positioned between the loudspeakers before
them.
b) Blinded auditors identify which selection is performed on
a Stradivarius, and which on an instrument by Joseph Guarneri
del Gesu.
If the auditors cannot distinguish live from recorded performance,
would the experimentalist therefore assure us there was no difference
between the two performances? If the auditors cannot distinguish
Stradivarius from Guarneri, would the experimentalist therefore
assure us there was no difference between the instruments?
Being charitable, I assume the experimentalist – the statistician
– would concede the trial outcome was invalid. Which is exactly
the point: we have run into a limit on the capability of the controlled
trial to measure reality.
Taking this a step further, I would suggest a series of trials
of high-end audio gear, to begin to establish the gradations of
qualitative steps, the continuum, that characterizes the range of
quality found in increasingly fine audio components.
a) Blinded comparison of a boom box (System 1: S1) with a $300,000
high-end stereo system (S2).
b) …replace the boom box with a Bose (S3)
c) …replace the Bose with a $10,000 component system. (S4)
d) …replace the amplifier in S4 with a $10,000 pre-amp
and a $10,000 power amp. (S5)
e) …replace the power amp in S5 with an $80,000 pair of
Class A mono-blocks. (S6)
f) …replace the $100 interconnects with $1,000 interconnects.
(S7)
g) … (S8 , S9….)
The hypothesis: as the component systems become increasingly expensive,
the auditor will be less successful distinguishing which “arm”
utilized, for example, the expensive cable and which utilized the
inexpensive cable: in other words, in a $300,000 stereo system,
the sound quality might be so good that the improvement achieved
by incorporating more expensive cable would be relatively
less noticeable, against the “background” of higher
quality sound reproduction.
But experimentation on intermediate system set-ups could help isolate
measurable effects of specific components, for example, by comparing
S9 with itself, alternately played through the expensive and the
inexpensive interconnects (the usual procedure), but also testing
it against a less expensive system altogether: does changing the
interconnect affect the ability of the auditor to distinguish S9
from S3?
Not to be cynical, but I can already hear the excuses emanating
from the skeptical community, about how impractical and expensive
it would be to conduct such a series of trials.
But our eminently practical friends ignore, as usual, the fact
that science is not the handmaiden of practicality. Knowledge comes
at a price, it takes effort, and it takes a determination to be
exhaustive in its analyses; it requires that we resist assuming
that our methodology works - especially in the absence of independent
corroboration of its findings - and that, instead, we seek confirmation
of our results.
Science takes work: work such as the indefatigable efforts of Darwin
to record and organize the full range of empirical facts to be discovered
in the field by the persistent observer; or the efforts of Freud
to document a nearly endless variety of mental “products”
and to analyze their inter-relationships exhaustively, and publish
his findings in the 24-volumes that make up his lifetime’s
labors; or the life’s work of Hahnemann, his tireless effort
to record each and every symptomatic response to remedies, in order
to provide a comprehensive data base on which to build a reliable
medical practice.
Against such untiring labor, such monumental achievements, frankly,
it is hard to credit very much the efforts of a relative handful
of researchers, in producing fewer than 200 hastily contrived controlled
trials over a period of several decades, that passed minimal scrutiny
for methodological adequacy, and that present negative findings
against homeopathy. Have any of these gentlemen, or all of them
together, committed more than a few months, or even cumulatively
more than a few years, to experimentation into homeopathic medicine?
How do we credit the allopathic physician, who manages to fit a
few trials of homeopathic remedies and herbal teas, into the time
that is left to him after he completes a busy day practicing in
the ER, or scrubbing up for surgery? The more so, as these trials
are easy to nitpick analytically – they simply fail to measure
accurately, that which they claim to measure.
Placebo, or Verum?
The Goldman Visual Field Test measures peripheral vision. The patient
looks into a kaleidoscope, in which points of light flash on and
off at different places in the field. The patient presses a button
to register when he sees a point of light, and the points he misses
define the impairment he may have in his visual field. But it is
common in these tests for subjects to “see” a point
of light when there isn’t one there. In this situation, the
examiner knows when there is a point of light, and when there isn’t,
so there isn’t any question when the subject gets it “wrong.”
And yet, in this test it doesn’t matter if the patient
is wrong: all that counts are the right answers. The wrong
answers may be chalked up to placebo effect, but they have no bearing
on whether other perceptions are real, or not, and do not affect
the patient’s score.
This is relevant to the questions with which we are faced, since
everyone can produce a placebo response, even those who are receiving
verum. Because of this, some percentage of verum subjects will produce
both verum and placebo responses, and this little twist must be
accounted for in calculating outcomes in a controlled trial: the
real rate of verum response will be greater than what is reflected
in the statistics, because a greater or lesser percentage of those
responses will be concealed, statistically, within the response
rate returned by the control. True, in most situations the impact
of this fact on our confidence in findings will not be great, but
we cannot assume that this is always true, unless we are satisfied
with being completely irresponsible in our approach to scientific
investigation.
In the case of trials of homeopathy, this dynamic takes on an added
twist, since every trial participant is capable of producing proving
symptoms. Because of this, in principle, every placebo response
of a verum subject must be considered, in a homeopathic proving
trial, to be, potentially, superimposed over a verum response, whether
the verum response is observed, or observable, or not.
Such a dynamic is at work in all clinical trials, of allopathic
as well as homeopathic medicines. Yet it is characteristic of homeopathy,
and of homeopathic trials that most symptomatic responses are mild,
and that the small doses applied mean that only those trial participants,
who are most sensitive to the remedies, will respond to them in
the first place.
These facts have the unexpected consequence that those participants
who populate the group that responds to verum, will, because of
their sensitivities, also populate the group that is most subject
to the power of suggestion. In short, in a homeopathic proving trial,
the same group of participants - the most sensitive participants
in the trial - will tend to respond to both verum and placebo: statistically,
this will mean that the experimental group, to take the most extreme
case, may in reality produce twice the number of symptom responses
as the control, yet "perform" no better than placebo,
statistically.
In practice, no effort is made in the controlled trial to distinguish
placebo from verum responses. In a formula, both (placebo and verum)
may be represented by "s" (symptom), and the formula looks
like this:
(fig. 1)
s : s
If there were 20 responders in both the experimental arm and the
control arm, then this outcome shows that verum performed no better
than placebo:
(fig. 2)
20 : 20
Of course, in this scenario - the usual practice in the controlled
trial - "s" refers not to each symptom that is produced,
but to the fact that individual participants in the trial produced
one or more symptoms. In the control arm, only placebo responses
occur, for the simple reason, of course, that verum has not been
administered.
But in the experimental arm, verum responses may be masked - statistically
- because the single group of 20 participants can account for both
numbers: the 20 individuals who produced the placebo symptoms, may
also have produced the verum symptoms, yet they get counted only
once.
Thus, to accurately record outcomes in such a trial, the
type of symptomatic response must be differentiated as either a
verum symptom (vs) or a placebo symptom (ps):
(fig. 3)
vs + ps : ps
In the experimental arm, therefore, if "John" produced
a verum symptom and also a placebo symptom, in the first formula
(fig. 1), he would be counted only once, as a "responder"
in the experimental group, even though he produced two symptoms.
In short, the verum response is lost to the final count (fig.
2).
Given these considerations, it is imperative that the symptoms
produced by trial participants be analyzed clinically, not just
statistically, since “real” symptomatic responses will
typically present a different profile than symptomatic responses
produced by placebo. Thus, a clinically based analysis of responses
may be able to differentiate, in many cases, those responses that
were truly "placebo responses," from those that represented
proving symptoms.**
In this case, then, "John" - a "complex responder"
as we might call him - would be counted twice, his placebo response
chalked up as "ps" and his verum response chalked up as
"vs." Then, when the final tally was made, we would have,
potentially, a record (just as an example, for 20 responders in
the experimental group) of 20 placebo responses and as many as another
20 verum responses. The formula then shows...
(fig. 4)
vs + ps : ps
20 + 20 : 20 =
40 : 20
...et voila! Verum has outperformed placebo, after all!
As a precautionary note to researchers who may choose to implement
this measure, it should be emphasized that a trial participant should
only be counted once as a placebo responder, and once as a verum
responder. After all, the purpose of this "double count"
is to assure that verum symptoms are not neutralized by placebo
symptoms, in the summary results of the statistical survey. But
we do not want to count all symptoms, just whether the
individual participant should be counted as responding to placebo
and/or verum: after all, if "John" produced 38 placebo
responses, for example, he could single-handedly bury the verum
group by his own extreme susceptibility!
The failure of statisticians, over the past 200 years, to notice
this situation is simply one more example of bias. It represents
the universal human tendency to sustain faith in the value
of our own beliefs, in the face of a challenge from unfamiliar sources.
In the present context, the statistician - conveniently - sees no
reason to question the methods he has used to test his preferred
(allopathic) medical practices, and assumes they are, prima facie,
adequate for testing everthing else as well.
In short, remembering that homeopathic remedies typically produce
mild symptoms, and that the small homeopathic dose is specifically
designed to produce mild, subtle effects, it is not surprising that
such effects are often missed in “objective” trials.
As we now see, this problem may be exacerbated, since many of the
symptoms may be masked by a cloud of undifferentiated statistics,
counting one thing but claiming to have counted everything else
in the world, too.
So long as no one is challenging them, the statisticians amongst
us appear quite content insisting the marbles are either black or
white ... never mind that one over there, with the stripes.
Finally, in this context, it is not surprising to recall
that the statistician has no working definition of placebo, no laboratory-based,
no clinical, and no descriptive standards by which to measure the
specific symptomatic product. Instead, he relies exclusively
on the statistical behavior of the control group, to set the standard
for the experimental group. But such a procedure guarantees that
the "complex responder," as we have dubbed him, is counted
only once, eliminating the (statistical) testimony of his second
symptom, the one that, arguably, is a response to the real medicine.
Such, at least, is the attitude that must be adopted in designing
a trial intended to measure homeopathic action: in other words,
the statistician must be able to prove whether such a process is
or is not affecting the observed results of the trial … assuming
it is not asking too much, to expect accuracy to characterize the
products of scientific inquiry.
Summary
In the present paper I have tried to show, what should in any case
be obvious, that reality is sometimes difficult to measure. It may
be costly and inconvenient, but it remains true that scientific
research, that deserves the name, demands we demonstrate that we
have in fact measured our subject accurately and exhaustively.
To that end, the scientific community, the general public, and
our government leaders - who often look to research scientists for
guidance - should insist, at long last, that standards be established,
based on proofs, that our research methods return honest results.
After all, making one's way, blindfolded, through a maze of self-referencing
calculations, is not a substitute for good observation or accurate
measurement.
_________________
* In the editorial at the head of the present issue, I have discussed
this point in some detail, in respect to the example of Belladonna,
a homeopathic remedy with 1040 associated symptoms, as compared
to the one symptom that, archetypically, characterizes the field
of action of the allopathic medicament.
** See my book review in the March 2006 issue of this journal for
a discussion of this process. |