rotten news relay - sci.stat.math - Q right way to interpret a test with multiple metrics

Subject: Q right way to interpret a test with multiple metrics
From: Cosine
Newsgroups: sci.stat.math
Date: Sat, 18 Mar 2023 00:47 UTC

X-Received: by 2002:ac8:5555:0:b0:3d5:bb6:9240 with SMTP id o21-20020ac85555000000b003d50bb69240mr2217405qtr.4.1679100477306;
Fri, 17 Mar 2023 17:47:57 -0700 (PDT)
X-Received: by 2002:a81:4419:0:b0:544:cd0e:2f80 with SMTP id
r25-20020a814419000000b00544cd0e2f80mr1594296ywa.8.1679100476975; Fri, 17 Mar
2023 17:47:56 -0700 (PDT)
Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!usenet.blueworldhosting.com!feed1.usenet.blueworldhosting.com!peer03.iad!feed-me.highwinds-media.com!news.highwinds-media.com!news-out.google.com!nntp.google.com!postnews.google.com!google-groups.googlegroups.com!not-for-mail
Newsgroups: sci.stat.math
Date: Fri, 17 Mar 2023 17:47:56 -0700 (PDT)
Injection-Info: google-groups.googlegroups.com; posting-host=114.24.116.239; posting-account=H-IscAoAAABkDNrURGSxo9jPN3MJ3a8A
NNTP-Posting-Host: 114.24.116.239
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <6c71a0cc-fc14-4b8d-97eb-b1d9725dcbe0n@googlegroups.com>
Subject: Q right way to interpret a test with multiple metrics
From: asecant@gmail.com (Cosine)
Injection-Date: Sat, 18 Mar 2023 00:47:57 +0000
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Received-Bytes: 2398

View all headers

Hi:

We could easily find in the literature that a study used more than one performance metric for the hypothesis test without explicitly and clearly stating what hypothesis this study aims to test. Often the paper only states that it intends to test if a newly developed object (algorithm, drug, device, technique, etc) would perform better than some chosen benchmarks. Then the paper presents some tables summarizing the results of many comparisons. Among the tables, the paper picks those comparisons having better values of some performance metric and showing statistical significance. Finally, the paper claims that the new object is successful since it has some favorable results that are statistically significant.

This looks odd. SHouldn't we clearly define the hypothesis before conducting any tests? For example, shouldn't we define the success of the object to be "having all the chosen metrics have better results"? Otherwise, why would we test so many metrics, instead of only one?

The aforementioned approach looks like this: we do not know what would happen. So let's pick some commonly used metrics to test if we could get some of them to show favorable and significant results.

Anyway, what are the correct or rigorous ways to conduct tests with multiple metrics?

Subject: Re: Q right way to interpret a test with multiple metrics
From: David Jones
Newsgroups: sci.stat.math
Organization: A noiseless patient Spider
Date: Sat, 18 Mar 2023 01:25 UTC
References: 1

Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: dajhawkxx@nowherel.com (David Jones)
Newsgroups: sci.stat.math
Subject: Re: Q right way to interpret a test with multiple metrics
Date: Sat, 18 Mar 2023 01:25:44 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 30
Message-ID: <tv33uo$2758o$1@dont-email.me>
References: <6c71a0cc-fc14-4b8d-97eb-b1d9725dcbe0n@googlegroups.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 18 Mar 2023 01:25:44 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="e88108e2f42da96e9f0c74866f7f511c";
logging-data="2331928"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18R0WE1lDwSoWHAxUBMklJ3G2YkBIr2InE="
User-Agent: XanaNews/1.21-f3fb89f (x86; Portable ISpell)
Cancel-Lock: sha1:ixaoYPbr6HIzGzm2cIo4jcyk3UU=

View all headers

Cosine wrote:

> Hi:
>
> We could easily find in the literature that a study used more than
> one performance metric for the hypothesis test without explicitly and
> clearly stating what hypothesis this study aims to test. Often the
> paper only states that it intends to test if a newly developed object
> (algorithm, drug, device, technique, etc) would perform better than
> some chosen benchmarks. Then the paper presents some tables
> summarizing the results of many comparisons. Among the tables, the
> paper picks those comparisons having better values of some
> performance metric and showing statistical significance. Finally, the
> paper claims that the new object is successful since it has some
> favorable results that are statistically significant.
>
> This looks odd. SHouldn't we clearly define the hypothesis before
> conducting any tests? For example, shouldn't we define the success of
> the object to be "having all the chosen metrics have better results"?
> Otherwise, why would we test so many metrics, instead of only one?
>
> The aforementioned approach looks like this: we do not know what
> would happen. So let's pick some commonly used metrics to test if we
> could get some of them to show favorable and significant results.
>
> Anyway, what are the correct or rigorous ways to conduct tests
> with multiple metrics?

You might want to search for the terms "multiple testing" and
"Bonferroni correction".

Subject: Re: Q right way to interpret a test with multiple metrics
From: Rich Ulrich
Newsgroups: sci.stat.math
Date: Sat, 18 Mar 2023 18:48 UTC
References: 1 2

Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!feeder1.feed.usenet.farm!feed.usenet.farm!peer01.ams4!peer.am4.highwinds-media.com!news.highwinds-media.com!tr1.eu1.usenetexpress.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!69.80.99.22.MISMATCH!Xl.tags.giganews.com!local-2.nntp.ord.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Sat, 18 Mar 2023 18:48:24 +0000
From: rich.ulrich@comcast.net (Rich Ulrich)
Newsgroups: sci.stat.math
Subject: Re: Q right way to interpret a test with multiple metrics
Date: Sat, 18 Mar 2023 14:48:27 -0400
Message-ID: <a31c1idjbkp3ehjpdj50k0037f48ftnkjd@4ax.com>
References: <6c71a0cc-fc14-4b8d-97eb-b1d9725dcbe0n@googlegroups.com> <tv33uo$2758o$1@dont-email.me>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 57
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-zMwHpNZs3Gf9lQGYXnDC0Z2V1bquAqDD2sLWmxR2X3lA9LlZmfhNC6M414FtQ7+9CqKHOOpjlHXewsc!cL/jl97oNEZJb/fNBbduETCfJrxJ1BMlg77ZlbU9tLt7Bxu9VALOx5mrd2wJMpg3he3WiqU=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Received-Bytes: 3626

View all headers

On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones"
<dajhawkxx@nowherel.com> wrote:

>Cosine wrote:
>
>> Hi:
>>
>> We could easily find in the literature that a study used more than
>> one performance metric for the hypothesis test without explicitly and
>> clearly stating what hypothesis this study aims to test.

That sounds like a journal with reviewers who are not doing their job.
A new method may have better sensitivity or specificity, making it
useful as a second test. If it is cheaper/easier, that virtue might
justify slight inferiority. If it is more expensive, there should be
a gain in accuracy to justify its application (or, it deserves further
development).

> Often the
>> paper only states that it intends to test if a newly developed object
>> (algorithm, drug, device, technique, etc) would perform better than
>> some chosen benchmarks. Then the paper presents some tables
>> summarizing the results of many comparisons. Among the tables, the
>> paper picks those comparisons having better values of some
>> performance metric and showing statistical significance. Finally, the
>> paper claims that the new object is successful since it has some
>> favorable results that are statistically significant.
>>
>> This looks odd. SHouldn't we clearly define the hypothesis before
>> conducting any tests? For example, shouldn't we define the success of
>> the object to be "having all the chosen metrics have better results"?
>> Otherwise, why would we test so many metrics, instead of only one?
>>
>> The aforementioned approach looks like this: we do not know what
>> would happen. So let's pick some commonly used metrics to test if we
>> could get some of them to show favorable and significant results.

I am not comfortable with your use of the word 'metrics' -- I like
to think of improving the metrics of a scale by taking a power
transformation, like, square root for Poisson, etc.

Or, your metric for measuring 'size' might be area, volume, weight....

>>
>> Anyway, what are the correct or rigorous ways to conduct tests
>> with multiple metrics?
>
>You might want to search for the terms "multiple testing" and
>"Bonferroni correction".

That answers the final question -- assuming that you do have
some stated hypothesis or goal.

--
Rich Ulrich

Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: dajhawkxx@nowherel.com (David Jones)
Newsgroups: sci.stat.math
Subject: Re: Q right way to interpret a test with multiple metrics
Date: Sun, 19 Mar 2023 10:58:42 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 23
Message-ID: <tv6pt1$2vjdf$1@dont-email.me>
References: <6c71a0cc-fc14-4b8d-97eb-b1d9725dcbe0n@googlegroups.com> <tv33uo$2758o$1@dont-email.me> <a31c1idjbkp3ehjpdj50k0037f48ftnkjd@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 19 Mar 2023 10:58:42 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="1441af166e24c46fd6175b85613ea34f";
logging-data="3132847"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19yRsyTHdSkIQZ2CwX3Ncz1gYtpX6x7pLA="
User-Agent: XanaNews/1.21-f3fb89f (x86; Portable ISpell)
Cancel-Lock: sha1:C9Emca/NvuWgJ7UTHOK8lhP7Zas=

View all headers

Rich Ulrich wrote:

> On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones"
> <dajhawkxx@nowherel.com> wrote:
>
> > Cosine wrote:
> >
> >>
> >> Anyway, what are the correct or rigorous ways to conduct tests
> >> with multiple metrics?
> >
> > You might want to search for the terms "multiple testing" and
> > "Bonferroni correction".
>
> That answers the final question -- assuming that you do have
> some stated hypothesis or goal.

Not quite. The "Bonferroni correction" is an approximation, and one
needs to think about that, and more deeply than jut the approximation
to 1-(1-p)^n. More deeply, the formula is exact and valid if all the
test-statistics are statistically independent, it is conservative if
there is positive dependence (and so "OK"). But, theoretically, it
might be wildly wrong if there is negative dependence

Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: dajhawkxx@nowherel.com (David Jones)
Newsgroups: sci.stat.math
Subject: Re: Q right way to interpret a test with multiple metrics
Date: Sun, 19 Mar 2023 11:26:16 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 69
Message-ID: <tv6rgo$2vscv$1@dont-email.me>
References: <6c71a0cc-fc14-4b8d-97eb-b1d9725dcbe0n@googlegroups.com> <tv33uo$2758o$1@dont-email.me> <a31c1idjbkp3ehjpdj50k0037f48ftnkjd@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 19 Mar 2023 11:26:16 -0000 (UTC)
Injection-Info: reader01.eternal-september.org; posting-host="1441af166e24c46fd6175b85613ea34f";
logging-data="3142047"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19V9jc/khUbxzKfK1hOzif08fExKw49+D8="
User-Agent: XanaNews/1.21-f3fb89f (x86; Portable ISpell)
Cancel-Lock: sha1:9nkrqKj/SY07nGcZEko0W4poJiY=

View all headers

Rich Ulrich wrote:

> On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones"
> <dajhawkxx@nowherel.com> wrote:
>
> > Cosine wrote:
> >
> >> Hi:
> >>
> >> We could easily find in the literature that a study used more than
> >> one performance metric for the hypothesis test without explicitly
> and >> clearly stating what hypothesis this study aims to test.
>
> That sounds like a journal with reviewers who are not doing their job.
> A new method may have better sensitivity or specificity, making it
> useful as a second test. If it is cheaper/easier, that virtue might
> justify slight inferiority. If it is more expensive, there should be
> a gain in accuracy to justify its application (or, it deserves further
> development).
>
> > Often the
> >> paper only states that it intends to test if a newly developed
> object >> (algorithm, drug, device, technique, etc) would perform
> better than >> some chosen benchmarks. Then the paper presents some
> tables >> summarizing the results of many comparisons. Among the
> tables, the >> paper picks those comparisons having better values of
> some >> performance metric and showing statistical significance.
> Finally, the >> paper claims that the new object is successful since
> it has some >> favorable results that are statistically significant.
> >>
> >> This looks odd. SHouldn't we clearly define the hypothesis before
> >> conducting any tests? For example, shouldn't we define the success
> of >> the object to be "having all the chosen metrics have better
> results"? >> Otherwise, why would we test so many metrics, instead
> of only one? >>
> >> The aforementioned approach looks like this: we do not know what
> >> would happen. So let's pick some commonly used metrics to test if
> we >> could get some of them to show favorable and significant
> results.
>
> I am not comfortable with your use of the word 'metrics' -- I like
> to think of improving the metrics of a scale by taking a power
> transformation, like, square root for Poisson, etc.
>
> Or, your metric for measuring 'size' might be area, volume, weight....
>
>
> >>
> >> Anyway, what are the correct or rigorous ways to conduct tests
> >> with multiple metrics?
> >
> > You might want to search for the terms "multiple testing" and
> > "Bonferroni correction".
>
> That answers the final question -- assuming that you do have
> some stated hypothesis or goal.

My other answer concentrated on the case where you put all attention on
the null hypothesis "no effect of any kind", but one could also think
of finding if any of the alternatives on which the test-statistics are
based are of any importance, and if so, which one(s).

In theory the "Bonferroni correction" approach doesn't deal with this.
One presumably would need to go back to estimates of effect sizes. But,
if the plan was to do further experiments targeted at getting better
estimates of particular effects, how do you choose how many and which
effects to investigate further. The original experiment might suggest
the one with the smallest p-value, but that might just be a chance
event, with some other one being better.

Subject: Re: Q right way to interpret a test with multiple metrics
From: Rich Ulrich
Newsgroups: sci.stat.math
Date: Mon, 20 Mar 2023 00:48 UTC
References: 1 2 3 4

Path: eternal-september.org!news.eternal-september.org!reader01.eternal-september.org!news.misty.com!border-2.nntp.ord.giganews.com!nntp.giganews.com!Xl.tags.giganews.com!local-1.nntp.ord.giganews.com!news.giganews.com.POSTED!not-for-mail
NNTP-Posting-Date: Mon, 20 Mar 2023 00:48:47 +0000
From: rich.ulrich@comcast.net (Rich Ulrich)
Newsgroups: sci.stat.math
Subject: Re: Q right way to interpret a test with multiple metrics
Date: Sun, 19 Mar 2023 20:48:47 -0400
Message-ID: <fr9f1idkrusgffq9svhoqq7fsh20ed9m31@4ax.com>
References: <6c71a0cc-fc14-4b8d-97eb-b1d9725dcbe0n@googlegroups.com> <tv33uo$2758o$1@dont-email.me> <a31c1idjbkp3ehjpdj50k0037f48ftnkjd@4ax.com> <tv6pt1$2vjdf$1@dont-email.me>
User-Agent: ForteAgent/8.00.32.1272
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Lines: 50
X-Usenet-Provider: http://www.giganews.com
X-Trace: sv3-OnfwVFPiWHxyct5kVJVx6Mh6fU2amP2AkfOGK50z4MBjoxv4Wf3CjGTBmmMyL40bYmAXvYTw5b/f3LQ!zNB7agNDyPZEBk7kJlr4b65Sepu2NU3dRrBCDwnLH9wajbCzr58tdqZAYtsPPccTwPA6k48=
X-Complaints-To: abuse@giganews.com
X-DMCA-Notifications: http://www.giganews.com/info/dmca.html
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40

View all headers

On Sun, 19 Mar 2023 10:58:42 -0000 (UTC), "David Jones"
<dajhawkxx@nowherel.com> wrote:

>Rich Ulrich wrote:
>
>> On Sat, 18 Mar 2023 01:25:44 -0000 (UTC), "David Jones"
>> <dajhawkxx@nowherel.com> wrote:
>>
>> > Cosine wrote:
>> >
>> >>
>> >> Anyway, what are the correct or rigorous ways to conduct tests
>> >> with multiple metrics?
>> >
>> > You might want to search for the terms "multiple testing" and
>> > "Bonferroni correction".
>>
>> That answers the final question -- assuming that you do have
>> some stated hypothesis or goal.
>
>Not quite. The "Bonferroni correction" is an approximation, and one

The sufficient answer started with "Search for the terms" -- You
should find much more than "How To" apply Bonferroni correction.

Multiple testing is also a broad topic. The original question was
not very specific, but there should be a GOAL, something about
making some /decision/ or reaching a conclusion.

Here's some open-ended thinking about an open-ended question.

I think I can usually work a decision into some hypothesis; but
"p-level of 0.05" is a convention of social science research. Not
every hypothesis merits that test.

Some areas with tests (new atomic particles) use far more stringent
nominal levels ... I think the official logic incorporates
"bonferroni"-type considerations. But for decisions in general,
in other areas, sometimes we settle for "50%" (or worse).

>needs to think about that, and more deeply than jut the approximation
>to 1-(1-p)^n. More deeply, the formula is exact and valid if all the
>test-statistics are statistically independent, it is conservative if
>there is positive dependence (and so "OK"). But, theoretically, it
>might be wildly wrong if there is negative dependence

--
Rich Ulrich

You will be the last person to buy a Chrysler.

sci / sci.stat.math / Q right way to interpret a test with multiple metrics

Subject	Author
Q right way to interpret a test with multiple metrics	Cosine
Re: Q right way to interpret a test with multiple metrics	David Jones
Re: Q right way to interpret a test with multiple metrics	Rich Ulrich
Re: Q right way to interpret a test with multiple metrics	David Jones
Re: Q right way to interpret a test with multiple metrics	Rich Ulrich
Re: Q right way to interpret a test with multiple metrics	David Jones