# Insititude of Statistics

AI and Robotics

Nov 7, 2013 (7 years and 10 months ago)

392 views

Insititude of Statistics

Title

Statistical Inference under the General Response Transformation
Heteroscedastic Robust Regression Model

Principal Investigator

Chih
-
Rung Chen

National Science Council

Keywords

Bayesian Inference, Power Transformation
, Exponential Transformation,
Aranda
-
Ordaz Transformation, Heteroscedasticity

When there exist heteroscedastic errors and/or departures from normality in the
data, a popular approach is to transform the response. Originally, transforming the
response was
proposed both as a means of achieving homoscedasticity and
approximate normality and for inducing a simpler linear model for the transformed
response (Box and Cox, 1964). In such situations, Box and Cox (1964) propose the
following response transformation
normal homoscedastic regression model for
modeling independent continuous data:

h(yi;λ) = f(xi;β) + εi, i = 1, …, n,

where yi is the observation for subject i, λ is a finite
-
dimensional transformation
parameter vector, h(

;λ) is a strictly increasing and d
ifferentiable transformation, xi is
a known covariate vector for subject i, β is a finite
-
dimensional regression parameter
vector, f(

;β) is a regression function, and ε is are i.i.d. N(0,σ2) errors with unknown
variance σ2 > 0.

When both heteroscedastic e
rrors and departures from normality cannot be
removed simultaneously in the data by any single transformation, the Box
-
Cox model
is further generalized to the following response transformation normal heteroscedastic
regression model for modeling independen
t continuous data:

h(yi;
λ) = f(xi;β) + g(f(xi;β),xi;γ) εi, i = 1, …, n,

where γ is a variance parameter vector, g(

,

;γ) is a positive weight function, and
ε is are i.i.d. N(0,1) standardized errors.

However, if the range of the response transformation is different from R
(≡(−∞,∞
)), the corresponding errors cannot be normally distributed. Commonly used
examples are the power transformations (Box and Cox, 1964), exponential
transformations (Manly, 1976), and Aranda
-
Ordaz transformations (Aranda and Ordaz,
1981). Moreover, the corre
sponding errors don't even have the same distributions, due
to the fact that they may have different supports.

Thus, Chen and Wang (2003) propose the following general transformation
truncated

normal heteroscedastic regression model

h(yi;λ) = f(xi;β) + g(f(xi;β),xi;γ) εi, i = 1, …, n,

where ε is are independent standardized truncated normal errors with median 0.

In this project, we shall first utilize the likelihood function proposed in Chen and
Wang (2003) and then propose the follow
ing general transformation truncated normal
heteroscedastic regression model

h(yi;λ) = f(xi;β) + g(f(xi;β),xi;γ) εi, i = 1, …, n,

where yi is the observation for subject i, λ is a finite
-
dimensional random
transformation parameter vector with normal (or tr
uncated normal or vague) prior
distribution, h(

;λ) is a strictly increasing and differentiable transformation, xi is a
known covariate vector for subject i, β is a finite
-
dimensional random regression
parameter vector with normal (or truncated normal or v
ague) prior distribution, f(

;β)
is a regression function, γ is a random variance parameter vector with inverse Wishart
(or truncated inverse Wishart or vague) prior distribution, g(

,

;γ) is a positive weight
function, and ε is are independent standardize
d truncated normal errors with median 0.
Next, we shall propose the corresponding Markov chain Monte Carlo (MCMC)
posterior estimation, hypothesis testing, credible region, and prediction, and the
corresponding finite
-
sample and large
-
sample properties for

the proposed Bayesian
regression model.

NSC96
-
2118
-
M
-
009
-
003
-

(96N214)

-----------------------------------------------------------------------------------------
---------------

Title

A
S
tudy for Tolerance Interval

Principal Investigator

Lin
-
An Chen

National Science Council

Keywords

The tolerance interval is often used to investigate if there is

percentage of
acceptable products in a lot at some de
sired confidence. This paper shows that this
confidence, with percentage

fixed , is actually an unknown parameter and shows
the popularly used shortest version of tolerance interval be Eisenhart et al.. (1940) is
not capable to
serve as a test statistic for hypothesis assuming the unknown
confidence to be a desired constant
0
q

. A new test is shown to be more capable in
this purpose . The sample size determination based on this new test ensuring to protec
t
the manufacturer’s benefits and risks when the specification limits indicate true
confidence well , respectively , above and below
0
q

has been studied .

NSC96
-
2119
-
M
-
009
-
002
-

(96N215)

---------------------------------------------
-----------------------------------------------------------

Title

Statistical
V
alidation and

I
nferences of
E
ndophenotypes

(1/2)

Principal Investigator

Guan
-
Hua Huang

National Science Council

Keywords

E
ndophenotype
,
G
enetic
A
nalysis
,
H
eritability
,
V
ariance
C
omponent
A
nalysis

Endophenotypes, which involve the sam
e biological pathways as diseases b
ut
presumably are closer to the
relevant gene action than diagnostic phenotypes, have
emerged as an important concept in the genetic

studies of complex diseases. In this
project, we propose to develop a formal statistical

methodology for

validating
endophenotypes. The proposed method was motivated by the conditioning strategy
used for

surrogate endpoints commonly seen in clinical research. Indices such as
proportion of heritability explained,

ive heritability are
used as operational criteria of validation. Besides, we will

provide relevant confidence
intervals for these indices for making statistical inferences. Using these

confidence
intervals, we will construct some criteria to help us search

a useful endophenotype.
Usefulness

of the proposed methods will be demonstrated through computer
simulations.

NSC96
-
2118
-
M
-
009
-
001
-
MY2

(96N185)

-----------------------------------------------------------------------------------------
---------------

Title

The Bayesian Infrernce and
W
eighted
F
requentist Inference

(2/2)

Principal Investigator

Hui
-
Nien HUNG

National Science Council

Keywords

In the statistical theory, the frequentist and Bayesian point of views are
different.Sometimes, for computing

purpose, the frequentist methods use Bayesian
prior as a computing tool. But, Bayesian statistician always treat those methods as
Bayesian method and criticize them from Bayesian point of view. I don’t think that is
a right way. In many statistics problem
s, form frequentist point of view, there is no
best solution (for example, UMP test not always exists). In these situations, if we can
put“right” weight on the whole or part of parameter space, then we may find the best
solution in the frequentist point of

view. There are three major points in this two years
project. The first one is: Even the weight function in frequentist point of view and the
prior function in the Bayesian point of view are similar, the “best method criterions”
in Bayesian and frequentis
t are different, the “right weight meaning ” in Bayesian and
frequentist are different, and sometime the frequentist only put weight on part of the
parameter space. Therefore, we need to think about the differences between those two
methods. The second goa
l is: The weight function or the prior function may be
improper, and we may have improper posterior. From frequentist point of view, we
think that we need to take the improper weight function to be the limit of a sequence
of proper weight functions. Unfort
unately, if we choose different sequences to
approach the same improper weight function, we may have different results.
Therefore, we will try to find a “good” parameterization of the parameter space, in
order to find a“right” sequence of weight functions
to approach the improper weight
function. The final goal is: We will try to find, form the frequentist point of view, the
“best” weight function on the whole or part of the parameter space. This is not an easy
problem; we hope that we can have some improve
ment with these two years.

NSC95
-
2118
-
M
-
009
-
004
-
MY2 (95R144
-
1)

--------------------------------------------------------------------------------------------------------------

Title

A Study on SPC Phase I Process Monitoring

(2/2)

Principal Investigator

Shiau, J.
-
J. H.

National Science Council

Keywords

Univariate Control Charts, Multivariate Control Charts, Phase I Process
Monitoring, Overall False Alarm Rate, Family
-
wise Erro
r Rate, False
Discovery Rate, Multiple Comparisons, Profile Monitoring

The implementation of SPC process monitoring usually consists of two phases,
Phase I and Phase II. The main task for Phase I is to detect and filter out the out
-
of
control data points
from the historical data so that the remaining in
-
control data can be
used to establish appropriate control limits for Phase II process monitoring. Phase I is
an iterative process by recalculating trial control limits each time when some data
points are cl
aimed out
-
of
-
control and assignable causes are found. However, the
effectiveness of the traditional Phase I approach may be doubtful since existing
out
-
of
-
control data points may inflate the variability estimate and hence some
out
-
of
-
control data points ma
y go undetected, which in turn affects the performance of
Phase II. To our knowledge, there is no statistical study on how effective the Phase I
approach is. In recent years, Phase I research starts evaluating Phase I methods from
the multiple comparisons
point of view. However, by controlling the overall false
alarm rate (also called family
-
wise error rate, FWER) of the whole phase I monitoring
to a certain level, say, 0.05, and giving a false alarm rate for each individual test by
the Bonferroni approach,

it creates the problem of very low detecting power for each
of the individual tests, even when the number of the tests in Phase I is fairly small. In
this project, we propose a detailed study on phase I performance, including finding
out the effectiveness

of the current iterative approach and how effective the new
controlling overall false alarm rate approach is. To remedy the above mentioned
problems, we propose (i) using robust methods for choosing in
-
control data points and
(ii) controlling FDR (false d
iscovery rate), a popular criterion in bioinformatics,
instead of the overall alarm rate for getting higher detecting power. We will study the
performance of the proposed method theoretically and/or by simulation. In the first
year of the project, we will
concentrate on commonly
-
used univariate control charts
such as
XR

charts or
XS

charts. In the second year, we will focus on multivariate
control charts, which is a lot more complicated than the univariate case. If time and
manpower permit, we will extend
the study further to profile monitoring.

NSC95
-
2118
-
M
-
009
-
006
-
MY2

(95R146
-
1)

--------------------------------------------------------------------------------------------------------------

Title

Statistical Analysis of Large Genetic Networks in Yeast

(1/2
)

Principal Investigator

Lu, H. H.
-
S.

National Science Council

Keywords

S
ystem
B
iology,
C
omputational
C
omplexity,
D
imension
R
eduction,
M
ulti
-

D
imensional
S
caling (MDS),
C
ell
C
ycle,
C
lustering,
C
lassification,
R
egistration,
D
iauxic
S
hift,
F
ermentati
on, Boolean
N
etworks, Bayesian
N
etworks,
P
rotein
I
nteraction
N
etwork

Is it possible to develop simplified models to gain deep insights for large and
complex biologic networks? This is a top challenge for system biology in the era of
post
-
genomic studies.
We plan to develop statistical methods for this purpose. The
large genetic networks in yeast will be used as examples.

First of all, it is crucial to reduce the computational complexity of statistical
methods for analyzing the large genetic networks. For i
nstance, we plan to develop
the improved methods with low computational complexity for dimension reduction
techniques, including multi
-
dimensional scaling (MDS) and related methods in
nonlinear dimension reduction. These improved methods will be applied to

the
analysis of yeast cell cycles and their genetic networks.

Secondly, it is often necessary to develop statistical methods to analyze gene
expression curves for investigating the large genetic networks. The gene expression
curves could have time shifts
that will need registration in clustering and classification.
We plan to develop statistical methods for analyzing the gene expression curves of
diauxic shift in fermentation for yeast. The network analysis by Boolean and Bayesian
networks will be used for

the follow
-
up analysis.

Finally, it is challenging to develop statistical methods for analyzing the
interaction patterns of large genetic networks. The interaction patterns could be
distinct and the resulting observation types in various experiment techniques will be
different. H
ence, we plan to develop statistical methods of estimation and inference
for analyzing interaction patterns in yeast protein interaction networks by integrating
databases from different experiment techniques and laboratories.

At the end of this long term p
roject, we will develop improved statistical
methods with low computational complexity for the analysis of dimension reduction,
network analysis and interaction pattern in yeast genetic networks. These methods can
be applied to study large genetic networks

in human and other species for the
investigation of system biology.

NSC96
-
2118
-
M
-
009
-
004
-
MY2 (96N186)

--------------------------------------------------------------------------------------------------------

Title

Analysis of Instant Trend of
t
he Price of
a

Stock by Solving
t
he Model of
t
he
Markov Chains in Random Environments

Principal Investigator

Nan Fu Peng

National Science Council

Keywords

We use the model of the Markov chains in ran
dom environments
to analyze the
instant trend of the price of a stock. We assume first the price of a stock to be a

geometric Brownian motion. Conditional on the Brownian motion, the trend follows a

two
-
state Markov chains. Our goal is to find the finite time distributions

and the
limiting

distributions of the stochastic transition probabilities of the Markov chains.
Extensions

of this model is also explored.

NSC96
-
2118
-
M
-
009
-
002
-

(96N213)

-------------------------------------------------------------------------------------
----
-----
----------

Title:

Exact
C
onfidence
C
oefficients of
C
onfidence
I
ntervals for
D
iscrete
D
istributions

Principal Investigator:
Hsiuying Wang

National Science Council

Keywords
:

For a confidence interval (
L
(
X
)
,U
(
X
)) of a parameter
_
in discre
te distributions, the
coverage probability is a variable function of
_
. The confidence

coefficient is the infimum of
the coverage probabilities, inf

_

P
_
(
_
2
(
L
(
X
)
,U
(
X
))).

Since we do not know which point in
the parameter space the infimum coverage probab
ility occurs at, the exact confidence
coefficients are unknown.

Beside confidence coefficients, evaluation of a confidence intervals
can be

based on average coverage probability. Usually, exact average probability

is also
unknown and it was approximated by

taking the mean of coverage

probabilities at some
random chosen points in the parameter space. In this

research, we plan to propose
methodologies for computing the exact confidence coefficients of confidence intervals for the
discrete distributions in

the

first year, and propose methodologies for computing the exact
average

coverage probabilities of confidence intervals for discrete distributions in the

second
year.

NSC95
-
2118
-
M
-
009
-
011
-
MY2 (95R748
-
1)

-------------------------------------------------------
-------------------------------------------------

Title

Semi
-
P
arametric Estimation for Dependent Truncation Data (2/3)

Principal Investigator

Wei

J
ing Wang

National Science Council

Keywords

Archimedean Copula Model, Semi
-
P
arametric Inference, T
runcation

In many useful applications, the variable of int
erest
many be

truncated by
another random variable.
Most existing

inference
methods are derived under

the
assumption that the
truncation

variable

is

independent

of the variable of interest
.
Despite that a couple of papers have discussed t
esting quasi
-
indep
endence
between
the two variables, assessing the underlying dependent relationship is still an open
problem in the literature.
In this project, we assume

that the dependence
structure

follows

a

semi
-
parametric copula model. Objectives of statistical inference include
estimation of the marginal distribution

functions;
the truncation proportion

and

the
association parameter. The whole problem is quite challenging since all of the above
three quanti
ties are unknown and estimating
each of them under truncation is not an

eas
y

Simulations will be performed to assess the validity of the estimators and
evaluate
their finite sample performance
s
. Large sample theory
of the proposed
method
will be deve
loped.

NSC95
-
2118
-
M
-
009
-
005
-
MY3 (95R145
-
1)

--------------------------------------------------------------------------------------------------------------