View of Minimax Estimation of the Mean Matrix of the Matrix Variate Normal Distribution under the Divergence Loss Function

(1)

MINIMAX ESTIMATION OF THE MEAN MATRIX OF THE

MATRIX VARIATE NORMAL DISTRIBUTION UNDER THE

DIVERGENCE LOSS FUNCTION

Shokofeh Zinodiny

Amirkabir University of Technology, Tehran, Iran Sadegh Rezaei

Amirkabir University of Technology, Tehran, Iran Saralees Nadarajah1

University of Manchester, Manchester M13 9PL, UK

1. INTRODUCTION

Let X=x_{i, j}be ap× m random matrix having a matrix variate normal distribution with mean matrixΘ = θ_{i, j}and covariance matrix V⊗ Im, where V is known or

unknown, I_mis them× m identity matrix and ⊗ denotes the Kronecker product. This note considers estimation ofΘ relative to the divergence loss function

L(a;Θ) = 1 β(1 − β) 1− Z Rp×m fβ(X|a) f1−β(X|Θ) dx , (1)

where 0< β < 1, f (X|Θ) is the conditional density function of X given Θ and a is an estimator ofΘ.

Estimators for the mean matrix have been proposed under different loss functions. Efron and Morris (1972) proposed an empirical Bayes estimator outperforming the max-imum likelihood estimator,X , for the case m > p + 1. Stein (1973), Zhang (1986a), Zhang (1986b), Bilodeau and Kariya (1989), Ghosh and Shieh (1991), Konno (1991), Tsukuma (2008), Tsukuma (2010) and others found minimax estimators better than the maximum likelihood estimator for quadratic loss and general quadratic loss functions.

The most recent minimax estimator forΘ is that due to Zinodiny et al. (2017). They estimated Θ under the balanced loss function. But Zinodiny et al. (2017) considered only the case V= σ2_I

p, whereσ2 is unknown. In this note, we consider three cases 1_{Corresponding author. E-mail: mbbsssn2@manchester.ac.uk}

(2)

370

as outlined in the abstract. Furthermore, the divergence loss function in (1) has attrac-tive properties not shared by other loss functions, including the balanced loss function. Firstly, it is invariant to one-to-one coordinate transformation of a. Secondly, the Fisher information function, used commonly to measure the sensitivity of estimates, is pro-portional to (1). Thirdly, many loss functions including the balanced loss function are symmetric, but (1) is not symmetric. Ifβ is closer to 1 more weight is given to a. If β is closer to 0 more weight is given toΘ. Other attractive properties of (1) can be found in Kashyap (1974).

The divergence loss function is directly comparable to the densities f (X|Θ) and f (X|a). Robert (2001) refers to it as an intrinsic loss function, see also Amari (1982) and Cressie and Read (1984). The divergence loss function has been applied in many areas. Some examples include: discrimination between stationary Gaussian processes (Cord-uas, 1985); the Vocal Joystick engine, a real-time software library which can be used to map non-linguistic vocalizations into realizable continuous control signals (Malkin et al., 2011); risk-aware modeling framework for speech summarization (Chen and Lin, 2012); models for web image retrieval (Yanget al., 2012); models for coclustering the mouse brain atlas (Jiet al., 2013); properties of record values (Paul and Thomas, 2016).

The aim of this note is to estimate the mean matrix of the matrix variate normal distribution under the divergence loss function. The contents are organized as follows: Section 2 shows that the empirical Bayes estimators dominate the maximum likelihood estimator under (1) form> p + 1 and hence the latter is inadmissible for the case V is known. We find a general class of minimax estimators using a technique due to Stein (1973) when V= I_p. Section 3 shows thatX is inadmissible for m> p + 1 when V =

σ2_I

p, whereσ2is unknown. Section 4 shows thatX is inadmissible for the case V is

unknown. These sections in fact extend the results of Ghoshet al. (2008) and Ghosh and Mergel (2009) to the multivariate case. Section 5 performs a simulation study to compare one of the derived estimators with that in Zinodinyet al. (2017). Section 6 concludes the note.

2. ESTIMATION OF THE MEAN MATRIX WHENVIS KNOWN

Here, we suppose

X∼ Np×m(Θ,V ⊗ Im),

whereN_p×m(Θ,V ⊗ Im) denotes the matrix variate normal distribution with mean

ma-trix Θ and covariance matrix V ⊗ I_m, where V is assumed known. We consider the problem of estimating the mean matrixΘ under the loss function (1), namely

L(a;Θ) = 1 β(1 − β) 1− Z Rp×m fβ(X|a) f1−β(X|Θ) dx = 1 β(1 − β) 1− e−β(1−β)tr((a−Θ)0V−1(a−Θ)2 ) , (2)

(3)

where tr(A) and A0_{denote, respectively, the trace and the transpose of a matrix A, and}

f (X|Θ) denotes the matrix variate normal density function. The usual estimator for Θ is the maximum likelihood estimator,X .

Lemma 1 shows that the maximum likelihood estimator is minimax under the loss function (2).

LEMMA1. The estimator X is minimax under the loss function (2).

PROOF. Suppose the proper prior sequenceπ_n(Θ) is distributed according to a ma-trix variate normal distribution with mean mama-trix zero and covariance mama-trixnC⊗Im,

i.e. Θ ∼ N_p×m0_p×m,nC⊗ Im

, where C is a p× p known positive definite matrix. Then, the posterior distribution is

Θ|X ∼ N_p×m I_p− V(nC + V)−1 X, V−1+ (nC)−1−1 ⊗ Im . Thus, the Bayes estimator is

δ_π_n(X) = E [Θ|X] =

I_p− V(nC + V)−1X. Moreover, the risk function of X and Bayes risk function ofδ_π

n(X) are R(X;Θ) =1−[1 + β(1 − β)] −p m2 β(1 − β) and r_n= rπ_n,δ_π n(X) = 1− V−1_{+ (nC)}−1 m 2 [1 + β(1 − β)]V−1+ (nC)−1 − m 2 β(1 − β) ,

respectively, where|A| denotes the determinant of A. Since the determinant of a matrix is a continuous function, we have

lim

n→∞rn=

1−[1 + β(1 − β)]−p m2

β(1 − β) = supΘ R(X;Θ).

Hence, using Theorem 1.12 on page 613 of Lehmann and Casella (1998) and results of Blyth (1951),X is minimax under the loss function (2). 2 Now we construct a class of empirical Bayes estimators better thanX . Assume we have some additional information aboutΘ that can be written as

Θ ∼ N_p×m

0_p×m, A⊗ Im

.

(4)

372

The conditional distribution ofΘ given X is

N_p×mI_p− V(V + A)−1X, V−1+ A−1−1 ⊗ Im

. So, the Bayes estimator ofΘ under (2) is

b

ΘB= E [Θ|X] =

I_p− VΣ−1X,

whereΣ = A+V. In an empirical Bayes scenario, Σ is unknown, and is estimated from the marginal distribution of X. The marginal distribution of X isN_p×m0_p×m,Σ ⊗ I_m. So, S= XX0_{is completely sufficient for}_{Σ and an empirical Bayes estimator is}

b

Θ_EB= E [Θ|X] =

I_p− VbΣ−1(S)X,

where bΣ−1_{(S) is an estimator for Σ}−1_{and b}_Σ−1_{(S) depends on X only through S.}

Accord-ing to Ghosh and Shieh (1991), a natural candidate for bΣ−1_{(S) is (m − p −1)S}−1_{. Ghosh}

and Shieh (1991) obtained this empirical Bayes estimator when the loss function was L₁(δ;Θ) = tr (δ − Θ)0_Q_{(δ − Θ),} ₍₃₎

where Q is a p× p known positive definite matrix.

To continue, we use the following notations borrowed from Ghosh and Shieh (1991): let E_Θ denote the expectation conditional on Θ; eE denotes the expectation over the marginal distribution ofΘ; E denotes the expectation over the joint distribution of X andΘ. Let R_i(δ;Θ) = E_Θ[L_i(δ;Θ)], i = 1,2,3 and R(δ;Θ) = E_Θ[L(δ;Θ)]. We use the notation D=d_{i, j}, whered_{i, j}=1+δi, j

2

∂

∂ si, j (see the last line of page 308 in Ghosh

and Shieh (1991)),δ_{i, j}being the Kronecker deltas ands_{i, j}the(i, j)th element of S. Note that D is a differential operator in the form of a matrix. If we assume f(A) is a real-valued function of ap× p matrix A, then D f (A) is a p × p matrix with (i, j )th element

1+δi, j

2 ∂ f (A)

∂ si, j . For example, ifp= 2 and f (A) = s

2 1,1+ s2,22 then Df(A) = 2s_1,1 0 0 2s_2,2 . Also for any p× p matrix T, D(T) is a p × p matrix with (i, j )th element

p

P

l=1di,ltl , j.

For example, if p= 2 then D(S) = 3

2 0

0 3₂

.

THEOREM2. The empirical Bayes estimator in (3) is minimax under (2) if it is

mini-max under the quadratic loss function (3) when Q= V−1_.

PROOF. The difference of the risks of bΘ_EB and X is

RΘb_EB;Θ

− R(X;Θ) =_{β(1 − β)}1 g(X;Θ) − gΘb_EB;Θ

(5)

where

g(δ;Θ) = Ehe−β(1−β) tr((δ−Θ)0V−1(δ−Θ))/2i. Using the fact thate−x_{≥ 1 − x, we see that}

g(bΘ_EB;Θ) ≥ g(X;Θ)− β(1 − β) 2[1 + β(1 − β)]p m

2

E_Θ[tr(X0Σb−1V0V−1VbΣ−1X− 2 tr(X0Σb−1V0V−1(X − Θ))], whereE_Θis taken with respect toN_p×mΘ,_1+β(1−β)V ⊗ Im

. Using Theorem 2 in Ghosh and Shieh (1991) and the notationR₁(δ;Θ) = E_Θ[L₁(δ;Θ)], we can write

gΘb_EB;Θ − g(X;Θ) ≥ −β(1 − β) 2 [1 + β(1 − β)] −p m2 R 1 b Θ_EB;Θ− R1(X;Θ) . (4) Hence, ifR₁Θb_EB,Θ ≤ R1(X;Θ), then R b ΘEB;Θ ≤ R(X;Θ). 2

Similar to Ghosh and Shieh (1991), we can write (4) as gΘb_EB;Θ − g(X;Θ) ≥ −β(1 − β) 2 [1 + β(1 − β)] −p m2 _E_tr_V12U 1V 1 2 and U₁= bΣ−1(S)SbΣ−1(S) − 4DSbΣ−1(S)− 2(m − p − 1)bΣ−1(S), andR₁Θb_EB;Θ

≤ R1(X;Θ) if U1≤ 0 has a positive probability for some Θ. If bΣ−1(S)

is chosen in such a way that bΘEB improves onX under the quadratic loss function (3),

then it improves onX also under (2).

Let O_p be the set of orthogonal matrices of order p and let V_{m, p} be the Stiefel manifold, namely, V_{m, p} = ¦

V∈ Rm× p_{, V}0_V_{= I} p

©

. Write the singular value decom-position of X as ULV0, where U ∈ Op, V ∈ Vm, p and L = diag

l₁,l₂, . . . ,l_pwith l₁> l₂> ··· > l_p> 0. Ghosh and Shieh (1991) showed that if bΣ−1_{(S) = UL}−1_Ψ∗_(L)U0_,

whereΨ∗_{(L) is a diagonal matrix with diagonal elements equal to ψ}∗

i(L), i = 1,..., p,

then bΘEB is minimax under the following conditions when the loss function is (3) with

Q= V−1_.

THEOREM3. Suppose thatψ∗_i(L), i = 1,..., p satisfy I. 0< ψ∗

(6)

374

II. For every i= 1,..., p, ∂ ψ∗i(L)

∂ li ≥ 0,

III. ψ∗

i(L) are similarly ordered to li, that is

ψ∗

i(L) − ψ∗t(L)(li− lt) ≥ 0 for every 1 ≤

i, t≤ p.

Then, bΘ_EB = I_p− VUL−1_Ψ∗_(L)U0

X improves on X for m > p + 1 under the loss function (3).

PROOF. See Ghosh and Shieh (1991). 2

Using Theorem 2, we can see that bΘ_EB=I_p− VUL−1_Ψ∗_(L)U0

X is minimax for

m > p + 1 under the conditions of Theorem 3 when the loss function is (2). This

means,X is inadmissible for m> p + 1 under the loss function (2). Examples (1) and (2) in Ghosh and Shieh (1991) illustrated the above result. Of course, one may obtain estimators of the form (3) which dominateX when the conditions of Theorem 3 are not met. Some examples are given in Ghosh and Shieh (1991).

In the rest of this section, we consider the problem of estimatingΘ under the loss function (2) for the case V= I_p. Letδ = X + G, where G = G(X) is a p × m matrix valued function of X. Also let∇ be a p × m matrix with (i, j ) element equal to the differential operator_{∂ x}∂

i, j. We obtain a general condition for minimaxity.

THEOREM4. Suppose

δ = X + G

is minimax under (3). Then it is also minimax under (2). PROOF. The difference of the risks ofδ and X is

R(δ;Θ) − R(X;Θ) = g (X;Θ) − g (δ;Θ), where g(δ;Θ) = E_Θ e−β(1−β)tr((δ−Θ)0(δ−Θ)2 ) . Usinge−x_{≥ 1 − x, we see that}

g(δ;Θ) ≥ g (X;Θ) −β(1 − β)

2 [1 + β(1 − β)]

−p m2 _Etr G0_G+ 2tr G0(X − Θ),

whereE_Θis taken with respect toN_p×mΘ, Ip

1+β(1−β)⊗ Im

. Using Theorem 2 in Ghosh and Shieh (1991), we can write

g(δ;Θ) − g (X;Θ) ≥ −β(1 − β)

2 [1 + β(1 − β)]

−p m2 [R

1(δ;Θ) − R1(X;Θ)].

(7)

By Theorem 4, ifE_Θ[tr(G0G) + 2tr(G0(X − Θ))] ≤ 0 then R(δ;Θ) ≤ R(X;Θ). Us-ing Stein (1973)’s identities,E_Θ

_tr(_G0_(X−Θ)) 1+β(1−β) = E_Θ[tr(∇G0_{)], so we can write} E_Θtr G0_G + 2tr G0_{(X − Θ) = E tr G}0_G + 2[1 + β(1 − β)]tr ∇G0_.

So, if tr(G0_G_{) + 2[1 + β(1 − β)]tr(∇G}0_{) ≤ 0 with a positive probability for some Θ,}

thenδ is minimax under the loss function (2).

Now, assume the class of shrinkage estimators

δ =

I_p− UF−1Ψ(F)U0X, (5)

whereΨ(F) = diagψ₁, . . . ,ψ_pis a diagonal matrix withψ_ibeing functions of F= L2_.

We obtain the following by applying Theorem 4. COROLLARY5. Supposeψ_i satisfy

I. For fixed f₁, . . . ,f_i−1,f_i+1, . . . ,f_p,∂ ψi

∂ fi ≥ 0, where i = 1, . . . , p,

II. 0≤ ψp≤ · · · ≤ ψ1≤ 2(m − p − 1).

Then,δ in (5) is minimax under the loss function (2).

PROOF. With the notationΦ(F) = (F0)−1Ψ(F), Stein (1973) showed that

δ =

I_p− UΦ(F)U0X is minimax under conditions (I) and (II) when the loss function is (3). Using Theorem 4, we can see thatδ = X + G is minimax under (2), where G =

−UΦ(F)U0_X. ₂

Tsukuma (2008) showed that the proper Bayes estimators with respect to the follow-ing prior

Θ ∼ N_p×m

0_p×m,Λ−1I_p− Λ⊗ Im

are of the formδ =I_p− UF−1_Ψ(F)U0

X, whereΛ is a p × m random matrix with the density function

π (Λ) ∝ |Λ|a2−1I0

p× p< Λ < Ip

.

Under certaina and loss function (3), these estimators are minimax, thus we can see from Corollary 5 thatδ is minimax under (2).

(8)

376

3. ESTIMATION OF THE MEAN MATRIX WHENV= σ2_I

p ANDσ2IS UNKNOWN

Here, we assume that

X∼ N_p×mΘ,σ2I_p⊗ Im

, whereσ2_{is unknown. So,}_{S is independent of X and}

S∼ σ2χn2,

whereχ2

ndenotes a chi-square random variable withn degrees of freedom. We consider

estimation ofΘ under L(a;Θ) = 1 β(1 − β) 1− Z Rp×m fβ(X|a) f1−β(X|Θ) dx = 1 β(1 − β) 1− e− β(1−β)tr((a−Θ)0(a−Θ)) 2σ2 . (6)

Similar to Lemma 1, one can show that the maximum likelihood estimator,X , is imax under the loss function (6). Tsukuma (2010) obtained general conditions for min-imaxity of estimators having the formδ₁ = X + G(X, S), where G(X, S) is a p × m matrix valued function of X andS under the quadratic loss function

L₂(a;Θ) =tr((a − Θ)

0_{(a − Θ))}

σ2 . (7)

THEOREM6. The estimatorδ₁= X + G(X, S) is minimax under the quadratic loss function (6) if it is minimax under (7) for m> p + 1.

PROOF. We have R(δ₁;Θ) − R(X;Θ) = 1 β(1 − β)[g (X;Θ) − g (δ1;Θ)]. Usinge−x≥ 1 − x, g(δ₁;Θ) ≥ g (X;Θ) −β(1 − β) 2 [1 + β(1 − β)] −p m2 _E Θ tr(G0_{G) + 2tr(G}0_{(X − Θ))} σ2 and g(δ₁;Θ) − g (X;Θ) ≥ −β(1 − β) 2 [1 + β(1 − β)] −p m2 [R 2(δ1;Θ) − R2(X;Θ)], where R₂(δ;Θ) = E_Θ[L₂(δ;Θ)].

(9)

So, if

R₂(δ₁;Θ) ≤ R₂(X;Θ), then

R(δ₁;Θ) ≤ R(X;Θ).

The proof is complete. 2

Tsukuma (2010) showed that if (n − 2)tr(G0_G)

s + 2tr ∇G

0

+∂ tr(G0G)

∂ s ≤ 0

thenδ₁= X + G(X, S) is minimax under the loss function (7). So, using Theorem 6, we see thatδ is minimax under the loss function (6).

THEOREM7. Consider the estimator

δ2(X, S) =

I_p− UF−1Ψ(F, S)U0X,

where F = L_s = diagf₁, . . . ,f_pandΨ(F, s) = diagψ₁, . . . ,ψ_pare diagonal matrices withψ_i being functions of F and s. If the following conditions hold, thenδ₂is minimax under the loss function (7)

I. For fixed F, ∂ ψi

∂ s ≤ 0,

II. For fixed f₁, . . . ,f_j−1,f_j+1, . . . ,f_p,∂ ψi

∂ fj ≥ 0, where i, j = 1, . . . , p,

III. 0≤ ψp≤ · · · ≤ ψ1≤2(m−p−1)n+2 .

PROOF. Tsukuma (2010) showed thatδ₂is minimax under conditions I-III when the loss function is (7). So, using Theorem 6, we see thatδ2is minimax under the loss

function (6) if conditions I-III hold. 2

4. ESTIMATION OF THE MEAN MATRIX WHENVIS UNKNOWN

Here, we assume

X∼ Np×m(Θ,V ⊗ Im),

where V is unknown. So, bV is independent of X and b

(10)

378

whereW_p(V, n) denotes a Wishart random vector with n degrees of freedom and mean nV. We consider estimatingΘ under the loss function

L(a;Θ) = 1 β(1 − β) 1− Z Rp×m fβ(X|a) f1−β(X|Θ) dx = 1 β(1 − β) 1− e−β(1−β)tr((a−Θ)0V−1(a−Θ)2 ) . (8)

Similar to Lemma 1, one can show that the maximum likelihood estimator,X , is mini-max under the loss function (8).

We now consider the relationship between this loss function and the following quadratic loss function L₃(a;Θ,V) = tr (a − Θ)0Q∗(a − Θ), (9) where Q∗_{= V}−12QV− 1 2. Assumeδ₃= X+G X, bV, where GX, bVis ap× m matrix valued function of X and bV.

THEOREM8. The estimatorδ₃= X + GX, bVis minimax under the loss function (8) if it is minimax under (9). PROOF. We have R(δ3;Θ) − R(X;Θ) = 1 β(1 − β)[g (X;Θ) − g (δ3;Θ)], where g(δ₃;Θ) = E e− β(1−β)tr((δ3−Θ)0V−1₍δ3−Θ)) 2 . So, g(δ₃;Θ) ≥ g(X;Θ) −β(1 − β) 2 [1+β(1 − β)] −p m2 E_Θ[tr(G0Q∗G) + 2tr(G0Q∗(X − Θ))], whereE_Θis taken with respect toN_p×mΘ,_1+β(1−β)V ⊗ Im

. Using R₃(δ;Θ) = E_Θ[L3(δ;Θ)], we can write g(δ3;Θ) − g (X;Θ) ≥ − β(1 − β) 2 [1 + β(1 − β)] −p m2 [R 3(δ3;Θ) − R3(X;Θ)]. Hence, ifR₃(δ₃;Θ) ≤ R₃(X;Θ), then R(δ₃;Θ) ≤ R(X;Θ). 2

(11)

Shieh (1993) considered empirical Bayes estimators of the form b Θ1EB= I_p− bVS−1τV, Sb X, where S= XX0_and_τ b

V, Sis a symmetric matrix. Shieh (1993) showed that bΘ_1EB is better than the maximum likelihood estimator,X , under (8). For example, he showed that ifτV, Sb = (m − p − 1)I_p, then bΘ_EBimproves onX for m> p + 1.

For the case Q= V = I_p, Konno (1990), Konno (1991), Konno (1992) considered the estimators δ₄= ( I_p− RF−1Φ(F)R0 X, ifm< p, I_p− QF−1Φ(F)RQ−1X, if m≥ p,

where, forp< m, XbV−1X0= RFR0, R∈ Opand, forp≥ m, QbV−1Q0= Ip, Q0XX0Q=

F= diagf₁, . . . ,f_{m∧ p}, f₁≥ · · · ≥ fm∧ p > 0. Under the following conditions, Konno

(1990), Konno (1991), Konno (1992) showed thatδ₄improves onX when the loss func-tion is (8):

i) Fori= 1,..., p, φ_i(F) is nondecreasing in f_i;

ii) 0≤ φm∧ p(F) ≤ φm∧ p−1(F) ≤ ··· ≤ φ1(F) ≤2(m∨p−m∧p−1)n+(2m−p)∧p+1.

So, using Theorem 8, we see that these estimators are minimax under (8). 5. SIMULATION STUDY

As mentioned in Section 1, Zinodinyet al. (2017) estimatedΘ under the balanced loss function when V= σ2_I

p, whereσ2is unknown. Here, we perform a simulation study

to compare this estimator with the estimator under the divergence loss function, see Theorem 8. The simulation study was performed as follows:

1. simulate 10000 random samples each of sizen from a matrix variate normal dis-tribution with zero means and V= σ2_I

p;

2. compute Zinodinyet al. (2017)’s estimator and the estimator in Theorem 8 for each of the samples, say¦Θb₁, bΘ₂, . . . , bΘ₁₀₀₀₀

©

;

3. compute the biases of Zinodinyet al. (2017)’s estimator and the estimator in The-orem 8 as 1 10000m p 10000 X i=1 p X j=1 m X k=1 b Θi, j ,k

(12)

380 and 1 10000m p 10000 X i=1 p X j=1 m X k=1 e Θi, j ,k;

4. compute the mean squared errors of Zinodinyet al. (2017)’s estimator and the estimator in Theorem 8 as 1 10000m p 10000 X i=1 p X j=1 m X k=1 b Θ2 i, j ,k and 1 10000m p 10000 X i=1 p X j=1 m X k=1 e Θ2 i, j ,k.

We repeated this procedure forn= 10,12,...,109. We chose σ = 1, m = 10 and p = 5.

20 40 60 80 100 −0.20 −0.10 0.00 n Bias 20 40 60 80 100 −0.20 −0.10 0.00

Figure 1 – Biases of Zinodiny et al. (2017)’s estimator (solid curve) and the estimator in Theorem 8 (broken curve) versusn.

(13)

20 40 60 80 100 0.00 0.02 0.04 0.06 0.08 n MSE 20 40 60 80 100 0.00 0.02 0.04 0.06 0.08

Figure 2 – Mean squared errors of Zinodiny et al. (2017)’s estimator (solid curve) and the estimator in Theorem 8 (broken curve) versusn.

The biases versusn are drawn in Figure 1. The mean squared errors versus n are drawn in Figure 2. We see that the estimator in Theorem 8 has consistently smaller bias and consistently smaller mean squared error for everyn. We tookσ = 1, m = 10 and p= 5 in the simulations. But the same observation was noted for a wide range of other values ofσ, m and p. Hence, the estimator in Theorem 8 is a better estimator with respect to both bias and mean squared error.

6. CONCLUSIONS

We have considered the problem of estimating the mean of a matrix variate normal dis-tribution. We have provided minimax estimators under the divergence loss function. The divergence loss function has several attractive features compared to other loss func-tions considered in the literature.

We have given minimax estimators for the mean considering different forms for the covariance matrix. The forms considered include the cases that the covariance matrix is known and the covariance matrix is completely unknown.

We have performed a simulation study to compare one of our estimators with the most recently proposed estimator. The simulation shows that our estimator has consis-tently smaller bias and consisconsis-tently smaller mean squared error.

(14)

382

ACKNOWLEDGEMENTS

The authors would like to thank the referee and the Editor for careful reading and comments which greatly improved the paper.

REFERENCES

S. AMARI(1982). Differential geometry of curved exponential families-curvatures and in-formation loss. Annals of Statistics, 10, pp. 357–387.

M. BILODEAU, T. KARIYA(1989). Minimax estimators in the normal manova model. Journal of Multivariate Analysis, 28, pp. 260–270.

C. R. BLYTH(1951). On minimax statistical decision procedures and their admissibility. Annals of Mathematical Statistics, 22, pp. 22–42.

B. CHEN, S. H. LIN(2012).A risk-aware modeling framework for speech summarization. IEEE Transactions on Audio, Speech, and Language Processing, 20, pp. 211–222. M. CORDUAS(1985).On the divergence between linear processes. Statistica, 45, pp. 393–

401.

N. CRESSIE, T. R. C. READ(1984).Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society Series B, 46, pp. 440–464.

B. EFRON, C. MORRIS(1972). Empirical Bayes on vector observations: An extension of Stein’s method. Biometrika, 59, pp. 335–347.

M. GHOSH, V. MERGEL (2009). On the stein phenomenon under divergence loss and an unknown variance-covariance matrix. Journal of Multivariate Analysis, 100, pp. 2331–2336.

M. GHOSH, V. MERGEL, G. S. DATTA(2008). Estimation, prediction and the stein phe-nomenon under divergence loss. Journal of Multivariate Analysis, 99, pp. 1941–1961. M. GHOSH, G. SHIEH(1991). Empirical bayes minimax estimators of matrix normal

means. Journal of Multivariate Analysis, 38, pp. 306–318.

S. JI, W. ZHANG, R. LI(2013).A probabilistic latent semantic analysis model for cocluster-ing the mouse brain atlas. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10, pp. 1460–1468.

R. L. KASHYAP(1974). Minimax estimation with divergence loss function. Information

Sciences, 7, pp. 341–364.

Y. KONNO(1990). Families of minimax estimators of matrix of normal means with

(15)

Y. KONNO(1991).On estimation of a matrix of normal means with unknown covariance matrix. Journal of Multivariate Analysis, 36, pp. 44–55.

Y. KONNO(1992). Improved estimation of matrix of normal mean and eigenvalues in the multivariate F -distribution. Ph.D. thesis, Institute of Mathematics, University of Tsukuba.

E. L. LEHMANN, G. CASELLA(1998). Theory of Point Estimation. Springer Verlag, New York, 2 ed.

J. MALKIN, X. LI, S. HARADA, J. LANDAY, J. BILMES(2011).The vocal joystick engine

v1.0. Computer Speech and Language, 25, pp. 535–555.

J. PAUL, P. Y. THOMAS(2016). Sharma-mittal entropy properties on record values. Sta-tistica, 76, pp. 273–287.

C. P. ROBERT(2001). The Bayesian Choice, second edition. Springer Verlag, New York.

G. SHIEH(1993). Empirical bayes minimax estimators of matrix normal means for arbi-trary quadratic loss and unknown covariance matrix. Statistics and Decisions, 11, pp. 317–341.

C. STEIN (1973). Estimation of the mean of a multivariate normal distribution. In J. H ´AJEK(ed.),Proceedings of the Prague Symposium on Asymptotic statistics. Universita Karlova, Prague, pp. 345–381.

H. TSUKUMA(2008). Admissibility and minimaxity of Bayes estimators for a normal

mean matrix. Journal of Multivariate Analysis, 99, pp. 2251–2264.

H. TSUKUMA(2010). Proper Bayes minimax estimators of the normal mean matrix with common unknown variances. Journal of Statistical Planning and Inference, 140, pp. 2596–2606.

L. YANG, B. GENG, A. HANJALIC, X. S. HUA(2012).A unified context model for web image retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications, 8. Article No. 28.

Z. ZHANG(1986a). On estimation of matrix of normal mean. Journal of Multivariate Analysis, 18, pp. 70–82.

Z. ZHANG(1986b). Selecting a minimax estimator doing well, at a point. Journal of Multivariate Analysis, 19, pp. 14–23.

S. ZINODINY, S. REZAEI, S. NADARAJAH(2017).Bayes minimax estimation of the mean

matrix of matrix-variate normal distribution under balanced loss function. Statistics and Probability Letters, 125, pp. 110–120.

(16)

384

SUMMARY

The problem of estimating the mean matrix of a matrix-variate normal distribution with a co-variance matrix is considered under two loss functions. We construct a class of empirical Bayes estimators which are better than the maximum likelihood estimator under the first loss function and hence show that the maximum likelihood estimator is inadmissible. We find a general class of minimax estimators. Also we give a class of estimators that improve on the maximum likelihood estimator under the second loss function and hence show that the maximum likelihood estimator is inadmissible.

Keywords: Empirical Bayes estimation; Matrix variate normal distribution; Mean matrix; Mini-max estimation.