{"title": "Entropy Rate Estimation for Markov Chains with Large State Space", "book": "Advances in Neural Information Processing Systems", "page_first": 9781, "page_last": 9792, "abstract": "Entropy estimation is one of the prototypical problems in distribution property testing. To consistently estimate the Shannon entropy of a distribution on $S$ elements with independent samples, the optimal sample complexity scales sublinearly with $S$ as $\\Theta(\\frac{S}{\\log S})$ as shown by Valiant and Valiant \\cite{Valiant--Valiant2011}. Extending the theory and algorithms for entropy estimation to dependent data, this paper considers the problem of estimating the entropy rate of a stationary reversible Markov chain with $S$ states from a sample path of $n$ observations. We show that\n\\begin{itemize}\n\t\\item Provided the Markov chain mixes not too slowly, \\textit{i.e.}, the relaxation time is at most $O(\\frac{S}{\\ln^3 S})$, consistent estimation is achievable when $n \\gg \\frac{S^2}{\\log S}$.\n\t\\item Provided the Markov chain has some slight dependency, \\textit{i.e.}, the relaxation time is at least $1+\\Omega(\\frac{\\ln^2 S}{\\sqrt{S}})$, consistent estimation is impossible when $n \\lesssim \\frac{S^2}{\\log S}$.\n\\end{itemize}\nUnder both assumptions, the optimal estimation accuracy is shown to be $\\Theta(\\frac{S^2}{n \\log S})$. In comparison, the empirical entropy rate requires at least $\\Omega(S^2)$ samples to be consistent, even when the Markov chain is memoryless. In addition to synthetic experiments, we also apply the estimators that achieve the optimal sample complexity to estimate the entropy rate of the English language in the Penn Treebank and the Google One Billion Words corpora, which provides a natural benchmark for language modeling and relates it directly to the widely used perplexity measure.", "full_text": "Entropy Rate Estimation for Markov Chains with\n\nLarge State Space\n\nYanjun Han\n\nDepartment of Electrical Engineering\n\nStanford University\nStanford, CA 94305\n\nyjhan@stanford.edu\n\nDepartment of Electrical Engineering and Computer Sciences\n\nJiantao Jiao\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\njiantao@berkeley.edu\n\nChuan-Zheng Lee, Tsachy Weissman\nDepartment of Electrical Engineering\n\nStanford University\nStanford, CA 94305\n\n{czlee, tsachy}@stanford.edu\n\nYihong Wu\n\nDepartment of Statistics and Data Science\n\nYale University\n\nNew Haven, CT 06511\nyihong.wu@yale.edu\n\nTiancheng Yu\n\nDepartment of Electronic Engineering\n\nTsinghua University\n\nHaidian, Beijing 100084\n\nthueeyutc14@foxmail.com\n\nAbstract\n\nEntropy estimation is one of the prototypical problems in distribution property\ntesting. To consistently estimate the Shannon entropy of a distribution on S ele-\nments with independent samples, the optimal sample complexity scales sublinearly\nwith S as \u0398( S\nlog S ) as shown by Valiant and Valiant [4]. Extending the theory\nand algorithms for entropy estimation to dependent data, this paper considers the\nproblem of estimating the entropy rate of a stationary reversible Markov chain with\nS states from a sample path of n observations. We show that\n\n\u2022 Provided the Markov chain mixes not too slowly, i.e., the relaxation time is at\n\nln3 S ), consistent estimation is achievable when n (cid:29) S2\nlog S .\n\n\u2022 Provided the Markov chain has some slight dependency, i.e., the relaxation\n), consistent estimation is impossible when n (cid:46)\n\nmost O( S\n\ntime is at least 1 + \u2126( ln2 S\u221a\nS2\nlog S .\n\nS\n\nUnder both assumptions, the optimal estimation accuracy is shown to be \u0398( S2\nn log S ).\nIn comparison, the empirical entropy rate requires at least \u2126(S2) samples to be\nconsistent, even when the Markov chain is memoryless. In addition to synthetic\nexperiments, we also apply the estimators that achieve the optimal sample complex-\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fity to estimate the entropy rate of the English language in the Penn Treebank and\nthe Google One Billion Words corpora, which provides a natural benchmark for\nlanguage modeling and relates it directly to the widely used perplexity measure.\n\nIntroduction\n\n1\nConsider a stationary stochastic process {Xt}\u221e\nof size S. The Shannon entropy rate (or simply entropy rate) of this process is de\ufb01ned as [1]\n\nt=1, where each Xt takes values in a \ufb01nite alphabet X\n\n\u00afH = lim\nn\u2192\u221e\n\n1\nn\n\nH(X n),\n\n(1)\n\nwhere\n\nH(X n) =\n\nPX n (xn) ln\n\n1\n\nPX n (xn)\n\n(cid:88)\n\nxn\u2208X n\n\nis the Shannon entropy (or entropy) of the random vector X n = (X1, X2, . . . , Xn) and PX n (xn) =\nP (X1 = x1, . . . , Xn = xn) is the joint probability mass function. Since the entropy of a random\nvariable depends only on its distribution, we also refer to the entropy H(P ) of a discrete distribution\n\nP = (p1, p2, . . . , pS), de\ufb01ned as H(P ) =(cid:80)S\n\n.\n\ni=1 pi ln 1\npi\n\nThe Shannon entropy rate is the fundamental limit of the expected logarithmic loss when predicting\nthe next symbol, given the all past symbols. It is also the fundamental limit of data compressing\nfor stationary stochastic processes in terms of the average number of bits required to represent each\nsymbol [1, 7]. Estimating the entropy rate of a stochastic process is a fundamental problem in\ninformation theory, statistics, and machine learning; and it has diverse applications\u2014see, for example,\n[3, 2, 3, 3, 4, 2].\nThere exists extensive literature on entropy rate estimation. It is known from data compression theory\nthat the normalized codelength of any universal code is a consistent estimator for the entropy rate as\nthe number of samples approaches in\ufb01nity. This observation has inspired a large variety of entropy\nrate estimators; see e.g. [4, 2, 1, 6, 1]. However, most of this work has been in the asymptotic regime\n[3, 8]. Attention to non-asymptotic analysis has only been more recent, and to date, almost only\nfor i.i.d. data. There has been little work on the non-asymptotic performance of an entropy rate\nestimator for dependent data\u2014that is, where the alphabet size is large (making asymptotically large\ndatasets infeasible) and the stochastic process has memory. An understanding of this large-alphabet\nregime is increasingly important in modern machine learning applications, in particular, language\nmodeling. There have been substantial recent advances in probabilistic language models, which have\nbeen widely used in applications such as machine translation and search query completion. The\nentropy rate of (say) the English language represents a fundamental limit on the ef\ufb01cacy of a language\nmodel (measured by its perplexity), so it is of great interest to language model researchers to obtain\nan accurate estimate of the entropy rate as a benchmark. However, since the alphabet size here is\nexceedingly large, and Google\u2019s One Billion Words corpus includes about two million unique words,1\nit is unrealistic to assume the large-sample asymptotics especially when dealing with combinations of\nwords (bigrams, trigrams, etc). It is therefore of signi\ufb01cant practical importance to investigate the\noptimal entropy rate estimator with limited sample size.\nIn the context of non-asymptotic analysis for i.i.d. samples, Paninski [3] \ufb01rst showed that the Shannon\nentropy can be consistently estimated with o(S) samples when the alphabet size S approaches in\ufb01nity.\nThe seminal work of [4] showed that when estimating the entropy rate of an i.i.d. source, n (cid:29) S\nlog S\nsamples are necessary and suf\ufb01cient for consistency. The entropy estimators proposed in [4] and\nre\ufb01ned in [4], based on linear programming, have not been shown to achieve the minimax estimation\nrate. Another estimator proposed by the same authors [4] has been shown to achieve the minimax\nrate in the restrictive regime of S\nln S . Using the idea of best polynomial approximation,\nln S\nthe independent work of [4] and [1] obtained estimators that achieve the minimax mean-square error\nn log S )2 + log2 S\nlog S ) sample complexity in the\n\u0398((\n1This exceeds the estimated vocabulary of the English language partly because different forms of a word\ncount as different words in language models, and partly because of edge cases in tokenization, the automatic\nsplitting of text into \u201cwords\u201d.\n\nn ) for entropy estimation. The intuition for the \u0398( S\n\n(cid:46) n (cid:46) S1.03\n\nS\n\n2\n\n\findependent case can be interpreted as follows: as opposed to estimating the entire distribution which\nhas S \u2212 1 parameters and requires \u0398(S) samples, estimating the scalar functional (entropy) can be\ndone with a logarithmic factor reduction of samples. For Markov chains which are characterized by\nthe transition matrix consisting of S(S \u2212 1) free parameters, it is reasonable to expect an \u0398( S2\nlog S )\nsample complexity. Indeed, we will show that this is correct provided the mixing is not too slow.\nEstimating the entropy rate of a Markov chain falls in the general area of property testing and\nestimation with dependent data. The prior work [2] provided a non-asymptotic analysis of maximum-\nlikelihood estimation of entropy rate in Markov chains and showed that it is necessary to assume\ncertain assumptions on the mixing time for otherwise the entropy rate is impossible to estimate. There\nhas been some progress in related questions of estimating the mixing time from sample path [1, 2],\nestimating the transition matrix [1], and testing symmetric Markov chains [1]. The current paper\nmakes contribution to this growing \ufb01eld. In particular, the main results of this paper are highlighted\nas follows:\n\n\u2022 We provide a tight analysis of the sample complexity of the empirical entropy rate for\nMarkov chains when the mixing time is not too large. This re\ufb01nes results in [2] and shows\nthat when mixing is not too slow, the sample complexity of the empirical entropy does\nnot depend on the mixing time. Precisely, the bias of the empirical entropy rate vanishes\nuniformly over all Markov chains regardless of mixing time and reversibility as long as the\nnumber of samples grows faster than the number of parameters. It is its variance that may\nexplode when the mixing time becomes gigantic.\n\n\u2022 We obtain a characterization of the optimal sample complexity for estimating the entropy\nrate of a stationary reversible Markov chain in terms of the sample size, state space size,\nand mixing time, and partially resolve one of the open questions raised in [2]. In particular,\nwe show that when the mixing is neither too fast nor too slow, the sample complexity\n(up to a constant) does not depend on mixing time. In this regime, the performance of the\noptimal estimator with n samples is essentially that of the empirical entropy rate with n log n\nsamples. As opposed to the lower bound for estimating the mixing time in [1] obtained by\napplying Le Cam\u2019s method to two Markov chains which are statistically indistinguishable,\nthe minimax lower bound in the current paper is much more involved, which, in addition to a\nseries of reductions by means of simulation, relies on constructing two stationary reversible\nMarkov chains with random transition matrices [4], so that the marginal distributions of the\nsample paths are statistically indistinguishable.\n\n\u2022 We construct estimators that are ef\ufb01ciently computable and achieve the minimax sample\ncomplexity. The key step is to connect the entropy rate estimation problem to Shannon\nentropy estimation on large alphabets with i.i.d. samples. The analysis uses the idea of\nsimulating Markov chains from independent samples by Billingsley [3] and concentration\ninequalities for Markov chains.\n\n\u2022 We compare the empirical performance of various estimators for entropy rate on a vari-\nety of synthetic data sets, and demonstrate the superior performances of the information-\ntheoretically optimal estimators compared to the empirical entropy rate.\n\n\u2022 We apply the information-theoretically optimal estimators to estimate the entropy rate of the\nPenn Treebank (PTB) and the Google One Billion Words (1BW) datasets. We show that\neven only with estimates using up to 4-grams, there may exist language models that achieve\nbetter perplexity than the current state-of-the-art.\n\nThe rest of the paper is organized as follows. After setting up preliminary de\ufb01nitions in Section 2,\nwe summarize our main \ufb01ndings in Section 3, with proofs sketched in Section 4. Section 5 provides\nempirical results on estimating the entropy rate of the Penn Treebank (PTB) and the Google One\nBillion Words (1BW) datasets. Detailed proofs and more experiments are deferred to the appendices.\n\n2 Preliminaries\nConsider a \ufb01rst-order Markov chain X0, X1, X2, . . . on a \ufb01nite state space X = [S] with transition\nkernel T . We denote the entries of T as Tij, that is, Tij = PX2|X1(j|i) for i, j \u2208 X . Let Ti denote\nthe ith row of T , which is the conditional law of X2 given X1 = i. Throughout the paper, we focus\n\n3\n\n\fon \ufb01rst-order Markov chains, since any \ufb01nite-order Markov chain can be converted to a \ufb01rst-order\none by extending the state space [3].\nWe say that a Markov chain is stationary if the distribution of X1, denoted by \u03c0 (cid:44) PX1, satis\ufb01es\n\u03c0T = \u03c0. We say that a Markov chain is reversible if it satis\ufb01es the detailed balance equations,\n\u03c0iTij = \u03c0jTji for all i, j \u2208 X . If a Markov chain is reversible, the (left) spectrum of its transition\nmatrix T contains S real eigenvalues, which we denote as 1 = \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbS \u2265 \u22121. We de\ufb01ne\nthe spectral gap and the absolute spectral gap of T as \u03b3(T ) = 1 \u2212 \u03bb2 and \u03b3\u2217(T ) = 1 \u2212 maxi\u22652 |\u03bbi|,\nrespectively, and the relaxation time of a reversible Markov chain as\n\n\u03c4rel(T ) =\n\n1\n\n\u03b3\u2217(T )\n\n.\n\n(2)\n\nThe relaxation time of a reversible Markov chain (approximately) captures its mixing time, which\nroughly speaking is the smallest n for which the marginal distribution of Xn is close to the Markov\nchain\u2019s stationary distribution. We refer to [3] for a survey.\nWe consider the following observation model. We observe a sample path of a stationary \ufb01nite-state\nMarkov chain X0, X1, . . . , Xn, whose Shannon entropy rate \u00afH in (1) reduces to\n\nTij ln\n\n1\nTij\n\n= H(X1, X2) \u2212 H(X1)\n\n(3)\n\nwhere \u03c0 is the stationary distribution of this Markov chain. Let M2(S) be the set of transition\nmatrices of all stationary Markov chains on a state space of size S. Let M2,rev(S) be the set of\ntransition matrices of all stationary reversible Markov chains on a state space of size S. We de\ufb01ne\nthe following class of stationary Markov reversible chains whose relaxation time is at most 1\n\u03b3\u2217 :\n\nM2,rev(S, \u03b3\u2217) = {T \u2208 M2,rev(S), \u03b3\u2217(T ) \u2265 \u03b3\u2217}.\n\n(4)\nThe goal is to characterize the sample complexity of entropy rate estimation as a function of S, \u03b3\u2217,\nand the estimation accuracy.\nNote that the entropy rate of a \ufb01rst-order Markov chain can be written as\n\nS(cid:88)\n\nS(cid:88)\n\n\u00afH =\n\n\u03c0i\n\ni=1\n\nj=1\n\nS(cid:88)\n\ni=1\n\n\u00afH =\n\n\u03c0iH(X2|X1 = i).\n\n(5)\n\nGiven a sample path X = (X0, X1, . . . , Xn), let \u02c6\u03c0 denote the empirical distribution of states, and\nthe subsequence of X containing elements following any occurrence of the state i as X(i) = {Xj :\nXj \u2208 X, Xj\u22121 = i, j \u2208 [n]}. A natural idea to estimate the entropy rate \u00afH is to use \u02c6\u03c0i to estimate \u03c0i\nand an appropriate Shannon entropy estimator to estimate H(X2|X1 = i). We de\ufb01ne two estimators:\n\n1. The empirical entropy rate: \u00afHemp =(cid:80)S\n2. Our entropy rate estimator: \u00afHopt = (cid:80)S\n\n(cid:0)X(i)(cid:1). Note that \u02c6Hemp(Y) computes\n(cid:0)X(i)(cid:1), where \u02c6Hopt is any minimax\n\nthe Shannon entropy of the empirical distribution of its argument Y = (Y1, Y2, . . . , Ym).\n\ni=1 \u02c6\u03c0i \u02c6Hemp\n\ni=1 \u02c6\u03c0i \u02c6Hopt\n\nrate-optimal Shannon entropy estimator designed for i.i.d. data, such as those in [4, 4, 1].\n\n3 Main results\n\nOur \ufb01rst result provides performance guarantees for the empirical entropy rate \u00afHemp and our entropy\nrate estimator \u00afHopt:\nTheorem 1. Suppose (X0, X1, . . . , Xn) is a sample path from a stationary reversible Markov chain\nwith spectral gap \u03b3. If S0.01 (cid:46) n (cid:46) S2.99 and 1\nn ln n ln3 S , there exists some constant\nC > 0 independent of n, S, \u03b3 such that the entropy rate estimator \u00afHopt satis\ufb01es:2 as S \u2192 \u221e,\n\n(cid:46)\n\nS3\n\n\u03b3\n\n| \u00afHopt \u2212 \u00afH| \u2264 C\n\nP\n\n\u2192 1\n\n(6)\n\n(cid:18)\n\nS\n\nln n ln2 S \u2227\n(cid:19)\n\nS2\n\nn ln S\n\n2The asymptotic results in this section are interpreted by parameterizing n = nS and \u03b3 = \u03b3S and S \u2192 \u221e\n\nsubject to the conditions of each theorem.\n\n4\n\n\fUnder the same conditions, there exists some constant C(cid:48) > 0 independent of n, S, \u03b3 such that the\nempirical entropy rate \u00afHemp satis\ufb01es: as S \u2192 \u221e,\n\n(cid:18)\n\n(cid:19)\n\nP\n\n| \u00afHemp \u2212 \u00afH| \u2264 C(cid:48) S2\nn\n\n\u2192 1.\n\n(7)\n\nn\u03b3 ), our dominating term O( S2\n\nTheorem 1 shows that when the sample size is not too large, and the mixing is not too slow, it suf\ufb01ces\nto take n (cid:29) S2\nln S for the estimator \u00afHopt to achieve a vanishing error, and n (cid:29) S2 for the empirical\nentropy rate. Theorem 1 improves over [2] in the analysis of the empirical entropy rate in the sense\nthat unlike the error term O( S2\nn ) does not depend on the mixing time.\nNote that we have made mixing time assumptions in the upper bound analysis of the empirical\nentropy rate in Theorem 1, which is natural since [2] showed that it is necessary to impose mixing\ntime assumptions to provide meaningful statistical guarantees for entropy rate estimation in Markov\nchains. The following result shows that mixing assumptions are only needed to control the variance of\nthe empirical entropy rate: the bias of the empirical entropy rate vanishes uniformly over all Markov\nchains regardless of reversibility and mixing time assumptions as long as n (cid:29) S2.\nTheorem 2. Let n, S \u2265 1. Then,\n\n(cid:16) n\n\n(cid:17)\n\nsup\n\nT\u2208M2(S)\n\n| \u00afH \u2212 E[ \u00afHemp]| \u2264 2S2\nn\n\nln\n\nS2 + 1\n\n+\n\n(S2 + 2) ln 2\n\nn\n\n.\n\n(8)\n\nTheorem 2 implies that if n (cid:29) S2, the bias of the empirical entropy rate estimator universally\nvanishes for any stationary Markov chains.\nNow we turn to the lower bounds, which show that the scalings in Theorem 1 are in fact tight. The\nnext result shows that the bias of the empirical entropy rate \u00afHemp is non-vanishing unless n (cid:29) S2,\neven when the data are independent.\nTheorem 3. If {X0, X1, . . . , Xn} are mutually independent and uniformly distributed, then\n\n(cid:18)\n\n(cid:19)\n\n| \u00afH \u2212 E[ \u00afHemp]| \u2265 ln\n\nS2\n\nn + S \u2212 1\n\n.\n\n(9)\n\nThe following corollary is immediate.\nCorollary 1. There exists a universal constant c > 0 such that when n \u2264 cS2, the absolute value of\nthe bias of \u00afHemp is bounded away from zero even if the Markov chain is memoryless.\n\nThe next theorem presents a minimax lower bound for entropy rate estimation which applies to any\nestimation scheme regardless of its computational cost. In particular, it shows that \u00afHopt is minimax\nrate-optimal under mild assumptions on the mixing time.\n(ln S)2 , \u03b3\u2217 \u2264 1 \u2212 C2\nTheorem 4. For n \u2265 S2\n\nln S , ln n (cid:28) S\n\n(cid:113)\n\nS ln3 S\n\n, we have\n\nn\n\n(cid:18)\n\n(cid:19)\n\nlim inf\nS\u2192\u221e inf\n\u02c6H\n\nsup\n\nT\u2208M2,rev(S,\u03b3\u2217)\n\nP\n\n| \u02c6H \u2212 \u00afH| \u2265 C1\n\nS2\n\nn ln S\n\n\u2265 1\n2\n\n.\n\n(10)\n\nHere C1, C2 are universal constants from Theorem 6.\n\nThe following corollary, which follows from Theorem 1 and 4, presents the critical scaling that\ndetermines whether consistent estimation of the entropy rate is possible.\nCorollary 2. If ln3 S\nrate with a uniformly vanishing error over Markov chains M2,rev(S, \u03b3\u2217) if and only if n (cid:29) S2\nln S .\nTo conclude this section we summarize our result in terms of the sample complexity for estimating\nthe entropy rate within a few bits (\u0001 = \u0398(1)), classi\ufb01ed according to the relaxation time:\n\n, there exists an estimator \u02c6H which estimates the entropy\n\nS (cid:28) \u03b3\u2217 \u2264 1 \u2212 C2\n\nln2 S\u221a\nS\n\n\u2022 \u03c4rel = 1: this is the i.i.d. case and the sample complexity is \u0398( S\n\nln S );\n\n5\n\n\f\u2022 1 < \u03c4rel (cid:28) 1 + \u2126( ln2 S\u221a\n\nS\n\nand no matching lower bound is known;\n\n): in this narrow regime the sample complexity is at most O( S2\n\nln S )\n\n) \u2264 \u03c4rel (cid:28) S\n\nln3 S : the sample complexity is \u0398( S2\n\nln S );\n\nln3 S : the sample complexity is \u2126( S2\n\nln S ) and no matching upper bound is known. In\n\nthis case the chain mixes very slowly and it is likely that the variance will dominate.\n\n\u2022 1 + \u2126( ln2 S\u221a\n\u2022 \u03c4rel (cid:38) S\n\nS\n\n4 Sketch of the proof\n\nIn this section we sketch the proof of Theorems 1, 2 and 4, and defer the details to the appendix.\n\n4.1 Proof of Theorem 1\n\nA key step in the analysis of \u00afHemp and \u00afHopt is the idea of simulating a \ufb01nite-state Markov chain from\nindependent samples [3, p. 19]: consider an independent collection of random variables X0 and Win\n(i = 1, 2, . . . , S; n = 1, 2, . . .) such that PX0(i) = \u03c0i, PWin (j) = Tij. Imagine the variables Win\nset out in the following array:\n\nW11 W12\nW21 W22\n...\n...\nWS1 WS2\n\n. . . W1n\n. . . W2n\n...\n...\n. . . WSn\n\n. . .\n. . .\n\n...\n\n. . .\n\nFirst, X0 is sampled. If X0 = i, then the \ufb01rst variable in the ith row of the array is sampled,\nand the result is assigned by de\ufb01nition to X1. If X1 = j, then the \ufb01rst variable in the jth row\nis sampled, unless j = i, in which case the second variable is sampled. In any case, the result\nof the sampling is by de\ufb01nition X2. The next variable sampled is the \ufb01rst one in row X2 which\nhas not yet been sampled. This process thus continues. After collecting {X0, X1, . . . , Xn} from\nthe model, we assume that the last variable sampled from row i is Wini. It can be shown that\nobserving a Markov chain {X0, X1, . . . , Xn} is equivalent to observing {X0,{Wij}i\u2208[S],j\u2208[ni]},\nand consequently \u02c6\u03c0i = ni/n, X(i) = (Wi1, . . . , Wini).\nThe main reason to introduce the above framework is to analyze \u02c6Hemp(X(i)) and \u02c6Hopt(X(i)) as if the\nargument X(i) is an i.i.d. vector. Speci\ufb01cally, although Wi1,\u00b7\u00b7\u00b7 , Wim conditioned on ni = m are\nnot i.i.d., they are i.i.d. as Ti for any \ufb01xed m. Hence, using the fact that each ni concentrates around\nn\u03c0i (cf. De\ufb01nition 2 and Lemma 4 for details), we may use the concentration properties of \u02c6Hemp and\n\u02c6Hopt (cf. Lemma 3) on i.i.d. data for each \ufb01xed m \u2248 n\u03c0i and apply the union bound in the end.\nBased on this alternative view, we have the following theorem, which implies Theorem 1.\nTheorem 5. Suppose (X0, X1, . . . , Xn) comes from a stationary reversible Markov chain with\nspectral gap \u03b3. Then, with probability tending to one, the entropy rate estimators satisfy\n\n(cid:18) S\n(cid:19)0.495\n(cid:19)0.495\n(cid:18) S\n\n+\n\nn\n\n+\n\nn\n\n+\n\nS ln S\nn0.999 +\n\nS ln S ln n\n\nn\u03b3\n\nS ln S\nn0.999 +\n\nS ln S ln n\n\nn\u03b3\n\n+\n\n+\n\n(cid:115)\n\n(cid:115)\n\nS ln n ln2 S\n\nn\u03b3\n\n,\n\n(11)\n\nS ln n ln2 S\n\nn\u03b3\n\n.\n\n(12)\n\n| \u00afHopt \u2212 \u00afH| (cid:46) S2\nn ln S\n\n| \u00afHemp \u2212 \u00afH| (cid:46) S2\nn\n\n+\n\n4.2 Proof of Theorem 2\n\nBy the concavity of entropy, the empirical entropy rate \u00afHemp underestimates the truth \u00afH in expectation.\nOn the other hand, the average codelength \u00afL of any lossless source code is at least \u00afH by Shannon\u2019s\nsource coding theorem. As a result, \u00afH \u2212 E[ \u00afHemp] \u2264 \u00afL \u2212 E[ \u00afHemp], and we may \ufb01nd a good lossless\ncode to make the RHS small.\n\n6\n\n\fSince the conditional empirical distributions maximizes the likelihood for Markov chains (Lemma 13),\nwe have\n\n(cid:34)\n\n(cid:35)\n\n(cid:34)\n(cid:34)\n\n\u2265 EP\n\n\u2265 EP\n\n(cid:35)\n\n1\n\n= \u00afH\n\n(cid:35)\n\nEP\n\n1\nn\n\nln\n\n1\n\nQX n\n\n1 |X0(X n\n\n1 |X0)\n\n1\nn\n\nln\n\n1\n\nPX n\n\n1 |X0 (X n\n\n1 |X0)\n\nmin\n\nP\u2208M2(S)\n\n1\nn\n\nln\n\nPX n\n\n1 |X0 (X n\n\n1 |X0)\n\n= E[ \u00afHemp]\n\nwhere M2(S) denotes the space of all \ufb01rst-order Markov chains with state [S]. Hence,\n\n| \u00afH \u2212 E[ \u00afHemp]| \u2264 inf\n\nQ\n\nsup\n\nP\u2208M2(S),xn\n\n0\n\n1\nn\n\nln\n\nP (xn\nQ(xn\n\n1|x0)\n1|x0)\n\n.\n\n(13)\n\n(14)\n\n(15)\n\nThe following lemma provides a non-asymptotic upper bound on the RHS of (15) and completes the\nproof of Theorem 2.\nLemma 1. [3] Let M2(S) denote the space of Markov chains with alphabet size S for each symbol.\nThen, the worst case minimax redundancy is bounded as\n\n(cid:16) n\n\n(cid:17)\n\nP (xn\nQ(xn\n\n1|x0)\n1|x0)\n\n\u2264 2S2\nn\n\nln\n\nS2 + 1\n\n+\n\n(S2 + 2) ln 2\n\nn\n\n.\n\n(16)\n\ninf\nQ\n\nsup\n\nP\u2208M2(S),xn\n\n0\n\n1\nn\n\nln\n\n4.3 Proof of Theorem 4\n\nTo prove the lower bound for Markov chains, we \ufb01rst introduce an auxiliary model, namely, the\nindependent Poisson model and show that the sample complexity of the Markov chain model is lower\nbounded by that of the independent Poisson model. Then we apply the so-called method of fuzzy\nhypotheses [4, Theorem 2.15] (see also [1, Lemma 11]) to prove a lower bound for the independent\nPoisson model. We introduce the independent Poisson model below, which is parametrized by an\nS \u00d7 S symmetric matrix R, an integer n and a parameter \u03bb > 0.\nDe\ufb01nition 1 (Independent Poisson model). Given an S \u00d7 S symmetric matrix R = (Rij) with\nRij \u2265 0 and a parameter \u03bb > 0, under the independent Poisson model, we observe X0 \u223c \u03c0 = \u03c0(R),\nand an S \u00d7 S matrix C = (Cij) with independent entries distributed as Cij \u223c Poi (\u03bbRij), where\n\nS(cid:88)\n\nS(cid:88)\n\n\u03c0i = \u03c0i(R) =\n\nri\nr\n\n,\n\nri =\n\nRij,\n\nr =\n\nri.\n\n(17)\n\nj=1\n\ni=1\n\nFor each symmetric matrix R, by normalizing the rows we can de\ufb01ne a transition matrix T = T (R)\nof a reversible Markov chain with stationary distribution \u03c0 = \u03c0(R). Upon observing the Poisson\nmatrix C, the functional to be estimated is the entropy rate \u00afH of T (R). Given \u03c4 > 0 and \u03b3, q \u2208 (0, 1),\nde\ufb01ne the following collection of symmetric matrices:\n\nR(S, \u03b3, \u03c4, q) =\n\nR \u2208 RS\u00d7S\n\n+\n\n: R = R(cid:62), \u03b3\u2217(T ) \u2265 \u03b3,\n\nRij \u2265 \u03c4, \u03c0min \u2265 q\n\n,\n\n(18)\n\n(cid:40)\n\n(cid:41)\n\n(cid:88)\n\ni,j\n\nwhere \u03c0min = mini \u03c0i. The reduction to independent Poisson model is summarized below:\nLemma 2. If there exists an estimator \u02c6H1 for the Markov chain model with parameter n such that\nP(| \u02c6H1 \u2212 \u00afH| \u2265 \u0001) \u2264 \u03b4 under any T \u2208 M2,rev(S, \u03b3), then there exists another estimator \u02c6H2 for the\nindependent Poisson model with parameter \u03bb = 4n\n\nP(cid:16)| \u02c6H2 \u2212 \u00afH(T (R))| \u2265 \u0001\n\n\u03c4 such that\n\n(cid:17) \u2264 \u03b4 + 2Sn\n\nsup\n\nprovided q \u2265 c3 ln n\n\nR\u2208R(S,\u03b3,\u03c4,q)\nn\u03b3 , where c3 \u2265 20 is a universal constant.\n\n\u2212 c2\n4+10c3 + Sn\u2212c3/2,\n\n3\n\n(19)\n\nTo prove the lower bound for the independent Poisson model, the goal is to construct two sym-\nmetric random matrices (whose distributions serve as the priors), such that (a) they are suf\ufb01-\nciently concentrated near the desired parameter space R(S, \u03b3, \u03c4, q) for properly chosen parameters\n\n7\n\n\f\u03b3, \u03c4, q; (b) their entropy rates are separated; (c) the induced marginal laws of the suf\ufb01cient statistic\nC = X0\u222a{Cij +Cji : i (cid:54)= j, 1 \u2264 i \u2264 j \u2264 S}\u222a{Cii : 1 \u2264 i \u2264 S} are statistically indistinguishable.\nThe prior construction in De\ufb01nition 4 satis\ufb01es all these three properties (cf. Lemmas 10, 11, 12), and\nthereby lead to the following theorem:\nTheorem 6. If n \u2265 S2\nln S , ln n (cid:28) S\n\n, we have\n\n(cid:113)\n\nS ln3 S\n\nn\n\n(cid:18)\n(ln S)2 , \u03b3\u2217 \u2264 1 \u2212 C2\n\n(cid:19)\n\nlim inf\nS\u2192\u221e inf\n\u02c6H\n\nsup\n\nR\u2208R(S,\u03b3\u2217,\u03c4,q)\n\nP\n\n| \u02c6H \u2212 \u00afH| \u2265 C1\n\nS2\n\nn ln S\n\n\u2265 1\n2\n\n(20)\n\nwhere \u03c4 = S, q =\n\n\u221a\n5\n\n1\nn ln S\n\n, and C1, C2 > 0 are two universal constants.\n\n5 Application: Fundamental limits of language modeling\n\nIn this section, we apply entropy rate estimators to estimate the fundamental limits of language\nmodeling. A language model speci\ufb01es the joint probability distribution of a sequence of words,\nQX n(xn). It is common to use a (k \u2212 1)th-order Markov assumption to train these models, using\nsequences of k words (also known as k-grams,3 sometimes with Latin pre\ufb01xes unigrams, bigrams,\netc.), with values of k of up to 5 [2]. A commonly used metric to measure the ef\ufb01cacy of a model\nQX n is the perplexity (whose logarithm is called the cross-entropy rate):\n\n(cid:115)\n\nperplexityQ (X n) = n\n\n1\n\nQX n (X n)\n\n.\n\nIf a language is modeled as a stationary and ergodic stochastic process with entropy rate \u00afH, and X n\nis drawn from the language with true distribution PX n, then [2]\n\n\u00afH \u2264 lim inf\nn\u2192\u221e\n\n1\nn\n\nlog\n\n1\n\nQX n (X n)\n\nn\u2192\u221e log(cid:2)perplexityQ (X n)(cid:3) ,\n\n= lim inf\n\nwith equality when Q = P . In this section, all logarithms are with respect to base 2 and all entropy\nare measured in bits.\nThe entropy rate of the English language is of signi\ufb01cant interest to language model researchers:\nsince 2 \u00afH is a tight lower bound on perplexity, this quantity indicates how close a given language\nmodel is to the optimum. Several researchers have presented estimates in bits per character [3, 9, 5];\nbecause language models are trained on words, these estimates are not directly relevant to the present\ntask. In one of the earliest papers on this topic, Claude Shannon [3] gave an estimate of 11.82 bits\nper word. This latter \ufb01gure has been comprehensively beaten by recent models; for example, [2]\nachieved a perplexity corresponding to a cross-entropy rate of 4.55 bits per word.\nTo produce an estimate of the entropy rate of English, we used two well-known linguistic corpora:\nthe Penn Treebank (PTB) and Google\u2019s One Billion Words (1BW) benchmark. Results based on\nthese corpora are particularly relevant because of their widespread use in training models. We used\nthe conditional approach proposed in this paper with the JVHW estimator describe in Section D. The\nPTB corpus contains about n \u2248 1.2 million words, of which S \u2248 47, 000 are unique. The 1BW\ncorpus contains about n \u2248 740 million words, of which S \u2248 2.4 million are unique.\nWe estimate the conditional entropy H(Xk|X k\u22121) for k = 1, 2, 3, 4, and our results are shown in\nFigure 1. The estimated conditional entropy \u02c6H(Xk|X k\u22121) provides us with a re\ufb01ned analysis of the\nintrinsic uncertainty in language prediction with context length of only k \u2212 1. For 4-grams, using the\nJVHW estimator on the 1BW corpus, our estimate is 3.46 bits per word. With current state-of-the-art\nmodels trained on the 1BW corpus having an cross-entropy rate of about 4.55 bits per word [2],\nthis indicates that language models are still at least 0.89 bits per word away from the fundamental\nlimit. (Note that since H(Xk|X k\u22121) is decreasing in k, H(X4|X 3) > \u00afH.) Similarly, for the much\nsmaller PTB corpus, we estimate an entropy rate of 1.50 bits per word, compared to state-of-the-art\nmodels that achieve a cross-entropy rate of about 5.96 bits per word [4], at least 4.4 bits away from\nthe fundamental limit.\nMore detailed analysis, e.g., the accuracy of the JVHW estimates, is shown in the Appendix E.\n\n3In the language modeling literature these are typically known as n-grams, but we use k to avoid con\ufb02ict\n\nwith the sample size.\n\n8\n\n\f1\n\n)\n1\n\u2212\nk\nX\n|\nk\nX\n\n(\n\n\u02c6H\ny\np\no\nr\nt\nn\ne\n\n.\n\nd\nn\no\nc\n\nd\ne\nt\na\nm\n\ni\nt\ns\ne\n\nPTB\n1BW\n\nbest known model for PTB\n[4]\n\nbest known model for 1BW\n[2]\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n1\n\n2\n\n3\n\n4\n\nmemory length k\n\nFigure 1: Estimates of conditional entropy based on linguistic corpora\n\n9\n\n\fReferences\n[1] Jayadev Acharya, Hirakendu Das, Alon Orlitsky, and Ananda Theertha Suresh. A uni\ufb01ed\nmaximum likelihood approach for estimating symmetric properties of discrete distributions. In\nInternational Conference on Machine Learning, pages 11\u201321, 2017.\n\n[2] Andr\u00e1s Antos and Ioannis Kontoyiannis. Convergence properties of functional estimates for\n\ndiscrete distributions. Random Structures & Algorithms, 19(3-4):163\u2013193, 2001.\n\n[3] Patrick Billingsley. Statistical methods in Markov chains. The Annals of Mathematical Statistics,\n\npages 12\u201340, 1961.\n\n[4] Charles Bordenave, Pietro Caputo, and Djalil Chafai. Spectrum of large random reversible\nMarkov chains: two examples. ALEA: Latin American Journal of Probability and Mathematical\nStatistics, 7:41\u201364, 2010.\n\n[5] Peter F. Brown, Vincent J. Della Pietra, Robert L. Mercer, Stephen A. Della Pietra, and\nJennifer C. Lai. An estimate of an upper bound for the entropy of english. Comput. Linguist.,\n18(1):31\u201340, March 1992.\n\n[6] Haixiao Cai, Sanjeev R. Kulkarni, and Sergio Verd\u00fa. Universal entropy estimation via block\n\nsorting. IEEE Trans. Inf. Theory, 50(7):1551\u20131561, 2004.\n\n[7] Nicolo Cesa-Bianchi and Gabor Lugosi. Prediction, learning, and games. Cambridge University\n\nPress, 2006.\n\n[8] Gabriela Ciuperca and Valerie Girardin. On the estimation of the entropy rate of \ufb01nite Markov\nchains. In Proceedings of the International Symposium on Applied Stochastic Models and Data\nAnalysis, 2005.\n\n[9] Thomas Cover and Roger King. A convergent gambling estimate of the entropy of English.\n\nIEEE Transactions on Information Theory, 24(4):413\u2013421, 1978.\n\n[10] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley, New York,\n\nsecond edition, 2006.\n\n[11] Constantinos Daskalakis, Nishanth Dikkala, and Nick Gravin. Testing symmetric markov chains\n\nfrom a single trajectory. arXiv preprint arXiv:1704.06850, 2017.\n\n[12] Michelle Effros, Karthik Visweswariah, Sanjeev R Kulkarni, and Sergio Verd\u00fa. Universal\nlossless source coding with the Burrows Wheeler transform. IEEE Transactions on Information\nTheory, 48(5):1061\u20131081, 2002.\n\n[13] Moein Falahatgar, Alon Orlitsky, Venkatadheeraj Pichapati, and Ananda Theertha Suresh.\nLearning markov distributions: Does estimation trump compression? In Information Theory\n(ISIT), 2016 IEEE International Symposium on, pages 2689\u20132693. IEEE, 2016.\n\n[14] Yanjun Han, Jiantao Jiao, Tsachy Weissman, and Yihong Wu. Optimal rates of entropy\n\nestimation over lipschitz balls. arxiv preprint arxiv:1711.02141, Nov 2017.\n\n[15] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of\n\nthe American statistical association, 58(301):13\u201330, 1963.\n\n[16] Daniel J Hsu, Aryeh Kontorovich, and Csaba Szepesv\u00e1ri. Mixing time estimation in reversible\nMarkov chains from a single sample path. In Advances in neural information processing systems,\npages 1459\u20131467, 2015.\n\n[17] Qian Jiang. Construction of transition matrices of reversible Markov chains. M. Sc. Major\n\nPaper. Department of Mathematics and Statistics. University of Windsor, 2009.\n\n[18] Jiantao Jiao, H.H. Permuter, Lei Zhao, Young-Han Kim, and T. Weissman. Universal estimation\nof directed information. Information Theory, IEEE Transactions on, 59(10):6220\u20136242, 2013.\n\n[19] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Minimax estimation of\nfunctionals of discrete distributions. Information Theory, IEEE Transactions on, 61(5):2835\u2013\n2885, 2015.\n\n10\n\n\f[20] Jiantao Jiao, Kartik Venkat, Yanjun Han, and Tsachy Weissman. Maximum likelihood estimation\nof functionals of discrete distributions. IEEE Transactions on Information Theory, 63(10):6774\u2013\n6798, Oct 2017.\n\n[21] Daniel Jurafsky and James H. Martin. Speech and Language Processing (2Nd Edition). Prentice-\n\nHall, Inc., Upper Saddle River, NJ, USA, 2009.\n\n[22] Sudeep Kamath and Sergio Verd\u00fa. Estimation of entropy rate and R\u00e9nyi entropy rate for Markov\nchains. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 685\u2013689.\nIEEE, 2016.\n\n[23] John C Kieffer. Sample converses in source coding theory. IEEE Transactions on Information\n\nTheory, 37(2):263\u2013268, 1991.\n\n[24] Ioannis Kontoyiannis, Paul H Algoet, Yu M Suhov, and AJ Wyner. Nonparametric entropy esti-\nmation for stationary processes and random \ufb01elds, with applications to English text. Information\nTheory, IEEE Transactions on, 44(3):1319\u20131327, 1998.\n\n[25] Coco Krumme, Alejandro Llorente, Alex Manuel Cebrian, and Esteban Moro Pentland. The\n\npredictability of consumer visitation patterns. Scienti\ufb01c reports, 3, 2013.\n\n[26] Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. CoRR,\n\nabs/1703.10722, 2017.\n\n[27] J Kevin Lanctot, Ming Li, and En-hui Yang. Estimating DNA sequence entropy. In Symposium\non discrete algorithms: proceedings of the eleventh annual ACM-SIAM symposium on discrete\nalgorithms, volume 9, pages 409\u2013418, 2000.\n\n[28] David A Levin and Yuval Peres. Estimating the spectral gap of a reversible markov chain from\n\na short trajectory. arXiv preprint arXiv:1612.05330, 2016.\n\n[29] Michael Mitzenmacher and Eli Upfal. Probability and computing: Randomized algorithms and\n\nprobabilistic analysis. Cambridge University Press, 2005.\n\n[30] Ravi Montenegro and Prasad Tetali. Mathematical aspects of mixing times in Markov chains.\n\nFoundations and Trends R(cid:13) in Theoretical Computer Science, 1(3):237\u2013354, 2006.\n\n[31] Liam Paninski. Estimation of entropy and mutual information. Neural Computation, 15(6):1191\u2013\n\n1253, 2003.\n\n[32] Liam Paninski. Estimating entropy on m bins given fewer than m samples. Information Theory,\n\nIEEE Transactions on, 50(9):2200\u20132203, 2004.\n\n[33] Daniel Paulin. Concentration inequalities for markov chains by marton couplings and spectral\n\nmethods. Electronic Journal of Probability, 20, 2015.\n\n[34] Claude E. Shannon. Prediction and entropy of printed English. The Bell System Technical\n\nJournal, 30(1):50\u201364, Jan 1951.\n\n[35] Paul C Shields. The ergodic theory of discrete sample paths. Graduate Studies in Mathematics,\n\nAmerican Mathematics Society, 1996.\n\n[36] Chaoming Song, Zehui Qu, Nicholas Blumm, and Albert-L\u00e1szl\u00f3 Barab\u00e1si. Limits of predictabil-\n\nity in human mobility. Science, 327(5968):1018\u20131021, 2010.\n\n[37] Taro Takaguchi, Mitsuhiro Nakamura, Nobuo Sato, Kazuo Yano, and Naoki Masuda. Pre-\n\ndictability of conversation partners. Physical Review X, 1(1):011008, 2011.\n\n[38] Terence Tao. Topics in random matrix theory, volume 132. American Mathematical Society\n\nProvidence, RI, 2012.\n\n[39] Kedar Tatwawadi, Jiantao Jiao, and Tsachy Weissman. Minimax redundancy for markov chains\n\nwith large state space. arXiv preprint arXiv:1805.01355, 2018.\n\n[40] A. Tsybakov. Introduction to Nonparametric Estimation. Springer-Verlag, 2008.\n\n11\n\n\f[41] Gregory Valiant and Paul Valiant. Estimating the unseen: an n/ log n-sample estimator for\nentropy and support size, shown optimal via new CLTs. In Proceedings of the 43rd annual\nACM symposium on Theory of computing, pages 685\u2013694. ACM, 2011.\n\n[42] Gregory Valiant and Paul Valiant. The power of linear estimators. In Foundations of Computer\n\nScience (FOCS), 2011 IEEE 52nd Annual Symposium on, pages 403\u2013412. IEEE, 2011.\n\n[43] Paul Valiant and Gregory Valiant. Estimating the unseen: improved estimators for entropy and\nother properties. In Advances in Neural Information Processing Systems, pages 2157\u20132165,\n2013.\n\n[44] Chunyan Wang and Bernardo A Huberman. How random are online social interactions?\n\nScienti\ufb01c reports, 2, 2012.\n\n[45] Yihong Wu and Pengkun Yang. Minimax rates of entropy estimation on large alphabets via best\npolynomial approximation. IEEE Transactions on Information Theory, 62(6):3702\u20133720, 2016.\n\n[46] Aaron D. Wyner and Jacob Ziv. Some asymptotic properties of the entropy of a stationary\nergodic data source with applications to data compression. IEEE Trans. Inf. Theory, 35(6):1250\u2013\n1258, 1989.\n\n[47] Jacob Ziv and Abraham Lempel. Compression of individual sequences via variable-rate coding.\n\nInformation Theory, IEEE Transactions on, 24(5):530\u2013536, 1978.\n\n[48] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR,\n\nabs/1611.01578, 2016.\n\n12\n\n\f", "award": [], "sourceid": 6406, "authors": [{"given_name": "Yanjun", "family_name": "Han", "institution": "Stanford University"}, {"given_name": "Jiantao", "family_name": "Jiao", "institution": "University of California, Berkeley"}, {"given_name": "Chuan-Zheng", "family_name": "Lee", "institution": "Stanford University"}, {"given_name": "Tsachy", "family_name": "Weissman", "institution": "Stanford University"}, {"given_name": "Yihong", "family_name": "Wu", "institution": "Yale University"}, {"given_name": "Tiancheng", "family_name": "Yu", "institution": "Tsinghua University"}]}