Minimum $L^q$-distance estimators for non-normalized parametric models

We propose and investigate a new estimation method for the parameters of models consisting of smooth density functions on the positive half axis. The procedure is based on a recently introduced characterization result for the respective probability distributions, and is to be classified as a minimum distance estimator, incorporating as a distance function the $L^q$-norm. Throughout, we deal rigorously with issues of existence and measurability of these implicitly defined estimators. Moreover, we provide consistency results in a common asymptotic setting, and compare our new method with classical estimators for the exponential-, the Rayleigh-, and the Burr Type XII distribution in Monte Carlo simulation studies. We also assess the performance of different estimators for non-normalized models in the context of an exponential-polynomial family.


Introduction
One of the most classical problems in statistics is the estimation of the parameter vector of a parametrized family of probability distributions. It presents itself in a significant share of applications because parametric models often contribute a reasonable compromise between flexibility in the shape of the statistical model and meaningfulness of the conclusions that can be drawn from the model. As a consequence, all kinds of professions are confronted with the issue of parameter estimation, be it meteorologists, engineers or biologists. Throughout the last decades, a vast amount of highly focused estimation procedures for all kinds of situations have been provided, but the procedure that is arguably used most often remains the maximum likelihood estimator. Apart from its (asymptotic) optimality properties, its popularity is presumably in direct relation with its universality: For the professions mentioned above, and many more, whose prime interest is not the study of sophisticated statistical procedures, it is essential to have at hand a method that is both, easily communicated and applicable to a wide range of model assumptions. A second class of methods incorporates the idea of using as an estimator the value that minimizes some goodness-of-fit measure. To implement this type of estimators, the empirical distribution, quantile or characteristic function is compared to its theoretical counterpart from the underlying parametric model in a suitable distance, and the term is minimized over the parameter space, see Wolfowitz (1957), or Parr (1981) for an early bibliography. These procedures provide some freedom in adapting the estimation method to the intended inferences from the model and they regularly possess good robustness properties [see Parr and Schucany (1980) as well as Millar (1981)]. An example which was discussed recently, and which goes by the name of minimum CRPM estimation, see Gneiting et al. (2005), is tailored to the practice of issuing forecasts: As argued by , a good probabilistic forecast minimizes a (strictly) proper scoring rule such as the 'CRPM' ], and after constructing a suitable model it appears somewhat more natural to use as an estimator the one that minimizes the scoring rule instead of a classical estimation method like maximum likelihood [for a comparison see Gebetsberger et al. (2018)]. As it happens, these rather universal procedures listed above easily run into computational hardships. Just consider that even for 'basic' models, density functions can take complicated forms, and distribution or characteristic functions, or even normalization constants, may be nowhere near to an explicit formula. This is where we want to tie on. In a recent work, Betsch and Ebner (2019a) established distributional characterizations that, from a practical point of view, are comparable to the characterization of a probability distribution through its distribution function. Their results, which are given in terms of the derivative of a density function and the density itself, provide explicit formulae that simplify the dependence of the terms on the parameters (even for rather complicated models), and extend characterizations via the zero-bias-or equilibrium transformation [Goldstein and Reinert (1997), Peköz and Röllin (2011), respectively] that arise in the context of Stein's method, cf. Chen et al. (2011). The aim of this work is to investigate these characterizations, which where already used to construct goodness-of-fit tests [see Betsch and Ebner (2020), Betsch and Ebner (2019b)], more closely in the context of parameter estimation. An advantage of the resulting estimators lies in the way the density function of the underlying model appears in the characterization, and thus also in the estimation method. When considering for some (positive) density function p the quotient p ′ p , the term no longer depends on the integration constant which ensures that the function integrates to one, but only on the functional form of the density. As indicated before, our estimators depend on the underlying model precisely via this quotient, so they are applicable in cases where the normalization constant is unknown. Models of this type occur (though often in discrete settings) in such applied areas as image modeling [using Markov random fields, see Li (2009)] and machine learning, or in any other area where models are complex enough to render the calculation of the normalization constant impractical. For more specific discussions of such applications, we refer to the introduction of the work by Uehara et al. (2019a). The problem was already addressed by Hyvärinen (2005), who set out to find an estimation method which only takes into account the functional form of a density. The approach introduced there goes by the name of 'score matching', and the estimation method involves terms of the form p ′′ p − 1 2 p ′ p 2 and hence does not depend on the normalization constant either. In the univariate case we discuss here, our method provides a good supplement as it contains no second derivatives and may thus be applicable to cases where other methods fail. Also note that several other approaches by Pihlaja et al. (2010), Matsuda and Hyvärinen (2019), and Uehara et al. (2019b), are available. Later on we also discuss noise-contrastive estimation, a concept introduced by Gutmann and Hyvärinen (2010). All these references indicate that statistical inference for non-normalized models is a topic of very recent investigation that also interests researcher in machine learning, a fact which we further allude to at the end of the following section.
In Section 2 we introduce this new class of parameter estimators that are comparable, in their range of applicability in the given setting, to the maximum likelihood and minimum Cramér-von Mises distance estimators [as discussed by Parr and Schucany (1980) or Parr and De Wet (1981)]. We rigorously deal with the existence and measurability of our estimators in Section 3. In Section 4 we provide results on consistency. Thereafter, we give as (normalized) examples the exponential-(Section 5), the Rayleigh-(Section 7), and the Burr Type XII distribution (Section 8). For each of the three parametric models we compare our new method to classical methods like the maximum likelihood and minimum Cramér-von Mises distance estimator in competitive Monte Carlo simulation studies. The Burr distribution [cf. Burr (1942), Rodriguez (1977), Tadikamalla (1980), Section 6.2 of Kleiber and Kotz (2003), or Kumar (2017)] as a model is relevant in econometrics, initiated by Singh and Maddala (1976) [see also Schmittlein (1983)], and other areas like engineering, hydrology, and quality assurance, see Shah and Gokhale (1993) for corresponding references. However, the parameter estimation is non-trivial and can even cause computational issues. Thus, providing a new estimation method could prove useful in applications. In Section 9 we discuss an exponential-polynomial model for which the normalization constant is intractable, and we compare the new estimators with the score matching and noise-contrastive estimation approaches.

The new estimators
To be specific, recall that the problem of parameter estimation for continuous, univariate probability distributions presents itself as follows. Consider for Θ ⊂ R d a parametric family of probability density functions P Θ = p ϑ | ϑ ∈ Θ , and let X 1 , . . . , X n be a sample consisting of independent real-valued random variables with a distribution from P Θ , that is, there exists some ϑ 0 ∈ Θ such that X i has density function p ϑ0 (X i ∼ p ϑ0 , for short) for i = 1, . . . , n. Denote with P ϑ the distribution function corresponding to p ϑ . The task is to construct an estimator of the unknown ϑ 0 based on X 1 , . . . , X n .
For the construction of our new estimation method, we first recall in a non-technical fashion a famous distributional characterization that can be traced back to Charles Stein, see Chapter VI of Stein (1986). In the more elaborated version of Ley and Swan (2013) it establishes that, given a suitable probability density function p, the distribution of a real-valued random variable X is given through the density function at hand if, and only if, for a large enough class of suitably chosen test functions f . Motivated by the well-known zero-bias distribution, Betsch and Ebner (2019a) used the above characterization in a recent publication to derive explicit identities which retain the essence of the characterizing property. Indeed, they were able to derive from the Stein characterization that, for a suitable density function p on the positive axis with few technical assumptions (which we adopt below), the distribution of a positive random variable X (satisfying a weak integrability property) is given through p if, and only if, the distribution function F X corresponding to X satisfies As we intent to use this result as a foundation for our estimation method in parametric models for non-negative quantities, assume that the support of each density function in P Θ is (0, ∞). In particular, suppose that each p ϑ is positive and continuously differentiable on (0, ∞). Also assume that p ϑ (x) = 0. These presumptions where made by Betsch and Ebner (2019a) to derive the characterization given above, and they are straight forward to check for most common density functions. Particularly the last condition is exhaustively discussed in Proposition 3.7 of Döbler (2015). Let X be a positive random variable with and define the function for (t, ϑ) ∈ (0, ∞) × Θ. Then, the characterization of Betsch and Ebner (2019a), as built up in Equation (1) and as given in their Corollary 3, states that X has density function p ϑ if, and only if, η(t, ϑ) = 0 for every t > 0. Therefore, if we assume initially that X ∼ p ϑ0 [note that (2) is satisfied by requirement on p ϑ ], then η(· , ϑ) L q = 0 if, and only if, ϑ = ϑ 0 .
Here, L q = L q (0, ∞), B(0, ∞), w(t) dt , 1 ≤ q < ∞, denote the usual L q -spaces over (0, ∞), w is a positive and integrable weight function, and for f ∈ L q , g ∈ L q ′ (1/q + 1/q ′ = 1) are the usual norm and duality in L q . Thus, with an empirical version of η, based on a sample of independent and identically distributed (i.i.d.) random variables X 1 , . . . , X n with X 1 ∼ p ϑ0 , a reasonable estimator for the unknown ϑ 0 is ϑ n,q = arg min η n (· , ϑ) L q | ϑ ∈ Θ = arg min η n (· , ϑ) q that is, we choose ϑ n,q such that η n (· , ϑ n,q ) L q ≤ η n (· , ϑ) L q for each ϑ ∈ Θ. Heuristically, η n (· , ϑ) L q approximates η(· , ϑ) L q , so ϑ n,q should provide an estimate for the minimum of ϑ → η(· , ϑ) L q which coincides with ϑ 0 , the (unique) zero of this function. At this point of course, there arise questions of existence and measurability of such an estimator, and we will handle these questions in full detail in Section 3. Intuitively, one might argue to replace F X and the empirical distribution function in the definition of η and η n , respectively, with the theoretical distribution function P ϑ . However, there is a bit of a technical point involved, and the characterizations by Betsch and Ebner (2019a) do not include results that give a rigorous handle for this slightly (yet decisively) different situation. There are, however, similar characterizations for univariate distribution with other supports than the positive half axis. We allude to that setting in Section 10. Note that the availability of the term p ′ ϑ p ϑ for the model in consideration is rather essential. If this term is not amenable explicitly, it might still be calculable using numerical differentiation (and so η n and the estimator could be calculated numerically), but it would make it hard to theoretically justify the validity of the conditions on p ϑ . In our experience, however, the term p ′ ϑ p ϑ is readily available whenever p ϑ can be differentiated explicitly, and this seems a manageable assumption.
As we have outlined above, our new estimators are eventually based on Stein characterizations which rely on some suitable class of test functions [for an overview in the univariate case, and a record of the vast amount of literature on these identities, see Ley et al. (2017b)]. The goal of Betsch and Ebner (2019a) was to derive from these Stein identities new characterizations that no longer involve the classes of test functions. While this approach apparently leads to feasible applications in statistics, other methods are based directly on the Stein characterizations, using Stein discrepancies which gradually become popular in machine learning. The idea in the context of parameter estimation, in heuristic terms, boils down to choosing as a parameter estimator the value which (approximately) minimizes where the supremum is over all test functions in consideration. By the Stein characterization detailed above, the expectation is 0 for every test function precisely when ϑ = ϑ 0 , as we assume that X ∼ p ϑ0 . However, it is not clear how to calculate the supremum in practice taking that the class of test functions is very large. The theory developed around Stein discrepancies has produced different formal methods to evaluate such terms. Other than the fact that they are based on the Stein characterization, the identities derived by Betsch and Ebner (2019a) are not related to the framework of Stein discrepancies, and so it is surprising that merely measuring the difference between the functions in (1) in an L q -norm, which is what we do to construct our estimators, leads back to so-called feature Stein discrepancies. Indeed, upon defining the 'feature' function Φ(x, t) = min{x, t}, x, t > 0, and considering the Langevin-Stein operators as applied to suitable functions f : (0, ∞) 2 → R, we obtain which is the right-hand side of Equation (1) in the paper by Huggins and Mackey (2018). So by retracing their calculation, where G Φ,q is the class of test functions as defined by Huggins and Mackey (2018) (the precise form of which is inessential at this point). This means that we can embed our setting into the framework of these feature Stein discrepancies, as the construction of our estimator cumulating in (4) corresponds to minimizing the quantity at the beginning of this paragraph which sought to motivate these discrepancies. Now, of course, the starting point of our estimation method being the characterizations by Betsch and Ebner (2019a), we already had our estimator at hand explicitly and could choose the feature function accordingly. Still, the fact that both the characterization of Betsch and Ebner (2019a) and the (feature) Stein discrepancy approach (for the above feature function), when translated into an estimation method, lead to the same procedure is remarkable and deems it worthwhile to study the method further, as we were assured that it can be rather hard to find explicit examples for which the Stein discrepancy approach is feasible in practice.
To complete this insightful tour into the realm of Stein discrepancies, we mention some contributions of various solutions to statistical problems based on those discrepancies. In particular, Chwialkowski et al. (2016), Liu et al. (2016), and Yang et al. (2018) construct tests of fit, Gorham and Mackey (2015) measure sample quality, and Barp et al. (2019) develop estimation methods for non-normalized statistical models.

Existence and measurability
We discuss the measurability properties of η n and derive an existence result for a measurable version of (approximate) estimators of the type in (4). The result that is central to us in this section can be found in Chapter III of Castaing and Valadier (1977) [see the references therein and Chapter 8 by Cohn (2013) for further background]. Before we summarize these results, recall that a Suslin space is a Hausdorff topological space which is the image of a separable, completely metrizable topological space under a continuous map [for an overview, consult Chapter II of Schwartz (1973)]. See also Remark A.1 in Appendix A for more information.
Theorem 3.1. Let (Ω, F , P) be a complete probability space and (S, O S ) a Suslin topological space with Borel-σfield B(S). Assume that Γ maps Ω into the non-empty subsets of S, and that Then, there exists an F , B(S) -measurable map ϑ : Ω → S such that ϑ(ω) ∈ Γ(ω) for every ω ∈ Ω. Additionally, Here, (R, B) denotes the extended real line with its usual σ-field, and we write ⊗ for the product of σ-fields.
To apply Theorem 3.1, we first have to investigate the measurability properties of η n . In the setting of Section 2, assume the following regularity condition.
Let (Ω, F , P) be a complete probability space, which is assumed to underlie all random quantities of the previous and subsequent sections. Notice that the function η n defined in (3) depends on the random variables X 1 , . . . , X n defined on Ω, hence η n (as a random quantity) can be understood as a map Ω × (0, ∞) × Θ → R. Exploiting the structure of η n , we obtain the following lemma. The proof is simple, and the basic thoughts can be found in Appendix A.
Whenever we refer to an estimator that satisfies (4), we mean precisely such an (approximate) measurable version. This settles the existence problem and for our asymptotic studies we have measurability of ϑ n,q at hand.

Consistency
In this section, we investigate the asymptotic behavior of our estimators. Unfortunately, we cannot apply the general results for minimum distance estimators given by Millar (1984), since a major assumption in that work is that the term in the norm is differentiable (with respect to ϑ) with derivative not depending on ω, that is, in a sense, the parameter and the 'uncertainty' have to be separated, which is clearly not the case in our setting. Thus, we need to deal with the empirical process involved.
Assume the setting from Section 2. For brevity, we keep the notation ψ n,q (ϑ) = ψ n,q (ω, ϑ) = η n (ω, · , ϑ) L q and set ψ q (ϑ) = η(· , ϑ) L q . Recall from the construction that ϑ n,q (approximately) minimizes ψ n,q [see (6)], and ϑ 0 is the unique minimum of ψ q . The heuristic of the consistency statement proven in this section is as follows. If ψ n,q converges to ψ q in a suitable function space, then the random minimal points ϑ n,q converge to ϑ 0 . In order to establish convergence of ψ n,q , we need the functions to be sufficiently smooth in ϑ. In most applications the mapping ϑ → p ϑ (x) will be continuously differentiable for every x > 0, which can often be used to derive the following regularity condition.
Now, let K = ∅ be an arbitrary compact subset of Θ. Then on Ω and for ϑ (1) , ϑ (2) ∈ K, we have with H and α as in (R3). In particular, K ∋ ϑ → ψ n,q (ω, ϑ) is continuous for every ω ∈ Ω, and, by Lemma 3.2, it constitutes a product measurable map. This already implies that ϑ → ψ n,q (ϑ) is a random element of C(K) + [see Lemma 3.1 of Kallenberg (2002)], the space of continuous functions from K to [0, ∞) which is a complete, separable metric space (endowed with the usual metric that induces the uniform topology). From (R3) it also follows that K ∋ ϑ → ψ q (ϑ) is an element of C(K) + . We can now state the convergence results for ψ n,q that are essential for our consistency proof.
The proof of this lemma is rather technical and deferred to Appendix B. Note that the term inf ϑ ∈ F ψ n,q (ϑ) is a random variable by Theorem 3.1 (cf. the measurability of m n,q in the previous section). The following theorem uses the above lemma to establish consistency. In the second statement, we assume that the parameter space Θ is compact, thus rendering Lemma 4.1 applicable on the whole of Θ, which will turn out essential to prove strong consistency. For most practical purposes this is sufficient, when parameters relevant for modeling in applications can be taken to stem from some (huge) compact set. Note that with this compactness assumption we actually do not need the ε n -term in (6) since ψ n,q is lower semi-continuous by (R1) and Fatou's lemma, and thus attains its minimum in Θ. The first statement of the following theorem shows that if the sequence ϑ n,q is already known to be tight, no compactness assumption is needed, but we can only expect weak consistency in general, thus denoting by ' P −→' convergence in probability. After the proof of the theorem, we provide an insight in which cases this is possible (Remark 4.3).
Proof. In the proof of (i) we follow Theorem 3.2.2 from van der Vaart and Wellner (2000), but we adapt the reasoning to our setting, using the measurability properties we established in Section 3, and Lemma 4.1. For completeness, as well as to prepare the proof of the second result, we give a full proof. We start with a preliminary observation, establishing that the minimum at ϑ 0 is (locally) well separated. If K is a compact subset of Θ, and O an open subset of R d which contains ϑ 0 , then Indeed, if this is not the case, we find a sequence denotes the open ball in R d of radius ε around ϑ 0 . Applying Lemma 4.1 and (7) to K and F , together with (6) and the Portmanteau theorem [cf. Theorem 2.1 of Billingsley (1968 Note that if F = ∅, the inequality holds trivially. Since both ε and δ were arbitrary, the claim follows. For this first part of the proof, we only needed the convergences provided by Lemma 4.1 to be valid in probability. For the following proof of (ii), we rely on the stronger result. The arguments we use are scattered over Section 3 of the work by Sahler (1970). For reasons alluded to in Remark A.1, and since that work contains some typos, we provide the adapted arguments. Let ε > 0 and define β ε = inf ϑ ∈ Θ\Bε(ϑ0) ψ q (ϑ). By (7), we have β ε > 0. Using the well-known equivalent criterion for almost sure convergence, Lemma 4.1 gives By definition of β ε this implies Moreover, ψ n,q (ϑ 0 ) + ε n −→ ψ q (ϑ 0 ) = 0 P-a.s., as n → ∞, and thus Putting everything together, that is, ϑ n,q −→ ϑ 0 P-a.s., as n → ∞.

Remark 4.3. [A priori tightness of the sequence of estimators]
We provide a tool for proving tightness of the estimators before having established consistency, which we can use in Theorem 4.2 to get consistency even for unbounded parameter spaces. The statement essentially yields that if ψ n,q is strictly convex, ϑ n,q n ∈ N is tight. More precisely, suppose that conditions (R1) -(R3) hold. Let Θ be convex with ϑ 0 ∈ Θ • , the interior of Θ. Further, let ψ n,q be strictly convex (almost surely). Then the sequence of estimators ϑ n,q is tight in Θ. The proof is straight-forward and some hints are given in exercise problem 4 in Section 3.2 of van der Vaart and Wellner (2000) (for more details, find the proof in Appendix B).
5 Example: The exponential distribution This trivially is an admissible class of density functions. Moreover, let ϑ 0 ∈ Θ, X ∼ p ϑ0 , and take a sample X 1 , . . . , X n of i.i.d. copies of X. An easy calculation gives which nicely illustrates that ϑ 0 is indeed the unique zero of this functions. For the particular choice of weight w(t) = exp(−at), t > 0, with some tuning parameter a > 0, and in the case q = 2, straight-forward calculations give and X (1) < . . . < X (n) is the ordered sample. Using that e −aX (k) < e −aX (j) P-a.s. for j < k, we obtain and since 1 + aX (j) < e aX (j) P-a.s., we have Ψ (1) n > 0 almost surely. Therefore, ψ n,2 2 is strictly convex (almost surely), and has a unique minimum. By Remark 4.3 and Theorem 4.2 (i), the estimator is consistent for ϑ 0 (over the whole of Θ). Note that we have not made the dependence of Ψ (1) n , and Ψ n on 'a' explicit to prevent overloading the notation. With a similar argument as above, we may show that Ψ (2) n < 0 almost surely, thus we can calculate ϑ To provide insight on the performance of this estimator, we compare it with the maximum likelihood estimator and the minimizer of the mean squared error (for n ≥ 3) which are given as respectively, as well as with the minimum Cramér-von Mises distance estimator discussed in the introduction, namely where P ϑ (x) = 1 − exp(−ϑx), x > 0, denotes the distribution function of the exponential distribution, and where F n is the empirical distribution function of X 1 , . . . , X n . For this comparison we simulate (for fixed values of n and ϑ 0 ) D = 100, 000 samples of size n from an exponential distribution with parameter ϑ 0 , calculate the values of the estimator for each sample yielding values ϑ 1 , . . . , ϑ D , and approximate the bias and mean squared error (MSE) via for each of the above estimators. We perform all simulations with Python 3.7.2 (as provided by the Python Software Foundation, https://www.python.org, accessed 28 August 2019). For the minimization required to calculate the minimum Cramér-von Mises distance estimator, we choose as initial value the maximum likelihood estimator and use a sequential least squares programming method ('SLSQP') [cf. Kraft (1988)] implemented in the 'optimize.minimize' function of the Python module 'scipy', see Jones et al. (2001). The Tables 1 and 2 below contain the results for the bias and MSE values.  Table 1: Approximated biases calculated with 100,000 exponentially distributed Monte Carlo samples.
As for the biases, the maximum likelihood estimator and the minimum MSE estimator perform almost identically in terms of the absolute bias, and the minimum Cramér-von Mises distance estimator has a slight edge. Our new estimator outperforms all other methods (virtually) uniformly. More precisely, it seems as if for larger tuning parameters 'a' the bias decreases. We will show, however, that this observation is not correct in that generality. The results for the mean squared error reveal that the minimum MSE estimator is the best method with respect to this measure of quality, which is no surprise as it is constructed to minimize the MSE. For sample size n = 10 the superiority is particularly obvious, but for larger samples, the maximum likelihood estimator is only slightly worse. Our new estimator shows almost identical results (for a = 0.25) as the maximum likelihood estimator, undermining that the method is sound and powerful. In contrast to the observation with the bias values, the MSE appears to increase with 'a'. This nicely illustrates the variance-bias trade-off commonly observed in the context of estimation problems.  6 The case a → ∞ As discussed previously, the simulation results for the exponential distribution somewhat indicate that as the tuning parameter 'a' grows, the bias decreases while the MSE increases. Interestingly, we can lay observations for a → ∞ on a rigorous theoretical basis. To be precise, observe the following general result.
Theorem 6.1. Consider the setting from Section 2 with weight function w(t) = e −at , a > 0. For the quantity ψ n,q (ϑ, a) = ψ n,q (ϑ) = η n ( · , ϑ) L q from the end of Section 3, we make the dependence on the tuning parameter 'a' explicit. Then, on a set of measure one, where Γ(·) denotes the Gamma function.
The proof consists of an almost trivial application of an Abelian theorem for the Laplace transform, see p.182 of Widder (1959), or the work by Baringhaus et al. (2000). Since a, q > 0, the functions ψ n,q (ϑ) and a q+1 ψ n,q (ϑ) q attain their minimum in the same point. Thus, in the limit a → ∞, our procedure essentially yields as an estimators the minimizer of the quantity In the situation of the exponential distribution as discussed in Section 5, the result reduces to lim a → ∞ a 3 ψ n,2 (ϑ, a) 2 = 2ϑ 2 , so in the limit a → ∞, the procedure will choose ϑ = 0 / ∈ Θ as the estimator, which leads to a bias of −ϑ 0 and an MSE of ϑ 2 0 . The observation from the simulations is, therefore, not universal. An example for which the limit in Theorem 6.1 is less trivial is the Rayleigh distribution.

Example: Rayleigh distribution
Let Θ = (0, ∞) and take the density function of the Rayleigh distribution with parameter ϑ ∈ Θ, It is easy to check that the Rayleigh density satisfies all regularity conditions stated throughout the work, and that we have The limit in Theorem 6.1 thus takes the form where X 1 , . . . , X n are i.i.d. random variables which follow the Rayleigh law, X 1 ∼ p ϑ0 , for some unknown scale parameter ϑ 0 ∈ Θ. In the case q = 2, it is easy to calculate that the minimum of the above function over ϑ > 0 is given through Strikingly, this asymptotically derived moment-type estimator is itself consistent for ϑ 0 , as P-a.s., as n → ∞, where we used the law of large numbers, as well as the fact that X 1 , . . . , X n all follow the Rayleigh distribution with parameter ϑ 0 . We compare this estimator with other methods. Among them is our new estimator ϑ (a) n,2 = arg min ψ n,2 (ϑ) 2 ϑ > 0 = arg min ϑ −4 Ψ (1) a 2 e −aX (j) , and X (1) < . . . < X (n) denotes the ordered sample. It is easily seen that if both Ψ (1) n > 0 and Ψ (2) n < 0 P-a.s., then the minimum can be calculated explicitly as ϑ (a) n,2 = −

Ψ
(1) n Ψ (2) n , and indeed, using that e −aX (k) < e −aX (j) and 1 − e −aX (j) − aX (j) e −aX (j) > 0 P-a.s., we have and with similar thoughts, Ψ n < 0 P-a.s.. Additionally, we consider the maximum likelihood estimator and a moment estimator, which are given as respectively. Note in particular that the moment estimator is unbiased and we can expect it to outperform the other estimators in this regard. Finally, we include the minimum Cramér-von Mises distance estimator given through where we solve the minimization numerically via a sequential least squares programming method as in the case of the exponential distribution in Section 5, using as initial value the maximum likelihood estimator. The execution of the comparison is as in the example on the exponential distribution, and the results are displayed in Tables 3 and 4.  Table 3: Approximated biases calculated with 100,000 Rayleigh-distributed Monte Carlo samples.
Apparently, the moment estimator ϑ Mom n outperforms the other estimators with respect to the bias values, while the maximum likelihood estimator ϑ ML n gets the smallest MSE. The estimator we obtained via the limit results from the previous section seems sound in itself but is completely negligible compared to the other methods. In terms of bias,  the minimum Cramér-von Mises distance estimator is preferable to the maximum likelihood method, and both are outdone by our new estimator, which even keeps up with the unbiased moment estimator for the smaller values of the parameter ϑ 0 . Notice that the maximum likelihood and moment estimator tend to underestimate the parameter, while the other procedures tend to a slight overestimation. As for the MSE, the moment estimator and our new method perform similarly and follow the maximum likelihood estimator closely. The minimum Cramér-von Mises distance estimator is a bit behind. To summarize, the maximum likelihood and moment estimator for the Rayleigh parameter are both simple and very convincing, but the newly proposed method keeps up (for suitably chosen tuning parameter) and appears to find a good compromise between bias and MSE. The only graver weakness shows for the large parameter value ϑ 0 = 10 and small sample sizes n = 10, 25.

Example: The Burr Type XII distribution
Consider the density function p ϑ (x) = c k x c−1 1 + x c −k−1 , x > 0, where ϑ = (c, k) ∈ (0, ∞) 2 = Θ. It is not exactly trivial, but still straight-forward, to prove that this is an admissible distribution in terms of the setting in Section 2 [see also Betsch and Ebner (2019a)] and the conditions (R1) -(R3). With q = 2 and weight w(t) = e −at , where a > 0, the function ψ n,2 (ϑ) = η n ( · , ϑ) L 2 from Section 3 (see also Section 2) can be calculated explicitly as and where X (1) < . . . < X (n) denotes the ordered sample. Our estimator ϑ  (4), can be calculated as the minimizer of the above function over Θ. We use the 'L-BFGS-B'-method [L-BFGS-B algorithm, see Byrd et al. (1995) and Zhu et al. (1997)] implemented in the 'optimize.minimize' function of 'scipy' to solve the minimization numerically, using (1, 1) as initial values. (Note that in preliminary simulations we have tried several other optimization routines, like a truncated Newton algorithm or the 'SLSQP' from previous sections, but the 'L-BFGS-B'-method appeared to be the most reliable for our purpose.)  Table 5: Approximated biases calculated with 100,000 Burr-distributed Monte Carlo samples.
As competitors to our estimator we consider the maximum likelihood estimator with implementation as suggested by Shah and Gokhale (1993) [for a different algorithm, see Wingo (1983)]. More precisely we use the Newton-Raphson method (with initial value c = 1) to find the root 1 + X c (the minimization is solved numerically, similar to our new estimator). Note that there have been further contributions to the estimation of the Burr parameters [see Schmittlein (1983), Shah and Gokhale (1993), Wingo (1993), and Wang and Cheng (2010)]. Like for the exponential-and Rayleigh distribution, we approximate bias and MSE of these estimators and show the results in Tables 5 and 6. For each value of ϑ 0 and n, the first line corresponds to the bias/MSE of the estimator for the c-parameter, and the second line corresponds to the k-parameter. As before, it becomes evident that our new procedure outperforms the maximum likelihood and minimum Cramér-von Mises distance estimator in terms of the bias. Unlike for the exponential distribution, the dependence on the tuning parameter 'a' is less clear: For a great deal of parameter values and sample sizes, the estimator ϑ (3) n,2 yields the best result, but in some cases (mostly for the k-parameter) the estimator ϑ (0.25) n,2 , with tuning parameter from the other end of the spectrum, performs best. Also observe the oddity that in some cases the estimator fares noticeably worse for a = 0.5 than for both smaller and larger tuning parameters. Thus, if one seeks to minimize some measure of quality of the estimators, an optimal, data dependent choice of the tuning parameter would be useful (more on this in Section 10). In the light of our simulations, we suggest the use of ϑ (3) n,2 in practice as long as no adaptive tuning is available. Both in the bias and in the MSE simulation, the maximum likelihood estimator ran into computational issues for sample size n = 10. The minimum Cramér-von Mises distance estimator is more stable in this regard, but still a lot less so than our new estimators which show notably slighter outliers only for large values of the Burr parameters. Once samples get larger (n = 50+), the asymptotic optimality properties of the maximum likelihood estimator appear to kick in, as its performance stabilizes. Still for suitably chosen tuning parameter, our estimators are very close in virtually all instances. The small sample behavior of the maximum likelihood estimator poses a huge drawback for applications and the problem is well-known.

Example: Exponential-polynomial models
We now proceed to consider an example of a non-normalized parametric model, one of the major motivations to this work. In particular, let These density functions correspond to a so-called exponential-polynomial model, which constitutes a special type of exponential family. It is trivial to see that these density functions obey the regularity assumptions (R1) -(R3), and also not hard to verify that the regularity conditions stated by Betsch and Ebner (2019a) (as summarized in Section 2) are satisfied. Thus, we can first of all note, as a corollary to Theorem 3 of Betsch and Ebner (2019a), the following characterization Corollary 9.1. A positive random variable X with EX d < ∞ follows the exponential-polynomial model in (8) if, and only if, the distribution function F X of X satisfies This is the characterization which underlies our new estimation method as constructed in Section 2.
Notice that C(ϑ) cannot be written in a closed form, so maximum likelihood estimators are not readily available for the model in (8). Using the method of holonomic gradient descent, introduced by Nakayama et al. (2011), Hayakawa and Takemura (2016) identify a differential equation which allows to numerically calculate C(ϑ) and its derivatives, and thus to get an approximation of the ML estimator. In our simulations, however, we focus on methods that do not try to approximate C(ϑ) numerically, but get rid of the normalization constant altogether. Namely, we consider our new method and compare it to the well-known score matching approach of Hyvärinen (2007), in generalization of his method introduced in Hyvärinen (2005), as well as to the noise-contrastive estimation technique of Gutmann and Hyvärinen (2012). In the case of non-negative, univariate observations, the score matching approach boils down to finding the minimum of [see Section 3 of Hyvärinen (2007)], where X 1 , . . . , X n are i.i.d. random variable with X 1 ∼ p ϑ0 , for some unknown ϑ 0 ∈ Θ. Clearly, the quantity does not rely on C(ϑ). As for the estimator constructed in this paper, fixing q = 2 and the weight w(t) = e −at , where a > 0 is a tuning parameter, we may calculate ψ n,2 (ϑ) = η n ( · , ϑ) L 2 (see Sections where X (1) < . . . < X (n) are the ordered values X 1 , . . . , X n . This formula is notably more complicated than the one resulting from the score matching approach, but in the two-parameter setting we now turn to, both estimators can be calculated explicitly. More precisely, to keep the presentation clear, we intent to focus on a two parameter case, but in order not to end up with a Gaussian-type model, we consider d = 3 and fix ϑ 2 = 0, thus effectively considering the model . In this case, each time by solving a quadratic equation in ϑ 1 and ϑ 3 (which is obtained by simplifying the above quantities J N N and ψ n,2 further), we obtain the estimators explicitly. The score matching estimators for ϑ = (ϑ 1 , ϑ 3 ) are given as where m k = n j=1 X k j , and our new estimators are a 2 e −aX (j) − e −aX (k) Moreover, we consider the noise-contrastive estimators in the refined version of Gutmann and Hyvärinen (2012) that generalizes the initial results of Gutmann and Hyvärinen (2010). The idea is motivated by a binary classification problem and proceeds to consider the unknown normalization constant as an additional parameter to be estimated. The objective function is constructed in such a way that it ensures that the obtained estimator for the normalization constant truly provides (in numerical approximation) a normalized density without any further constraints on the optimization. Following Gutmann and Hyvärinen (2012), we implement this technique as follows. Given the sample X 1 , . . . , X n , choose the noise sample size T n = ν · n (for some fixed ν ∈ N, in our case ν = 10) and sample from the noise distribution (in our case, the exponential distribution with rate parameter λ n = n/ n j=1 X j ) to obtain values Y 1 , . . . , Y Tn . Then, minimize the objective function to obtain an estimator ϑ N C n for the unknown parameters (ϑ 1 , ϑ 3 ) as well as for the logarithm of the inverse of the normalization constant. In our simulations we used the 'L-BFGS-B'-method, which we have also applied in previous examples, for this optimization [with initial values (0, −0.1, 0) and with the second parameter constrained to the negative numbers].  1996 20.723 19.8873 18.6499 17.9361 17.9169 911.2872 171.7483 206.1279 202.4045 195.468  It is immediate that our new estimator and the noise-contrastive estimator outperform the score matching method distinctively over all tuning parameters, sample sizes, and parameter values for both the bias and the MSE, with the only exception being the parameter vector (1, −0.05), for which the estimator ϑ (5) n,2 fares worse than the score matching approach in MSE terms. We propose as a very good compromise choice of the tuning parameter the use of ϑ (1) n,2 as an estimator. This particular estimator outperforms the score matching method by factors of (at least) 4 in terms of MSE and also fares notably better in terms of the bias. It also outperforms the noise-contrastive estimation method uniformly, except for four instances in the MSE values (in three of which our method still performs better when another tuning parameter is chosen). The simulation in this non-normalized models conforms with the observation from previous examples that the new method fares remarkably well bias-wise. We also note that all of the estimators admit a large mean squared error for very small sample sizes, a behavior to be expected. From our simulations we conclude that the new estimation method is to be preferred clearly over the other approaches in this univariate setting of the exponentialpolynomial models, but of course larger scale simulations involving different types of multi-parameter versions of the model would be needed to further strengthen this position [also, generalizations of the score matching technique, like Yu et al. (2019), could be taken into account]. One massive advantage of the score matching and noise-contrastive estimation approaches, however, is that they readily generalize to the multivariate situation, a generalization we were not (yet) able to establish for our approach (see the last paragraph of Section 10).
Remark 9.2. We observed in our simulations that the noise-contrastive estimators can run into computational problems when the exponentials in the objective function raise an overflow warning. A step by step analysis of the code suggests that for large noise sample sizes T n (that is, for large ν) one tends to obtain some large values in the sample Y 1 , . . . , Y Tn which are cubed in the exponential terms and thus become very (if not too) large. The behavior seems to appear more often for small parameter values ϑ (0) 3 , but it seems to affect only single evaluations of the objective function during the optimization routine. We believe that most values for the noise-contrastive estimation approach in the table are intact and they also replicated when we reran the whole simulation, with a bit of an exception in the case of the parameter vector (1, −0.05), where the values show a rather noticeable dependence on the initial value chosen for the optimization (though this does not happen for the other parameter values). Therefore, one possible ways to reduce the occurrence of overflows, which lies in choosing small initial values for the ϑ 3 -parameter in the optimization routine has an impact on the performance of the estimator. Another way out could be to adopt noise distributions with extremely short tails. It could prove useful to see if our observations replicate in other simulation studies. Note that no computational issues arise for the score matching and our new approach, where the estimators can be calculated explicitly.

Notes and comments
Note that there remain some problems for further research on our newly proposed estimators, the discussion or extension of which would be too extensive for this contribution. First, for all estimators we considered explicitly, we incorporate a tuning parameter 'a' on which the performance depends strongly. It would be beneficial to have an adaptive choice of this parameter [see Allison and Santana (2015), and the refinements by Tenreiro (2019), who discuss such a method in the context of goodness-of-fit testing problems], probably adaptable to which criterion (minimal bias etc.) the estimator should satisfy. In the context of deriving results for a → ∞, we obtained another consistent estimator for the Rayleigh parameter, and it would be interesting to see if such results can be derived for other distributions. Also, we have not used in practice the flexibility gained by providing all results for the general L q -spaces, but restricted our attention to the case q = 2, mostly because of the explicit formulae obtainable in that case. If no closed formula for ψ n,q is feasible, either because of the use of some q = 2 or because some advanced weight function w is chosen, the integral in ψ n,q has to be solved numerically which could lead to a computationally highly demanding procedure overall. As for the choice of a specific weight function w, to our best knowledge there exist no theoretical results which favor specific choices over others. Considering the vast amount of weighted L 2 -statistics put to use in goodness-of-fit testing problems, it seems we cannot hope for general results in that direction. As such, the choice of the weight function provides some flexibility, but without clear guidance to satisfy specific objectives other than ψ n,q being calculable explicitly.
We have proven in a quite usual setting the consistency of our estimators. Surely, a limit theorem of the type where s(n) −→ ∞, as n → ∞, and where P is some limit distribution (e.g. the normal distribution) is desirable. Such a result would pave the way for constructing confidence regions for the true parameter based on our method. The main hurdle in direct approaches of proving such a limit results, like some Taylor expansion or methods from empirical process theory, is that the terms involved in such calculations become too complicated and make the endeavor appear impractical to us. One hope is that, since Barp et al. (2019) provide limit results for special classes of Stein discrepancy-based estimators, the interpretation of our estimation method in terms of the feature Stein discrepancy might at some point lead to advances.
Moreover, a larger-scale simulation study, involving more underlying parameters, sample sizes, and distributions could provide further insight into the estimation method. Improvements from a numerical point of view would, of course, benefit the approach. From a theoretical perspective, an important step in this last direction is to study whether the minimization method that is used in cases where the estimators cannot be calculated explicitly will always find a global minimum, or if not, in which situations it is likely to get stuck in some local minimum.
Note that Betsch and Ebner (2019a) also give characterization results for density functions on bounded intervals or on the whole real line. These can be used to construct similar estimation methods in the corresponding cases. To sketch the idea in the case of parametric models on the whole real line, assume that the support of each density function p ϑ in P Θ is the whole real line (and that some mild regularity conditions hold). Let X be a real-valued random variable and consider for (t, ϑ) ∈ R × Θ. Then, similar to our elaborations in Section 2, Theorem 4.1 of Betsch and Ebner (2019a) shows that X ∼ p ϑ0 if, and only if, η(t, ϑ 0 ) = 0 for every t ∈ R. Therefore, if, initially, X ∼ p ϑ0 , then η(· , ϑ) L q = 0 if, and only if, ϑ = ϑ 0 . Here, L q = L q R, B 1 , w(t) dt , 1 ≤ q < ∞, with a positive weight function w satisfying Thus, with a reasonable estimator for ϑ 0 is ϑ n,q = arg min η n (· , ϑ) L q | ϑ ∈ Θ .
Apparently, once we switch to density function supported by the whole real line, the characterization result due to Betsch and Ebner (2019a), and thus our estimator, have slightly different forms, but using the results from Section 3, we could still prove existence and measurability for this type of estimator, and give a formal definition as in (6). Moreover, a classical proof via the law of large numbers for random elements in separable Banach spaces and the Arzelà-Ascoli theorem [considering the modulus of continuity, as employed by Billingsley (1968)] yields the convergence results from Lemma 4.1 for ψ n,q = η n (· , ϑ) L q , but with all convergences only in probability. That result can then be used to derive consistency as in Theorem 4.2, again with all convergences only in probability. However, choosing a fixed (i.e. parameter-independent) weight function on R with a mere scale-tuning, as we employ it throughout (using the weight t → e −at ), appears not to be sufficient to account for the possible location-dependence of the model. Thus, in simulations (for instance with the Cauchy distribution) the problem, to us, seems empirically more involved and is therefore not addressed in the work at hand. Still, we deem it possible to apply our new type of estimator to models which are supported by any connected subset of R as indicated in the previous lines. Of course, the next question which forces itself on us is whether a similar method can be devised for multivariate models. Here the frontiers are somewhat blurry: The Stein density approach identity which appears at the beginning of Section 2 is not yet fully understood in the multivariate case [as stated in Remark  Brown and Purves (1973)], but it requires σ-compactness of the parameter space, thus essentially reducing the study to euclidean parameters (a Banach space is σ-compact if, and only if, it is of finite dimension, which follows easily from Baire's category theorem). Of course this is enough for our purposes, but currently the interest in statistical inference for infinite dimensional models grows remarkably. Hence if a statistician was to investigate measurability of an estimator for some infinite dimensional quantity, she would have to resort to a result in the generality of Theorem 3.1. Another reason for us to build on Theorem 3.1 is that other measurability results known to us do not quite fit the construction of our estimators. For instance, Sahler (1970) considers minimum discrepancy estimators, where discrepancies are (certain) functions on the Cartesian product of a suitable set of probability measures with itself. It is (formally) not possible to identify such a set of probability measures in our setting, as we ought to introduce the empirical distribution of a sample into the discrepancy function, while only considering parametric distributions with a continuously differentiable density. Even though we believe this to be a purely formal issue which might be resolved to render results from Sahler (1970) applicable, additional caution is needed that Theorem 3.1 does not require. Likewise, the setting considered by Pfanzagl (1969) does not cover our estimators.
Note that since completing (the σ-field of) an underlying probability space does not interfere with measurability properties of random maps, nor does it meddle with push-forward measures, the corresponding assumption in Theorem 3.1 is no restriction. If S is a complete, separable metric space and the map Γ from Theorem 3.1 takes compact subsets of S as values, the condition imposed on the graph is equivalent to Γ being measurable with respect to the Borel-σ-field generated by the Hausdorff topology [see Theorems III.2 and III.30 by Castaing and Valadier (1977)]. Likewise, if S is a locally compact, complete, separable metric space and Γ maps into the closed subsets of S, the condition is equivalent to Γ being measurable with respect to the Borel-σ-field generated by the Fell topology [this can be proven using results from Beer (1993) and Castaing and Valadier (1977)].
Proof of Lemma 3.2. First recall the following lemma on product-measurability, the proof of which is an easy exercise. Then h is A ⊗ B(I), B(T ) -measurable.
Since P ϑn,q n ≤ n 0 is a finite set of measures, there exists a compact set K ⊂ R d such that P ϑ n,q ∈ K ≥ 1 − ε 2 for all n ≤ n 0 . The set K ∩ B ⊂ Θ is a compact subset of R d and thus also of Θ, for a compact metric space is a compact subset of every metric space it embeds into continuously [see p.21, Theorem 3, of Kuratowski (1968)]. By choice of the sets, P ϑ n,q ∈ K ∩ B ≥ 1 − ε, which is the claim.