Positivity-hardness results on Markov decision processes

This paper investigates a series of optimization problems for one-counter Markov decision processes (MDPs) and integer-weighted MDPs with finite state space. Specifically, it considers problems addressing termination probabilities and expected termination times for one-counter MDPs, as well as satisfaction probabilities of energy objectives, conditional and partial expectations, satisfaction probabilities of constraints on the total accumulated weight, the computation of quantiles for the accumulated weight, and the conditional value-at-risk for accumulated weights for integer-weighted MDPs. Although algorithmic results are available for some special instances, the decidability status of the decision versions of these problems is unknown in general. The paper demonstrates that these optimization problems are inherently mathematically difficult by providing polynomial-time reductions from the Positivity problem for linear recurrence sequences. This problem is a well-known number-theoretic problem whose decidability status has been open for decades and it is known that decidability of the Positivity problem would have far-reaching consequences in analytic number theory. So, the reductions presented in the paper show that an algorithmic solution to any of the investigated problems is not possible without a major breakthrough in analytic number theory. The reductions rely on the construction of MDP-gadgets that encode the initial values and linear recurrence relations of linear recurrence sequences. These gadgets can flexibly be adjusted to prove the various Positivity-hardness results.


Introduction
When modelling and analyzing computer systems and their interactions with their environment, two qualitatively different kinds of uncertainty about the evolution of the system execution play a central role: non-determinism and probabilism.If a system is, for example, employed in an unknown environment or depends on user inputs or concurrent processes, modelling the system as non-deterministic accounts for all possible external influences, sequences of user inputs, or possible orders in which concurrent events take place.If transition probabilities between the states of a system, such as the failure probability of components or the probabilities in a probabilistic choice employed in a randomized algorithm, are known or can be estimated, it is appropriate to model this behavior as probabilistic.A pure worst-or best-case analysis is not very informative in such cases and the additional probabilistic information available should be put to use.Markov decision processes (MDPs) are a standard operational model combining non-deterministic and probabilistic behavior and are widely used in operations research, artificial intelligence, and verification among others.
In each state of an MDP, there is a non-deterministic choice from a set of actions.Each action specifies a probability distribution over the possible successor states according to which a transition is chosen randomly.Typical optimization problems on MDPs require resolving the non-deterministic choices by specifying a scheduler such that a quantitative objective function is optimized.For example, the standard model-checking problem asks for the minimal or maximal probability that an execution satisfies a given linear-time property.Here, minimum and maximum range over all resolutions of the non-deterministic choices, i.e., over all schedulers.This model-checking problem is known to be 2EXPTIME-complete if the property is given in linear temporal logic (LTL) [29] and solvable in polynomial time if the property is given by a deterministic automaton [30,10].Many quantitative aspects of a system can be modeled by equipping an MDP with weights that are collected in each step.These weights might represent time, energy consumption, utilities, or generally speaking any sort of costs or rewards incurred.
Classical optimization problems in this context that are known to be solvable in polynomial time include the optimization of the expected value of the total accumulated weight before a target state is reached, the so-called stochastic shortest path problem (SSPP) [16,30,5], the expected value of the reward earned on average per step, the so-called expected mean payoff or long-run average, or the expected discounted accumulated weight where after each step a discount factor is applied to all future weights (for the latter two, see, e.g., [40,60]).
Of course, there is a vast landscape of further optimization problems on finite-state MDPs that have been analyzed.We are, nevertheless, not aware of natural decision problems for standard (finite-state) MDPs with a single weight function and single objective that are known to be undecidable.Undecidability results have been established for more expressive models.
for both problems for linear recurrence sequences of low order or for restricted classes of sequences [62,66,53,54,55].A proof of decidability or undecidability of the Positivity problem for arbitrary sequences, however, withstands all known number-theoretic techniques.In [54], it is shown that decidability of the Positivity problem (already for linear recurrence sequences of order 6) would entail a major breakthrough in the field of Diophantine approximation of transcendental numbers, an area of analytic number theory.
We call a problem to which the Positivity problem is reducible Positivity-hard.From a complexity theoretic point of view, the Positivity problem is known to be at least as hard as the decision problem for the universal fragment of the theory of the reals with addition, multiplication, and order [55], a problem known to be coNP-hard and to lie in PSPACE [22].As most of the problems we will address are PSPACE-hard, the reductions in this paper do not 1 We do not distinguish between the Positivity problem and its complement in the sequel.So, we also refer to the problem whether there is an  such that   < 0 as the Positivity problem.
provide new lower bounds on the computational complexity.The hardness results in this paper hence refer to the far-reaching consequences on major open problems that a decidability result would imply.Furthermore, of course, the undecidability of the Positivity problem would entail the undecidability of any Positivity-hard problem.

Problems under investigation and related work on these problems
In the sequel, we briefly describe the problems studied in this paper and describe related work on these problems.In general, the decidability status of all of these problems is open and we will prove them to be Positivity-hard.
Energy objectives, one-counter MDPs, and quantiles.If weights model a resource like energy that can be consumed and gained during a system execution, a natural problem is to determine the worst-or best-case probability that the system never runs out of the resource.This is known as the energy objective.There has been work on combinations of the energy objective with further objectives such as parity objectives [23,46] and expected mean payoffs [20].Previous work on this objective focused on the possibility to satisfy the objective (or the combination of objectives) almost surely.The quantitative problem whether it is possible to satisfy an energy objective with probability greater than some threshold  is open.
The complement of the energy objective can be found in the context of one-counter MDPs (see [19,18,21]): Equipping an MDP with a counter that can be increased and decreased can be used to model a simple form of recursion and can be seen as a special case of pushdown MDPs.
The process is said to terminate as soon as the counter value drops below 0 and the standard task is to compute maximal or minimal termination probabilities.In one-counter MDPs that terminate almost surely, one furthermore can ask for the extremal expected termination times, i.e. the expected number of steps until termination.On the positive side, for one-counter MDPs, it is decidable in polynomial time whether there is a scheduler that ensures termination with probability 1 [19].Furthermore, selective termination, which requires termination to occur inside a specified set of states can be decided in exponential time [19].On the other hand, the computation of the optimal value and the quantitative decision problem whether the optimal value exceeds a threshold  are left open in the literature.For selective termination, even the question whether the supremum of termination probabilities over all schedulers is 1 is open.Furthermore, also the problem to compute the minimal or maximal expected termination time of a one-counter MDP that terminates almost surely under any scheduler is open.There are, however, approximation algorithms for the optimal termination probability [18] and for the expected termination time of almost surely terminating one-counter MDPs [21].
One-counter MDPs can be seen as a special case of recursive MDPs [32].For general recursive MDPs, the qualitative decision problem whether the maximal termination probability is 1 is undecidable while for restricted forms, so-called 1-exit recursive MDPs, the qualitative and also the quantitative problem is decidable in polynomial space [32].One-counter MDPs can be seen as a special case of 1-box recursive MDPs in the terminology of [32], a restriction orthogonal to 1-exit recursive MDPs.
The termination probability of one-counter MDPs and the satisfaction probability of the energy objective are closely related to the computation of quantiles (see [64,7,61]).Given a probability value , here the task is to compute the best bound  such that the maximal or minimal probability that the accumulated weight exceeds the bound is at most or at least .
The decision version whether the maximal or minimal probability that the accumulated weight before reaching a target state exceeds  is at least or at most  is also known as the cost problem (see [37,38,5]).The computation of quantiles and the cost problem have been addressed for MDPs with non-negative weights and are solvable in exponential time in this setting [64,37].The decision version of the cost problem with non-negative weights is furthermore PSPACE-hard for a single inequality on the accumulated weight and EXPTIME-complete if a Boolean combination of inequality constraints on the accumulated weight is considered [37].For the setting with arbitrary weights, [5] provides solutions to the qualitative question whether a constraint on the accumulated weight is satisfied with probability 1 (or > 0).Further, it is known that the quantitative problem is undecidable if multiple objectives with multiple weight functions have to be satisfied simultaneously [61].
Non-classical stochastic shortest path problems (SSPPs).The classical SSPP described above requires that a goal state is reached almost surely.In many situations, however, there might be no schedulers reaching the target with probability 1 or schedulers that miss the target with positive probability are of interest, too.Two non-classical variants that drop this requirement are the conditional SSPP (see [11,59]) and the partial SSPP (see [25,26]).In the conditional SSPP, the goal is to optimize the conditional expected accumulated weight before reaching the target under the condition that the target is reached.In other words, the average weight of all paths reaching the target has to be optimized.In the partial SSPP, paths not reaching the target are not ignored, but assigned weight 0. Possible applications for these non-classical SSPPs include the analysis of probabilistic programs where no guarantees on almost sure termination can be given (see, e.g., [36,42,13,24,50]), the analysis of fault-tolerant systems where error scenarios might occur with small, but positive probability, or the trade-off analysis with conjunctions of utility and cost constraints that are achievable with positive probability, but not almost surely (see, e.g., [8]).In [25] and [11], partial and conditional expectations, respectively, have been addressed in the setting of non-negative weights.In both-cases, the optimal value can be computed in exponential time [25,11] while the threshold problem is PSPACE-hard [59,11].In MDPs with positive and negative weights, it is known that the optimal values might be irrational and that optimal schedulers might require infinite memory [59].
Conditional expectations also play an important role for some risk measures.The conditional value-at-risk (CVaR) is an established risk measure (see, e.g., [65, 1]) defined as the conditional expected outcome under the condition that the outcome belongs to the  worst outcomes for a given probability value .In the context of optimization problems on weighted MDPs, the CVaR has been studied for mean-payoffs and weighted reachability where only one terminal weight is collected per run (see [43]), and for the accumulated weight before reaching a target state in MDPs with non-negative weights (see [3]).The CVaR for accumulated weights can be optimized in MDPs with non-negative weights in exponential time [58,48].

Contribution
We develop a technique to provide reductions from the Positivity problem to threshold problems on MDPs, asking whether the optimal value of a quantity strictly exceeds a given rational threshold.The resulting reductions are based on the construction of MDP-gadgets that allow to encode the linear recurrence relation of a linear recurrence sequence and the initial values, respectively.The approach turns out to be quite flexible.By adjusting the gadgets encoding initial values, we can provide reductions of the same overall structure for several of the optimization problems we discussed.Through further chains of reductions depicted in Figure 1, we establish Positivity-hardness for the full series of optimization problems under investigation.The main result of this paper consequently is the following: Main result.

Related work on Skolem-and Positivity-hardness in verification
In [4], the Positivity-hardness of decision problems for Markov chains has been established.
The problems studied in [4] are (1) to decide whether for given states ,  and rational number , there is a positive integer  such that the probability to reach  from  in  steps is at least , and (2) the model checking problem for a probabilistic variant of monadic logic and a variant of LTL that treats Markov chains as linear transformers of probability distributions.A connection between similar problems and the Skolem problem and Positivity problem has also been conjectured in [14,2].These decision problems are of quite different nature than the problems studied here.In particular, the problems are shown to be Positivity-hard in Markov chains.
In contrast, e.g., partial and conditional expectations in Markov chains can be computed in polynomial time [59] and the threshold problem for the termination probability of recursive Markov chains, which subsume one-counter Markov chains, can be solved in polynomial space [31].So, the Positivity-hardness of the corresponding problems on MDPs is not inherited from Positivity-hardness on Markov chains.Instead, our reductions show how the non-determinism in MDPs allows encoding linear recurrence sequences in terms of optimal values of various quantitative objectives by forcing an optimal scheduler to take certain decisions.Consequently, the reductions are of a different nature than the reductions in [4].There, the behavior of a Markov chains in  steps can directly be expressed by   where  is the transition probability matrix, which resembles the matrix formulation of the Positivity problem, which asks for a matrix  and an initial vector  whether there is an  such that    lies within a half-space .
In this context also the results of [27] and [45]  expected termination time for purely probabilistic programs formulated in the probabilistic fragment of probabilistic guarded command language (pGCL) [47] is pinpointed to levels of the arithmetical hierarchy (for details on the arithmetical hierarchy, see, e.g., [49]).The results reach up to Π 0 3 -completeness for deciding universal almost sure termination with finite expected termination time (Π 0 1 -complete problems are already undecidable while still co-recursively enumerable).Undecidability is not surprising as the programs subsume ordinary programs.But the universal halting problem for ordinary programs is only Π 0 2 -complete showing that deciding universal termination with finite expected termination time of probabilistic programs is strictly harder.Similarly deciding termination from a given initial configuration is Σ 0 1 -complete for or-dinary programs (halting problem) while deciding almost sure termination with finite expected termination time for probabilistic programs from a given initial configuration is Σ 0 2 -complete.Operational semantics of pGCL-programs can be given as infinite-state MDPs [36].Applied to the purely probabilistic fragment, this leads to infinite-state Markov chains.

Outline
In the following Section 2, we provide necessary definitions and present our notation.In Section 3, we outline the general structure of the gadget-based reductions from the Positivityproblem and construct an MDP-gadget in which a linear recurrence relation can be encoded in terms of the optimal values for a variety of optimization problems (Section 3.2).Afterwards, we construct gadgets encoding also the initial values of a linear recurrence sequence and provide the reductions from the Positivity problems and all subsequent reductions as depicted in Figure 1 (Section 4).We conclude with final remarks and an outlook on future work (Section 5).

Preliminaries
We assume some familiarity with Markov decision processes and briefly introduce our notation in the sequel.More details can be found in text books such as [60].

Markov decision process. A Markov decision process
where  is a finite set of states, Act is a finite set of actions,  :  × Act ×  → [0, 1] ∩ Q is the transition probability function for which we require that ∈ (, , ) ∈ {0, 1} for all (, ) ∈  × Act, and  init ∈  is the initial state.Depending on the context, we enrich MDPs with a weight function wgt :  × Act → Z, a finite set of atomic propositions AP and a labeling function  :  → 2 AP , or a designated set of goal states Goal.The size of an MDP M, denoted by size(M), is the sum of the number of states plus the total sum of the lengths of the encodings of the non-zero probability values (, ,  ′ ) as fractions of co-prime integers in binary and, if present, the lengths of the encodings of the weight values wgt(, ) in binary.
We write Act() for the set of actions that are enabled in a state , i.e.,  ∈ Act() if and only if ∈ (, , ) = 1.Whenever the process is in a state , a non-deterministic choice between the enabled actions Act() has to be made.We call a state absorbing if the only enabled actions lead to the state itself with probability 1 and weight 0. If there are no enabled actions, we call a state terminal or a trap state.The paths of M are finite or infinite sequences  0  0  1  1  2  2 . . .where states and actions alternate such that (  ,   ,  +1 ) > 0 for all  ≥ 0. Throughout this section, we assume that all states are reachable from the initial state in any MDP, i.e., that there is a finite path from  init to each state .We extend the weight function to finite paths.For a finite path  =  0  0  1  1 . . . −1   , we denote its accumulated weight by wgt() = wgt( 0 ,  0 ) + . . .+ wgt( −1 ,  −1 ).
A one-counter MDP is an MDP equipped with a counter.Each state-action pair increases or decreases the counter or leaves the counter unchanged.A one-counter MDP is said to terminate if the counter value drops below zero.We view one-counter MDPs as MDPs with a weightfunction wgt :  × Act → {−1, 0, +1}.In this formulation a one-counter MDP terminates when a prefix  of a path satisfies wgt() < 0.
A Markov chain is an MDP in which the set of actions is a singleton.There are no nondeterministic choices in a Markov chain and hence we drop the set of actions.Consequently, a Markov chain is a tuple M = (, ,  init ), possibly extended with a weight function, a labeling, or a designated set of goal states.The transition probability function  is a function from  ×  to [0, 1] ∩ Q such that ∈ (, ) ∈ {0, 1} for all  ∈ .

Scheduler.
A scheduler for an MDP M = (, Act, ,  init ) is a function  that assigns to each finite path  not ending in trap state a probability distribution over Act(last()) where last() denotes the last state of .This probability distribution indicates which of the enabled actions is chosen with which probability under  after the process has followed the finite path .
We allow schedulers to be randomized and history-dependent.By restricting the possibility to randomize over actions or by restricting the amount of information from the history of a run that can affect the choice of a scheduler, we obtain the following types of schedulers: A scheduler  is called deterministic if it does not make use of the possibility to randomize over actions, i.e., if () is a Dirac distribution for each path .Such a scheduler  can be viewed as a function that assigns an action to each finite path .A scheduler  is called memoryless if () = ( ′ ) for all finite paths ,  ′ with last() = last( ′ ).In this case,  can be viewed as a function that assigns to each state  a distribution over Act().A memoryless deterministic scheduler hence can be seen as a function from states to actions.In an MDP with a weight function, a scheduler  is said to be weight-based if () = ( ′ ) for all finite paths ,  ′ with wgt() = wgt( ′ ) and last() = last( ′ ).Such a scheduler assigns distributions over actions to state-weight pairs from  × Z.
Probability measure.Given an MDP M = (, Act, ,  init ) and a scheduler , we obtain a probability measure Pr  M, on the set of maximal paths of M that start in : For each finite path  =  0  0  1  1 . . . −1   with  0 = , we denote the cylinder set of all its maximal extensions by Cyl().The probability mass of this cylinder set is then given by Pr  M, (Cyl()) = () • Π −1 =0 ( 0 . . .  )(  ).
Recall that ( 0 . . .  ) is a probability distribution over actions and that ( 0 . . .  )(  ) denotes the probability that the scheduler  chooses action  after the prefix  0 . . .  of .The set of cylinder sets forms the basis of the standard tree topology on the set of maximal paths.By Carath éodory's extension theorem, we can extend the pre-measure Pr  M, (Cyl()) defined on the cylinder sets to a probability measure on the Borel -algebra of the space of maximal paths with the standard tree topology.We sometimes drop the subscript  if  is the initial state  init of M. In a Markov chain N , we drop the reference to a scheduler and write Pr N , .
Let  be a random variable on the set of maximal paths of M starting in , i.e.,  is a function assigning values from R ∪ {−∞, +∞} to maximal paths.We denote the expected value of  under the probability measure Pr  M, by E  M, ().
The values we are typically interested in are the worst-or best-case probabilities of an event or the worst-or best-case expected values of a random variable.Worst or best case refers to the possible ways to resolve the non-deterministic choices.Hence, these values are formally expressed by taking the supremum or infimum over all schedulers.Given an MDP M, a state , and an event, i.e., a set of maximal paths, , or a random variable  on the maximal paths of M, we define where inf and sup range over all schedulers  for M.
We use LTL-like notation such as "♢(accumulated weight < 0)" to denote the event that a prefix of a path has a negative accumulated weight.Note that this event expresses the termination of a one-counter MDP in our view of one-counter MDPs as MDPs with a weightfunction taking only values in {−1, 0, +1}.

Classical stochastic shortest path problem.
Let M be an MDP with a weight function wgt :  × Act → Z and a designated set of terminal goal states Goal.We define the following random variable Goal on maximal paths  of M as follows: The expected accumulated weight before reaching Goal under a scheduler  is given by the expected value E  M, init ( Goal).It is evident that this expected value is only defined if The classical stochastic shortest path problem asks for the optimal value where the supremum ranges over all schedulers  with Pr  M, init ( Goal) = 1.The classical stochastic shortest path problem can be solved in polynomial time [16,30,5].

Outline of the Positivity-hardness proofs
The Positivity-hardness results in this paper are obtained by sequences of reductions depicted in Figure 1.The key steps for these sequences are the three direct reductions from the Positivityproblem to the threshold problems for the maximal termination probability of one-counter MDPs, the maximal partial expectation, and the maximal conditional value-at-risk, respectively.

Structure of the MDP constructed for the direct reductions from the Positivity problem
The three direct reductions from the Positivity problem (at the top of Figure 1) follow a modular approach: The MDPs constructed for the reductions are obtained by putting together three gadgets as sketched in Figure 2. One gadget encodes a linear recurrence relation exploiting the dependency of optimal values from different starting states after different amounts of weight have been accumulated in the history of a run onto each other.A second gadget encodes the initial values of a linear recurrence sequence.Together, these two gadget allow us to encode linear recurrence sequences.Finally, an initial gadget is added in which each positive amount of weight  is accumulated with positive probability.Afterwards, the gadget is left and a scheduler has to decide how to leave the initial gadget.The optimal decision if weight  has been accumulated directly corresponds to whether the th member of the given linear recurrence sequence is non-negative.
More precisely, let a rational linear recurrence sequence be given in terms of the initial values  0 , . . .,  −1 and the coefficients  1 , . . .,   of the linear recurrence relation.The three gadgets are connected via two states  and  as depicted in Figure 2. In state  and , actions  0 , . . .,  −1 and  0 , . . .,  −1 , respectively, leading to the gadget encoding the initial values and action  and , respectively leading to the gadget encoding the linear recurrence relation are enabled.The gadgets will be constructed such that an optimal scheduler has to choose action   or   if the accumulated weight in state  or  is a value  with 0 ≤  <  and that it has to choose action  if the accumulated weight is at least .After  or  is chosen, the accumulated weight is decreased within the gadget encoding the linear recurrence relation before the MDP moves back to the states  and  with positive probability.For each of the three direct reductions from the Positivity problem, we construct one such gadget tailored to the three respective quantities.
For accumulated weights  of at least , the gadget encoding the recurrence will exploit the dependency of the optimal values  (, ) and  (, ) on the optimal values when starting with lower accumulated weight.This gadget can be used in all reductions and will be described in the next subsection.
Put together, these two gadgets ensure that  () =   for all  ≥ 0. To complete the reductions, we add an initial gadget I depicted in Figure 3  of weight  is accumulated with positive probability.Afterwards, a scheduler has to choose whether to move to state  or state  via the actions  and , respectively.It is optimal to move to  if and only if   ≥ 0. Let now  be the scheduler always choosing  in the initial gadget and afterwards behaving optimally when choosing from  0 , . . .,  −1 and  or  0 , . . .,  −1 and  as described above.This scheduler is optimal if and only if the given linear recurrence sequence is non-negative.The final step to complete the reduction is to compute the value   ( init , 0) that is achieved by  starting from the initial state.In all three reductions, we can compute this rational value via converging matrix series.The optimal value  max ( init , 0) that can be achieved from the initial state now satisfies if and only if the given linear recurrence sequence is non-negative.

MDP-gadget for linear recurrence relations
In this section, we demonstrate how to construct the gadget ensuring that the difference of optimal values  (, ) −  (, ) follows a given linear recurrence relation with respect to different weight levels .In the next section, the initial values of a linear recurrence sequence will be encoded in MDP-gadgets tailored to the different quantities we address.

Optimality equations.
By choosing  and  appropriately, we can scale down the initial values and coefficients of the recurrence relation for any given input.
To obtain precise bounds that will be used throughout the following sections, let The gadget G ᾱ to encode linear recurrence relations.The example here is depicted for a linear recurrence of depth 2 with  1 ≥ 0 and  2 < 0. The outgoing actions   and   lead to the gadget encoding initial values as depicted in Figure 2.
Note that this linear recurrence relation also holds for the optimal values in the classical stochastic shortest path problem for example.So, the gadget alone is not yet enough for a hardness proof.The missing ingredient is the encoding of the initial values of a linear recurrence sequence.In order to include the encoding of the initial values in our approach, it is necessary that optimal schedulers cannot be chosen to be memoryless.The optimal decisions have to depend on the weight that has been accumulated in the history of a run.If this is the case, we aim to encode the initial values by adding further outgoing actions to the states  and .By fine-tuning the weights and probabilities of these actions, we can achieve that for small weights  some of the new actions are optimal while for large weights the actions  and  of the gadget are optimal.
If we manage to design the other actions such that the differences  (,  + ) −  (,  + ) are equal to given starting values   for a sequence of weights ,  + 1, . . .,  +  − 1 while actions  and  are optimal for weights of at least  + , we can encode arbitrary linear recurrence sequences.This is the goal of the subsequent section.

Reductions from the Positivity problem
To encode initial values of a linear recurrence sequence, we construct further MDP gadgets.
For the termination probability and expected termination time of one-counter MDPs and for partial expectations, we can construct these gadgets directly.For the conditional value-at-risk, we use an intermediate auxiliary random variable.Putting together these gadgets with the gadget G ᾱ from the previous section, we obtain the basis for the Positivity-hardness results of the respective threshold problems.The Positivity-hardness of the remaining problems is obtained as a consequence of these results via further reductions.An overview of the chains of reductions used is presented in Figure 1.

One-counter MDPs, energy objectives, cost problems, and quantiles
The first problem we will show to be Positivity-hard is the threshold problem for the optimal termination probability of one-counter MDPs.From this result, Positivity-hardness results for energy objectives, cost problems, and the computation of quantiles follow easily.Afterwards, we adjust the reduction to show Positivity-hardness of the threshold problem for the optimal expected termination time of almost-surely terminating one-counter MDPs.
Termination probability of one-counter MDPs.We formulated the termination of a onecounter MDP in terms of weighted MDPs M. Recall that a one-counter MDP terminates if the counter value drops below zero.If we consider the weight that has been accumulated instead of the counter value, the quantities we are interested are Pr opt M (♢ accumulated weight < 0) for opt = max and opt = min.The main result we prove in this section is the following: The Positivity problem is reducible in polynomial time to the following problems: Given an MDP M and a rational  ∈ (0, 1), 1. decide whether Pr max M, init (♢(accumulated weight < 0)) > .
Note that if weights are encoded in unary, we can transform a weighted MDP to a onecounter MDP that can only increase or decrease the counter value by 1 in each step in polynomial time.The MDPs that are constructed from a linear recurrence sequence of depth  in the proof of Theorem 4.1 will contain only weights with an absolute value of at most .So, they can be transformed to one-counter MDPs in time linear in the size of the original input and we conclude that the following two threshold problems for the optimal termination probability of one-counter MDPs are Positivity-hard: The Positivity problem is reducible in polynomial time to the following problems: Given a one-counter MDP M viewed as an MDP with weights in {−1, 0, +1} and a rational  ∈ (0, 1), 1. decide whether Pr max M, init (♢(accumulated weight < 0)) > .
Among the direct reductions from the Positivity problem we present, the construction of the gadget encoding the initial values of a linear recurrence sequence is arguably the simplest for these optimal termination probabilities.In the formulation with weighted MDPs, the termination of a one-counter MDP is moreover the complement of the energy objective "□ accumulated weight ≥ 0".We will first prove Positivity-hardness for the threshold problem for maximal termination probabilities and outline the necessary adjustments to show Positivityhardness also for the threshold problem for minimal termination probabilities afterwards.
We split the proof of Theorem 4.1 into four parts.First, we provide the construction of an MDP from a linear recurrence sequence.Then, we show that the linear recurrence sequence is correctly encoded in this MDP in terms of the maximal termination probabilities.To complete the proof of item 1, we then show how to compute the threshold  for the threshold problem and how this establishes the correctness of the reduction.Finally, we show how to adapt the construction to prove hardness of the threshold problem for minimal termination probabilities.The values (, ) in an MDP with state space  now satisfy the optimality equation ( * ) from Section 3.2 (where (, ) takes the role of  (, ) in ( * )), which we restate here for convenience.

Proof of
We have (, ) = 1 for all states  and all  < 0 and for all  ∈  and  ≥ 0.
So, to capture the linear recurrence relation, we will be able to make use of the gadget G ᾱ from Section 3.2.The missing ingredient is a gadget to encode the initial values of a linear recurrence sequence.
The new gadget O β encoding the initial values β is depicted in Figure 5 and works as follows: For 0 ≤  ≤  − 1, the action   enabled in  leads to state   with probability −  +1 +   .By assumption on   , this probability is less than − +1 +1 .The remaining probability leads to trap.In state , the action   leads to   with probability −  +1 and to trap with the remaining probability.In order to terminate, the accumulated weight has to drop below 0 before reaching trap.As soon as the trap state is reached with non-negative accumulated weight, the process cannot terminate anymore.The optimal decision in order to maximize the termination probability wgt : +1 τ σ in state  is now easy to determine.Let ℓ be the current weight.If 0 ≤ ℓ ≤  − 1, choosing action  leads to termination with probability less than 1/( + 1) as trap is reached immediately with probability at least /( + 1) due to our assumption that ≤ |  | < 1/( + 1).Choosing action   makes it impossible to terminate if ℓ > .If ℓ ≤ , then choosing   lets the process terminate if   is reached.This happens with probability −  +1 +   .As   < 1/( + 1) for all , the maximal termination probability is reached when choosing  ℓ .If ℓ ≥ , then   leads to termination with probability 0 for all .Hence, action  is optimal.Analogously, we see that the optimal choice in state  with weight ℓ is  ℓ if ℓ ≤  − 1 and  otherwise.
For counter value  ≤  − 1, we have seen that   and   , respectively, are the optimal actions.Hence,  () =   in this case as we have just seen that the optimal termination probability when starting with weight  ≤  − 1 is − +1 +   in  and − +1 in .For this recursive expression, we consider the following Markov chain C for  ∈ N that is also depicted in Figure 7 -for better readability, it is depicted for the case  = 2 there:  for  ∈ N. In other words, this vector contains the optimal values for the partial expectation when starting in  or  with an accumulated weight from {, . . .,  +  − 1}.The vector  0 is the column vector ( (,  − 1), . . ., (, 0), . . .(,  − 1), . . ., (, 0)) ⊤ and these values occur as transition probabilities in M under the actions  −1 , . . .,  0 and  −1 , . . .,  0 .
As the reachability probabilities in C are rational and computable in polynomial time, we conclude from equation ( * ) that there is a matrix  ∈ Q 2×2 computable in polynomial time such that  +1 =   for all  ∈ N. So,   =    0 for all  ∈ N.
Hence, we can write We have to subtract (, 0) as the state choice cannot be reached with weight 0, but the summand 1 • (, 0) occurs in the sum.As (, 0) =  +1 +  0 , this does not cause a problem.
We claim that the matrix series involved converges to a rational matrix.We observe that the maximal row sum in  is at most | where  2 is the identity matrix of size 2×2.So, is computable in polynomial time.

■
All in all, this finishes the proof of item (1) of Theorem 4.

Proof of Theorem 4.1(2).
The construction we provided so far shows that the threshold problem for the maximal termination probability of one-counter MDPs is Positivity-hard.Using exactly the same ideas, we can show that the threshold problem for the minimal termination probability is Positivity-hard as well.Let us describe the necessary changes in the construction that are also depicted in Figure 8.We rename the state trap to trap ′ and add a transition with weight − to a new absorbing state trap.For all 0 ≤  ≤  − 1, now state trap is reached directly with probability 1 and weight −  from the states   and   .Furthermore, the probability to reach   when choosing   in  is changed to +1 +1 +   and the probability to reach trap ′ is adjusted accordingly.The analogous change is performed for   .Now, it is easy to check that the optimal choice to minimize the termination probability in state  is to choose  if the accumulated weight is ≥ .In this case the probability of termination is less than 1 +1 .If the accumulated weight is 0 ≤ ℓ < , the optimal choice is  ℓ .The analogous result holds in state .From then on the proof is analogous to the proof for the maximal termination probability with the change that we have to consider the scheduler  always choosing  in the state choice this time.This scheduler is optimal to minimize the termination probability if and only if the given linear recurrence sequence is non-negative.With these adjustments, we conclude: The Positivity problem is reducible in polynomial time to the following problem: Given an MDP M and a rational  ∈ (0, 1), decide whether

Energy objectives.
As the energy objective □(accumulated weight ≥ 0) is satisfied if and only if ♢(accumulated weight < 0) does not hold, the Positivity-hardness of the threshold problem of the optimal satisfaction probability of an energy objective follows easily.As we conclude: The Positivity problem is reducible in polynomial time to the following problems: Given an MDP M and a rational  ∈ (0, 1),
Cost problems and quantiles.
The analogous result also holds for the total accumulated weight.

Termination times of one-counter MDPs.
To conclude the section, we show that not only the threshold problems for optimal termination probabilities, but also for the optimal expected termination times in one-counter MDPs that terminate almost surely is Positivity-hard.We again work with weighted MDPs.Let  be the random variable that assigns to each path in a weighted MDP M the length of the shortest prefix  such that wgt() < 0. To reflect precisely the behavior of a one-counter MDP, we now will work with MDPs where the weight is reduced or increased by at most 1 in each step.We make a small change to the MDP constructed for the proof of Corollary 4.5 that is depicted in Figure 8.The initial component (that is not depicted) stays unchanged.For the remaining transitions, all transition reduce the weight or leave it unchanged.The transitions with weight 0 do not occur directly after each other except for the loop at the state trap that we adjust in a moment.Hence, we can add additional auxiliary states such that along each path starting from  or  not reaching the state trap, the weight is left unchanged and reduced by 1 in an alternating fashion.So, if a path starts in state  or  with accumulated weight  and terminates (i.e., reaches accumulated weight −1) before reaching the state trap this takes 2( + 1) steps.Now, we replace the loop at the state trap by the gadget depicted in Figure 9 and let us call the resulting MDP N .So, when reaching trap the accumulated weight is increased by 1 before it is reduced in every other step until termination.That means that if a path starting in state  or  with weight  does not terminate before reaching trap, the termination time is 2( + 1) + 3 steps.Now, let  be a scheduler and denote the probability not to terminate before reaching trap under  by   .For the expected termination time  in N , we now have The summands (1/2)  ( + 2( + 1)) correspond to the probability to accumulated weight  in the initial component which takes  steps and the 2( + 1) steps needed to terminate by alternatingly leaving the weight unchanged and reducing it by 1.The three additional steps after trap occur precisely with probability   .
Not terminating before trap corresponds exactly to not terminating at all in the MDP constructed for Corollary 4.5.The termination probability there is hence 1 −   for any scheduler .
It is hence possible to terminate with a probability less than  in that MDP if and only if it is possible to reach an expected termination time of more than 11 − 3 in N .By Corollary 4.5 and the fact that termination is reached almost surely in N under any scheduler, we hence conclude: Let M be a one-counter MDP with initial state  init that terminates almost surely under any scheduler, let  be a rational, and let  be the random variable assigning the termination time to runs.The Positivity problem is polynomial-time reducible to the problem whether The analogous argument with similar changes to the MDP used in the proof of Theorem 4.1 can be used to show the analogous result for the problem whether E min M, init ( ) < .

Partial and conditional stochastic shortest path problems
Our next goal is to prove that the partial and conditional SSPPs are Positivity-hard.Note that this stands in strong contrast to the classical SSPP, which is solvable in polynomial time [16,30,5].We start by providing a formal definition of the decision versions of these two problems.
Let M be an MDP with a designated set of terminal states Goal.We define the random variable ⊕Goal on maximal paths  of M: The objective in the partial SSPP is to maximize the expected value of ⊕Goal which we call the partial expected accumulated weight, or partial expectation for short, i.e., to compute the value where the supremum ranges over all schedulers .The threshold problem asks, given a rational , whether Note that the minimization of the partial expectation can be reduced to the maximization by multiplying all weights in with −1.
The conditional expectation under a scheduler  that reaches Goal with positive probability is the value Again, we are interested in the maximal value where the supremum ranges over all schedulers  with Pr  M (♢Goal) > 0. Consequently, the threshold problem asks for a given rational  whether CE max > .
Again, multiplying all weights with −1 reduces the minimization of the conditional expectation to the maximization.Furthermore, given a further set of states , the problem to maximize E  M ( Goal | ♢) among all schedulers  that reach  with positive probability can be reduced to the conditional SSPP in our formulation as shown in [11] 2 .

Partial SSPP.
In the sequel, we will provide a direct reduction from the Positivity problem to the partial SSPP using our modular approach via MDP-gadgets to prove the following result: T H E O R E M 4 .1 0. The Positivity problem is polynomial-time reducible to the decision version of the partial SSPP, i.e., the question whether In [11], only MDPs with non-negative weights are considered.The reduction of [11], however, does not require the restriction to non-negative weights.for a given MDP M and a given rational .
Again, we split up the proof of the theorem into the construction of the MDP with the proof of the correctness of the encoding of the linear recurrence sequence and the computation of the threshold .
Proof of Theorem 4.10: construction of the MDP and correctness of the encoding of a linear recurrence sequence.Let  be a natural number and let (  ) ≥0 be the linear recurrence sequence given by rationals   for 1 ≤  ≤  and   for 0 ≤  ≤ −1 via  0 =  0 , . . .,  −1 =  −1 and  + =  1  +−1 + • • • +     for all  ≥ 0. By Assumption 3.1, we can assume w.l.o.g. that  |  | < 1  4 and that 0 ≤   < 1 4 2+2 for all .We begin by constructing a gadget P β that encodes the initial values  0 , . . .,  −1 .The gadget is depicted in Figure 10 and contains states , , goal, and fail.For each 0 ≤  ≤  − 1, it additionally contains states   and   .In state   , there is one action enabled that leads to goal with probability 1 2 2(− ) +   and to fail otherwise.From state   , goal is reached with probability 1 2 2(− ) and fail otherwise.In state , there is an action   leading to   with weight  −  for each 0 ≤  ≤  − 1.Likewise, in state  there is an action   leading to   with weight −  for each 0 ≤  ≤  − 1.
We furthermore reuse the initial gadget I and the gadget encoding the linear recurrence relation G ᾱ from the previous section.In the gadget G ᾱ, we rename the absorbing state trap to the terminal state goal which is the target state for the partial SSPP.As before, we glue together the three gadgets I, G ᾱ and P β at states , , and goal.Let us call the full MDP that we obtain in this way M which is depicted in Figure 11.We denote the state space by .
The somewhat complicated choices of probability values lead to the following lemma showing the correct interplay between the gadgets constructed via straight-forward computations.For  >  this value is ≤ 0 and hence   is certainly not optimal.For  = , we obtain a partial expectation of 1 2 2(− ) +   .
For  < , state   is reached with weight 1 So, the partial expectation obtained via   is at most So, indeed action   maximizes the partial expectation among the actions   with 0 ≤  ≤  − 1 when the accumulated weight in state  is −( − 1) + .The argument for state  is the same with   = 0 for all .It is easy to see that for accumulated weight −( − 1) +  with 0 ≤  ≤  − 1 actions  or  are not optimal in state  or : If goal is reached immediately, the weight is not positive and otherwise states  or  are reached with lower accumulated weight again.The values   are chosen small enough such that also a switch from state  to  while accumulating negative weight does not lead to a higher partial expectation.
For positive accumulated weight , the optimal partial expectation when choosing  first is at least 3   4  by construction and the fact that a positive value can be achieved from any possible successor state via one of the actions   and   with 0 ≤  ≤  − 1. Choosing   on the other hands results in a partial expectation of at most ( + ) • ( for  ∈ N. In other words, this vector contains the optimal values for the partial expectation when starting in  or  with an accumulated weight from { + 1, . . .,  + }.Further, we define the vector containing the optimal values for weights in {− + 1, . . ., 0} which are the least values of accumulated weight reachable under scheduler . −1 = ((, 0), (, −1), . . ., (, − + 1), (, 0), (, −1), . . ., (, − + 1)) ⊤ .
As we have seen, these values are given as follows: (, − + 1 + ) = So, we have an explicit representation for   .The value we are interested in is (1/2) ℓ (, ℓ).
following lemma.Note that a reduction in the other direction is provided in [59] rendering the two problems polynomial-time inter-reducible.Given an MDP M and a rational , decide whether CE max M > .
Two-sided partial SSPP.To conclude this section, we prove the Positivity-hardness of a two-sided version of the partial SSPP with two non-negative weight functions.The key idea is that, instead of using arbitrary integer weights, we can simulate the non-monotonic behavior of the accumulated weight along a path in the partial SSPP with arbitrary weights with two nonnegative weight functions.In the definition of the random variable ⊕Goal, we can replace the choice that paths not reaching Goal are assigned weight 0 by a second weight function.Let M = (, Act, Pr,  init , wgt goal , wgt fail , goal, fail) be an MDP with two designated terminal states goal and fail and two non-negative weight functions wgt goal :  × Act → N and wgt fail :  × Act → N.
Assume that the probability Pr min M, init (♢{goal, fail}) = 1.Define the following random variable  on maximal paths  : problem for the two-sided partial expectation is Positivity-hard as well by a small adjustment of the construction above.
additional steps.So, The optimal scheduler  for the partial expectation in M ′ is the same as in the MDP M above.
Also, the value  of this scheduler can be computed as in Lemma

Conditional value-at-risk for accumulated weights
Lastly, we aim to prove the Positivity-hardness of the threshold problem for the CVaR in this section.To this end, we provide a further direct reduction from the Positivity-problem to the threshold problem for the expected value of an auxiliary random variable closely related to the CVaR using our MDP-gadgets.
Conditional Value-at-Risk.
Outcomes of  which are less than  are treated differently to outcomes equal to  as it is possible that the outcome  has positive probability and we only want to account exactly for the  worst outcomes.Hence, we take only  − Pr  M ( < ) of the outcomes which are exactly  into account as well.To provide worst-case guarantees or to find risk-averse policies, we are The probability  is 1≤≤ |  |. 0 ≤  ≤  − 1.After gluing together this gadget with the gadget G ᾱ at states , , and goal, we prove that the interplay between the gadgets is correct: Let 0 ≤  ≤  − 1. Starting with accumulated weight −+  in state , the action   maximizes the partial expectation among the actions  0 , . . .,  −1 .Likewise,   is optimal when starting in  with weight −+ .If the accumulated weight is non-negative in state  or , then  or  are optimal.The idea is that for positive starting weights, the tail loss of   and   is relatively high while for weights just below 0, the chance to reach goal with positive weight again outweighs this tail loss.
First, we estimate the expectation of goal when choosing   and  while the accumulated weight is −+  in .If  > , then   and  lead to goal directly with probability 1− and weight ≤ −1.So, the expectation is less than −(1 − ) ≤ −1+ 1 5(+1) .
If  ≤ , then with probability 1− goal is reached with positive weight, hence goal is 0 on these paths.
With probability   , goal is reached via  ′  .In this case all runs reach goal with negative weight.On the way to  ′  weight 2 is added, but afterwards subtracted again at least once.In expectation weight 2 is subtracted +1  many times.Furthermore, −2+ is added to the starting weight of −+ .So, these paths contribute to the expectation of goal.
The remaining paths reach goal via   and all reach goal with negative weight as well.
The probability to reach   is  −   .On the way to   , the initial weight of −+ is changed to −2+ +.Afterwards, weight − is accumulated +1  -many times in expectation.So, these remaining paths contribute (−3+ +−1) • ( −   ).So, all in all the expectation of goal in this situation is •(−3+ +−1)−  .Now, as  ≤ 1 5(+1) and   ≤  3 for all , we see that •(−3+ +−1)−  ≥ −(3 + 2) ≥ −1+ 1  5(+1) .The optimum with  ≤  is obtained for  =  as   ≤ /3 for all .Hence indeed   is the optimal action.For   the same proof with   = 0 for all  leads to the same result.Now assume that the accumulated weight in  or  is ℓ ≥ 0.Then, all actions lead to goal with a positive weight with probability 1 − .In this case goal is 0. However, a scheduler  which always chooses  and  is better than a scheduler choosing   or   for any  ≤ −1.Under scheduler  starting from  or  a run returns to {, } with probability  while accumulating weight ≥ − and the process is repeated.After choosing   or   the run moves to   ,   or  ′  while accumulating a negative weight.From then on, in each step it will stay in that state with probability greater than  and accumulate weight ≤ −.Hence, the expectation of goal is lower under   or   than under .Therefore indeed  and  are the best actions for non-negative accumulated weight in states  and .
Let now (, ) and (, ) denote the optimal expectations of goal when starting in  or  with weight .Further, let  () = (, ) − (, ).From the argument above, we also learn that the difference  (−+ ) is equal to   , for 0 ≤  ≤  − 1 .Put together with the linear recurrence encoded in G ᾱ this shows that  (− + ) =   for all  where (  ) ∈N is the linear recurrence sequence specified by the   ,   , 1 ≤  ≤ , and 0 ≤  ≤ −1.
Finally, we add the same initial component as in the previous section to obtain an MDP M.
Let  be the scheduler always choosing  in state  and afterwards following the optimal actions as described above is optimal if and only if the linear recurrence sequence stays nonnegative.The remaining argument goes completely analogously to the proof of Theorem 4.1.
Grouping together the optimal values in vectors   with 2 entries as done there, we can use the same Markov chain as in that proof to obtain a matrix  such that  +1 =   .This allows us to compute the rational value  = E  M, init ( goal) via a matrix series in polynomial time and E max M, init ( goal) >  if and only if the given linear recurrence sequence is eventually negative.■ By the discussion above, this lemma directly implies Theorem 4.17.With adaptions similar to the previous section, it is possible to obtain the analogous result for the minimal expectation of goal.This implies that also the threshold problem whether the minimal conditional valueat-risk is less than a threshold , CVaR min  ( goal) < , is Positivity-hard.

Conclusion
The In addition, the optimum must not be achievable with memoryless schedulers, but the optimal decisions have to depend on the accumulated weight to make it possible to encode initial values of a linear recurrence sequence.This combination of conditions is quite common as we have seen.
Furthermore, our Positivity-hardness results can be used to establish Positivity-hardness of further decision problems on MDPs, which are on first sight of a rather different nature: In [58,57], it is shown how our proof of the Positivity-hardness of the two-sided partial SSPP can be modified to prove the Positivity-hardness of two problems concerning the long-run satisfaction of path properties, namely the threshold problem for long-run probabilities and the model-checking problem of frequency-LTL.Both of these problems address the degree to which a property is satisfied by the sequence of suffixes of a run in order to analyze the long-run behavior of systems.The long-run probability of a property  in an MDP M under a scheduler  is the expected long-run average of the probability that a suffix generated by  in M satisfies .
Similarly, frequency-LTL extends LTL by an operator that requires a certain percentage of the suffixes of a run to satisfy a property.Long-run probabilities and frequency-LTL in MDPs have been investigated in [6] and [34,35], respectively, where decidable special cases of the mentioned decision problems have been identified.In general, however, the decidability status of these problems is open.The reductions in [58,57] show how the two-sided partial SSPP can be encoded into the long-run probability as well as the long-run frequency of the satisfaction of a simple regular co-safety property, i.e., the negation of a safety property, yielding Positivityhardness for the threshold problem for long-run probabilities and the model-checking problem of frequency-LTL in MDPs.
It is worth mentioning that in the special case of Markov chains, several of the problems investigated here are decidable: In Markov chains, partial and conditional expectations can be computed in polynomial time [59].Furthermore, one-counter Markov chains constitute a special case of recursive Markov chains, for which the threshold problem for the termination probability can be decided in polynomial space [31].Remarkably however, the threshold problem for the probability that the accumulated cost satisfies a Boolean combination of inequality constraints in finite-state Markov chains is open [38].
Finally, the Positivity-hardness results leave the possibility open that some or all of the problems we studied are in fact harder than the Positivity problem.In particular, it could be the case that the problems are undecidable and that a proof of the undecidability would yield no implications for the Positivity problem.For this reason, investigating whether some or all of the threshold problems are reducible to the Positivity problem constitutes a very interesting -and challenging -direction for future work.Such an inter-reducibility result would show that studying any of the discussed optimization problems on MDPs could be a worthwhile direction of research to settle the decidability status of the Positivity-problem.Some hope for an inter-reducibility result can be drawn from the fact that the optimal values are approximable for several of the problems -for termination probabilities and expected termination times of one-counter MDPs, this was shown in [18,21] and for partial and conditional expectations in [59].This indicates that there is at least a major difference to undecidable problems in a similar context such as the emptiness problem for probabilistic finite automata where the optimal value cannot be approximated [56,28].

Figure 1 .
Figure 1.Overview of the dependencies between the Positivity-hardness results.The squares refer to the threshold problems for the respective quantities.

Figure 3 .
Figure 3.The initial gadget I.

Theorem 4 . 1 ( 1 )
: construction of the MDP.Given a linear recurrence sequence in terms of the rational coefficients  1 , . . .,   of the linear recurrence relation as well as the rational initial values  0 , . . .,  −1 for  ≥ 2, our first goal is to construct an MDP M and a rational  ∈ (0, 1) such that Pr max M, init (♢( accumulated weight < 0)) >  if and only if   < 0 for some  ≥ 0. By Assumption 3.1, we can assume that the input values are sufficiently small.More precisely, we assume that  =1 |  | < 1/( + 1) and that 0 ≤   < 1/( + 1) for all 0 ≤  ≤  − 1, which is ensured by the bounds in Assumption 3.1, and because the Positivity problem becomes trivial if one of the values   with 0 ≤  ≤  − 1 is negative.We denote the supremum of possible termination probabilities in terms of the current state  and counter value (accumulated weight)  by (, ).More precisely, in an MDP M for  ≥ 0, we define (, ) def = Pr max M, (♢ accumulated weight < −).

Figure 5 .L E M M A 4 . 3 .
Figure 5. Gadget O β encoding initial values of a linear recurrence sequence in terms of maximal termination probabilities of one-counter MDPs.

Figure 6 .
Figure 6.Full MDP for the reduction to the threshold problem for termination probabilities of one-counter MDPs.The MDP contains the upper part for all 0 ≤  ≤  − 1.The middle part is depicted for  = 2,  1 ≥ 0, and  2 < 0.

Figure 8 .
Figure 8. Necessary changes to the construction for the result for minimal termination probabilities.The initial component of the MDP is omitted here and stays unchanged.

c : − 1 Figure 9 .
Figure 9. Necessary changes to the construction for the result for for maximal expected termination times.

Figure 10 .
Figure 10.The gadget P β encoding the initial values in the reduction to the threshold problem for partial expectations.

Figure 12 .
wgt fail ( ) if  ⊨ ♢fail.Due to the assumption that goal or fail is reached almost surely under any scheduler, the expected value E  M, init () is well-defined for all schedulers  for M. We call the value E max M, init () = sup  E  M, init () the optimal two-sided partial expectation.We can show that the threshold The gadget T β encoding initial values in terms of two-sided partial expectations.

Figure 13 .
Figure 13.The gadget encoding initial values for the reduction to the threshold problem for the conditional value-at-risk.The gadget contains the depicted states and actions for each 0 ≤  ≤  − 1.
Positivity-hardness results established in this paper show that a series of problems on finite-state MDPs that have been studied and left open in the literature exhibit an inherent mathematical difficulty.A decidability result for any of these problems would imply a major break-through in analytic number theory.At the heart of our Positivity-hardness proofs lies the construction of modular MDPs consisting of three gadgets.This construction provides a versatile proof strategy to establish Positivity-hardness results: It allowed us to provide three direct reductions from the Positivity problem by constructing structurally identical MDPs that only differ in the gadget encoding the initial values.The further chains of reductions depicted in Figure 1 established Positivity-hardness for a landscape of different problems on one-counter MDPs and integer-weighted MDPs.The proof technique might be applicable to further threshold problems associated to optimization problems on MDPs.A main requirement for the direct applicability of the technique is that the optimal values  (, ) in terms of the current state  and the weight  accumulated so far, or a similar quantity that can be increased and decreased, satisfy an optimality equation of the form  (, ) = max ∈Act() ∑︁ ∈ (, , ) •  (,  + wgt(, )).

Scaling down coefficients of a linear recurrence sequence. Given
Let us start by the following observations on the well-known relation between the optimal values at different states in the classical stochastic shortest path problem, i.e., the maximal expected accumulated weights before reaching a goal state (defined in Section 2).Let M = (, Act, ,  init , wgt, Goal) be an MDP.The solution to the classical stochastic shortest path problem satisfies the so-called Bellman equation.If  () denotes the value when Goal and  () = 0 for  ∈ Goal.This simple form of optimality equation implies the existence of optimal memoryless deterministic schedulers for the classical stochastic shortest path problem (in case optimal schedulers exist, i.e., if the optimal values are finite).For problems like the optimization of the termination probability of one-counter MDPs, it is, however, clearly not sufficient to consider the optimal values only in dependency of the starting state.The counter-value, i.e. the weight that has been accumulated so far, is essential.So, let  (, ) denote the maximal termination probability of a one-counter MDP when starting in state  with counter value .Letting  (, ) = 1 if  < 0, we obtain the following equation for all states  and all values  ≥ 0: ≥0 be a linear recurrence sequence specified by the initial values  0 =  0 , . . .,  −1 =  −1 and the linear recurrence relation  + =  1  +−1 + • • • +     for all  ≥ 0. For any  > 0 and  > 0, the sequence (  ) ≥0 defined by   =  •   •   for all  is non-negative if and only if (  ) ≥0 is non-negative.Furthermore, it satisfies   =  •   •   for  <  and starting in state , i.e., the maximal expected accumulated weight before reaching Goal from state , then  () = max ∈Act() wgt(, ) + ∑︁ ∈ (, , ) •  () for  ∉  (, ) = max ∈Act() ∑︁ ∈ (, , ) •  (,  + wgt(, )).( * ) Already in this equation, the value  (, ) hence possibly depends on values of the form  (,  − ) for some .We want to exploit this interrelation to encode linear recurrence relations + =  1  +−1 + • • • +    into the optimal values  (, ).Of course, the values (, , ) are all non-negative.So, we cannot directly encode a linear recurrence into the optimal values for different weight levels at one state as the coefficients might be negative.To overcome this problem, we instead consider the difference  (, ) −  (, ) for two different states  and . the coefficients  1 , . . .,   , and initial values  0 =  0 , . . .,  −1 =  −1 of a linear recurrence sequence, we have to assume that these are all sufficiently small for the following constructions.So, let us clarify why we can assume this without loss of generality and let us provide precise bounds.Let (  ) •   of the linear recurrence of the sequence (  ) ≥0 can be computed in polynomial time as well.The choice 4 2+2 • .Again, since the value  is linear in the size of the original input,  can be computed in polynomial time.The initial values of the new sequence (  ) ≥0 are now  ′  def =   =  •   •   for  < , computable in polynomial time.The choice of  guarantees that max 0≤ <  ′  < min( 1 4 2+2 ,  ′ 4 ).Given the coefficients  1 , . . .,   , and initial values  0 =  0 , . . .,  −1 =  −1 If we know that an optimal scheduler chooses action  in state  and action  in state  if the accumulated weight is , 5+5) .So, if  > 1, then  = 1 •(5+5) and else  = 1 (5+5) .The value  can be computed in polynomial time.As the numerical value of  is linear in the size of the given original input, the coefficients  ′1 def =  •  1 ,  ′ 2 def =  2 •  2 , ...,  ′  def = def = max 0≤ < |   |.We can choose  def = min( ′ ,1)A S S U M P T I O N 3 .1.MDP-gadget for linear recurrence relations.Given the coefficients  1 , . ..,   of a linear recurrence relation satisfying Assumption 3.1, we construct the MDP-gadget depicted in Figure 4.The gadget contains states , , and trap as well as  1 , . ..,   and  1 , . ..,   . Istate , an action  is enabled which has weight 0 and leads to state   with probability   if   > 0 and to state   with probability |  | if   < 0 for all .The remaining probability leads to trap.From each state   , there is an action leading to  with weight −.The action  enabled in  as well as the actions leading from states   to  are constructed analogously.If   is negative, action  reaches state   with probability |  |.Otherwise it reaches   with probability   .The state trap is absorbing.A the gadget depends on the inputs ᾱ = ( 1 , . ..,   ), we call it G ᾱ.This gadget G ᾱ will be integrated into MDPs without further outgoing edges from states  1 , . . .,   ,  1 , . . .,   .For any optimization problem for which the optimal values  depend on the state and the weight accumulated so far and satisfy equation ( * ), we can encode a linear recurrence in an MDP containing this gadget (and possibly further actions for state  and ): then  (, ) −  (, ) = 1 −  ∑︁ =1 |  | ( (goal, ) −  (goal, )) + ∑︁ 1≤≤,   ≥0    (, −) −    (, −) + ∑︁ 1≤≤,   <0 (−  ) (, −) + (−  ) (, −) =  ∑︁ =1   • ( (, −) −  (, −)).
In the constructed MDP M, the value  = Pr  In order to compute the value , we first provide a recursive expression of the maximal termination probabilities (, ) and (, ).By the definition of , these are precisely the termination probabilities under  when starting from  or  with some positive accumulated The state choice is reached with any positive accumulated weight with positive probability.For the optimal choices in the state choice with accumulated weight , we observe that choosing  is optimal if and only if  () ≥ 0. By Lemma 4.3, this holds if and only if   ≥ 0.Consider now the scheduler  which always chooses  in state choice and afterwards behaves according to the optimal choices as described in the proof of Lemma 4.3.This scheduler  is optimal if and only if the sequence (  ) ≥0 is non-negative.To complete the reduction, we will compute the value def = Pr  M, init (♢(accumulated weight < 0)).We will see that  is a rational computable in polynomial time and we know that Pr max M, init (♢(accumulated weight < 0)) ≤  if and only if the scheduler  is optimal which is the case if and only if (  ) ≥0 is non-negative.L E M M A 4 .4.P R O O F .weight  ∈ N because  behaves optimally as soon as state  or  has been reached.
The transitions in C behave as  and  in M, but the decrease in the accumulated weight is explicitly encoded into the state space.
M, init (♢(accumulated weight < 0)) >  if The proof of the Positivity-hardness of the threshold problem for the termination probability of one-counter MDPs in fact also serves as a proof that cost problems and the computation of quantiles of the accumulated weight before reaching a goal state are Positivity-hard.Observe that in the MDP constructed for Theorem 4.1 and Corollary 4.5, almost all paths  under any scheduler satisfy ♢(accumulated weight < 0) if and only if they satisfy trap( ) < 0 if and only if their total accumulated weight is less than 0. Thus, problems: Given an MDP M with a designated set of trap states Goal and a rational  ∈ (0, 1),1.decidewhether Pr maxM, init [25]each weight , denote by (, ) and (, ) the optimal partial expectation when starting in state  or  with accumulated weight  in M as if the respective state was reached from the initial state with weight  and probability 1.For each weight  ≥ − + 1, denote by  () the difference (, ) − (, ) between these optimal partial expectation when starting in state  and  with weight .Comparing action   and   for starting weight −(−1)+ , we conclude from the previous lemma that the difference between optimal values  (−(−1)+ ) is equal to   , for 0 ≤  ≤  − 1.The important fact we use next is that for partial expectations, the optimal values (, ) for states  ∈  \ {goal} and starting weights  ∈ Z satisfies the optimality equation ( * ) from Section 3.2 when setting (goal, ) =  as already shown in[25]: ( ′ ,  + wgt(, )).the fact that G ᾱ encodes the given linear recurrence relation as soon as  and  are the optimal actions as shown in Section 3.2, we conclude the following lemma: Consider the linear recurrence sequence (  ) ≥0 given above by  1 , . . .,   and  0 , . . .,  −1 and the MDP M constructed from this sequence.We have  (−(−1) + ) =   and analogously for (,  + ).We now group the optimal values together in the following column vectors   = ((,  + ), (,  +  − 1), . . ., (,  + 1), (,  + ), . . ., (,  + 1)) ⊤ L E M M A 4 .1 2.
LetM be an MDP with a designated terminal target state goal and let  be a rational number.We construct an MDP N such that PE max M >  if and only if CE max We obtain N by adding a new initial state  ′ init , renaming the state goal to goal ′ , and adding a new state goal to M. In  ′ init , one action with weight 0 is enabled leading to the old initial state  init and to goal with probability 1/2 each.From goal ′ there is one new action leading to goal with probability 1 and weight +.scheduler  for M can be seen as a scheduler for N and vice versa.Now, we observe that for any scheduler , P R O O F .N > .
Act, Pr,  init , wgt goal , wgt fail , goal, fail) as above and a rational , decide   < 1 4 2+2 for all .The non-negativity of the values   for all  can be assumed as the Positivity problem is trivial otherwise.The initial gadget and the gadget G ᾱ are as before.The gadget P β, however, is slightly modified and replaced by the gadget xT T H E O R E M 4 .1 6.The Positivity problem is polynomial-time reducible to the following problem: Given an MDP M = (, whether E max M, init () > .P R O O F .Given the parameters  1 , . . .,   and  0 , . . .,  −1 of a rational linear recurrence sequence, we can construct an MDP M ′ = (, Act, Pr,  init , wgt, goal, fail) with one weight function wgt :  × Act → Z similar to the MDP M depicted in Figure 11.W.l.o.g., we again assume that  |  | < 1 4 and that 0 ≤ β depicted in Figure 12.For this gadget, we define With the transitions as in the figure, the probability to reach goal or fail and the weight accumulated does not change when choosing action   or   compared to the gadget P β.The only difference is that the expected time to reach goal or fail changes.The steps alternate between probability 1 −  and probability 0 to reach goal or fail -just as in the gadget G ᾱ.In this way, it makes no difference for the expected time before reaching goal or fail when a scheduler stops choosing  and .We can, in fact, compute the expected time  to reach goal or fail from  init under any scheduler quite easily: Reaching  or  takes 3 steps in expectation.Afterwards, the number of steps taken is 1 + 2ℓ with probability  ℓ • (1 − ).In expectation, this yields 1 2 2(− ) , and  2 = (1 − )(1 − 1 2 2(− ) ).
4.13.So, PE max M ′ , init >  if and only if the given linear recurrence sequence is eventually negative.Note that all weights in M ′ are ≥ −.We define two new weight functions to obtain an MDP N from M ′ : We let wgt goal (, ) = wgt(, ) +  and wgt fail (, ) = + for all (, ) ∈  × Act.Both weight functions take only non-negative integer values.Any scheduler  for M ′ can be viewed as a scheduler for N , and vice versa, as the two MDPs only differ in the weight functions.Further, we observe that for each maximal path  ending in goal or fail in M ′ and at the same time in N , we have  ( ) = ⊕goal( ) +  • length( ).(Recall that ⊕goal( ) equals wgt( ) if  reaches goal and 0 if  reaches fail.)As the expected time before goal or fail is reached is constant, namely  under any scheduler, it follows that for all schedulers  we have E  N , init () = PE  M ′ , init +  •  .
Therefore, E max N , init () >  +  •  if and only if the given linear recurrence sequence eventually becomes negative.
So, the value-at-risk is the point at which the cumulative distribution function of  reaches or exceeds .The conditional value-at-risk is now the expectation of  under the condition that the outcome belongs to the  worst outcomes -in this case, the  lowest outcomes.Denote VaR   () by .Following the treatment of random variables that are not continuous in general in [43], we define the conditional value-at-risk as follows: Given an MDP M = (, Act, ,  init , wgt, Goal) with a scheduler , a random variable  defined on runs of the MDP with values in R and a value  ∈ [0, 1], we define the value-at-risk as VaR   () = sup{ ∈ R|Pr  M ( ≤ ) ≤ }.