Adaptive ZEM/ZEV feedback guidance for rendezvous in lunar NRO with collision avoidance

(1)

Master of Science in Space Engineering

ADAPTIVE ZEM/ZEV FEEDBACK GUIDANCE FOR

RENDEZVOUS IN LUNAR NRO WITH COLLISION

AVOIDANCE

Advisor: Prof. Mauro Massari Co-Advisor: Prof. Roberto Furfaro

Andrea Scorsoglio Student ID 852310

(2)

with collision avoidance | Master of Science in Space Engineering, Politecnico di Milano. c

Copyright July 2018.

Politecnico di Milano:

www.polimi.it

School of industrial and information engineering:

(3)

Since the beginning of my studies at Politecnico di Milano I aways wanted to do some kind of experience abroad. I have to thank professors Francesco Topputo and Pierluigi di Lizia for giving me the possibility to do so and go abroad to develop this thesis. I also thank them for the important inputs and advices on the project itself. I thank my advisor, Mauro Massari, for helping me develop this thesis both from a distance and from here in Milan; his advices where always important to overcome the myriads of problems that arose during the project and I will be forever thankful for that. I would also like to thank Roberto Furfaro of University of Arizona for the amazing work we have done together, the pieces of advice he gave me where always important to proceed with the work and grow as a researcher and engineer. I thank also professor Richard Linares for his important advices.

My "brothers", Marcello and Enrico, far at times but always and forever close, these 5 years have been awesome with you. I also thank my american and less american friends I met in Arizona: Patrick, Vida, Lang, Makayla and of course Shahin and Enrico. They have made this experience unforgettable and helped relieve the stressful moments during my stay. I thank my mates Marcello, Ilaria and Riccardo: sharing this experience with you has been amazing, our movies and trips together helped make this experience rich and will never be forgotten.

And of course I thank my family that will aways be with me, whether I’m at home or 6000 miles away. I know, even if they don’t show that, how hard it is for them to let me go; I will always be thankful for the sacrifices they have made for me throughout my life and will never thank them enough.

Milano, July 2018 A. S.

(4)

(5)

Fin dall’inizio dei miei studi al Politecnico di Milano, ho sempre voluto fare qualche tipo di esperienza all’estero. Devo quindi ringraziare i professori Francesco Topputo e Pierluigi di Lizia per aver creduto in me e avermi dato la possibilità di andare all’estero per sviluppare questa tesi. Li ringrazio inoltre per gli importanti suggerimenti sul progetto in sè. Ringrazio il mio relatore, il professor Mauro Massari, per avermi aiutato nella realizzazione di questa tesi, sia da lontano che qui a Milano. I suoi consigli sono sempre stati fondamentali per risolvere i problemi incontrati durante il percorso e per questo gli sono grato. Vorrei ringraziare il professor Roberto Furfaro dell’università dell’Arizona per il lavoro svolto insieme; i suggerimenti che mi ha dato sono sempre stati importanti per proseguire nel lavoro e per crescere come ricercatore e ingegnere. Ringrazio inoltre il professor Richard Linares per i suoi preziosi consigli.

I miei "fratelli", Marcello ed Enrico, lontani a volte ma sempre e comunque in contatto, questi 5 anni sono stati fantastici insieme. Ringrazio poi gli amici americani e meno americani incontrati in Arizona: Patrick, Vida, Lang, Makyla e ovviamente Shahin ed Enrico, senza di loro questa esperienza non sarebbe stata la stessa. L’hanno resa indimenticabile e mi hanno aiutato a alleviare l’inevitabile stress di alcuni momenti. Ringrazio anche i miei compagni di avventura Marcello, Ilaria e Riccardo: condividere questa esperienza con voi è stato fantastico, i film e i viaggi insieme hanno arricchito la permanenza negli States e non li dimenticherò mai.

E infine ringrazio ovviamente la mia famiglia, che occuperà sempre un posto di riguardo, sia che io sia a casa che a 6000 miglia di distanza. So, anche se non lo danno a vedere, quanto sia dura per i miei genitori lasciarmi andare; sarò sempre riconoscente per i sacrifici che hanno fatto per me e non li ringrazierò mai abbastanza per questo.

Milano, July 2018 A. S.

(6)

(7)

to my family and friends, who make my life better every day.

(8)

(9)

Since the beginning of space exploration in the late fifties, human and robotic flights have relied heavily on automation. The task that a spacecraft can accomplish are varied and range from space/Earth weather monitoring and telecommunication to celestial bodies observation and space exploration in general. In cases when direct human intervention is not an option, autonomous navigation and guidance is of pivotal importance. In the past years there have been a lot of advancements in this field with commercial companies and space agencies starting to work heavily on autonomous landing mechanisms (SpaceX’s Falcon 9) and autonomous docking systems (ESA’s ATV and Roscosmos’ Progress). Moreover with the newly announced long term project for human exploration towards the Moon aimed at the construction of the Lunar Orbital Gateway-Platform, the importance of new guidance algorithms for relative maneuvering in particular environment will be important in the future. This is why this thesis aims to develop a new feedback guidance algorithm for docking maneuvers in the cislunar environment. In particular, the aim is to create a guidance algorithm that is lightweight, closed-loop, so that it can be implemented directly on board and capable of taking constraints on the relative position into account.

The problem has been solved starting from the well know Zero-Effort-Miss/Zero-Effort-Velocity feedback algorithm using machine learning to improve its capabilities and widen its field of applicability. The algorithm has been developed in the circular restricted three body problem (CRTBP) framework for Near Rectilinear Orbits (NRO) but the results are easily generalizable to many more guidance problems. The results are satisfactory. It has been shown that reinforcement learning can be effectively used to solve spacecraft guidance problems; the proposed algorithm is in fact capable not only to improve the performance of classical ZEM/ZEV when constraints are not present, but it is also capable of solving the guidance problem in presence of a wide variety of peculiar constraint scenarios.

(10)

(11)

Fin dagli inizi dell’esplorazione spaziale alla fine degli anni cinquanta, l’automazione è sempre stata importante, sia per missioni umane che robotiche. I compiti che un veicolo spaziale deve svolgere sono svariati, dalle previsioni metereologiche e le telecomunicazioni all’osservazione e l’esplorazione di corpi celesti. Nei casi in cui l’intervento umano diretto nel controllo non è un’opzione percorribile, la navigazione e guida autonome assumono un’importanza vitale. Negli anni passati ci sono stati importanti progressi in questo campo, con agenzie spaziali e compagnie private che hanno iniziato a lavorare assiduamente a meccanismi di atterraggio (Falcon 9 di SpaceX) e docking (ATV di ESA e Progress di Roscosmos) autonomo. Inoltre con l’annuncio delle future missioni di esplorazione spaziale umana nei prossimi anni verso la Luna atte alla costruzione del Lunar Orbital Gateway-Platform, l’importanza di nuovi algoritmi di guida per docking e atterraggio di precisione in ambienti non convenzionali è evidente. Questo è il motivo per cui questa tesi mira a creare un algoritmo di guida per docking in ambiente cislunare. In particolare, l’obbiettivo è creare un algoritmo che sia al contempo leggero dal punto di vista computazionale, a ciclo chiuso, così che possa essere implementato direttamente a bordo e capace di tenere in considerazione vincoli sulla posizione relativa.

Il problema è stato risolto partendo dal ben noto algoritmo di guida denominato Zero-Effort-Miss/Zero-Effort-Velocity (ZEM/ZEV) usando alcune tecniche di intelligenza artificiale per migliorarne le prestazioni e aumentarne le potenzialità. L’algoritmo è stato sviluppato nel problema dei tre corpi ristretto per orbite NRO (Near Rectilinear Orbits) ma i risultati sono generalizzabili a molti altri problemi di guida. I risultati sono soddisfacenti. E’ stato dimostrato che l’intelligenza artificiale può essere utilizzata per risolvere problemi di guida di satelliti; l’algoritmo prodotto è infatti capace non solo di migliorare le prestazioni del classico ZEM/ZEV in problemi in cui non vi siano vincoli sulla posizione, ma anche in casi in cui i vincoli sono presenti e l’algoritmo classico fallisce.

(12)

(13)

1 Introduction 1

1.1 Why NROs in L2 . . . 2

1.2 Why relative dynamics constrained guidance problem . . . 4

1.3 Why ZEM/ZEV closed-loop feedback guidance . . . 4

1.4 Why machine learning . . . 5

1.5 Thesis outline . . . 6

2 Theoretical Background 9 2.1 Circular restricted three body problem for Earth-Moon system . . . 9

2.1.1 Equations of motion and libration points . . . 10

2.1.2 Near Rectilinear Orbits . . . 12

2.2 Classical Zero-Effort-Miss/Zero-Effort-Velocity algorithm . . . 18

2.3 Machine learning . . . 21

2.3.1 Reinforcement learning . . . 22

2.3.2 Extreme learning machines . . . 26

3 Problem Formalization 29 3.1 NRO rendezvous problem formalization . . . 29

3.2 NRO relative motion . . . 31

3.2.1 Reference frames . . . 31

3.2.2 Relative equations of motion . . . 34

4 Adaptive-ZEM/ZEV algorithm 37 4.1 General description of the algorithm . . . 38

4.2 Samples generation . . . 39

4.2.1 Policy . . . 39 xiii

(14)

4.2.2 Features . . . 40

4.3 Critic neural network . . . 40

4.4 Policy update . . . 43

4.5 Remarks on the A-ZEM/ZEV . . . 45

5 Numerical Results 47 5.1 GPOPS . . . 49

5.2 Clohessy-Wiltshire equations . . . 50

5.2.1 Single test trajectory . . . 54

5.2.2 Multiple test trajectories . . . 57

5.3 Non-linear relative equations . . . 64

5.3.1 Single test trajectory . . . 69

5.3.2 Multiple test trajectories . . . 71

6 Stability Analysis 83 6.1 Transformation of LTV systems into LTI systems . . . 84

6.2 Stability of the A-ZEM/ZEV algorithm . . . 86

7 Conclusions and future work 95

(15)

1.1 Orbits considered for Lunar Orbital Platform-Gateway . . . 2

2.1 Non-dimensional synodic reference frame (Rem) . . . 10

2.2 Lagrangian points of the Earth-Moon system in the rotating non-dimensional synodic reference frame . . . 12

2.3 NROs in the Earth-Moon system originating from L1 and L2. The black sphere is the Moon while the red dot is the libration point (L1 or L2) . . . 19

2.4 Single layer feedforward network . . . 27

3.1 Rendezvous spheres . . . 30

3.2 Reference systems utilized . . . 34

4.1 Schematic representation of the actor-critic algorithm . . . 38

4.2 Policy neural network . . . 41

4.3 Summary of the A-ZEM/ZEV algorithm . . . 44

5.1 NRO orbit selected . . . 48

5.2 Conical constraint scenario for CW case . . . 52

5.3 Spherical constraints scenario for CW case . . . 53

5.4 Keep-out sphere scenario . . . 54

5.5 Trajectory in LVLH frame for uncontrained problem with CW equations . 56 5.6 Thrust, spacecraft mass and guidance gains for uncontrained problem with CW equations . . . 56

5.7 Test trajectories and cost during training for uncontrained problem with CW equations . . . 57 5.8 Trajectory in LVLH frame for conical constraint problem with CW equations 58

(16)

5.9 Thrust, spacecraft mass and guidance gains for conical constraint problem with CW equations . . . 58 5.10 Test trajectories and cost during training for conical constraint problem

with CW equations. The test trajectories are green (first iteration) to yellow (last iteration) . . . 59 5.11 Trajectory in LVLH frame for spherical constraint problem with CW equations 60 5.12 Thrust, spacecraft mass and guidance gains for spherical constraints problem

with CW equations . . . 61 5.13 Test trajectories and cost during training for spherical constraint problem

with CW equations. The test trajectories are green (first iteration) to yellow (last iteration) . . . 61 5.14 Trajectory in LVLH frame for KOS problem with CW equations . . . 62 5.15 Thrust, spacecraft mass and guidance gains for KOS problem with CW

equations . . . 62 5.16 Test trajectories and cost during training for KOS problem with CW

equations. The test trajectories are green (first iteration) to yellow (last iteration) . . . 63 5.17 Test trajectories and cost for unconstrained problem with CW equations

and multiple test trajectories. The green trajectories are referred to the first iteration, the light green ones to an intermediate iteration and the yellow ones to the final iteration . . . 64 5.18 Test trajectories during training for conical constraint problem with CW

equations. The green trajectories are referred to the first iteration, the light green ones to an intermediate iteration and the yellow ones to the final iteration . . . 65 5.19 Cost and impacts percentage for conical constraint problem with CW

equations and multiple test trajectories . . . 65 5.20 Test trajectories during training for spherical constraint problem with CW

equations. The green trajectories are referred to the first iteration, the light green ones to an intermediate iteration and the yellow ones to the final iteration . . . 66 5.21 Cost and impacts percentage for spherical constraint problem with CW

(17)

5.22 Test trajectories during training for KOS problem with CW equations. The green trajectories are referred to the first iteration, the light green ones to an intermediate iteration and the yellow ones to the final iteration . . . 67 5.23 Cost and impacts percentage for KOS problem with CW equations . . . . 67 5.24 Trajectory in LVLH frame for unconstrained problem with NLR equations 70 5.25 Thrust, spacecraft mass and guidance gains for unconstrained problem with

NLR equations . . . 70 5.26 Test trajectories and cost during training for unconstrained problem with

NLR equations . . . 71 5.27 Trajectory in LVLH frame for conical constraint problem with NLR equations 72 5.28 Thrust, spacecraft mass and guidance gains for conical constraint problem

with NLR equations . . . 73 5.29 Test trajectories and cost during training for conical constraint problem

with NLR equations. The test trajectories are green (first iteration) to yellow (last iteration) . . . 73 5.30 Trajectory in LVLH frame for spherical constraint problem with NLR

equations . . . 74 5.31 Thrust, spacecraft mass and guidance gains for spherical constraint problem

with NLR equations . . . 75 5.32 Test trajectories and cost during training for spherical constraint problem

with NLR equations. The test trajectories are green (first iteration) to yellow (last iteration) . . . 75 5.33 Trajectory in LVLH frame for KOS problem with NLR equations . . . 76 5.34 Thrust, spacecraft mass and guidance gains for KOS problem with NLR

equations . . . 77 5.35 Test trajectories and cost during training for KOS problem with NLR

equations. The test trajectories are green (first iteration) to yellow (last iteration) . . . 77 5.36 Test trajectories and cost during training for unconstrained problem with

NLR equations and multiple test trajectories. The green trajectories are referred to the first iteration, the light green one to an intermediate iteration and the yellow ones to the final iteration . . . 79

(18)

5.37 Test trajectories for conical constraint problem with NLR equations and multiple test trajectories. The green trajectories are referred to the first iteration, the light green one to an intermediate iteration and the yellow ones to the final iteration . . . 79 5.38 Cost and impacts percentage for conical constraint problem with NLR

equations and multiple test trajectories . . . 80 5.39 Test trajectories for spherical constraints problem with NRL equations. The

green trajectories are referred to the first iteration, the light green one to an intermediate iteration and the yellow ones to the final iteration . . . 80 5.40 Cost and impacts percentage for spherical constraints problem with NRL

equations . . . 81 5.41 Test trajectories for KOS problem with NLR equations. The green

traject-ories are referred to the first iteration, the light green one to an intermediate iteration and the yellow ones to the final iteration . . . 81 5.42 Cost and impacts percentage for KOS problem with NLR equations . . . . 82 6.1 Eigenvalues of the system during entire mission for all cases with CW

equations . . . 90 6.2 Eigenvalues of the system during entire mission for all cases with NLR

equations . . . 91 6.3 State Transition Matrix components for all cases. . . 93

(19)

1.1 Orbits comparison from [45] . . . 3 5.1 Performance comparison for CW equations . . . 59 5.2 Performance comparison for NLR equations . . . 78

(20)

(21)

Introduction

Accurate feedback guidance algorithms have always been of utmost importance for space exploration. Both for manned and unmanned flights, the success of a mission resides in the possibility to maneuver the spacecraft in an efficient, precise and safe way. With the Lunar Orbital Platform-Gateway (LOP-G) [30] set to become the new establishment for human exploration of the solar system, and especially of the Moon, relative dynamics guidance in cislunar environment will be of pivotal importance in the near future. NASA has stated that in the next decade, the Moon will be one of the primary objective for space exploration, both for its scientific value and as proving ground for further advancements in human exploration (i.e. Mars). NASA plans to enlist a series of commercial robotic landers and rockets to meet lunar payload delivery and service needs. In this context, the Lunar Orbital Platform-Gateway will serve NASA and its commercial and international partners as a uniquely valuable staging point and communications relay for exploration and science missions in deep space. Near Rectilinear Halo Orbits (NRHO or NRO) [45] in the Earth-Moon three-body framework are considered the most promising environment for this kind of missions because of their advantageous shape in terms of Earth and Moon visibility and insertion cost. The relative rendezvous problem in this environment has already been studied and formalized [6] but there is little to no literature on the guidance and control side of the problem. The aim of this work is to propose a new guidance algorithm capable of operating in such environment. The idea is to use some concepts of artificial intelligence, especially reinforcement learning [1, 39, 46] and extreme learning machines [5, 21, 22], to create a zero-effort-miss/zero-effort-velocity [17] based closed-loop algorithm able to solve this kind of problems.

(22)

Figure 1.1: Orbits considered for Lunar Orbital Platform-Gateway

1.1 Why NROs in L

2

Maneuvering in lagrangian points orbits has always been important since the beginning of solar system exploration. Examples of spacecraft that make use of the advantageous position of these particular points are the solar wind monitoring probes (ACE, SOHO, DSCVR , WIND) that are positioned in L1 Earth-Sun lagrangian point [3, 4, 9, 10].

Lagrangian points have also become increasingly important with the announcement of the James Webb telescope which is currently undergoing final testing and will launch in March 2021 directed to an halo orbit originating from the L2 Earth-Sun lagrangian point [25,

31]. Moreover, interest in the Earth-Moon lagrangian points has raised lately as cislunar Near Rectilinear Orbits (NRO) will probably be the destination of the Lunar Orbital Platform-Gateway. The project will be the first establishment in the future of human exploration of the solar system once the International Space Station will be dismissed. Its power and propulsion module is set to launch in 2022. The reasons why NROs will probably be selected for future missions to the Moon are numerous. An important study by NASA, reported by Campolo [6,45], has shown some important advantages over different cislunar orbits. Even though there is no official definition for NROs, they can be defined as degenerate halo orbits whose projection on the xy-plane of the closest point to the mean lunar surface lies inside the circle defined by the projection of the Moon sphere on the same plane. A representation of an NRO in the synodic reference frame can be seen in Figure 1.1, together with the other orbits considered in the study. A summary of the results of the above mentioned study can be seen in Table 1.1.

(23)

Table 1.1: Orbits comparison from [45]

From the study it is evident why NROs are particularly interesting and to be preferred with respect to other orbits in cislunar environment for a mission like LOP-G: their particular shape allows them to have continuous coverage of either one of the sides of the Moon, being in the meantime continuously visible from Earth. Moreover they are advantageous in terms of ∆V for transfer to and from Earth and lunar surface; it has been proven in fact by the same study that they are within the launching capabilities of an SLS-Orion mission. Finally they have a small ∆V requirement for station keeping and the thermal characteristics are favorable. This is why this thesis is focused on the development of a new guidance algorithm for missions involving NROs. The northern or southern family must be selected in case coverage of Moon’s north or south pole is to be preferred respectively. In the case of this project, being a proof of concept more than a project related to a particular mission profile, the algorithm has been tested on an orbit of the northern family but the concepts are general so it can be applied to any NRO of any family.

(24)

1.2 Why relative dynamics constrained guidance problem

As said above, NROs are set to become increasingly important for robotic and manned missions. Many of the operations in this environment will rely on precision insertion maneuvers and relative guidance. The latter is of pivotal importance in case of docking for building the lunar gateway, autonomous resupply missions and astronaut transfer. The safety requirements defined in [6] and summarized in Chapter 3 impose constraints that the guidance must follow. Up until now, guidance algorithms almost always have relied on open loop architectures that are either defined beforehand on the ground or depend on direct human intervention in case of manned missions. Examples of almost automated docking are ESA’s ATV [36] and Roscosmos’ Progress [49]. Although well performing they still rely heavily on planning from ground and do not use a completely autonomous guidance approach. Moreover, they work well for docking to the ISS, so in Low Earth Orbit (LEO) but there is no assurance that they would work in a cislunar environment. All in all, the problem of relative guidance in this environment should be formalized and addressed thoroughly because of the increase of interest in this matter and the relative lack of solutions.

1.3 Why ZEM/ZEV closed-loop feedback guidance

Over the past few years, researchers have been exploring the performances of the generalized Zero-Effort-Miss/Zero-Effort-Velocity (ZEM/ZEV) feedback guidance for soft landing, intercept and rendezvous problems [17,18,48]. The ZEM/ZEV feedback guidance is attractive because of its analytical simplicity and accuracy: guidance mechanization is straightforward, and it can theoretically drive the spacecraft to a target autonomously and with minimal guidance errors, regardless of the equations of motion. Moreover, it has been shown to be globally finite time stable and robust to perturbations and uncertainties in the model if a proper sliding parameter is added (Optimal Sliding Guidance) [12]. One of the biggest strengths is its closed-loop nature: the guidance action in a particular state is derived directly from information on the current state and the target state. This is powerful because there is no need of integrating ground operations in the control loop as it is done with open loop architectures, making the spacecraft much more autonomous. It is all in all a very straightforward way of solving closed-loop guidance problem when constraints are not an issue. Nevertheless, the algorithm has two major limitations:

(25)

• It solves the guidance problem optimally only in cases where the gravity field and the acceleration components in general are constant or solely dependent on time • It is virtually impossible to embed constraints directly into the algorithm

These are strong limitations. In certain cases the algorithm can generate solutions that are even 100% sub-optimal which is undesirable for missions with a high fuel consumption. The second limitation makes the algorithm not suitable for a wide range of problems where constraints must be enforced. For example relative motion operations for docking normally have path constraints to be followed so it is clear that the classical algorithm is not feasible for this kind of problems. The aim of this project is to create a new algorithm that retains the strengths of classical ZEM/ZEV and overcomes its major limitations by making use of machine learning techniques.

1.4 Why machine learning

Machine learning has been a tool for carrying out data analysis and solve particularly difficult tasks for many years now. The basis for neural networks and many of the learning algorithms known today have been developed and theorized during the 20th century. In the last decade, the interest in the matter has raised because of the increased computing power available, which has given the possibility to actually implement successfully the ideas that were only theorized earlier. Machine learning is defined as that part of Artificial Intelligence (AI) that gives machines the ability to learn without being programmed a priori to solve a particular task. This definition is really vague so a bit more involved discussion is needed on the matter. Machine learning tasks are typically divided in three branches [27, 41]:

• Supervised learning: the machine is provided with samples from the process to be learned composed by inputs and outputs. The learning task is represented by learning the rule that maps the input into the outputs.

• Unsupervised learning: the machine is fed only with inputs and it’s left on its own to learn particular patterns in the input data.

• Reinforcement learning: derived from animal training, it is based on the concept of learning by trial and error. The machine is left free to operate in a certain environment and it is given positive or negative rewards based on the task to be

(26)

solved. The machine learns the optimal way of acting, or policy, that allows it to optimize the total reward.

In the case of this thesis, the first and the last type of machines will be used. In particular reinforcement learning is not new to application with robot locomotion and guidance in general. Reinforcement learning algorithms have been successfully used among the information technology community to solve many problems related to robot locomotion, gamble games and even Atari video games. Although they have been used to solve many robotic motion task [16,23,29,34,35,40], they have not been used frequently in spacecraft guidance. There is an example, involving a landing problem in which reinforcement learning has been used to select the optimal sequence of waypoints in a waypoint-based ZEM/ZEV algorithm [11] but they have never been used, to the author knowledge, to solve directly the guidance problem. The idea behind this project is to use reinforcement learning, specifically an actor-critic algorithm (described in details in Section 2.3and Chapter 4), to solve the constrained guidance problem, paving the way for true autonomous spacecraft guidance.

1.5 Thesis outline

The thesis is organized in the following sections:

• Chapter 1: Introduction. It contains the motivations of the thesis and an introduction to the problem.

• Chapter 2: Theoretical background. In this chapter an introduction to all the concepts used throughout the project is presented. Specifically the Circular Restricted Three Body Problem (CRTBP) and Near Rectilinear Orbits (NROs), the classical Zero-Effort-Miss/Zero-Effort-Velocity (ZEM/ZEV) guidance algorithm, actor-critic algorithms and Extreme Learning Machines (ELM).

• Chapter 3: Problem formalization. In this chapter the rendezvous problem in the CRTBP is formally introduced and the reference frames and equations of motion are described.

• Chapter 4: Adaptive-ZEM/ZEV algorithm. In this chapter the new guidance algorithm is described in details.

(27)

• Chapter 5: Numerical results. In this chapter the numerical results obtained after testing of the A-ZEM/ZEV algorithm are presented and analyzed.

• Chapter 6: Stability analysis. In this chapter the convergence and stability properties of the A-ZEM/ZEV algorithm are addressed.

• Chapter 7: Conclusions and future work. In this chapter the major results of the work are summarized and an insight into future possible developments is presented.

(28)

(29)

Theoretical Background

In this chapter, the main theoretical concepts on which the entire work is based on are presented. Namely the Circular Restricted Three Body Problem (CRTBP) and Near Rectilinear Orbits (NROs), the classical Zero-Effort-Miss/Zero-Effort-Velocity (ZEM/ZEV) feedback guidance algorithm, the stochastic Actor-Critic algorithm and Extreme Learning Machines (ELM).

2.1 Circular restricted three body problem for Earth-Moon

sys-tem

Consider the motion of a particle in presence of two main bodies, or primaries, m1 and

m2 where the only mean of interaction between the particles is the gravitational attraction.

In general under these conditions, all bodies are free to move. In the Restricted Three Body Problem, the primaries are considered orbiting around the center of mass of the system. The mass of the third body is small enough to be negligible with respect to the primaries which leads to its motion being independent from its mass. A further assumption is introduced in the Circular Restricted Problem: the motion of the two primaries is constrained to be circular around the center of mass. This model is valid in many cases in which the spacecraft is attracted by two big bodies such as Earth and Sun or Earth and Moon and it is in general valid in many gravitational systems throughout the solar system in which the primaries have nearly circular orbits. Historically the dynamics of the problem are expressed in the absolute synodic reference frame that in the case of the Earh-Moon system and for the remainder of this thesis will be called Rem. The origin of

(30)

Figure 2.1: Non-dimensional synodic reference frame (Rem)

this frame is positioned in the center of mass of the system G, the x-axis is aligned with the line connecting the two primaries, the z-axis is parallel to the angular momentum vector of the primaries and the y-axis completes the orthonormal triad. As it can be clearly deduced by its definition, the frame is non-inertial, it is in fact rotating with an angular velocity equal to the mean angular motion of the two primaries around their center of mass. Moreover, quantities in this reference frame are made non-dimensional by introducing some normalization parameters as explained in the next section.

2.1.1 Equations of motion and libration points

The equations of motion governing the dynamics of a particle of mass m are expressed in the synodic non-dimensional reference frame in Figure 2.1. The system is made non-dimensional with the introduction of a particular set of units: the unit of length L is the constant separation between the primaries, the unit of velocity V is the mean orbital velocity of m2 and the unit of time T is chosen such that the orbital period of the primaries

(31)

units and the primed dimensionlized units is:

d0 = Ld (2.1)

s0 = V s (2.2)

t0 = T

2πt (2.3)

The only parameter governing the dynamics of the system is the mass parameter, µ = m2

m1+ m2

(2.4) In this reference system, the equations of motion describing the dynamics of the particle are the following:

         ¨ x − 2 ˙y = x − 1−µ_r3 1 (x + µ) −_rµ3 2 (x − (1 − µ)) ¨ y + 2 ˙x = y − y1−µ_r3 1 + _rµ3 2 ¨ z = −z1−µ_r3 1 + µ r3 2 (2.5) with r1 = p (x + µ)2_{+ y}2_{+ z}2 _(2.6) r2 = p (x − (1 − µ))2_{+ y}2_{+ z}2 _(2.7)

If the non-dimensional potential U is considered and defined as U = 1 2 x 2_{+ y}2₊ 1 − µ r1 + µ r2 (2.8) equation 2.5 can be written as

         ¨ x − 2 ˙y = ∂U_∂x ¨ y + 2 ˙x = ∂U_∂y ¨ z = ∂U_∂z (2.9)

A more comprehensive study on the problem and on the procedure to derive the equations of motion can be found in [24].

Equilibrium solutions Although equations 2.5 do not have a closed form analytical solution, it is possible to determine the location of equilibrium points of the CRTBP using the potential defined above. The equilibrium points, or lagrangian points, or libration

(32)

Figure 2.2: Lagrangian points of the Earth-Moon system in the rotating non-dimensional synodic reference frame

points are stationary points of the potential function U and are the solutions of the equation

∇U = 0 (2.10)

The equilibrium points are locations in which the secondary mass m would appear motionless in the rotating synodic frame. They are five, three lying on the x-axis, called collinear points, and two forming equilateral triangles with the two primaries, called equilateral or triangular points. A representation of the lagrangian points in the Earth-Moon system expressed in the synodic reference frame is given in Figure 2.2.

2.1.2 Near Rectilinear Orbits

In the CRTBP framework, there exist a wide variety of trajectories that result in a periodical motion. Belonging to this subset are all the trajectories that, starting from a point and moving under the only influence of gravity, revisit periodically the above mentioned point. Over the years many types, or families, of orbits in all five libration points have been discovered. They can be divided in two main groups: in-plane and out-of-plane orbits. Near Rectilinear Orbits, or NROs, belong to the second group; more specifically, they are a degenerate subset of Halo Orbits whose projection on the x-y plane

(33)

of the closest point to one of the primaries lies inside the circles defined by the projection on the same plane of the aforementioned primary. Representations of NROs can be seen in Figure 1.1 and later in this chapter. The generation of periodical orbits in this framework is not straightforward. As stated in section 2.1.1, equations 2.5 are non-linear and do not have analytical solution so trajectories have to be found using a shooting algorithm based on a multi-variable newton method. The whole process of finding those orbits is described thoroughly in [15] and [32] and since it is not the main subject of this thesis, it will not be described in details. In the following, a summary of the method is presented in order to formally introduce NROs and to show how the orbits for the problem where computed and selected.

Multi-variable newton method

The multi-variable newton method as described in [6,15] and summarized below is the main concept on which generation of orbits in the CRTBP framework is based on and relies on some definitions. If the system of equation in 2.5 is written as

˙x = f(t, x) (2.11)

where x is the state vector of six components (positions and velocity), its dynamics is associated to the initial state by the State Transition Matrix (STM), Φ(t, t0). The

dynamics of the STM is described by the following equation:

˙ Φ(t, t0) = A(t)Φ(t, t0) (2.12) where Φ(t, t0) = ∂x ∂x0 (2.13)

and A(t ) is the Jacobian matrix relating the variations of the vector f with the variation of the state x.

A(t) = ∂f

(34)

In the CRTPB the matrix A(t ) has the following form: A(t) =            0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 Uxx Uxy Uxz 0 2 0 Uyx Uyy Uyz −2 0 0 Uzx Uzy Uzz 0 0 0            (2.15)

Where the coefficients Uij are the second derivative of the potential function U and can be

seen in extended form in [6]. The multi-variable Newton method, as for the single-variable counterpart, seeks to determine the zeros x∗ of a particular scalar function

f (x∗) = 0 (2.16)

In this case the free variable vector is comprised of n design variables that depend on the particular problem to solve

X =         x1 . . . xn         (2.17)

These variables are then subjected to m scalar constraint equations

F(X) =         F1(X) . . . Fm(X)         = 0 (2.18)

The goal of the method is to find the solution X∗ that satisfy the constraint function F(X∗) = 0 within some numerical tolerance.

Single shooting method for NRO computation

The newton method described above is used to solve the problem of computing NROs in the contest of this project by incorporating it into a single shooting algorithm. This is a

(35)

way of solving a Two Point Boundary Value Problem (TPBVP) by numerically integrating the equations of motion of the problem (in this case equations 2.5) together with 2.12and use the newton method to update the initial conditions in order to minimize the error at the end of the trajectory as expressed by the constraint equations, until the convergence condition is reached. It is called "single shooting" because a single integrated arc is used.

In order to compute the halo family (so the NRO family), the planar, or lyapunov, family should be computed first [14, 32]. The method in fact relies on a continuation scheme to find the entire group of orbits so an initial orbit is needed to then find the rest of them. In the case of halo orbits, the first is found by first finding the bifurcating orbit of the lyapunov family and then introduce a small perturbation in the z direction to start the continuation. This is why it is necessary to first find the lyapunov family.

Lyapunov family By exploiting the symmetry of this kind of orbits with respect to the x -axis, the corrector is created in such a way that it ensures the perpendicular crossing of the axis after half a period. This means that the x component of the velocity and the y component of the position must be zero at the end of the arc. The free variables in this case are x0, ˙y0 and T0. So the initial state vector for the equations of motion is:

q0 =

h

x0 0 0 0 y˙0 0

i

(2.19) and the free variable vector of the method is:

X = h

x y T˙ i

(2.20) The constraint condition is

F(X) = [yt ˙xt] = 0 (2.21)

where t = T

2 and T is the period of the orbit. By using the multi-variable Newton’s method described above, the free variables vector at (j + 1)-th iteration is

Xj+1 = Xj − DF(Xj₎−1

F(Xj) (2.22)

where DF(X) = ∂F(X) ∂X0

is the Jacobian of the method. It should be noted that if the number of free variables is not equal to the number of constraints, the system is not squared and so the pseudo-inverse of the Jacobian is to be used instead.

(36)

In this specific case the DF matrix is: DF(X) = " Φ21 Φ25 y˙ Φ41 Φ45 x¨ # (2.24)

so equation 2.23must be used. In order to obtain the entire family a continuation scheme has been implemented. The first two Lyapunov orbits, computed using the intial guess from references [14], are used as seed to compute all the other orbits of the family. By applying a small variation on the initial conditions, which depend on the previous orbits, the new orbit is obtained running the corrector again on the perturbed initial conditions as in 2.25,2.26. dx0 = x (j−1) 0 − x (j−2) 0 d ˙y0 = ˙y0(j−1)− ˙y0(j−2) dT0 = T (j−1) 0 − T (j−2) 0 (2.25) xj₀ = x(j−1)₀ + dx0 ˙ y0j = ˙y0(j−1)+ d ˙y0 T₀j = T₀(j−1)+ dT0 (2.26)

Halo and NRO family The corrector used to find Halo and NROs derives from the one used for Lyapunov orbits. However in this case two different correctors are needed to compute the entire family. Due to the fact that the family extends both in x and z direction, the initial conditions are firstly perturbed in the z direction until a certain condition is reached, then they are perturbed in the x direction.

Corrector with z₀ fixed The simmetry with respect to the xz -plane of the orbits is used in this case, which translates into a perpendicular crossing of the xz -plane at half a period of the orbit. The state vector for this case is:

q₀ =hx0 0 z0 0 y˙0 0

i

(2.27) where z0 is imposed. The degrees of freedom vector is the same as the lyapunov case:

X =hx y T˙ i

(37)

while the constraint that ensures xz -plane perpendicular crossing is:

F(X) = [yt ˙xt ˙zt] = 0 (2.29)

In this case the jacobian becomes:

DF(X) =    Φ21 Φ25 y˙ Φ41 Φ45 x¨ Φ61 Φ65 z¨    (2.30)

The system is squared so equation 2.22 can be used to obtain the corrected solution.

Corrector with x0 fixed The corrector is similar to the previous one. The state

vector is the same as before except the x0 is now imposed while z0 is left free to vary. The

degrees of freedom vector is then:

X =hz y T˙ i

(2.31) while the constraint is the same as for the other case. The Jacobian is:

DF(X) =    Φ23 Φ25 y˙ Φ43 Φ45 x¨ Φ63 Φ65 z¨    (2.32)

It is a squared matrix so again equation 2.22 can be applied.

Continuation The continuation method starts from the identification of the bifurcating orbit of the relative lyapunov family. From theory this is identified by the condition in which the stable and unstable eigenvalues of the monodromy matrix1 jump from the real axis to the unitary circle in the complex plane. Stable and unstable eigenvalues are respectively the ones that have magnitude smaller and greater than one. A first estimation of the position of the bifurcating orbit is obtained by finding the eigenvalues of the monodromy matrix and stopping at the first orbit that meets the condition described above. Once an interval is identified, a bisection method is applied on the initial state vector in order to obtain the bifurcating orbit, under a certain tolerance.

Starting then from the bifurcating orbit, the first two out of plane orbit are found by applying a displacement of an arbitrary small value in the z direction either in the north

1_{The monodromy matrix is defined as the state transition matrix evaluated at t = T , with T being the period}

(38)

(positive) or in the south direction (negative). These two orbits are then used as seeds for the continuation scheme. The increments are:

dx0 = x (j−1) 0 − x (j−2) 0 d ˙y0 = ˙y0(j−1)− ˙y0(j−2) dz0 = z (j−1) 0 − z (j−2) 0 dT0 = T (j−1) 0 − T (j−2) 0 (2.33)

and the initial conditions at each step are:

xj₀ = x(j−1)₀ + dx0 ˙ y0j = ˙y0(j−1)+ d ˙y0 z₀j = z₀(j−1)+ dz0 T₀j = T₀(j−1)+ dT0 (2.34)

Since the halo orbit family develops both in z and x directions, the corrector used is switched from z0-fixed to x0-fixed when the following condition is met:

|dx0| > |dz0| (2.35)

The resulting NROs originating from L1 and L2 in the Earth-Moon system are shown in

Figure 2.3. The lyapunov and other halo orbits are omitted. Although everything that will be developed in the following is valid for any of the NRO families, the algorithm has been tested on an orbit of the northern family of NROs originating from L2. It should be

clear though that the equations of motion are valid for any of the orbits in any family and the algorithm, with the proper adjustments on the parameters, can work for any of the above mentioned NRO families.

2.2 Classical Zero-Effort-Miss/Zero-Effort-Velocity algorithm

The optimal feedback guidance algorithm based on miss (ZEM), zero-effort-velocity (ZEV) is presented in this section to give an overview of its architecture and its field of application. Consider a mission from time t0 to tf, the optimal control acceleration

a is the solution that minimizes the performance index: J = 1

2 Z tf

t0

(39)

(a) L1 northern family (b) L1 southern family

(c) L2 northern family (d) L2 southern family

Figure 2.3: NROs in the Earth-Moon system originating from L1and L2. The black sphere is the Moon

(40)

for a body subjected to the following general dynamic equations, valid in any case, even for non-inertial systems:

˙r = v

˙v = a + f (r, v) a = T/m

(2.37)

with r, v, T and a position, velocity, thrust and acceleration command vectors respectively and f (r, v) being the generalized acceleration terms in which the gravitational and non-inertial acceleration contributions are present, with the following given boundary conditions:

r(t0) = r0, r(tf) = rf (2.38)

v(t0) = v0, v(tf) = vf (2.39)

The Hamiltonian function for this problem is then defined as H = 1

2a

T_{a + p}

rTv + pvT(f + a) (2.40)

where pr and pv are the costate vectors associated with position and velocity vector

respectively. The time-to-go is defined as: tgo = tf − t. The optimal acceleration at any

time t is expressed as

a = −tgopr(tf) − pv(tf) (2.41)

by substituting equation 2.41 into the dynamics equations to solve for pr(tf) and pv(tf),

the optimal control solution with specified rf and vf and tgo is obtained as:

a = 6[rf − (r + tgov)] t2 go − 2(vf − v) tgo + 6 Rtf t (τ − t)f (τ )dτ t2 go − 4 Rtf t f (τ )dτ tgo (2.42) The ZEM distance and the ZEV error are defined respectively as, the distance between the desired final position and velocity and the projected final position and velocity if no additional control is commanded from time t onward. Assuming this is a problem for which f (r, v) = g(t), the ZEM and ZEV have the following expressions:

ZEM = rf − r + tgov + Z tf t (tf − τ )g(τ )dτ ZEV = vf − v + Z tf t g(τ )dτ (2.43)

(41)

Then the optimal control law 2.42 can be expressed as: a = 6 t2 go ZEM − 2 tgo ZEV (2.44)

In any other case in which f (r, v) 6= g(t), the control law is still usable but it will not be necessarily optimal because the hypothesis on which it has been demonstrated to be the solution of the energy optimal problem fail. In case the equations of motion are non-linear and in general when 2.43 do not apply, ZEM and ZEV are expressed in a slightly different way. The projected position and velocity cannot be recovered analytically: they must be obtained through an integration of the equations of motion from the current time instant to the end of the mission with control actions set to zero.

ZEM = rf − rnc

ZEV = vf − vnc

(2.45)

where rnc and vnc are, respectively, the position and velocity at the end of mission if no

control action is given from the considered time onward. It should be noted that using the formulation in 2.44, which will be called classical ZEM/ZEV from now on, can result in valid trajectories even for cases when the generalized acceleration term is arbitrary. In these types of environment however, using a definition of ZEM and ZEV as in 2.45, the control gains that solve the optimal problem are no longer the ones in 2.44. This leads to the definition of the Generalized-ZEM/ZEV algorithm [17], valid in any environment:

a = KR t2 go ZEM + KV tgo ZEV (2.46)

The non-linear acceleration components in the relative equations of motion of the problem under investigation in this project, described in Chapter3, justify the use of this generalized form of the guidance algorithm.

2.3 Machine learning

In this section, an overview of the machine learning methods used in the implementation of the algorithm is presented. In particular, focus is put on Reinforcement Learning (RL), specifically stochastic actor-critic algorithms and Extreme Learning Machines (ELM).

(42)

2.3.1 Reinforcement learning

Reinforcement Learning (RL) can be defined as the formalization of learning by trial and error: it is based on the idea that a machine can autonomously learn the optimal behavior, or policy, to carry out a particular task, given the environment, by maximizing (or minimizing) a cumulative reward (or cost).

Reinforcement learning for optimal control

Reinforcement Learning is closely related to the theory of classical optimal control. Both reinforcement learning and optimal control address the problem of finding an optimal policy (or control policy), in other words a way of interacting with the environment, that optimizes an objective function, being it a cost to be minimized or a reward to be maximized. They both rely on the notion of a system being described by an underlying set of states, controls and a model describing the transitions between states given the action, practically speaking, a physical model. However optimal control is somewhat limited by the fact that it assumes perfect knowledge of the system model. In cases when this is true, optimal control offers very good guarantees but the strong assumption of perfect knowledge of the system often limits their applicability in practice. In contrast reinforcement learning operates directly on measured data and rewards/cost obtained from interaction with the environment. This is very powerful in cases in which the environment is not know a priori and it is why it has been selected for this study. RL algorithms work on systems that are formalized as Markov Decision Processes [39, 41, 43].

Markov decision processes

The reinforcement learning problem is generally modeled as a Markov Decision Process (MDP) which is composed by: a state space X, an action space U , an initial state distribution with density p1(x1) representing the initial state of the system, a transition

dynamics distribution with conditional density p(xt+1|xt, ut) =

Z

xt+1

f (xt, ut, x0)dx0 (2.47)

representing the dynamic relationship between a state and the next, given action u (It should be noted that if the dynamics of the system is considered completely deterministic, this probability is always 0 except when action ut brings the state from xt to xt+1) and a

(43)

reward function r: S × U → R that depends in general on the previous state, the current state and the action taken. The reward function r is assumed to be bounded. A policy is used to select actions by the agent given a certain state. The policy is stochastic and denoted by πϑ : X → P (U ) where P (U ) is the set of probability measures of U , ϑ ∈ Rn is

a vector of n parameters and πϑ(ut|xt) is the probability of selecting action ut given state

xt. The agent uses the policy to interact with the MDP and generate a trajectory made of

a sequence of states, actions and rewards. The return r_tγ =

∞

X

k=t

γk−tr(xk, uk) (2.48)

is the discounted reward along the trajectory from time step t onward, with 0 < γ ≤ 1. The agent’s goal is to obtain a policy that maximizes the discounted cumulative reward from the start state to the end state, denoted by the performance objective J (π) = E[r1γ|π].

By denoting the density at state x0 after transitioning for t time steps from state x by p(x → x0, t, π) and the discounted state distribution by

ρπ(x0) := Z X ∞ X t=1 γt−1P1(x)p(x → x0, t, π)dx (2.49)

The performance objective can then be written as an expectation: J (πϑ) = Z X ρπ(x) Z U

πϑ(x, u)r(x, u)dudx = Ex∼ρπ_,u∼π

ϑ[r(x, u)] (2.50)

where Ex∼ρπ denotes the improper expected value with respect to discounted state

distri-bution ρ(x).

During training, the agent will have to estimate the reward-to-go function J for a given policy π: this procedure is called policy evaluation. The resulting estimate of J is called value function and two definitions exist of it. The state value function

Vπ_{(x) = E} " _∞ X k=0 γkrk+1|x0 = x, π # (2.51) only depends on the state x. The state-action value function

Qπ_{(x, u) = E} " _∞ X k=0 γkrk+1|x0 = x, u0 = u, π # (2.52) depends on the state x but also on the action u. The relationship between the two is:

(44)

The above mentioned V and Q in recursive form become:

Vπ_{(x) = E [ρ(x, u, x}0) + γVπ(x0)] (2.54) and

Qπ_{(x, u) = E [ρ(x, u, x}0) + γQπ(x0, u0)] (2.55) which are called Bellman Equations. Optimality for both Vπ _{and Q}π _{is governed by the}

Bellman optimality equation. Let V∗(x) and Q∗(x, u) be the optimal value and action-value functions respectively, the corresponding Bellman optimality equations are:

V∗(x) = max u E [ρ(x, u, x 0_{) + γV}∗_(x0_)] (2.56) Q∗_{(x, u) = E} h ρ(x, u, x0) + γ max u0 Q ∗ (x0, u0) i (2.57) The goal of reinforcement learning is to find the policy π that maximizes Vπ, Qπ or J (πϑ),

or in other words, find V∗ or Q∗ that satisfy the Bellman optimality equation. Stochastic policy gradient theorem

Policy gradient algorithms are among the most popular classes of continuous action and state space reinforcement learning algorithms. The fundamental idea on which they are based on is to adjust the parameters ϑ of the policy πϑin the direction of the performance

objective gradient ∇ϑJ (πϑ). The biggest challenge is to compute effectively the gradient

∇ϑJ (πϑ) so that at each iteration the policy becomes better than the one at the previous

iteration. It turns out, from the work by Williams [46] who theorized the REINFORCE algorithms, that the gradient of the performance objective can be estimated using samples from experience, so without actually computing it and without a complete knowledge of the environment (sometimes referred to as model-free algorithms). A direct implication of [46] is the policy gradient theorem:

∇ϑJ (πϑ) = Z X ρπ(x) Z U

∇ϑπϑ(u|x)Qπ(x, u)dudx = Ex∼ρπ_,u∼π

ϑ[∇ϑlog πϑ(u|x)Q

π

(x, u)] (2.58) where Q(x, u) is the state-action value function expressing the expected total discounted reward being in state x taking action u. The theorem is important because it reduces the computation of the performance gradient, which could be hard to compute analytically, to an expectation that can be estimated using a sample-based approach. It is important to note that this estimate is demonstrated to be unbiased so it assures that a policy is at

(45)

least as good as the one in the previous iteration. Once ∇ϑJ (πϑ) is computed, the policy

update is simply done in the direction of the gradient

ϑk+1 = ϑk+ αk∇ϑJk (2.59)

where α is the learning rate and is supposed to be bounded.

One important issue to be addressed is how to estimate the Q function effectively; all of the above is in fact valid in the case Q represent the true action-value function. In case of continuous action and states spaces, obtaining an unbiased estimate of this is difficult. One of the simplest approach is to use the single sample discounted return r_tγ to estimate Q which is the idea behind the REINFORCE algorithm [46]. This is demonstrated to be unbiased2, but the variance is high, which leads to slow convergence. One way of estimating the action-value function in a way that reduces the variance while keeping the error contained is the introduction of a critic in the algorithm.

Stochastic actor-critic algorithm

The actor-critic is a widely used architecture based on policy gradient. It consists of two major components. The actor adjusts the parameters ϑ of the stochastic policy πϑ(x) by stochastic gradient ascent (or descent). The critic evaluates the goodness of the

generated policy by estimating some kind of value function. If a critic is present, instead of the true action-value function Qπ_{(x, u), an estimated action-value function Q}ω_{(x, u) is}

used in equation 2.58. Provided, in fact, that the estimator is compatible with the policy parametrization, meaning that

∂Qw(x, u) ∂w = ∂π(x, u) ∂ϑ 1 π(x, u) (2.60)

then Qw(x, u) can be substituted to Qπ(x, u) in 2.58 and the gradient would still assure improvement by moving in that direction. It is important to note that Qw(x, u) is required to have zero mean for each state: P

uπ(x, u)Qw(x, u) = 0, ∀x ∈ X. In this

sense it is better to think of Qw_{(x, u) as an approximation of the advantage function}

Aπ_{(x, u) = Q}π_{(x, u) − V}π_{(x) rather than Q}π_{(x, u). This is in fact what will be used in the}

following.

In general introducing an estimation on the action-value function may introduce bias but the overall variance of the method is decreased which ultimately leads to faster

2

(46)

convergence. The critic network goal is to estimate the action-value function, providing a better estimate of the expectation of the reward with respect to using the single sample reward-to-go given state x and action u. This happens because the action-value function is estimated from an average over all the samples, not just from the ones belonging to a single trajectory. This will become clearer in Chapter 4 where the details of the algorithm will be discussed.

2.3.2 Extreme learning machines

Extreme Learning Machines (ELM) are a particular kind of Single Layer Feedforward Networks (SLFN) with a single layer of hidden neurons which do not make use of back-propagation as learning algorithm. Backback-propagation is a multiple step iterative process; ELM instead uses a learning method which allows for learning in a single step. The concepts behind ELM had already been in the scientific community for years before Huang theorized and formally introduced them as Extreme Learning Machines in 2004 [21,22]. According to their creator, they can produce very good results with a learning time that is a fraction of the time needed for algorithms based on back-propagation.

Consider a simple SLFN, the universal approximation theorem states that any continu-ous target function f (x) can be approximated by SLFNs with a set of hidden nodes and appropriate parameters. Mathematically speaking, given any small positive ε, for SLFNs with enough number of neurons L, it is verified that:

kfL(x) − f (x)k < ε (2.61) where fL(x) = L X i=1 βihi(x) = H(x)β (2.62)

is the output of the SLFN, β being the output weights matrix and H(x) = σ(Wx + b) the output of the hidden layer for input x, with W and b being the input weights and biases vectors respectively. σ is the activation function of the hidden neurons. A representation of an SLFN can be seen in Figure 2.4. In conventional SLFN, input weights wi, biases

bi and output weights βi are learned via backpropagation3. ELM are a particular type

of SLFN that have the same structure but only βi are learned, while input weights and

3

Backpropagation is an optimization technique based on the concept of updating iteratively weights and biases of a neural network according to the gradient of the loss function to be minimized. It is called backpropagation because the error is calculated at the output and distributed back through the network layers. Details in [19]

(47)

Figure 2.4: Single layer feedforward network

biases are assigned randomly at the beginning of training without the knowledge of the training data and are never changed. It is demonstrated that, for any randomly generated set {W, b} of input weights and biases,

lim

L→∞kf (x) − fL(x)k = 0 (2.63)

holds if the output weights matrix β is chosen so that it minimizes 2.61, which is equivalent to saying that it minimizes the loss function kf (x) − fL(x)k. Equation 2.61 after some

manipulation, becomes

kHβ − Yk < ε (2.64)

where Y = [y1, ..., yN]T are the target labels and H = [hT(x1), ..., hT(xN)] the hidden

layer output. Given N training samples {xi, yi}Ni=1, the training problem is reduced to:

Hβ = Y (2.65)

The output weights are then simply:

β = ˜HY (2.66)

Where ˜H is the Moore-Penrose generalized inverse matrix4 _{of H. This is demonstrated} to minimize the loss 2.64 given a large enough sets of training points and neurons. This is another way of saying that ˜H are the weights that represent the minimum norm least square solution of 2.65.

This will be used in the actor-critic algorithm and will be explained in Chapter 4.

4

(48)

(49)

Problem Formalization

3.1 NRO rendezvous problem formalization

The problem faced in this project is the creation of a guidance algorithm for performing rendezvous in the context of cislunar Near Rectilinear Orbits. Rendezvous is defined as the sequence of maneuvers that a chaser spacecraft has to perform in order to bring itself along a target spacecraft considered passive and non-maneuvering. The operations as defined in [6], are often divided in two section, one "far approach" phase, starting at the departure of the chaser from the phasing orbit and ending at the beginning of robust relative navigation and a "close approach" phase, starting at the end of the first phase and ending with the docking. The problem has already been formalized in [6]: Campolo starting from regulations on rendezvous in two-body dynamics, has defined the guidelines for rendezvous in the CRTBP. In the following, a summary of such study is presented in order to formally define the problem faced and to clarify some of the choices made in the following chapters.

Noted that the cislunar short-term relative dynamics are quasi-straight, the constraints and safety procedures developed for the faster dynamics of the problem in the neighborhood of a strong central body, are no longer valid. ESA, in collaboration with NASA, has defined some guidelines for future Orion missions in cislunar environment. The regulations define four areas around the target related to different phases of the rendezvous procedure:

• Keep-Out Sphere (KOS) - roughly 200 m radius sphere around the target center of mass

• Approach/Departure Corridors - Shapes defined within the KOS. They are generally 29

(50)

Figure 3.1: Rendezvous spheres

defined with respect to the target spacecraft. In the case of ISS rendezvous, they have a conical shape with a half cone angle of 15◦. They are the only regions within the KOS in which the chaser can move.

• Approach Sphere (AS) - The AS is a 1 km radius sphere around the center of mass of the target. The Approach Initiation (AI) burn is the first burn allowed to target within the AS. Integrated operations must begin before the chaser is on a trajectory that would enter the AS and may end when the chaser is outside the AS and confirmed to be on a safe free drift trajectory.

• Rendezvous Sphere (RS) - The RS is a 10 km radius sphere around the target center of mass and is used to govern the Rendezvous Orbit Entry (ROE) decision.

(51)

corridor within the KOS is not conical but rather a combination of a conical and a cylindrical corridor; its shape will be described in details in Chapter5where the numerical results are shown.

In this project, focus is put only on the close approach part of the problem. Precisely, it is assumed that the most critical part is the one related to precision guidance inside the AS, so this is the environment in which the algorithm is developed. Although this choice has been made, the concepts are general and can be applied to the broader problem of rendezvous as a whole.

Chaser and target are assumed to be already on the same NRO orbit, at a relative distance of about 1 km, so within the limits of the Approach Sphere as described in the guidelines. Their motion is described by equations 2.5 in the non-dimensional synodic reference frame. These equations however are not feasible for describing the relative guidance and control problem so the introduction of relative reference frames and relative dynamics equations is necessary.

3.2 NRO relative motion

The motion of the chaser as seen from the target centered reference frame is defined as relative motion. In the three-body environment and for NROs in particular, relative motion is something that has not been studied as extensively as for LEOs. [6] is one of the first examples of formalization of the problem. The solution proposed is interesting and it is the starting point for this thesis work. In the following, the reference frames and the equation of motion used in the rest of the work are presented.

3.2.1 Reference frames

The absolute dynamics in the Earth-Moon CRTBP are developed in the absolute synodic non-dimensional frame Rem introduced in section 2.1. The description of rendezvous

dynamics however is normally done using a reference frame relative to the target. In case of two-body dynamics this is generally the Local-Vertical-Local-Horizon frame (LVLH). The LVLH frame has been defined also for the CRTBP [6]. The problem is intrinsically different with respect to the two-body case. It has been demonstrated however that the short term NRO dynamics can be described in a LVLH defined with respect to a Moon Centerd Inertial (MCI) frame. It should be noted that this is not a consequence of the

(52)

shape of NRO being similar to keplerial orbits in the vicinity of the Moon but rather of its dynamics. The halo like shape is in fact obtained in the Rem frame, which is not inertial.

In the following sections all the above mentioned reference frames are formally introduced, together with the Earth-Moon relative synodic frame (EMRS) that will become important in the following. The change of coordinates between each of the reference frames can be found in [6] and since this is not the main focus of this thesis it is not reported in full length here.

The Earth-Moon synodic frame (EMS)

The Earth-Moon synodic frame Rem is a rotating non-dimensional frame centered in

the center of mass of the system. It was already introduced in Chapter 2and is reported here for completeness. The x -axis is aligned with the line passing through the centers of Earth and Moon so that the two bodies are fixed points in this frame. The z -axis is aligned with the angular momentum of the system and the y-axis completed the right-handed triad. The fixed positions of the two primaries re_em and rm_em in this frame are:

re_em =h−µ 0 0i T (3.1) rm_em =h1 − µ 0 0 iT (3.2) The frame is rotating and it is normalized using normalization parameters defined in section 2.1.

The Moon-Centered Inertial frame (MCI)

The MCI reference frame Rm is an inertial frame defined as follows:

• The origin is in the center of the Moon • The z -axis is aligned with the z -axis of Rem

• At t = 0, Rem and Rm overlap

Letting rp

m be the absolute position vector in Rm, the change of coordinates between Rem

and Rm is: rp m ˙ rmp ! = Rs(t) rp em− rmem ˙rp em ! (3.3)

(53)

where Rs(t) = R(t) 0 ˙ R(t) R(t) ! (3.4)

Where 0 is the 3x3 null matrix and R(t) is the rotation matrix:

R(t) =    cos(t) −sin(t) 0 sin(t) cos(t) 0 0 0 1    (3.5)

The NRO Local Vertical Local Horizon frame (LVLH)

The NRO Local Vertical Local Horizon frame Rl is a target centered, time-dependent

reference frame used to describe intuitively the relative motion between chaser and target. The frame depends on the instantaneous motion of the target in the MCI frame. Let xu

m = (rmu ˙rum) T

be the state of the target in Rm and let (e1, e2, e3) be the orthonormal

basis associated to Rl, then the basis can be defined as:

• The e1-e2 plane is the instantaneous plane of motion of the target

• e3 is opposite to the position vector rum in Rm

• e2 is normal to the instantaneous plane of motion of the target, opposite to its angular

momentum with respect to the Moon

• e1 completes the right-hand orthonormal system

This is obtained operatively: e3 = − ru m kru mk , e2 = − ru m× ˙rmu kru m× ˙rmuk , e1 = e2× e3 (3.6)

Earth-Moon Relative Synodic reference system (EMRS)

Another reference system is needed to describe the relative motion along NRO orbits: it is defined as the relative synodic reference system Rrem. It is the target-centered, relative

version of the synodic frame Rem. Its axis are aligned with the latter at all times and the

origin, as said above, is coincident with the target. In this reference frame the state is simply obtained by subtraction of the chaser state from the target state in the synodic

(54)

L₂ z_em y_em xem ℛem Earth Moon ℛ_m zm ym xm t Target ℛ𝑙 zl x_l y_l y_rem x_rem z_rem ℛrem

Figure 3.2: Reference systems utilized

non-dimensional reference frame. Its importance will become clear in the next section where the relative equations of motion are presented.

A representation of all the reference systems can be see in Figure 3.2 and a detailed explanation on the transformation from one to any of the other can be found in [6].

3.2.2 Relative equations of motion

Once the reference frames have been defined, the relative motion equations can be introduced. Following again the work by Campolo [6], the relative motion in NROs can be fundamentally described using two models depending on the position along the orbit. It has been shown that in portion of the orbit where the gravitational influence of the Moon is strong, so close to the periselene, the problem dynamically resembles the two-body problem hence the Clohessy-Wiltshire equation (CW) expressed in the LVLH frame can be employed with little error. In other region of the orbit, the Non-Linear Relative equations (NLR) defined in the relative synodic reference system (EMRS) must be employed instead.