4.2 Benchmarking and application of the protocol

(1)

Figure 4.7: Relaxed scan (RS) C13=C14 photoisomerization paths along S1 for L83QÂSR_AT , WTÂSR_AT and W76S/Y179FÂSR_AT , respectively, computed in Ref. 49 with the original ARM.

(A)-(C). CASPT2//CASSCF/AMBER/6-31G(d) energy profiles along S1(squares) isomerization paths. S0

(diamonds) and S2 (triangles) profiles along the S1 path are also given. The S1 is computed in terms of a relaxed scan along the C12-C13=C14-C15 dihedral angle. Adapted with permission from Marín et al.[49].

otherwise, the rhodopsin is discarded.

In our example (see Figure 4.6 and Figure 4.7), the RS of L83QÂSR_AT is barrierless (see Figure 4.7A) and is discarded, whereas WTÂSR_AT (see Figure 4.7B) and W76S/Y179FÂSR_AT (see Figure 4.7B) present a E^f_S1 of about ca. 2.8 and 6.4 kcal mol⁻¹ and are used to build the list of potentially fluorescent candidates. Finally, the computed E^f_S1s are contrasted and related with the ESL. Please note that, using the currently presented protocol, L83Q would have been already discarded at a previous step. In order to perform Phase III on L83Q, one would need to manually choose a starting structure, since there is no PLA a-ARM QM/MM model. Marín et al. decided to use an extrapolated model when performing the original calculation.

4.1.4 Protocol Automation

Above I have described the three phases (sections 4.1.1-4.1.3) that makes possible to catego- rize rhodopsin variants as dim-fluorescent or enhanced-fluorescent systems. As anticipated above, I focused on the fast arising fluorescence of the variant DA state (mechanism A of Figure 4.1) and not that of its photocycle intermediates (mechanism B of Figure 4.1). Each phase is implemented into ARM as an independent, stand-alone module that operates auto- matically (i.e., via predefined command-line arguments) and, in principle, provides useful, but specific, information on the fluorescent features of a target set of variants specified in the calculation input. More specifically, Phase I is driven by the a_arm_emission module (see Section A.2.2.2), Phase II by the a_arm_fc module (see Section A.2.2.3) and Phase III by the a_arm_relaxed_scan module (see Section A.2.2.4).

In order to achieve a high level of automation and, potentially, a high-throughput screening of rhodopsin variants with enhanced fluorescence, I designed and implemented a general driver that links Phases I-III in the ARM framework (see Appendix A). This is the a_arm_fluorescence_searcher driver illustrated in Figure 4.8 which automates the pipeline connecting an input list of target rhodopsin variants along with their S₀ a-ARM

Digitally signed by: PEDRAZA GONZALEZ LAURA MILENA

Reason: Ph.D. Thesis, dottorato di ricerca in Scienze chimiche e farmaceutiche Ciclo XXXIII (Matricola

(2)

Input:

List of Rhodopsin Variants

a-ARM QM/MM

final equilibrated

structure

Phase I: Location of the First Excited State Minimum

First Excited State (S₁) Geometry Optimization

S₁ Minimum?

Phase II: Computation of Semiclas- sical Franck-Condon (FC) Trajectories

Franck- Condon Trajectory

Conical Intersection

≈ 200 fs?

Phase III: Calculation of the Excited State Reaction Path

Relaxed Scan along the Isomerization

Coordinate

Torsional barrier?

Significant fluorescence.

SELECT CANDIDATE Dim fluorescence.

DISCARD CANDIDATE

Output:

List of Poten- tial Candidates

no yes

yes no

no yes

Figure 4.8: General workflow of the three-phases a-ARM rhodopsin fluorescence screening protocol. This diagram displays the methodology for automatic searching of fluorescent rhodopsins. The protocol is composed of three phases: i) Location of the first excited state minimum, ii) Franck-Condon trajectory calculation, and iii) Relaxed scan along the isomerization path; each of these phases serves as a criteria to select/discard possible fluorescent candidates.

QM/MM models, to a new list containing the potentially fluorescent candidates, along with their corresponding computed trends in maximum absorption (λ^a_max) and emission (λ^f_max) wavelengths and energy barriers (E^f_S1).

In other words, the driver provides a “one-click” architecture to perform all the opera- tions required by the protocol (i.e., quantum chemistry calculations, FC trajectory and RS calculations, classification of rhodopsins) without any user decision/intervention, beyond the provided input, including also the production of formatted tables and graphical repre- sentations. This makes possible the fast and parallel study of large arrays of rhodopsins.

For example, the current work presents fluorescence analyses conducted on a set of 27 rhodopsins, that was performed in parallel. When considering the time required to build and process all 27 models manually, the presented research would not have been doable within a reasonable time slot if it were not for the achieved automation. In fact, to the best of our knowledge, this is the first reported effort for providing an unified platform for the automatic search of fluorescent proteins.

The different default parameters for each phase (e.g., number of states to be com- puted, number of constraints and step size for the relaxed scan) are predefined in an unique input file as shown in Figure 4.9. All of these parameters can be customized at the input level; however, the user is recommended to use the default values that were determined via the benchmark calculations. In addition, it is mandatory that all the files of the S₀ a-ARM QM/MM model (see Section A.2.1.3) share the root name provided in the

&LIST_OF_TARGET_RHODOPSINS section. For instance, for RHODOPSIN_1, one must place in the root folder the following bundle of files: WT_ASR_AT.Final.xyz, WT_ASR_AT.key, WT_ASR_AT.Espf.Data, WT_ASR_AT.JobIph, WT_ASR_AT.pdb and WT_ASR_AT.cavity containing the information necessary for the different calculations.

The driver starts by verifying that all files specified above, for each rhodopsin variant, are in the root folder. Then, one working sub-folder is generated for each rhodopsin. After this preparation step, the protocol starts with the parallel execution of Phase I (i.e., via the a_arm_emission module), as shown in Figure 4.8, for all variants (i.e., each in a different processor). Phase I finishes with the output file of the S₁ geometry optimization (i.e., PLA structure). The convergence of this calculation is evaluated by using the following criteria:

(3)

&GENERAL_INFO

project_name : ASR_variants

&LIST_OF_TARGET_RHODOPSINS NUMBER_OF_RHODOPSINS : 3 RHODOPSIN_1 : WT_ASR_AT RHODOPSIN_2 : L83Q_ASR_AT

RHODOPSIN_3 : W76S-Y179F_ASR_AT

&EMISSION_MODULE N_ROOTS_S1 : 2 M_ROOTS_S1 : 3

&FC_MODULE N_ROOTS_FC : 2 M_ROOTS_FC : 3 N_STEPS : 400 GRAD : NONE

&RS_MODULE N_ROOTS_RS : 2 M_ROOTS_RS : 3 STEP_SIZE_RS : 5

Figure 4.9: Input file required for the a-ARM rhodopsin fluorescence screening protocol.

Example of input file, in yaml format, for the a_arm_fluorescence_searcher driver. Each section starts with a & command, and defines the parameters to be passed to the driver (general name of the project and list of rhodopsin variants to be analyzed) and to the single modules that are thus commandeered.

if the geometry optimization calculation reached convergence within 100 optimization steps, the rhodopsin is considered as a potential fluorescent candidate and it continues to Phase II (section 4.1.2); otherwise, the rhodopsin is discarded. The output is a list of potentially fluorescent candidates, along with their λ^f_max, calculated based on the PLA structure. Phase II starts (i.e., via the a_arm_fc module) immediately after the presence of the PLA is ver- ified. When the FC trajectory calculation reaches a threshold time of 200 fs, the following criteria is used: if the CI has not been reached the rhodopsin is considered as a potential fluorescent candidate and the FC calculation continues until it completes 500 fs and then pass to Phase III (section 4.1.3); otherwise, the rhodopsin is discarded. The output is a list of potentially fluorescent candidates, along with the corrected λ^f_max, this time calculated as the average ∆E^f_S1−S0 along the FC trajectory.[49] Finally, Phase III is computed (i.e., via the a_arm_relaxed_scan module) for the candidates selected in Phase II. The main output is a list of potentially fluorescent variants, along with the calculated E^f_S1. In addition, the output files of phases II and III include both a graphical representation and raw-data of the computed FC trajectory and photo-isomerization path, respectively. This includes information not only on the S₀, S₁ and S₂ energy profiles, but also on complementary properties,

(4)

such as: Mulliken charges calculated for the reactive fragment, oscillator strength, bond length alternation (BLA) and hydrogen-out-of-plane (HOOP).

I stress that, unlike the case of Phase I explained above, neither Phase II nor Phase III start at the same time for all the variants, since once the driver is launched each rhodopsin is processed as a different thread. This architecture of the driver avoids problems of dead times and makes possible to easily restart the calculations without the needed to start from scratch in case of technical problems (i.e., the cluster is turned off). As further described in Section A.2.2, each of the three phases are composed of different stages/routines that work as a thread (i.e., the input of one stage is the output of the previous stage). Therefore, the stages and phases communicate between them through a communicator file with the extension *.finished. In this regard, when one routine of the module finishes a signal is produced by generating the communicator file that contains information on the module name (i.e., a_arm_emission), and the current stage of the module. The information contained in such a file is managed by the a_arm_crontab module to schedule the execution of the next routine, via python crontab utility. This procedure is the same for each of the modules and drivers of the ARM package.

4.2 Benchmarking and application of the protocol

In this section I report on the performance of the proposed a-ARM rhodopsin fluorescence screening protocol, illustrated in Figure 4.8, as a computational tool for the parallel and, therefore, relatively fast screening of large arrays of light-emitting rhodopsins (mechanism A of Figure 4.1). To this aim, I employ three different sets of rhodopsins, each intended for a specific scope, as follows:

In Section 4.2.1 I introduce and discuss the first set, from now on called benchmark set. As observed in Table 4.2, it is composed of 43 rhodopsin variants, with available experimental data on λ^a_max, ranging from 470 nm to 628 nm. It includes vertebrate (V), invertebrate (I), microbial (M) and heliorhodopsin (H) variants that feature either all-trans, 11-cis or 9-cis r PSB chromophore isomers. The objective of this set is to expand the quality of the previously reported benchmark of a-ARM models,[62, 75, 79] presented in Figure 3.4, for testing the reliability of the ground-state (S₀) equilibrium QM/MM models generated with the a-ARM rhodopsin model building protocol,[61, 62, 75] based on their ability to reproduce experimental trends in λ^a_max. In addition, I discuss the presence of few customized, rather than default, models. This is important to establish the quality and limitations of the automatically generated input S₀ structures for fluorescence screening. I stress that, as indicated in Table 4.2, such building protocol takes advantage of the availability of several X- ray crystallographic structures. The models of the set members, for which an experimental structure is not available (e.g., several mutants), were built via a comparative (homology) modeling protocol.

The second and third sets are actually subsets of the benchmark set. In Section 4.2.2

(5)

Table 4.2: Benchmark, application and search sets including wild-type and mutant rhodopsins.

Name type^a variant PDB-ID RET-C^b code absorption^c

(nm) (kcal mol⁻¹) (eV) benchmark set

Rh V WT 1U19[22] 11-cis WT^Rh_11C 498[22] 57.4 2.49

JSiR1 I WT 6I9K[127] 9-cis WT^JSR1_9C 505[127] 56.6 2.45

JSR1 I WT 6I9K[127] 11-cis WT^JSR1_11C 535[127] 53.4 2.32

ChR2 M WT 6EID[89] all-trans WT^ChR2_AT 470[89] 60.8 2.64

BPR M WT 4JQ6[83] all-trans WT^BPR_AT 490[83] 58.3 2.53

HeR-48C12 H WT 6UH3[128] all-trans WT^HeR-48C12_AT 541[128] 52.8 2.29

TaHeR H WT 6IS6[23] all-trans WT^TaHeR_AT 541[23] 52.8 2.29

KR2 M WT 6REW[21] all-trans WT^KR2_AT 528[21] 54.1 2.35

D116N all-trans D116N^KR2_AT 565[29] 50.6 2.19

GVirus M WT 6JO0[16] all-trans WT^GVirus_AT 509[16] 56.2 2.43

OLPVRII M WT 6SQG[17] all-trans WT^OLPVRII_AT 514[17] 55.6 2.41

RxR M WT 6KFQ[129] all-trans WT^RxR_AT 540[129] 52.9 2.29

bR M WT 6G7H[81] all-trans WT^bR_AT 568[81] 50.4 2.19

PoXeR M WT^e all-trans WT^PoXeR_AT 564[29] 50.7 2.20

D216N all-trans D216N^PoXeR_AT 571[29] 50.1 2.17

application set

Arch3 M WT 6GUX all-trans WT^Arch3_AT 556[120] 51.4 2.23

D95E/T99C^d all-trans D95E/T99CÂrch3_AT 626[120] 45.6 1.98 D95E/T99C/V59A^d all-trans D95E/T99C/V59AÂrch3_AT 622[120] 46.0 1.99 D95E/T99C/P60L^d all-trans D95E/T99C/P60LÂrch3_AT 624[120] 45.8 1.99 D95E/T99C/P196S^d all-trans D95E/T99C/P196SÂrch3_AT 628[120] 45.5 1.97

Arch5^d all-trans Arch5^Arch3_AT 622[120] 46.0 1.99

Arch7^d all-trans Arch7-7^Arch3_AT 616[120] 46.4 2.01

QuasAr1^d all-trans QuasAr1^Arch3_AT 580[116] 49.3 2.14

QuasAr2^d all-trans QuasAr2^Arch3_AT 590[116] 48.5 2.10

Archon2^d all-trans Archon2^Arch3_AT 581[118] 49.2 2.13

search set

ASR M WT 1XIO[80] all-trans WT^ASR_AT 550[80] 52.0 2.25

V112N all-trans V112N^ASR_AT 532[130] 53.7 2.33

W76F all-trans W76F^ASR_AT 529[130] 54.0 2.34

L83Q all-trans L83Q^ASR_AT 517[49] 55.3 2.40

P206C all-trans P206C^ASR_AT 542[29] 52.8 2.29

P206H all-trans P206H^ASR_AT 525[29] 54.5 2.36

P206K all-trans P206K^ASR_AT 519[29] 55.1 2.39

P206Q all-trans P206Q^ASR_AT 527[29] 54.4 2.36

P206Y all-trans P206Y^ASR_AT 529[29] 54.1 2.35

S214D all-trans S214D^ASR_AT 550[29] 52.0 2.25

S214D/D217E all-trans S214D/D217E^ASR_AT 547[29] 54.0 2.34

S86D all-trans S86D^ASR_AT 549[29] 52.1 2.26

W76S/Y179F all-trans W76S/Y179F^ASR_AT 488[49] 58.6 2.54

Y73Q all-trans Y73Q^ASR_AT 552[29] 51.8 2.25

D217N all-trans D217N^ASR_AT 554[29] 51.5 2.23

D217E all-trans D217E^ASR_AT 555[29] 51.6 2.24

E36Q all-trans E36Q^ASR_AT 554[29] 51.5 2.23

D75E all-trans D75E^ASR_AT 526[29] 54.4 2.36

aVertebrate (V), invertebrate (I), microbial (M) and Heliorhodopsin (H);^b retinal configuration;^c experimental maximum absorption wavelength, λâ_max, expressed in nm and eV and as first vertical excitation energy, ∆Eâ_S1−S0, in kcal mol⁻¹. Structures obtained via comparative modeling using as template^d6GUX andê4TL3.[86]

I introduce and discuss the second set, hereinafter referred to as application set. As ob- served in Table 4.2, it incorporates the 10 microbial (Archaea) rhodopsin variants (wild-type Archeorhodospin-3 and 9 mutants) of the benchmark set with available experimental data not only on λ^a_max, but also on photophysical properties related to their fluorescent behavior.

Such properties are λ^f_max, ESL and φ^f (see Table 4.1). As mentioned above, Arch3-based rhodopsin variants have been experimentally demonstrated to be fluorescent and, in specific

(6)

cases, employed as fluorescent probes in optogenetics experiments. Therefore, this set is used for testing the ability of the proposed screening protocol to correctly predict trends in rhodopsin fluorescence and, therefore, to select the most likely fluorescent candidates.

As anticipated above this is done by using computational criteria based on the values of Transition oscillator strength (f_Osc) at an S₁ stable structure and barrier height along the S₁ isomerization path, to assess a qualitative consistency with the available observed pho- tophysical data (e.g., most relevantly, φ^f).

Finally, in Section 4.2.3 I introduce and discuss the third set, called search set. As observed in Table 4.2, it includes 14 rhodopsin variants whose fluorescent behavior has not yet been characterized and 3 that has been previously studied in Ref. 49, for a total of 17 variants. However, all the variants have available experimental data on λ^a_max(see Table 4.2).

More specifically, the set includes WT ASR_AT(WT^ASR_AT ) and 16 mutants. The latest intends to increase the set of three ASR variants studied in Ref. 49, by selecting mutations of residues that either directly interact with the chromophore or form part of the chromophore cavity.⁴ This set is employed for predicting the excited-state behavior of existing (i.e. successfully expressed in the lab) rhodopsin mutants and for ranking these mutants according to their chances of being fluorescent, which will then have to be experimentally confirmed.

Notice that, while the scope of this work is the selection of fluorescent rhodopsin candidates, the produced S₀ and S₁ equilibrium structures can also be employed for following mechanistic studies that, however, are outside the scope of the present work. Further studies devoted to the elucidation of the fluorescence mechanism of rhodopsins from the application set, are currently on going in our laboratory (doctoral thesis of Ph.D. candidate Leonardo Barneschi).

4.2.1 Benchmark Set Results

As a first step, I computed the trend in Maximum absorption wavelength (λ^a_max)⁵ for the benchmark set and contrasted it with experimental data (see Table 4.3 and Figure 4.10).

This was done to assess the quality of the automatically built ARM QM/MM models repre- senting the input S₀ equilibrium structures for the proposed a-ARM rhodopsin fluorescence screening protocol (see Figure 4.8). To this aim, I employed a 4.0 kcal mol⁻¹ error, as accu- racy threshold, as previously determined for the a-ARM rhodopsin model building protocol in paper [I] (see section 3.1.1).[61, 62, 75, 79]

Figure 4.10 displays, for each member of the benchmark set, the computed average λ^a_max (green up-triangles) from the N =10 independently generated a-ARM replicas (see Section A.2.1.3 and Refs. 61, 62 and 75) expressed in terms of Vertical Excitation energy (∆E^a_S1−S0).

Each value is given along with its error bar (i.e. standard deviation) (Figure 4.10A) and

4Further details on the generation of the ARM QM/MM models for the ASR mutants, are provided in Section 3.1.4 (see Table 3.1).

5The ARM QM/MM models were generated using the a-ARM version of the protocol (section 3.1.1) implemented in ARM as the a_arm_protocol (section A.2.1).

(7)

Table 4.3: Ground-state Vertical Excitation energy (∆Eâ_S1−S0), kcal mol⁻¹ and eV in italic and paren- thesis), Maximum absorption wavelength (λâmax), nm), and oscillator strength (fOsc), calculated using the a-ARMdefault and the a-ARMcustomized approaches. Differences between calculated and experimental data (∆∆EÊxp_S1-S0, ∆λâ,Expmax ) are also presented.

Experimental Calculated^a Error

Model ∆Eâ,Exp_S1-S0 λâ,Exp_max ∆E_S1-S0â λâ_max fOsc ∆∆E_S1-S0â,Exp ∆λâ,Exp_max benchmark set

WT^HeR-48C12_AT 52.8 (2.29) 541 52.80.7(2.29) 541 1.12 0.0 (0.00) 0

WT^TaHeR(c)_AT 52.8 (2.29) 541 55.20.4(2.40) 518 1.22 2.4 (0.10) -23

WT^JSR1_9C 56.6 (2.46) 505 55.80.7(2.42) 512 0.96 -0.8 (-0.03) 7 WT^JSR1_11C 53.4 (2.32) 535 52.80.7(2.29) 541 0.83 -0.6 (-0.03) 6 WT^Rh_11C 57.4 (2.49) 498 57.50.5(2.49) 497 0.88 0.1 (0.00) -1

WT^BPR(c)_AT 58.3 (2.53) 490 58.01.0(2.52) 493 0.73 -0.3 (-0.01) 3

WT^ChR2(c)_AT 60.8 (2.64) 470 62.20.6(2.70) 459 1.13 1.4 (0.06) -11

D216N^PoXeR_AT 50.1 (2.17) 571 50.60.4(2.20) 565 1.44 0.6 (0.02) -6 WT^bR_AT 50.3 (2.18) 568 53.90.3(2.33) 531 1.22 3.6 (0.15) -37 D116N^KR2_AT 50.6 (2.19) 565 52.50.3(2.28) 544 1.39 1.9 (0.08) -21

WT^PoXeR_AT 50.7 (2.20) 564 50.40.3(2.19) 567 1.47 -0.3 (-0.01) 3

WT^RxR_AT 52.9 (2.30) 540 56.70.5(2.45) 504 1.09 3.8 (0.15) -36

WT^OLPVRII_AT 55.6 (2.41) 514 54.80.1(2.38) 521 1.16 -0.8 (-0.03) 7

WT^KR2(c)_AT 54.1 (2.35) 528 56.00.0(2.43) 511 1.16 1.9 (0.08) -17

WT^GVirus_AT 56.2 (2.44) 509 55.61.0(2.41) 515 1.25 -0.6 (-0.03) 5

application set

WT^Arch3_AT 51.4 (2.23) 556 54.30.7(2.35) 527 1.25 2.9 (0.12) -29

Arch5Ârch3_AT 46.0 (1.99) 622 48.00.4(2.08) 596 1.34 2.0 (0.09) -26 Arch7Ârch3_AT 46.4 (2.01) 616 48.80.1(2.12) 586 1.29 2.4 (0.10) -30 Archon2Ârch3_AT 49.2 (2.13) 581 53.00.8(2.30) 540 1.31 3.8 (0.16) -41 QuasAr1Ârch3_AT 49.3 (2.14) 580 53.00.4(2.30) 540 1.28 3.7 (0.16) -40 QuasAr2Ârch3_AT 48.5 (2.10) 590 51.80.6(2.25) 552 1.38 3.3 (0.14) -38 D95E/T99CÂrch3AT 45.7 (1.98) 626 49.70.8(2.16) 575 1.41 4.0 (0.18) -51 D95E/T99C/P60LÂrch3_AT 45.5 (1.97) 628 49.90.6(2.17) 573 1.35 4.4 (0.19) -55 D95E/T99C/P196SÂrch3_AT 46.0 (1.99) 622 47.30.2(2.05) 604 1.41 1.4 (0.06) -18 D95E/T99C/V59AÂrch3_AT 45.8 (1.99) 624 49.70.0(2.15) 576 1.35 3.8 (0.17) -48 search set^c

WTÂSR_AT 52.0 (2.25) 550 52.30.2(2.27) 547 1.29 0.3 (0.01) -3 Y73QÂSR_AT [R1] 51.8 (2.25) 552 52.50.7(2.28) 544 1.27 0.7 (0.03) -8 S214DÂSR_AT [R1] 52.0 (2.25) 550 53.20.4(2.31) 538 1.26 1.2 (0.05) -12 S86DÂSR_AT [R1] 52.1 (2.26) 549 52.40.6(2.27) 545 1.28 0.4 (0.02) -4 P206CÂSR_AT [R1] 52.8 (2.29) 542 52.90.7(2.29) 541 1.32 0.1 (0.01) -1 P206HÂSRAT [R1] 54.5 (2.37) 525 56.00.2(2.43) 510 1.11 1.5(0.07) -15 P206KÂSR(c)_AT [R3] 55.1 (2.40) 519 55.30.2(2.40) 517 1.19 0.1(0.01) -2 P206QÂSR_AT [R1] 54.3 (2.36) 527 54.40.3(2.36) 526 1.07 0.1(0.00) -1 P206YÂSR_AT [R1] 54.1 (2.35) 529 53.41.5(2.32) 536 1.19 -0.7(-0.03) +7 S214D/D217EÂSRAT [R1/R1] 53.3 (2.27) 547 54.00.4(2.35) 529 1.23 1.7(0.07) -18 D217NÂSR_AT [R1] 51.5 (2.24) 554 52.60.5(2.29) 544 1.31 1.1(0.05) -10 D217EÂSR_AT [R1] 51.6 (2.24) 555 52.60.8(2.29) 544 1.28 1.0(0.04) -11 E36QÂSR_AT [R1] 51.6 (2.24) 555 52.90.4(2.30) 540 1.29 1.3(0.06) -14 D75EÂSR_AT [R1] 54.4 (2.37) 526 53.20.3(2.31) 538 1.28 -1.2(-0.05) +12 V112NÂSR(c)_AT [R3] 53.7 (2.33) 532 54.11.6(2.35) 528 1.23 0.4 (0.02) -4 W76FÂSR_AT [R1] 54.0 (2.34) 529 55.30.7(2.40) 517 1.14 1.3 (0.06) -12

L83Q^ASR(c)_AT [R2] 55.3 (2.40) 517 55.70.4(2.42) 513 1.02 0.4 (0.02) -4

W76S/Y179F^ASRAT [R1/R1] 58.6 (2.54) 488 56.82.0(2.46) 503 1.06 -1.8 (-0.08) 15 ADmaxb

4.4 (0.19)

MAE ± MAD of ∆∆E^Exp_S1-S0^b 1.5 ± 1.1 (0.07 ± 0.05)

aAverage value of 10 replicas, along with the corresponding standard deviation given as sub-index.

bThe 36 rhodopsins are considered.

cThe selected rotamer is specified in square parentheses (see Section 3.1.4).

(c)symbol stands for customized models constructed with the a-ARMcustomizedapproach.

(8)

45.0 48.0 51.0 54.0 57.0 60.0 63.0 66.0 69.0 72.0 75.0 78.0 81.0 84.0

1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6

−6.0

−3.0 0.0 3.0 6.0 9.0 12.0 15.0

WT

HeR-48C12 A(H)T WT

JSR1 9C

(I) WT

Rh 11C

(V) WT

BPR A(M)T WT

ChR2 A(M)T D95E/T99C/Arch3P60L(M)AT Arch3D95E/T99C(M)AT D95E/T99C/Arch3V59A(M)AT Arch3Arch5(M)AT D95E/T99C/ Arch3P196S(M)AT Arch3Arch7(M)AT Arch3QuasAr2(M)AT Arch3Archon1(M)AT Arch3Archon2(M)AT Arch3QuasAr1(H)AT Arch3WT(H)AT PoXeRD216N(M)AT PoXeRWT(M)AT OLPVRIIWT(M)AT D116N KR2 A(M)T TaHeRWT(H)AT WT

JSR1 11C

(I) WT

KR2 A(M)T WT GVirus A(M)T Y73Q

ASR A(M)T WT ASR A(M)T S214D

ASR A(M)T S86D ASR A(M)T P206C

ASR A(M)T V112N ASR A(M)T W76F

ASR A(M)T L83Q ASR A(M)T W76F/V112N

ASR A(M)T W76S/Y179F ASR A(M)T

−0.1 0.0 0.1

∆E

a S1−S0 (eV)

a-ARMdefault(N =10) Experimental

Benchmark set

Application set

a-ARMcustomized(N =10) A

Search set

(kcalmol−1)∆∆Ea,Exp S1−S0 (eV)

Rhodopsin variant B

(kcalmol−1)

Figure 4.10: Extended benchmark of the a-ARM protocol, in terms of reproduction of ex- perimental trends in λ^amax (benchmark set, application set and search set). The computed data was obtained using the a_arm_protocol driver, where the average value from the N =10 replicas is plotted (green up-triangles) (A) along with the corresponding error bars (B). Models were constructed with the a- ARMdefault approach, with exception of those circled in red that were constructed with the a-ARMcustomized

approach. (x axis; M indicates microbial, H heliorhodopsin, V vertebrate and I invertebrate rhodopsins).

difference with respect to corresponding experimental data (∆^Exp_calc∆E^a_S1−S0) (Figure 4.10B).

When all models were generated automatically with the a-ARM_default approach, the results show that 39 out of the 43 models (91%) have an absolute ∆^Exp_calc∆E^a_S1−S0 value below

± 4.0 kcal mol⁻¹. A customization procedure was used for the four outliers by using the a-ARM_customized approach, so as to improve the models quality. ⁶ As illustrated in Table 4.4, the customization of the WTs was achieved by exploring different choices for the protonation state of certain ionizable residues based on either chemical reasoning or experimental information. Moreover, the customization that requires a choice among the side-chain conformation for 3 of the ASR_AT mutants is explained in section 3.1.4.

As it will be further discussed in section 5.1.3 and paper [V], KR2 is a good case study for customization. The default model of this rhodopsin has two negatively charged aspartic acid residues, forming the counterion complex around the r PSB, Asp-116 and Asp-251.

However, as previously discussed in paper [I], two negative charges outbalance the single positive charge of the r PSB, producing a λ^a_max that is c.a. 15 kcal mol⁻¹ blue-shifted with respect to experimental value. Instead, the protonation/neutralization of the second counterion (Asp-251), through a customized model, allows a more charge-balanced model

6Both a-ARMdefaultand a-ARMcustomizedapproaches are described in section 3.1.1 and detailed in papers [I] and [III]

(9)

Table 4.4: Setup of the protonation states for a-ARMdefaultand a-ARMcustomizedmodels. The residues with different protonation states are highlighted. Asp, Glu are deprotonated while Ash and Glh are protonated.

Rhodopsin a-ARM_default a-ARM_customized

WT^BPR_AT • GLH-90

• GLH-124

• GLH-90

• GLU-124

WT^TaHeR_AT • HIE-23

• HID-82

• HIS-23

• HIE-82

WT^KR2_AT • ASP-251

• GLH-160

• ASH-251

• GLH-160

Charge of histidine: +1 when both the δ-nitrogen and -nitrogen of the imidazole ring are protonated (HIS), while it is neutral when either the δ-nitrogen (HID) or the -nitrogen (HIE) are deprotonated.

that reproduces the experimental λ^a_max with an error of c.a. 1.9 kcal mol⁻¹.

While the construction of customized models (i.e., through the modification of the pro- tonation state of specific residues) allowed to obtain ∆Êxp_calc∆Eâ_S1−S0 values falling within the error bar, it has to be reckoned that this were not possible when a fullly automated preparation of the input is requested. Most importantly, in the absence of experimental λâ_maxvalues (e.g., for rhodopsin variants not yet expressed and/or spectroscopically studied in the lab) it will not be possible to detect which model has to be customized. This also applies for the subroutine for mutants generation presented in section 3.1.4, where the choice of the mutated side-chain rotamer relies on experimental λâ_max. Additional limitations and pitfalls related to the ground-state a-ARM QM/MM building are explicitly described in section 3.1.1 (see page 51). These issues impose a first limit on the quality of an automated fluorescent rhodopsin screening and, therefore, an error bar.

In conclusion, the general trend in absorption energy can be qualitatively reproduced by using the a-ARM protocol for the 43 rhodopsin variants reported in Table 4.2. The corresponding ARM QM/MM models can be used for the ensuing excited-state calculations, as detailed in sections 4.2.2 and 4.2.3. However, if a full automated procedure were to be used for the input generation, only 91% of the models would be of a quality matching the selected threshold. Methods for improving the prediction of the residue protonation states during the construction of ARM QM/MM models are being presently investigated in our laboratory (see for instance a preliminary study in Ref. 93).

4.2.2 Application Set Results

In this Section I examine the results obtained by applying the protocol of Figure 4.8 to the application set: a set of experimentally investigated fluorescent rhodopsins for which default S₀ ARM QM/MM models have been automatically generated⁷. For each phase I discuss

7As specified in Table 4.2, the initial structures for the Arch3-based mutants were generated via comparative (homology) modeling, using as template the X-ray structure of WT Arch3 [PDBID 6GUX].

(10)

the resulting screening procedure and the corresponding selection outcome.

4.2.2.1 Phase I: application set

As introduced in Section 4.1.1, in Phase I the protocol looks for the existence of a S₁ excited state planar minimum (PLA) located near the FC region of the corresponding PES (see Figure 1.3) computed at the CASSCF level of theory and, therefore, corresponding to a rhodopsin where the r PSB structure is nearly planar. The existence of a PLA classifies the given candidate as potentially fluorescent and allows to move it to Phase II. However, it does not provide information on the actual stability of the PLA, which is of importance for having a long enough ESL and promotes light emission. Phase II and Phase III will address this point.

Before discussing the outcome of Phase I, I better evaluate how the quality of the initial ground-state S₀ ARM QM/MM models reflects the quality of the output S₁. In this regard, I consider two different aspects. The first one (i) is related to the effect of the choice of the initial S₀ geometry on the location of the PLA structure. The second aspect (ii) is, instead, related to the possible dependency of the computed trend in λ^f_max on the quality of the computed trend in λ^a_max.

Regarding aspect (i), Phase I receives as an input for each variant the representative S₀ a-ARM QM/MM structure, whose λ^a_max is closest to the average value. Since Ref. 61 demonstrated that at least 10 replicas of the ground-state ARM QM/MM model are necessary for the correct description of the λ^a_max property, we wonder whether the evaluation of a single replica of excited-state S₁ ARM QM/MM model is enough for the location of the PLA structure and the subsequent description of λ^f_max property.

Figure 4.11A presents a close view of the trend in absorption energy (turquoise triangles) computed for the application set (i.e., average of N =10 replicas), taken from Figure 4.10.

Notice that, in this figure, the individual ∆E^a_S1−S0 computed for the chosen representative replica of the ARM QM/MM model (i.e., the one closest to the average) is also reported (dark- blue circles). As shown in the figure and detailed in Table 4.3, the standard deviation in

∆E^a_S1−S0 for members of the application set is relatively small, ranging between 0.0 - 0.8 kcal mol⁻¹. Accordingly, I do not expect that the 10 S₀ ARM replicas of a same variant feature significant structural differences and, thus, I expect that all of the replicas might lead to the same PLA structure (or no PLA).

In order to corroborate the above hypothesis, I evaluated Phase I for the 10 replicas of a reduced sub-set of the application set. Notice that, for this test, the definition of a sub-set was needed considering the large number of required calculations and, consequently, the required computational wall-time.⁸ Thus, the members of the sub-set were chosen as those six variants heavily featured in a number of optogenetics studies (i.e., Arch3, Arch5,

8As observed in Figure A.4, the execution of the a_arm_emission module for a single replica implies 7 QM/MM calculations. The evaluation of the 10 seeds for each of the 10 members of the application set, would require around 700 calculations.

(11)

46.0 48.0 50.0 52.0 54.0

2.0 2.1 2.2 2.3 2.4

−8.0

−6.0

−4.0

−2.0 0.0 2.0 4.0 6.0 8.0

WTArch3AT(M) Arch5Arch3 AT(M) Arch7Arch3 AT(M) Archon2Arch3 AT(M) QuasAr1Arch3 AT(H) QuasAr2Arch3AT(M) D95E/ T99CArch3 AT(M) D95E/T99C/ V59AArch3 AT(M) D95E/T99C/ P196SArch3AT(M) D95E/T99C/ P60LArch3AT(M)

−0.2

−0.1 0.0 0.1 0.2

∆E

a S1−S0 (eV)

a-ARM (N =10) a-ARM (N =1) A Exp.

(kcalmol−1)∆∆Ea,Exp S1−S0 (eV)

Rhodopsin variant B

(kcalmol−1)

32.0 34.0 36.0 38.0 40.0 42.0

1.4 1.5 1.6 1.7 1.8

−6.0

−4.0

−2.0 0.0 2.0 4.0 6.0

WTArch3 AT(M) Arch5Arch3 AT(M) Arch7Arch3 AT(M) Archon2Arch3 AT(M) QuasAr1Arch3 AT(H) QuasAr2Arch3AT(M) D95E/ T99CArch3 AT(M) D95E/T99C/ V59AArch3AT(M) D95E/T99C/ P196SArch3AT(M) D95E/T99C/ P60LArch3AT(M)

−0.2

−0.1 0.0 0.1 0.2

∆Ef S1−S0 (eV)

PLA (N =1) FC

Exp. PLA (N =10)

C

(kcalmol−1)∆∆Ef,Exp S1−S0 (eV)

Rhodopsin variant D

(kcalmol−1)

Figure 4.11: Trends in vertical absorption (∆Eâ_S1−S0, left) and emission (∆E^f_S1−S0, right) ener- gies for the rhodopsins of the application set. (A) Computed vertical absorption values (∆Eâ_S1−S0) exclusively for the application set. Experimental (yellow triangles) and N = 10 replicas values (green up- triangles) were taken from Figure 4.10. The values relative to the replica used for the PLA calculations are included as dark-blue circles. (B) Difference between calculated and experimental data (∆Êxp_calc∆Eâ_S1−S0).

Green bars as in Figure 4.10, while dark blue bars are relative to the replica used for the PLA. (C) Com- puted ∆E^f_S1−S0 (via a_arm_emission module) using the representative replica (red squares), along with experimental data (indigo circles). Data are also presented (green triangles) for a subset of the application set, where values were computed as average of ten replicas, with their corresponding error bars. Finally, values corrected with kinetics energy (via a_arm_fc module) are shown in orange squares. (D) Difference between calculated and experimental data (∆^Exp_calc∆E^f_S1−S0). Bars are coloured according to the computed data reported in C. Data on emission for Arch3 is not presented, since its reported fluorescence comes from a photointermediate instead of from the DA state, as for its variants.

Arch7, Archon2, QuasAr1 and QuasAr2). With this choice, the total number of QM/MM calculations was reduced from c.a. 700, considering the whole application set, to c.a. 420. As a result, the produced trend in λ^f_max with the corresponding standard deviation is reported in Figure 4.11C (green triangles), for the variants with located PLA.

Most of the variants of the sub-set were found to present a PLA structure for each of their 10 initial S₀ structures (Arch5, Arch7, Archon2, QuasAr1 and QuasAr2), with a standard deviation (see green error bar in panel C) comparable to the one obtained for absorption (see turquoise error bars in panel A). However, for WT Arch3, only 8 out of the 10 initial S₀ replicas provide a PLA structure. Therefore, the results of Phase I for Arch3 are not conclusive and it has to be evaluated in Phase II. Consequently, I have used the a_arm_fc module (see Section A.2.2.3) to compute the FC trajectory of Arch3.

The trajectory energy profile plotted in Figure 4.12(a) shows a decay channel (located in the vicinity of a CI showing a ca. 90 degrees twisted C13=C14 bond), which is reached in less than 200 fs. This finding indicates that the WT^Arch3_AT candidate must be readily discarded as

(12)

Table 4.5: First excited-state vertical excitation energies (∆ES1-S0, kcal mol⁻¹ and eV in italic and paren- thesis), maximum emission wavelengths (λ^fmax, nm), and oscillator strength (fOsc), calculated using the a_arm_emission module. Differences between calculated and experimental data (∆∆E^f,Exp_S1-S0, ∆λ^f,Exp_max ) are also presented.

Rhodopsin Experimental Calculated^a Error

variant ∆E_S1-S0^f,Exp λ^f,Exp_max ∆E_S1-S0^f λ^f_max fOsc ∆∆E^f,Exp_S1-S0 ∆λ^f,Exp_max

Arch5^Arch3_AT 39.1 (1.70) 731 34.1 (1.48) 838 1.44 -5.0 107

Arch7^Arch3_AT 39.3 (1.70) 727 33.9 (1.47) 843 1.44 -5.4 116

Archon2Ârch3AT 38.8 (1.69) 735 35.8 (1.55) 799 1.25 -4.9 64 QuasAr2Ârch3_AT 40.0 (1.73) 715 36.2 (1.57) 789 1.38 -3.8 74 QuasAr1Ârch3_AT 40.0 (1.73) 715 35.6 (1.55) 802 1.31 -4.3 87 D95E/T99CÂrch3_AT 39.1 (1.70) 731 35.1 (1.52) 814 1.50 -4.0 83 D95E/T99C/V59AÂrch3_AT 39.2 (1.70) 728 36.4 (1.61) 786 1.56 -2.8 44 D95E/T99C/P196SÂrch3_AT 39.1 (1.70) 731 36.3 (1.58) 787 1.60 -2.8 56 D95E/T99C/P60LÂrch3_AT 39.1 (1.70) 731 35.0 (1.52) 817 1.50 -4.1 86

AD^bmax 5.4 (0.23)

MAE ± MAD of ∆^Exp_calc∆E^a_S1−S0^b 3.9 ± 0.8 (0.17 ± 0.03)

a1 replica close to the average.

their PLA structures are unstable, being prone to undergo double bond photoisomerization.

This appears to be, at least partially, consistent with experimental data since, as reported in Table 4.1, WT^Arch3_AT features the lowest observed φ^f value, that is at least two orders of magnitude smaller than that of the other members of the set and, as discussed above, its fluorescence comes from an intermediate and not from the DA state.

On the other hand, with the aim of comparing the average ∆E^f_S1−S0s with the values produced by means of the representative replica, I plotted in Figure 4.11 (see red squares) the latter values for each member of the application set (see Table 4.5). As observed, the values computed with the representative replica are in line with the average of the 10 replicas. These results suggest that, the use of a single replica provides a good estimation of the located PLA and, in turn, the emission energy, reducing in this case the number of required calculations from c.a. 420 to 60.

From the above results, I conclude that the choice of a single replica represents a good compromise between computational resources consumption and quality of the results.

Therefore, as shown in Figures 4.11C-D, Phase I was applied to the rest of the application set (D95E/T99C, D95E/T99C/V59A, D95E/T99C/P296S/ and D95E/T99C/P60L) using the representative replica.

I now analyze aspect (ii) above, focusing on the trends (i.e., for both absorption and emission) and evaluate whether they are consistent with the experimental observations.

As anticipated above, both the computed and observed trends, as well as the differences between computed and experimental λ^f_max for each rhodopsin, are plotted in Figure 4.11. I first analyze the trend in λ^a_max. Since the 9 mutants were generated from the same template X-ray structure (i.e., Arch3, PDBID 6GUX) via comparative modeling, I would expect that all of them present an error bar consistent with that of the WT Arch3. However, a close inspection of the trend (Figure 4.11A-B) reveals that the shape of the turquoise curve describing the calculated results do not accurately resemble the shape of the yellow curve describing the experimental trend. Indeed, only the first 6 variants reproduce the