### Chapter 5

## Choice of the algorithm and per- formance review

### 5.1 Introduction

In this chapter we will compare the algorithms explained in chapter 4, and choose the one best suited for our needs. We will discuss the specific implementation we have selected and justify it.

In the last part of the chapter we will see the result of some test runs, and compare them with the results obtained, with the existing program we had available.

### 5.2 Comparison of the possible algorithms

In the previous chapter we discussed a number of possible approaches to the problem of computing the Green’s functions by means of the recursive Green’s function technique. We will now compare the results we found, but first of all we need to recall some important detail about the case in which every algorithm can be used.

What we are referring to is the fact that some algorithm only works for hard-wall potentials. The Block Algorithm and the Memory Algorithm

belong to this category, while the Simple Algorithm and both versions of the Metalidis-Bruno Algorithm have no such restriction.

First of all let us consider the case of a system characterized by a hard- wall potential. In Fig. 5.1 we can see a summary of the results we found.

We will use the Simple Algorithm as the reference for the comparisons, in both the case of hard-wall and that of soft-wall device potential. As in the previous chapter P will be the number of slices the probe is discretized into, and Pm the number of columns affected by the probe.

### Simple Algorithm Block Algorithm Memory Algorithm 3(P−1)/(P+1) + 2

### P+1 7

### Metalidis−Bruno Algorithm Metalidis−Bruno Algorithm by Slice

### by Column

### 2P P

_{m}

### COMPUTATIONAL COSTS

Fig. 5.1 Recap of the algorithms analyzed in chapter 4.

The first comparisons will be with the Block Algorithm. We want to know which one is faster, and, in case the answer is not definitive, we want to know when one is faster and how large the difference between their speed is. In order to do this we must solve the following problem:

Cost (SA) ≥ Cost (BA)

When we substitute the values, we find:

P + 1 ≥ 7

or rather

Cost (SA) ≥ Cost (BA) ⇒ P ≥ 6 (5.1)

The speed-up of the Block Algorithm compared to the Simple Algorithm is:

Speed − up SA BA

= P + 1

7 (5.2)

When comparing the Simple Algorithm with the Memory Algorithm we find:

P + 1 ≥ 3P − 1 P + 1 + 2

(P + 1)^{2}−3 (P − 1) − 2 (P + 1) = P^{2}−3P + 2 = (P − 2) (P − 1) ≥ 0

Obviously only values for P ≥ 4 make sense, since the Memory Algorithm can not be applied for P < 3, and for P = 3 its cost is computed using a different formula.

In the end we find:

Cost (SA) ≥ Cost (M A) ⇒ P ≥ 4 (5.3)

while for the speed-up:

P + 1
3^{P −1}_{P +1} + 2
which can be rewritten as

Speed − up SA M A

= P^{2}+ 2P + 1

5P − 1 (5.4)

It should be noted that the speed-up for probes with a very large number of slices, in both (5.4) and (5.2) is of order P . More precisely P/7 for (5.2) and P/5 for (5.4).

The comparison between the Metalidis-Bruno Algorithm by Slice and the Simple Algorithm is rather simple:

Cost (M BAC) ≥ Cost (SA) ⇒ P ≥ 0 (5.5)

so this time the speed-up is for the Simple Algorithm compared to the other one, since the Metalidis-Bruno Algorithm by Slice actually causes a speed down compared to the Simple Algorithm

We can see how the speed-up is essentially costant with P , and for really large P it approaches the limit value of 0.5.

The last comparison is straightforward, but requires some interpretation.

Cost (M BAC) ≥ Cost (SA) ⇒ Pm≥P (5.7)

Speed − up

SA

M BAC

= P + 1 Pm

(5.8)

If we sum the number of columns the P slices are made of, in the case of the Simple Algorithm, by definition we obtain Pm. So it is impossible for Pm to be smaller of P . It can, at most, be equal to it, when we work with slices made by one column only (or rather, when we use the Simple Algorithm on a by column basis).

What equation (5.8) means is that the speed down of the Metalidis- Bruno Algorithm by Column is proportional to the average number of columns per slice.

Thanks to the summary in Fig. 5.2 we can see immediatly that the Metalidis- Bruno idea performs very poorly for us, since in both cases it results in a speed down compared to the Simple Algorithm. On the other hand, both the Memory Algorithm and the Block Algorithm exhibit speed-ups proportional to P .

In case of an ideal system characterized by a hard-wall potential pro- file, the Memory Algorithm is, without a doubt, the clear winner. However

Other slices to compose?

Prepare slice n

Prepare slice n of the probe

Other slices to compose?

probe positions?

Other possible

Output routine Find eigenvalues and

eigenvectors

for isolated slice (and overlap matrices as needed)

Compute Green’s function Prepare slice n

Setup

(iterations on unperturbed system)

Probe spectroscopy (iterations on perturbed slices)

Yes

No Yes

Yes No No

Compose new slice Compose new slice

for the unperturbed part Compose with block(s)

START

END

### FLOW CHART

Fig. 5.2 Flow chart of the program.

we need to keep in mind that longitudinal discontinuities of the transverse potential affect negatively the speed of this algorithm.

In the realistic case of a soft-wall potential, sadly, the only algorithms we can use are the simple one and those based on the Metalidis-Bruno idea.

Which means we must choose the Simple Algorithm.

### 5.3 Structure of the program

We want a program capable of handling soft-wall potentials, which rules out the possibilities described in paragraph (4.8) about exploiting possible repetitions. In addition, as we have seen in the last paragraph, our best choice is the Simple Algorithm, since we want to use a by-slice approach.

For every system to be studied, the algorithm will consider the leftmost and rightmost slice to be “reflectionless” leads for the reasons we discussed in the first chapter. Thus, for those slices in the isolated case, the Green’s functions computed will be those for a semiinfinite chain.

For every other slice, the formula for the finite slice will be used when calculating the Green’s functions for the isolated slice.

Every iteration we will compute the Schr¨odinger equation. The results of such calculations will allow us to compute the Green’s function as well as the overlap matrix with every adjacent slice (after we have done such calculations for them too, of course).

As explained in section 4.4, we will take advantage of the idea of section 4.3 in order to avoid repetitions in our calculations. Thus the next step is

to do a scan from left to right and store the resulting Green’s functions, then repeat the scan in the opposite direction, storing again everything we computed.

Actually the programs does the two steps we just explained in the same cycle, so that we decrease data transfer across the memory, thus rising the chances to have our data on lower level cash. Since the L1 and L2 cash memories are much faster than the normal RAM (they are on chip after all), it can be worth paying attention to these small details.

The involved speed-up will not probably be very significant, but with many small gains we can save more time than we could imagine. For example, just by rearranging the code in a different way the program took 8.34% more time to completely solve a problem.

Once all the needed data have been computed, the program starts scan- ning the probe. For every iteration it will compute the new Green’s functions for an isolated slice, for every the slice affected by the probe, and connect them togheter.

After the Green’s functions for the block of slices affected by the probe is found, we need to connect it with the Green’s functions for the part(s) of the system that are unaffected by the probe. The resulting Green’s functions

are used to compute the transmission and reflection matrix, and thus the conductance and the shot noise in the device under study.

### 5.4 Final consideratins and output examples

During the development phase we used relatively small hard-wall struc- tures to debug the program. Once everything worked fine, we moved on to a real test by simulating a quantum point contact with a soft-wall defined by a split gate. The setup was exactly the same as in [8]. In the following figures we can see the potential profile used and the results for conductance and shot noise. The probe was an ideal hard-wall probe, the discretization grid was 200x300 points (width times length).

Since the test gave the result we hoped for, we were finally ready to test the program performances on real structures.

Our first test was to compare the results of two runs on the same 200x300 structure, analyzed with different probes. In one of the runs the new and old program were required to use a hard-wall probe, in the other they were required to use a Lorentizan probe (39 slice long, 79 mesh points wide). The form of the potential of a Lorenzian probe is given by the following formula:

Vprobe = V0

1

1 + ^{mesh}^{x}^{∆x+mesh}_{N}2 ^{y}^{∆y}

Fig. 5.3 Potential profile for the soft-wall test run.

Where ∆x and ∆y are the distance of the point considered from the center of the probe, V0 is the potential of the tip of the probe, and N is the length after which the potential of the probe is half of that of its tip.

The potential profile we used for the system was a hard-wall quantum point contact, but we included the effects of screening, which, using the

Fig. 5.4 Variation of the conductance in the soft-wall test run.

results of [9], can be written as:

φ(r) = Ze

4πǫ0ǫr

Z ^{∞}

0

k

k + sJ0(kr)e^{−}^{kd}dk

where J0 is the Bessel function of order 0, r is the distance from the orthog- onal projection of the donor position onto the 2DEG plane, Z indicates the donor charge, d is the distance between the 2DEG and the donor layer, and s is the screening length.

Fig. 5.5 Shot noise results for the soft-wall test run.

The result for our program was of 48238.97 seconds for the Lorenzian probe, and 1312.39 seconds for the hard-wall probe.

The old program needed 75388.11 seconds for the Lorenzian probe, and 28778.61 seconds for the hard-wall probe.

The speed-up in the case of Lorenzian probe was 156.28%, for the hard-

Fig. 5.6 Potential profile for hard-wall qpc with potential modified by the effect of the dopants.

Aftere these first results, we proceeded with our tests and eventually found that the effective speed-up ranges between ∼ 150% to ∼ 320% when studying structures made by 200x300 points, with probes of lengths ranging from 39 to 21 points.

Despite our efforts, not much can be done to further increase the per- formance of the previous program. Essentially all of the speed-up is due to

Fig. 5.7 Conductance result for the hard-wall probe.

the use of the idea discussed in section 4.3: we compute only once, at the beginning of the run, every Green’s function we will need. Every Green’s function for regions unaffected by the probe, that is. From this follows that the speed-up depends on the dimension of the device and if the probe only.

It is interesting to digress a little, in order to briefly discuss the way

Fig. 5.8 Conductance result for the Lorenzian probe.

The problem we are solving is parallel by nature, since there is no corre- lation between the calculations performed to evaluate the Green’s function of the whole system for two different positions of the probe. Which means that we can use this parallelism by dividing the iterations of the probe position over more tasks. Actually, this is performed by running different instances of the program. For each of thes we assigned the task to scan a different part of the system.

Fig. 5.9 Shot noise result for the hard-wall probe.

The speed-up we just spoke about is calculated comparing the durations of the time spent by these tasks. More precisely, we used the “time” command (the cluster operating system was unix-like), which differentiate between the time really used by the program, the time used for system calls and the total time the program needed to run, which is influenced not only by the

Fig. 5.10 Shot noise result for the Lorenzian probe.

run the time command reported that the program received by the scheduler only 94% − 96% of the CPU time, which leads to noticable differences in the time elapsed from the moment we started the execution to the moment we received the output.

We conclude this chapter with two other simulations of a hard-wall quan- tum point contact with potential modified by screening.

These simulations, along with others not included, were used to test different setups of probe and device dimension. We used the same potential as before, but we changed some of the parameters, like the Fermi energy.

Fig. 5.12 Conductance result for the first simulation.

Fig. 5.13 Shot noise result for the first simulation.

Fig. 5.15 Conductance result for the second simulation.

Fig. 5.16 Shot noise results for the second simulation.