Large-Scale Implementation of the Density Matrix Renormalization Group Algorithm

(1)

Master in High Performance

Computing

Large-Scale Implementation of

the Density Matrix

Renormalization Group

Algorithm

Supervisors: Marcello Dalmonte, Ivan Girotto Candidate: James Vance

3

rd

_edition

2016–2017

(2)

Abstract

The density matrix renormalization group (DRMG) algorithm, a numerical tech-nique that has been successfully used for investigating the low energy properties of one-dimensional (1D) strongly correlated quantum systems, has recently emerged as an effective tool for studying two-dimensional (2D) systems as well. At the core of DMRG is a general decimation procedure that allows the systematic truncation of the Hilbert space leaving only the most relevant basis states. However, studying 2D sys-tems requires more degrees of freedom and greater computational resources. To ad-dress this computational roadblock, we develop a massively parallel implementation of the DMRG algorithm that targets a large number of basis states. It relies on parallel linear algebra libraries that distribute the generation and diagonalization of large sparse matrices, as these remain to be the most time-consuming steps in DMRG. We tailor our developed code for efficient performance on two sections of CINECA Marconi, a class Tier-0 supercomputing infrastructure, and evaluate its performance and scalabil-ity on up to thousands of processors. From the performance analysis we identify some limitations in scalability and suggest possible ways to rectify them.

(3)

Acknowledgments

I would like to express my deepest gratitude to my supervisors, Marcello Dalmonte and Ivan Girotto, for their expertise, guidance, and support from the initial stages up to the writing of this thesis. I would also like to thank Marlon Brenes Navarro for sharing his knowledge on PETSc and SLEPc, and Noel Lamsen for the helpful discussions on the DMRG algorithm.

I would like to acknowledge the ICTP Programme for Training and Research in Italian Laboratories (TRIL) for funding my participation in the MHPC programme. I would also like to thank CINECA and ICTP for providing access to a world-class supercomputing infrastructure.

(4)

Introduction

The study of two-dimensional many-body quantum systems is crucial for the under-standing of fundamental and technology-relevant problems such as high-temperature superconductivity [1], frustrated magnetism [2, 3], and topological phases of matter [4].

From the physics side, the study of twodimensional (2D) systems is challenging -quantum fluctuations are too strong to reliably apply mean field methods that work in three-dimensional systems. Also, the advanced theoretical machinery which works in one-dimensional (1D) systems does not work here. Thus, the study of 2D systems often calls for computational approaches [2].

The density matrix renormalization group (DMRG), a method that was originally formulated for solving the ground state properties of 1D quantum spin systems [5], has recently been proven effective for 2D systems as well [6–8]. It has especially shown to be very useful in problems that are intractable with traditional methods such as exact diagonalization (due to large number of basis states) [9], and quantum Monte Carlo (due to the negative sign problem) [10].

DMRG is a well-understood, numerically stable and widely-applicable algorithm. The accuracy of its results can be systematically quantified by changing two control parameters - the number of states m and number of sweeps Ns [11]. However, even

for 1D systems, DMRG can become computationally heavy. In this study, we will implement DMRG calculations for distributed memory architectures and enable the developed code to perform efficiently on a Tier-0 world class HPC infrastructure such as CINECA Marconi. Ultimately, our goal is to perform parallel large-scale DMRG on 1D and 2D frustrated magnets. Solving this HPC problem will give further insight into unprecedented regimes of physical problems.

(7)

CHAPTER 2

Background

2.1 Strongly Correlated Quantum Systems

The many-body problem is one of the grand challenges in quantum many-body systems. This is due to the large number of degrees of freedom needed to represent a quantum state. Oftentimes, this number grows exponentially with the size N of the system in consideration [12]. Physical systems can be effectively simplified into single-particle models when many-single-particle interactions are treated perturbatively. However, this simplification breaks down in the case of strongly-correlated systems in which the interactions among constituents of the system are non-trivial.

In recent years, several methods have been developed to study strongly-correlated quantum systems. A common theme among these methods is that they are most effec-tive in finding the ground state energy wavefunction of the system from which one can calculate observables. These standard methods include exact diagonalization, quantum Monte Carlo, and the density matrix renormalization group algorithm.

Within exact diagonalization (ED), the Hamiltonian of a quantum system is repre-sented by a sparse matrix whose eigenvectors and eigenvalues may be obtained using several direct and iterative methods such as Krylov subspace techniques. However, even with many simplifying assumptions and symmetries, ED still suffers from expo-nential growth of the required number of basis states. Consequently, the computational resources needed to calculate even a portion of its eigenspectrum also grows exponen-tially with the system size [13].

Quantum Monte Carlo (QMC) algorithms use importance sampling to solve higher-order integrals in the calculation of expectation values, and thus are able to bypass this

(8)

CHAPTER 2. BACKGROUND

increased complexity. Although the number of possible configurations still scale expo-nentially with the number of particles, with QMC, the computational effort now scales as a polynomial in N [14]. QMC methods have been used successfully for bosonic systems. However, these methods often fail in fermionic and frustrated spin systems due to the so called ‘sign problem’, that invalidates importance sampling, resulting in exponential increase in time to solution with the system size [10].

Addressing the limitations of ED and QMC is the Density Matrix Renormalization Group (DMRG) algorithm, which has recently emerged as a numerically precise and highly reliable method for studying many-body quantum systems [5]. Its efficient dec-imation procedure allows for the study of large systems intractable by ED, and it does not suffer from the frustrated spin or fermionic sign problem found in QMC [15].

2.2 The DMRG Algorithm

The Density Matrix Renormalization Group (DMRG) algorithm is a variational method used to study the ground state properties of low-dimensional many-body quan-tum systems such as spin lattices or molecular orbitals. The kernel of this method is to obtain a truncated wavefunction in a reduced Hilbert space whose basis is chosen to minimize the loss of information. This is achieved by decomposing the lattice into smaller sub-blocks and iteratively increasing the size of each block while performing a truncation of the operator bases at each step. With DMRG, the error resulting from trun-cation can be measured and controlled, and large systems can be treated very accurately to obtain ground state energies and gaps [11].

The method was conceptualized in 1992 by Steven White as an improvement to the block decimation procedure in Wilson’s Numerical Renormalization Group (NRG) [5, 16]. Since then, DRMG has evolved and has been applied to different problems in domains outside of condensed matter including nuclear physics [17, 18] and quantum chemistry [19,20].

In the following section we illustrate the traditional DMRG algorithm following the original paper [5] and subsequent detailed reviews and tutorials [11, 15, 21]. For simplicity we will be using spins in our description. However, the same procedure is immediately applicable to fermions and bosons.

(9)

CHAPTER 2. BACKGROUND s1 s2 ··· sL sR ··· s2 s1 (a) ··· ··· (b) ··· ··· (c) ··· ··· (d) ··· ··· s1 s2 ··· sL sL+1 sR+1 sR ··· s2 s1

Figure 2.1: Block growing scheme in the infinite-size DMRG algorithm:

Starting with the blocks of length L (R) and basis size DL(DR) formed in the

previous step (a), we grow each block by adding one site at their interface (b). We construct the superblock Hamiltonian and diagonalize it to obtain the ground state (c). We calculate the reduced density matrix for each block and use its m largest eigenvectors to rotate the operators to a truncated basis for the next step (d). [15]

2.2.1 Infinite-Size Algorithm

The starting point of the algorithm is to have two blocks of spin sites, referred to as the left and right blocks, or the system and environment blocks. The traditional DMRG algorithm then proceeds in two distinct phases: the insize algorithm and the finite-size algorithm. The infinite-finite-size DMRG (iDMRG) algorithm serves as the warm-up stage for the full DMRG algorithm since the resulting block and site operators may be stored to perform the finite-size algorithm. Its implementation also provides a template for the finite-size algorithm as the same steps will be performed in this procedure.

The algorithm, illustrated in Figure2.1, starts with the left (L) and right (R) blocks each containing one site with local dimension d. These blocks are denoted by B(1, d) where the arguments indicate the length and dimension of the block, respectively. We construct all the operator matrices, ˆO, required to represent the interaction between sites. For simplicity in our illustration of the algorithm, we consider systems with nearest-neighbor interactions in which the Hamiltonian usually takes the form

H =

∑

i

∑

n

(anSn,iTn,i+1+ bnVn,i) (2.1)

(10)

the ith site, and n serves as index for the various terms of the Hamiltonian. [21]

From a block of length L, the resulting enlarged block with L + 1 sites has a Hamil-tonian matrix of the form

H_L◦= HL⊗ 1◦+ 1L⊗ H◦+ HL↔◦ (2.2)

and correspondingly, for the right block with R sites enlarged to R + 1 the Hamiltonian is given by

H_◦R= 1_◦⊗ HR+ H◦⊗ 1R+ H◦↔R (2.3)

where⊗ is the Kronecker product, the circle symbol ◦ denote the added site, the last terms indicate the interaction between the block and the added sites, and 1 is the identity matrix corresponding to the block or site subspace denoted as its index.

Initially, we perform the addition of sites in a numerically exact manner until the blocks grow to B(n, D_L(R)) where the number of states D_L(R) = dn needed to exactly represent the block is just below or equal to our desired maximum number of basis states m. After reaching this threshold, the blocks are further enlarged but a truncation procedure is performed at each successive iteration that ensures that we keep a max-imum number of states to describe the enlarged block. Taking the interactions of the identical enlarged blocks B(n + 1, dD_L(R)), we form the superblock Hamiltonian

HL◦◦R= HL◦⊗ 1◦R+ 1L◦⊗ H◦R+ HL◦↔◦R. (2.4)

where the last term denotes the interaction between the connecting sites of the two blocks. [11,21]

We solve the ground state of the superblock Hamiltonian corresponding to the eigen-value problem posed by the time-independent Schrödinger equation

HL◦◦R|ψGS⟩ = EGS|ψGS⟩ (2.5)

using iterative diagonalization techniques such as Lanczos or Davidson algorithms. The ground state vector takes the form

|ψGS⟩ =

∑

l,r

(11)

expressed in terms of the basis vectors of the left and right enlarged blocks. From the ground state eigenvector, we calculate the reduced density matrices of each block by tracing out the other block

ρL◦= Tr◦R|ψGS⟩⟨ψGS| (2.7)

ρ◦R= TrL◦|ψGS⟩⟨ψGS| (2.8)

We take the full eigenvalue decomposition of each block’s density matrix and use this to express the reduced density matrices as

ρL◦=

∑

α ωα|α⟩L◦ L◦⟨α|.

(2.9)

Taking the vectors with m largest weights ω_α, we construct a rotation matrix U_L◦ of dimension dDL×m where DL= m if a previous truncation has been performed. We use

this to rotate the operator matrices O_L◦of dimension dDL acting on the enlarged block

to

˜

O_L◦= (U_L◦)†O_L◦U_L◦ (2.10)

The transformed operator matrix now has dimension min(m, dDL). We also perform the

corresponding steps on the enlarged right block◦R. After setting L ← L◦ and R ← ◦R as the starting blocks, the block enlargement and basis truncation steps are repeated. The density matrix truncation is applied on both blocks at each successive iteration until a suitable system size or tolerance in the error of the energy is reached. [11,15]

2.2.2 Application to the Heisenberg Spin Chain

The Heisenberg model is one of the simplest but non-trivial models studied with DMRG. Given a one-dimensional spin-1/2 chain of length N as illustrated in Figure2.2, the Heisenberg Hamiltonian is defined as

H =

N−1

∑

i=1

JiSiSi+1 (2.11)

where Ji is a coupling constant, and S = (Sx, Sy, Sz) are the spin operators represented

(12)

···

s1 s2 s3 sN−2 sN−1 sN

Figure 2.2: One-dimensional spin chain with open boundary conditions

and nearest-neighbor interaction.

S±= Sx± iSyallows us to rewrite the Hamiltonian into the more convenient form

H = N−1

∑

i=1 ( JzSz_iSz_i+1+ J 2 [ S+_i S−_i+1+ S−_i S+_i+1]) (2.12)

where J and Jz are coupling constants and the index i runs over interactions

be-tween nearest-neighbor sites. For spin-1/2 systems with J = Jz = 1, the Bethe

ansatz was used to exactly solve for the ground state energy which was found to be E0= 1/4− ln(2) ≈ 0.4431471805599. Several DMRG studies have also explored the low-energy properties of the Heisenberg model in different geometries and higher di-mensions such as two-dimensional ladders [8] and the Kagome lattice [22]. For our implementation, we will be using the spin-1/2 one-dimensional Heisenberg model as test Hamiltonian due to its simplicity and the presence of an exact solution for compar-ison.

A single site of a spin-1/2 chain is described by a state with the basis vectors |↑⟩ and|↓⟩ forming a local dimension of d = 2. To explicitly construct (2.12), we use the following matrix representations of the single-site spin operators

Sz_◦=  1/2 0 0 −1/2  , S+ ◦ =  0 1 0 0  , S− ◦ =  0 0 1 0   (2.13)

which are derived from the Pauli matrices. With respect to the entire N-spin chain, an operator acting on the ith site is explicitly represented by

Sq_i,N = (_i−1 ⊗ j=1 1_◦ ) ⊗ Sq ◦⊗   ⊗N j′=i+1 1_◦   (2.14)

where q ={z,+,−} and 1_◦ is the d× d identity matrix. If this acts on the last site of the left (right) block of length L (R), we simply denote it as Sq_L(R), and on the last site of the enlarged block of length L + 1 (R + 1) as Sq_L_◦(◦R). With this expression, we can explicitly construct the Hamiltonian (2.12) and the spin operators at each site.

(13)

Alternatively, we can also construct the Hamiltonian iteratively using (2.2) and (2.12). We begin with the case of two spins and construct its Hamiltonian explicitly as H2= JzSz_◦⊗ Sz_◦+ J 2 [ S+_◦ ⊗ S−_◦ + S−_◦ ⊗ S+_◦]. (2.15)

Then we can iteratively enlarge the block by adding another site to each block so that the Hamiltonian of a left (right) enlarged block containing L + 1(R + 1) sites is expressed by the recursion HL◦= HL⊗ 1◦+ JzSz_L⊗ Sz_◦+ J 2 [ S+_L ⊗ S−_◦ + S−_L ⊗ S+_◦] (2.16) H_◦R= 1_◦⊗ HR+ JzSz_◦⊗ Sz_R+ J 2 [ S+_◦ ⊗ S−_R+ S−_◦ ⊗ S+_R]. (2.17)

where Sq_L = Sq_L,L and Sq_R= Sq_R,R are the operators of the last sites added to the left and right blocks with (possibly truncated) basis dimensions DLand DR, respectively. Using

(2.4) we can now construct the superblock Hamiltonian as

H_L◦◦R= H_L◦⊗ 1_◦R+ 1_L◦⊗ H_◦R + JzSz_L◦⊗ Sz_◦R+ J 2 [ S+_L◦⊗ S−_◦R+ S−_L◦⊗ S+_◦R] (2.18)

with a total length L + R + 2 and a basis size of d2DLDR. [11]

2.2.3 Quantum Numbers and Symmetries

For many systems studied with DMRG, the Hamiltonian conserves some quantum numbers such that it commutes with the corresponding operator. These symmetries may be exploited to decrease the Hilbert space dimension which reduces memory consump-tion and computaconsump-tion time. Here, we discuss the most frequently used symmetry, the U (1) symmetry, in which total magnetization and/or total particle number are conserved quantum numbers. In the context of the Heisenberg model, U (1) symmetry conserves the total magnetization Sz_tot=∑N_i=1Sz_j allowing us to express the operators in terms of smaller blocks corresponding to a set of quantum numbers. [15]

In our particular spin-1/2 model, Sz_j are the eigenvalues of the Sz operator (2.13) at site j which are simply{+1₂,−1₂}. Thus, for a block of N sites, the possible quan-tum numbers are{N₂,N₂ − 1,...,−N₂}. If we label the block and site states of the sub-systems with their magnetizations and and preserve the total magnetization S_totz of the

(14)

CHAPTER 2. BACKGROUND superblock, we get S_totz = Sz_L+ S_σzL+ S z σR+ S z R (2.19)

whereσL(R) denote the added sites to the left and right blocks respectively. Thus, we can target a particular magnetization sector for the ground state solution (2.5), usually at Sz_tot= 0, or study spin gaps by solving the ground state at different target magnetizations. The reduced density matrices (2.8) also acquire a block-diagonal structure which allows us to perform its spectral decomposition (2.9) on each of these diagonal blocks. [15]

2.3 Libraries for Massively Parallel Linear Algebra

While simple DMRG applications require only moderate resources, going towards higher dimensions is computationally challenging since it requires keeping a large num-ber of states. This results in large matrices and computationally intensive operations. Thus there is a need for a parallel implementation that uses scalable linear algebra li-braries for the management of data and communication, and the implementation of the various solvers.

The Portable, Extensible Toolkit for Scientific Computation (PETSc) is a suite of data structures and routines for the scalable parallel solution of scientific applications originally designed for solving partial differential equations. It uses Message Passing Interface for distributed memory parallelism. It includes an implementation of parallel matrices and vectors together with basic operations on these objects, and several linear system solvers. [23] One of the packages that extends upon PETSc is the Scalable

Li-brary for Eigenvalue Problem Computations (SLEPc) which is used for the eigenvalue

decomposition of large sparse matrices on parallel distributed-memory architectures. Together, these two packages provide the framework in which we implement our scal-able DMRG calculations. [24] In this section we will explain the general features of these packages and what makes them suitable for our DMRG calculations.

PETSc implements a class of sparse matrices called MATAIJ which stores the ele-ments based in a compressed sparse row (CSR) format. In the CSR format the non-zero elements of each row are represented as a list of integer column indices and floating-point values. Distributed memory storage of this matrix is achieved by assigning own-ership of an almost uniform number of rows to each MPI process. Correspondingly, vectors are also stored in a distributed manner following the layout of the matrix which

(15)

acts on it.

For ubiquitous matrix-vector product operations, overlap between communication and computation is achieved by separately storing a diagonal block of matrix elements that act on the local portion of the vector, and an off-diagonal block of matrix elements that act on the non-local elements of the vector that are gathered to complete the op-eration [25]. Thus, when constructing a matrix operator, knowing how much memory to preallocate for the diagonal and off-diagonal blocks is essential for improved per-formance. PETSc also has a MATSHELL matrix-type with a customizable interface that allows for a matrix-free approach especially for matrix-vector products. [23]

On the other hand, SLEPc contains eigensolvers that are most efficient for finding

a small number eigenvalues of large sparse matrices based on several iterative algo-rithms. It already includes the usual methods that are used for Hamiltonian matrices such as Lanczos, Arnoldi and Krylov-Schur algorithms. Additionally, it also provides functionality for the partial SVD of rectangular matrices, solutions to generalized non-Hermition and non-linear eigenvalue problems, and some interface to serial LAPACK

routines. [24]

The combination of PETSc and SLEPc provides the basic functionalities needed for a large-scale DMRG calculation, including parallel sparse eigensolvers and basic linear algebra operations. They also enable scalability by using distributed-memory paral-lelism. Although the DMRG workflow is inherently serial, the generation of and oper-ation on large sparse operator matrices can greatly benefit from the speedup brought about by this parallelism. In fact, a previous thesis has demonstrated a massively-parallel implementation for the time-evolution of disordered quantum systems using Krylov subspace techniques inherently provided by these libraries [26,27].

(16)

CHAPTER 3

Implementation

In this chapter, we present our implementation of the infinite-size DMRG algo-rithm as detailed in the previous chapter. In Section3.1we describe the most basic but widely-applicable implementation that does not take symmetries into account but rather solves the entire Hamiltonian. We present an optimized version of this implementation in Section 3.2 by targeting a specific magnetization sector of the Hamiltonian which dramatically decreases the overall computational cost. For concreteness, we shall use the one-dimensional spin-1/2 Heisenberg chain with local dimension d = 2 as explicit model for our construction while keeping the notation as general as possible so that the same procedure can also be applied to other systems.

3.1 Basic Implementation

We first discuss the basic implementation which follows the traditional formulation in Section 2.2.1 where quantum numbers and symmetries are not considered. This approach is more computationally intensive but its implementation is simpler and it is applicable to a wider range of physical problems.

3.1.1 Workflow

Our implementation starts with the construction of the d-by-d operators H_◦, Sz_◦and S+_◦ for the single site. The operator S−_◦ and its representation relative to a block can be obtained from the Hermitian conjugate S−_i,N = (S+_i,N)† whenever needed. With these matrices, we form the left and right blocks containing one site each and perform the

(17)

CHAPTER 3. IMPLEMENTATION

steps of the DMRG iteration. Based on the details provided in Section2.2.1, we divide a single iteration of the DMRG algorithm into several distinct phases which we have implemented as functions in a C++ class called iDMRG.

• BuildBlockLeft and BuildBlockRight involve the addition of a site to the left and right blocks to form the enlarged blocks L◦ and ◦R, respectively. These steps involve the calculation of the enlarged block Hamiltonians H_L◦ (H_◦R) given in (2.2)-(2.3), and the construction of the operators Sz_L◦ and S+_L◦ (Sz_◦R and S+_◦R) for the newly added sites using (2.14). Using (2.14), the operators Sz_L◦ and S+_L◦ (Sz_◦R and S+_◦R) for the newly added sites are also constructed relative to their respective blocks. These steps result in the creation of several matrices of size dD_L(R)× dD_L(R). [28]

• BuildSuperblock constructs the superblock Hamiltonian HL◦◦R following

equa-tion (2.4) which results in a large sparse matrix of size d2DLDR× d2DLDR in the

MATAIJformat.

• SolveGroundState obtains the ground state solution of the superblock Hamilto-nian by solving the eigenvalue problem posed in (2.5).

• BuildReducedDMs constructs the reduced density matricesρ_L◦andρ_◦Rfrom the ground state eigenvector following (2.8). This results in smaller matrices with maximum size dD_L(R)× D_L(R).

• GetRotationMatrices is where we solve for the full spectrum of the reduced den-sity matrices using SLEPc’s Singular Value Decomposition (SVD) class which has an interface to denseLAPACKroutines. It also constructs the rotation matrix

UL◦using the singular vectors of the reduced density matrices.

• TruncateOperators performs the rotation of the block Hamiltonian and site oper-ators using (2.10).

3.1.2 Kronecker Product of Distributed Sparse Matrices

The key operation in the construction of matrix operators for the enlarged blocks and the superblock Hamiltonian is the Kronecker product. Given two matrices A and Bof dimension (MA, NA) and (MB, NB), respectively, the Kronecker product C = A⊗B

of dimension (MAMB, NANB) has elements given by

(18)

Algorithm 1Local Row and Column Ownership of Global Matrices

Input: Global matrix size Mglobal, MPI communicator rank nrankand size nprocs Output: Row ownership range [istart, iend)

functionSPLITOWNERSHIPRANGE( Mglobal, nrank, nprocs)

ioffset← (Mglobal mod nprocs)

mlocal← Mglobal/nprocs ▷ Number of locally-owned rows/columns if nrank< ioffsetthen ▷ Distribute remainder to first few processes

ioffset← nrank

mlocal← mlocal+ 1 end if

istart← Mglobal/nprocs· nrank+ ioffset

iend← istart+ mlocal return istart, iend end function

whereγ = i· MB+ k andδ = j· NB+ l. Looking at (2.2)-(2.4), we can see that the

ma-trices that will be constructed often involve a linear combination of Kronecker products of the form

C =

∑

n

anAn⊗ Bn (3.2)

where n is an index over terms and an is a coefficient. As of version 3.7.6,PETSc does

not yet support this operation for general sparse matrices so we have written our own custom implementation with the simplifying assumption that the sets of matrices{An}

and{Bn} will have uniform dimensions among themselves.

There are several constraints to our implementation. First, the matrix operators are stored in a compressed sparse row (CSR) format which is more efficient for access-ing and writaccess-ing elements contiguously in row-major order than in column-major order. Second, row ownership is distributed among all MPI processes in the communicator so that some rows of An and Bn must be sent to the correct processes where they are

needed to construct the local rows of resultant matrix C. Finally, preallocation for the non-zeros of the matrix must be done to improve efficiency during matrix construction so the resulting number of non-zeros of C must be known before construction. Taking these into account, we divide our implementation of the Kronecker product into three stages: submatrix collection, preallocation, and matrix construction.

In submatrix collection, we determine and communicate the entire rows of An and

Bn needed to construct the local rows of C for each MPI process. We begin by

(19)

Algorithm 2Kronecker Product: Submatrix Collection Input:

MPI matrices{An}Mn=1of size (MA,NA) and{Bn}Mn=1 of size (MB,NB)

MPI communicator rank nrankand size nprocs Output:

Sequential matrices{An,sub}Mn=1and{Bn,sub}Mn=1 containing non-local rows

procedureKRONGETSUBMATRICES

MC← MA· MB ▷ Global number of rows of C

NC← NA· NB ▷ Global number of columns of C

iC,start, iC,end← SPLITOWNERSHIPRANGE(MC, nrank, nprocs) for iC∈

[

iC,start, iC,end

) do

ISA← insert(iC/MB) ▷ Index set of required rows of A

ISB← insert(iC mod MB) ▷ Index set of required rows of B

end for

forn∈ {1...M} do

An,sub← MatGetSubmatrices of all rows of Anlisted in ISA

Bn,sub← MatGetSubmatrices of all rows of Bnlisted in ISB

end for end procedure

matrix that conforms to PETSc’s default layout. This is achieved by using the function

SPLITOWNERSHIPRANGE shown in Algorithm 1which outputs the range of row

in-dices [iC,start, iC,end) belonging to the current process. In Algorithm 2we describe the

remaining procedure which involves determining the indices of the rows of An and Bn

needed by a particular MPI process. These rows are then collected into submatrices by using PETSc’s MatGetSubmatrices function [23]. Additionally, a mapping object is created that takes the non-local row index and maps it to the local row index in the sequential submatrix.

A compressed sparse row matrix A may be logically thought of as an array of rows and that each row A[iA] is a list of tuples (colA, valB) representing the column indices

and values of only the non-zero elements in each row. Knowing how many non-zeros there will be in the matrix, and preallocating the corresponding amount of memory is crucial for attaining good performance as it avoids the repeated reallocation of memory for the non-zero rows during matrix construction. During prealloaction, we determine the required storage for each row of the resultant MPI or global sparse matrix follow-ing PETSc’s MATMPIAIJ format which is assembled into two sequential blocks in the

MATSEQAIJ format: the diagonal block and the off-diagonal block. An element that has a column index in the range [ jC,start, jC,end) calculated from Algorithm 1 belongs

(20)

Algorithm 3Kronecker Product: Preallocation

Input: Sets of sequential submatrices{Aq,sub}Mq=1and{Bq,sub}Mq=1

Output: Arrays DNNZ[] and ONNZ[] containing the number of non-zeros in the diag-onal and off-diagdiag-onal blocks, respectively

procedureKRONPREALLOCATION

jC,start, jC,end← SPLITOWNERSHIPRANGE(NC, nrank, nprocs) foriC∈ [ iC,start, iC,end ) do DNNZ[iC]← 0, ONNZ[iC]← 0 for q∈ {1...Q} do

for (colA, valA)∈ Aq,sub[iC/MB] do

for (colB, valB)∈ Bq,sub[iC mod MB] do

colC← colA· NB+ colB

if jC,start≤ colC< jC,endthen

DNNZ[iC]← DNNZ[iC] + 1 else ONNZ[iC]← ONNZ[iC] + 1 end if end for end for end for end for end procedure

number of elements in these blocks for each row are calculated and stored in arrays DNNZ and ONNZ for the diagonal and off-diagonal blocks, respectively, as illustrated in Algorithm3. These arrays are then fed into the MatMPIAIJSetPreallocation and MatSeqAIJSetPreallocationfunctions ofPETSc. [23,29]

Finally, we use the column indices and values of the sequential submatrices to con-struct the local rows of C in the procedure KRONMATRIXCONSTRUCTION as

illus-trated in Algorithm 4. Here, we loop through the rows of An and Bn, and follow the

definition of the Kronecker product in (3.1). Additionally, we loop through the terms of the linear combination in (3.2). The insertion of values to the matrix is achieved using the MatSetValues function provided byPETSc.

Together, these procedures allow us to explicitly construct the matrix representation of the block operators and the superblock Hamiltonian during the BuildBlockLeft, BuildBlockRight and BuildSuperblock stages of the workflow, and distribute them among different MPI processes.

(21)

Algorithm 4Kronecker Product: Matrix Construction Input: Coefficients{an} of the linear combination of ⊗

Sequential submatrices{An,sub}Mn=1and{Bn,sub}Mn=1containing non-local rows

Output: Resultant MPI matrix C (3.2) procedureKRONMATRIXCONSTRUCTION

forrowC∈ [ iC,start, iC,end ) do rowA← iC/MB rowB← iC mod MB for q∈ {1...Q} do

for (colA, valA)∈ An,sub[rowA] do

for (colB, valB)∈ Bn,sub[rowB] do

valC← an· valA· valB

C[rowC]← insert((colC, valC))

end for end for end for end for end procedure

3.1.3 Ground State Solution

After building the superblock Hamiltonian, the next step is finding the ground state solution. This is represented by the time-independent Schrödinger equation H|ψ⟩ = E |ψ⟩ for which we solve for the eigenpair corresponding to the lowest eigen-value. Since the Hamiltonian matrix for the system is sparse and very large, and only an extremal part of the eigenspectrum is needed, a full eigendecomposition of the ma-trix is impractical. Instead, iterative diagonalization techniques are often used which include the Lanczos and Davidson algorithms. These and many other techniques are implemented in SLEPc and have been tested for scalability up to thousands of cores.

[24]

In particular, the Krylov-Schur algorithm, which is based on Krylov subspace meth-ods, has shown good performance results for both Hermitian and non-Hermitian eigen-problems making it SLEPc’s default eigensolver. In our case in which the Hamiltonian matrix is Hermitian, the Krylov-Schur algorithm is equivalent to a thick-restart Lanczos method which maintains an upper bound in the number of Lanczos vectors. [30,31]

Once the Hamiltonian matrix has been constructed following the previous section, usingSLEPc’s Eigenvalue Problem Solver (EPS) object to find the ground state is very

(22)

Listing 3.1Code snippet for calling the eigensolver

1 EPS eps ;

2 EPSCreate ( PETSC_COMM_WORLD , & eps );

3 EPSSetOperators ( eps , superblock_H_ , NULL ); 4 EPSSetProblemType ( eps , EPS_HEP ); // Hermitian

5 EPSSetWhichEigenpairs ( eps , EPS_SMALLEST_REAL ); // Ground state 6 EPSSetFromOptions ( eps ); // Specify options at run time

7 EPSSolve ( eps );

workflow. The main portion of the calling sequence is shown in Listing3.1.

Parameters such as which algorithm to use, the kind and number of eigenpairs to solve for, and the tolerance can be specified as run-time parameters [32]. For our pur-pose, we use the Krylov-Schur algorithm and we specified lowest real eigenvalues with relative tolerance set at 10−12.

3.1.4 Construction of the Reduced Density Matrices

When the eigensolver has reached the desired tolerance, the ground state eigenvec-tor may now be extracted from the EPS object. This veceigenvec-tor takes the form

|ψGS⟩ =

∑

l,r

ψl,r|l⟩ ⊗ |r⟩ (3.3)

where indices l and r run over the bases of the enlarged blocks L◦ and ◦R, respectively, and it has a global length of d2DLDR. Its corresponding reduced density matrix takes

the form ρL◦◦R=|ψGS⟩⟨ψGS| =

∑

l,r,l′,r′ ψl,rψl∗′,r′|l⟩ ⟨ l′⊗|r⟩⟨r′ (3.4) To get the reduced density matrix for the left (right) block, we trace out the right (left) block subsystem ρL◦= Tr_◦R|ψGS⟩⟨ψGS| =

∑

l,l′ (

∑

_r ψl,rψl∗′,r ) |l⟩⟨l′ (3.5) ρ◦R= Tr_L◦|ψGS⟩⟨ψGS| =

∑

r,r′ (

∑

l ψl,rψl,r∗ ′ ) |r⟩⟨r′. (3.6)

(23)

where the inner summations resemble matrix multiplication [33]. This means that we can reshape the vector|ψGS⟩ in row-major order into a matrix Ψ of size dDL× dDR

re-flecting the Kronecker product structure. Then, the reduced density matrices are simply

ρL◦=ΨΨ† (3.7)

ρ◦R=Ψ†Ψ (3.8)

where † is the conjugate transpose.

We note that the ground state vector is in a parallel layout following the row-wise distribution of the superblock Hamiltonian. Thus, to construct (3.7) and (3.8), whether in parallel or in serial, the reshaped ground state vector has to be redundantly present in the participating processors. This is not a problem since it takes up much less memory compared to the superblock Hamiltonian and it can be achieved by us-ing the VecScatter functionality in PETSc. This routine is implemented in the

proce-dure BuildReducedDensityMatrices detailed in Algorithm 5, with run-time options to construct the matrices in serial on process 0 or in parallel.

Algorithm 5Construction of the reduced density matrices Input: Global ground state vectorψGS of length dDL× dDR

Run-time boolean option on whether to produce global output GLOBAL

Output: Reduced density matricesρ_L◦(dDL× dDL) andρ◦R(dDR× dDR)

procedureBUILDREDUCEDDENSITYMATRICES ψ ← full local copy ofψGSusing VecScatter

Prepare dDL× dDRlocal matrixΨ

for i∈ [0,dDL) do

for j∈ [0,dDR) do

Ψ[i, j] ←ψ[i· dDR+ j] ▷ Copy values in row-major order

end for end for

ifGLOBALthen

iC,start, iC,end← SPLITOWNERSHIPRANGE(dDL, nrank, nprocs)

ρL◦[rows∈ [iC,start. . . iC,end)]← Ψ[rows ∈ [iC,start. . . iC,end)]· Ψ†

iC,start, iC,end← SPLITOWNERSHIPRANGE(dDR, nrank, nprocs)

ρ◦R[rows∈ [iC,start. . . iC,end)]← (Ψ†)[rows∈ [iC,start. . . iC,end)]· Ψ

else

ρL◦← ΨΨ†

ρ◦R← Ψ†Ψ

end if end procedure

(24)

3.1.5 Construction of the Transformation Matrices

Having constructed the reduced density matrices, we proceed with obtaining their spectral decompositions which we will use to transform the spin and block operators for the next iteration. This requires the full spectral decomposition of relatively small dense matrices. Since the reduced density matrices are Hermitian and positive semi-definite, this is also equivalent to either an eigenvalue decomposition or singular value decomposition which are already implemented in SLEPc as an interface to dense LA

-PACKroutines as well as its own iterative sparse solvers. The eigenpairs should then be sorted in descending order of the eigenvalue, either internally in the SLEPc routine or after the eigenpairs have been solved. This straightforward procedure is illustrated in Algorithm6and it is performed on the density matrices of both the left and right blocks. Algorithm 6Construction of the rotation matrices

Input:

Reduced density matrixρ of size dD× dD m number of states to keep

Output: Rotation matrix U, truncation errorεtr procedureGETROTATIONMATRICES

{(wi, vi)}dD_i=1← eigenvalues and eigenvectors ofρ

Sort{(wi, vi)}Di=1in descending order of w

εtr← 1 − ∑m_i=1wi

U← [v1, v2, . . . , vm]

end procedure

3.1.6 Basis Truncation

The last step in the iteration is the rotation of enlarged block and site operators to the new basis. This is implemented in TruncateOperators named as such since the new basis is either of the same size or smaller than the original basis. The new operators are using ˜O = U†OU for all operators O in both the left and right blocks. This is implemented using the MatMatMatMult function inPETSc which does the triple matrix

multiplication in parallel. In addition, there is also a function called MatPTAP which avoids building the explicit conjugate transpose but which is only applicable to the case of real scalars.

(25)

3.2 Quantum Number Conservation

Many of the computationally intensive operations in DMRG can be optimized by taking conserved quantum numbers into account. However, the basic implementation that we have described in the previous section does not require the explicit tracking of the basis that constitutes the blocks’ Hilbert spaces. This means that taking conserved quantum numbers into account requires modifying the procedures discussed in our ba-sic implementation to explicitly track the basis states and divide the full Hilbert space into sectors of similar quantum numbers. In this section, we detail how we implement U (1) symmetry as discussed in Section 2.2.2, which in the context of the Heisenberg model refers to the conservation of total magnetization Sz_tot. The general structure of this implementation is adapted from the Simple-DMRG program written in Python [34].

3.2.1 Tracking of Magnetization Sectors

One of the first changes that introduced to our previous implementation is the track-ing of the magnetization sectors in the blocks’ Hilbert spaces. The magnetization val-ues, which are eigenvalues of the Sz operator, are represented as an ordered tuple with entries that correspond to each state in our chosen matrix representation. For a block of one site containing one spin-1/2, the magnetization quantum numbers corresponding to the matrix representations in (2.13) form the ordered list

Sz₁= [ +1 2,− 1 2 ] (3.9)

This means that the Hilbert space contains two magnetization sectors containing one state each.

These conserved quantum numbers are also additive among the constituent sites of a block. Each time an enlarged block is produced, this list is expanded following an outer sum structure corresponding to the Kronecker product as shown in Algorithm7. For example, when another site is added to (3.9), the enlarged block’s magnetization list becomes

Sz₂= [+1, 0, 0,−1] (3.10)

(26)

each, and Sz = 0 containing two states. This additional step is performed during the BuildBlockLeftand BuildBlockRight procedures. The magnetization sectors list is implemented in our code as a C++ standard vector for easy resizability.

On the other hand, we also need to track the magnetization sectors themselves by creating a map from a value of the magnetization to the corresponding set of indices of basis states that belong to that sector. We implement this as a C++ standard map that takes in a scalar value for the magnetization and a standard vector containing the indices of that sector. The subroutine, illustrated in Algorithm8, will be useful later on when determining which sectors to pair up in constructing the superblock Hamiltonian with a target magnetization.

Algorithm 7Performs the outer sum of quantum numbers between two blocks Input: Quantum number lists Sz_A(size NA) and Sz_B(size NB)

Output: Quantum number list Sz_ABof the combined block procedureOUTERSUMFLATTEN

for i∈ [0,NA) do for j∈ [0,NB) do Sz_AB[i· NB+ j]← Sz_A[i]· SzB[ j] end for end for end procedure

Algorithm 8Creating the map from quantum number to sector indices Input: Magnetization list Sz with N entries

Output: Mapping object SECTORINDICES

procedureINDEXMAP

Prepare SECTORINDICES: float→ vector of integers

for i∈ [0,N) do

Append i to SECTORINDICES[Sz[i]] end for

(27)

3.2.2 Solving for the Ground State in a Target Sector

When solving for a specific state, one can often find it in a specific quantum number sector. For example, the ground state in a spin-1/2 Heisenberg chain is the ground state in the magnetization sector Sz_tot = 0, while the first excited state is the ground state in Sz_tot= +1. Also, the most time-consuming and memory intensive steps in the DMRG algorithm are the construction and diagonalization of the superblock Hamiltonian due to the large size of the matrix which has a lot of basis states. Thus, tracking the sectors of each state allows us to use only those states that belong to our target sector when constructing the Hamiltonian, thereby reducing the computational cost involved.

We can use the INDEXMAP procedure in Algorithm 8to group together the states that belong to a specific sector. Then, we can select which sectors from the left and right blocks are to be paired together and determine the Kronecker product indices which belong to the target sector. This additional step as shown in Algorithm 9 is performed at the beginning of the BuildSuperBlock procedure. The index list V will be used to select which states belong to the target sector in the superblock Hamiltonian. A mapping M_L′ is also created to track the sectors of the resulting ground state vector.

Algorithm 9Selection of states in the target sector Input: Magnetization lists Sz_L and Sz_R,

Target magnetization sector Sz_tot

Output: List of indices V of states belonging to the superblock’s target sector, Map M_L′ of left block’s sector indices in the superblock

procedureSELECTSECTORSTATES

ML ← INDEXMAP(Sz_L)

MR← INDEXMAP(SzR)

for (sL,VL)∈ MLdo ▷ sL is an Szvalue and VLare sector indices

sR← Sztot− sL VR← MR[sR] for i∈ VLdo for j∈ VRdo Append (i· NR+ j) to V Append length(V ) to M_L′[sL] end for end for end for end procedure

(28)

3.2.3 Kronecker Product with Index Slicing

In constructing the superblock Hamiltonian with a restricted basis, we have to mod-ify the Kronecker product routine itself. First of all, a new input list V is introduced containing the basis indices produced in Algorithm 9. We use these indices to per-form array slicing on both the rows and columns of the full Kronecker product matrix. However, to reduce memory and communication, we avoid building the full matrix it-self. Taking the CSR storage format of our matrices into account, we perform the array slicing on the rows first since this can be done much more easily. This procedure is demonstrated in Algorithm10as a modification to Algorithm2by looping through the indices in V when performing the submatrix collection.

Having the needed submatrices on each MPI process, we build an intermediate sub-matrix which reflects the row-sliced superblock Hamiltonian. We preallocate the mem-ory needed for the intermediate sequential submatrix as shown in Algorithm11which is much simpler compared to the preallocation for the global matrix in Algorithm3since local submatrices are not divided into diagonal and off-diagonal blocks.

The entries for each row are then explicitly constructed into an intermediate se-quential matrix C′in the same manner as in Algorithm4. We can now perform the col-umn slicing on each process’ sequential matrix C′usingPETSc’s MatGetSubmatrices function. These matrices are then stitched together into a single global matrix using MatCreateMPIMatConcatenateSeqMatwhich handles everything including prealloca-tion. We show the details of these steps in Algorithm12. [23,29]

(29)

Algorithm 10Indexed Kronecker Product: Submatrix Collection Input:

MPI matrices{An}M_n=1of size (MA,NA) and{Bn}M_n=1 of size (MB,NB)

Indices V of the restricted basis

MPI communicator rank nrankand size nprocs

Output: Sequential matrices{An,sub}M_n=1and{Bn,sub}M_n=1containing non-local rows

procedureKRONIDXGETSUBMATRICES

MC← length(V) ▷Global number of rows of restricted C

NC← NA· NB ▷ Global number of columns of full C

iC,start, iC,end← SPLITOWNERSHIPRANGE(MC, nrank, nprocs) for iC∈

[

iC,start, iC,end

) do

ISA← insert(V[iC]/MB) ▷ Index set of required rows of A

ISB← insert(V[iC] mod MB) ▷ Index set of required rows of B

end for

for n∈ {1...M} do

An,sub← MatGetSubmatrices of all rows of Anlisted in ISA

Bn,sub← MatGetSubmatrices of all rows of Bnlisted in ISB

end for end procedure

Algorithm 11Indexed Kronecker Product: Preallocation of Intermediate Matrix Input: Sets of submatrices{An,sub}Mn=1and{Bn,sub}Mn=1

Output: ArrayNNZ[] containing the number of non-zeros in the sequential matrix

procedureKRONIDXPREALLOCATION

for iC∈ [ iC,start, iC,end ) do NNZ[iC]← 0 for n∈ {1...M} do NNZ_A← length(A_n,sub[iC/MB])

NNZB← length(Bn,sub[iC mod MB])

NNZ[iC]←NNZ[iC] +NNZA·NNZB

end for end for end procedure

(30)

Algorithm 12Indexed Kronecker Product: Matrix Construction Input: Coefficients{an} of the linear combination of ⊗

Sets of local submatrices{An,sub}Mn=1and{Bn,sub}Mn=1containing non-local rows

Output: Resultant matrix C

procedureKRONIDXMATRIXCONSTRUCTION

forrowC∈ [ iC,start, iC,end ) do rowA← V[iC]/MB

rowB← V[iC] mod MB

for n∈ {1...M} do

for (colA, valA)∈ An,sub[rowA] do

for (colB, valB)∈ Bn,sub[rowB] do

valC← an· valA· valB

C′[rowC]← insert((colC, valC)) ▷ Row-sliced seq. matrix

end for end for end for end for for jC∈ V do

ISC← insert( jC) ▷ Index set for columns

end for

C′′← MatGetSubmatrices of all columns of C′listed in ISC

C← MatCreateMPIMatConcatenateSeqMat of C′′ in all processes end procedure

(31)

3.2.4 Block Diagonal Reduced Density Matrices

The iterative diagonalization of the superblock Hamiltonian using SLEPc’s EPS

class proceeds in the same manner as in Listing 3.1. Additionally, we now have ex-plicit information on the quantum number of each of the ground state vector’s indices provided by M_L′ in Algorithm9. By fixing the total magnetization, the corresponding reduced density matrices acquire a block-diagonal structure [15]. Thus, instead of diag-onalizing a reduced density matrix of size dD×dD, we can instead work on its smaller diagonal blocks. The construction of the block diagonal reduced density matrices are shown in Algorithm 13. It uses the same intermediate matrixΨ as in Algorithm5 to build each block diagonal but with a smaller dimension of NL× NR

The spectral decomposition can also be implemented on each diagonal block of the reduced density matrices as shown in Algorithm 14 which involves significantly more steps than Algorithm 6 but is more computationally efficient since it involves the eigendecomposition of smaller matrices. The final truncation step of all operators proceeds in the same manner as the basic implementation using ˜O = U†OU for all operators O in both the left and right blocks.

(32)

Algorithm 13Construction of the block-diagonal reduced density matrices Input:

Global ground state vectorψGS,

Sector indices map of the superblock Hamiltonian M_L′, Sector indices maps of the left and right blocks MLand MR,

Target magnetization Sz_tot Output:

Sz sectors and reduced density matrices{(sq_L, ˆρ_L◦q )}QL

q=1and{(s q R, ˆρ

q ◦R)}Qq=1R

procedureBUILDBLOCKDIAGONALREDUCEDDENSITYMATRICES ψ ← full local copy ofψGSusing VecScatter

q← 0 ▷ Quantum number counter

for (sL,VL)∈ VL′do

sR← Sztot− sL

NL← length(ML[sL]) ▷ No. of states in the sector on the left block

NR← length(MR[sR]) ▷ No. of states in the sector on the right block

if NL· NR= 0 then Continue

end if

Prepare NL× NR local matrixΨ

for i∈ [0,NL) do

for j∈ [0,NR) do

Ψ[i, j] ←ψ[i· NR+ j] ▷ Copy values in row-major order

end for end for (sq_L, ˆρ_Lq_◦)← (sL, ˆΨ ˆΨ†) (sq_R, ˆρ_◦Rq )← (sR, ˆΨ†Ψ)ˆ q← q + 1 end for end procedure

(33)

Algorithm 14Construction of the rotation matrices Input:

{(sq_{, ˆ}_ρq₎_}Q

q=1sectors and diagonal blocks of a reduced density matrixρ,

M sector indices map of the corresponding block, m number of states to keep

Output:

Rotation matrix U, truncation errorεtr

procedureGETBLOCKDIAGONALROTATIONMATRICES

Let{Dq}Q_q=1number of states in each diagonal block

D← ∑Q_q=1Dqtotal number of states ofρ

PrepareΛ ≡ {(wq_i, vq_i)}D_i=1spectral decomposition ofρ for q∈ {1,2,...,Q} do

Λq← {(wq_i, vq_i)} Dq

i=1spectral decomposition of sub-blockρq

AppendΛqtoΛ

end for

SortΛ in descending order of w and keep only the first m entries Truncation error: εtr← 1 − ∑mi=1wi

Prepare U of dimension D× m for (wq_i, vq_i)∈ Λ do for j∈ M[sq] do Uj,i← (vq_i)j end for end for end procedure

(34)

CHAPTER 4

Results and Performance

In this chapter we evaluate the performance and scalability of our DMRG applica-tion. In Section4.1, we describe the target hardware architecture and the compilation of software dependencies on which we run our numerical calculations. In Section4.2we briefly look at how we evaluate the accuracy of our numerical results. Then, we present the results of our performance measurements in Section4.3, followed by the discussion of these results in Section4.4.

4.1 Computational Tools

4.1.1 System Architecture

The systems we targeted with our DMRG implementation are two partitions of CINECA Marconi which is a class Tier-0 supercomputer that runs on the Intel Xeon family of processors. Specifically, we perform our numerical experiments on the Mar-coni A1 partition which runs on Broadwell (BDW) processors, and on the MarMar-coni A2 partition which runs on Knights Landing (KNL) or Xeon Phi processors. Details on the relevant specifications of each partition are provided in Table4.1.

From the table, several key differences between BDW and KNL are evident. KNL nodes have a higher core count of slower processors, but with a higher peak performance compared to BDW. The BDW nodes have more total memory than KNL, but the KNL processors have an on-chip high-bandwidth memory of 16 GB offered by MCDRAM on top of the regular DDR4. The production nodes on KNL use MCDRAM in cache mode, in which they function as L3 cache that is invisible to the user, instead of acting

(35)

CHAPTER 4. RESULTS AND PERFORMANCE

Table 4.1: Details of the CINECA Marconi BDW and KNL partitions [35] Marconi A1 (BDW) Marconi A2 (KNL) Processors 2× 18-core Intel Xeon E5-2697

(Broadwell) at 2.3 GHz

68-core Intel Xeon Phi 7250 CPU (Knights Landing) at 1.40 GHz Cores 36 cores/node; 54,432 cores total 68 cores/node; 244,800 cores total Instruction Set

Extensions AVX 2.0 AVX-512

RAM 128 GB/node 16 GB/node of MCDRAM and 96 GB/node of DDR4 Max Memory

Bandwidth 76.8 GB/s 115.2 GB/s Network Intel Omnipath, 100 Gb/s

as a separate physical address space. [35]

BDW processors work on the standard 64-bit instruction set with Advanced Vector Extensions 2.0 (AVX2) extensions which enable 256-bit wide vectorizations. On the other hand, KNL has added support for AVX-512, which expands the width of AVX2 vectorization to 512 bits. This means that, in general, binaries compiled for the BDW processors are also compatible for KNL, but not vice versa.

4.1.2 Software Dependencies and Configuration

Our DMRG application was compiled with several software libraries tailored for Intel processors to ensure maximum possible performance on our target computing sys-tems. For performance tests on BDW and KNL, we usedPETSc version 3.7.2 andSLEPc

version 3.7.3, which are provided as modules on Marconi, and which were built using Intel MPI compiler and Intel Math Kernel Library (MKL) 2017 for linear algebra rou-tines. It was configured with the default 64-bit real scalar values and 32-bit integers, and with Fortran kernels enabled for array operations.

The DMRG application that we have written was compiled using Makefiles with directives to include configuration files from the targeted build of PETSc and SLEPc. The C++ files were compiled with the -std=c++11 flag indicating the standard, and the -O3optimization flag.

(36)

4.2 Convergence of Ground State Energy Per Site

0 100 200 300 400

Length of the Superblock, N −0.44 −0.43 −0.42 −0.41 G S E / si te , e N 0 DMRG eN 0 e∞ 0 101 ₁₀2

Length of the Superblock, N 10−3 10−2 10−1 E rr o r o f e N 0 re la ti ve to e ∞ 0

Figure 4.1: Left: Ground state energy per site eN₀(m) for different number of sites N of the superblock obtained during a single infinite-size DMRG run. Right: Corresponding error of eN₀(m) relative to the known value at the thermodynamic limit e∞₀. Values were obtained with m = 512 kept states.

25 ₂6 ₂7 ₂8 ₂9 ₂10

Number of kept states, m 10−11 10−10 10−9 10−8 10−7 E rr o r o f e N(0 m ) re la ti v e to e ∞(0 m m a x ) N= 24 N= 36

Figure 4.2: Convergence of eN₀(m) for increasing number of kept states. The error was calculated with respect to eN₀(mmax= 2048).

We test the accuracy of our DMRG implementation by tracking the ground state en-ergy per site of the spin-1/2 Heisenberg chain obtained from the diagonalization of the superblock Hamiltonian at each iteration. There are two ways to evaluate accuracy and convergence. First, the ground state energy per site eN₀(m) for N sites and m kept states should converge towards the known result at the thermodynamic limit e∞₀ = 0.4431... although the calculated energy will always be a bit larger due to finite-size corrections [36]. The convergence of the energy is shown in Figure 4.1. Second, the values of eN₀(m) should also converge as the number of kept states increases. This is shown in Figure4.2in which eN₀(m) converges towards the value obtained at the largest number of kept states eN₀(mmax).

(37)

4.3 Performance Analysis

In this section, we present results of the performance measurements of our DMRG implementation. The numerical calculations were done on the Marconi A1 (BDW) and A2 (KNL) clusters. Elapsed time and memory consumption were measured using

PETSc profiling tools from the root MPI process (rank 0) and scaled accordingly. All

DMRG simulations presented here involve growing a superblock chain of N = 36 sites with various number of kept states after truncation. At each iteration, we calculate the ground state which resides in the magnetization sector Sz_tot= 0. Elapsed times are shown color-coded for each component step of our workflow (Section3.1.1), while memory consumption is shown for the objects with the most memory usage.

4.3.1 Comparison Between Implementations

We first compare the two DMRG implementations that we have detailed in the pre-vious chapter: the basic implementation without symmetries, and the implementation with U (1) symmetry or conservation of total magnetization Sz_tot. We look at the re-sulting improvement from the basic implementation which solves for the ground state from among all magnetization sectors, to the second implementation which targets the ground state in the S_totz = 0 sector.

Matrix Vector Direct Solver

0 50 100 150 200 250 M em or y (G B yt es ) All Sz sectors Target Sz tot= 0

Figure 4.3: Comparison of memory usage among different PETSc objects between the basic implementation without symmetries targeting all S_totz sec-tors and the implementation which targets the Sz_tot= 0 sector.

As shown in Figure4.3, memory consumption is significantly reduced in the second implementation. This is due to the fact that the Sz_tot= 0 sector is a smaller subset of all possible states of the superblock. This results in a smaller Hamiltonian matrix which in turn produces smaller vectors, and which is diagonalized with less memory usage.

(38)

4₍₂₇₂₎ 8₍₅₄₄₎

Number of nodes (Number of processes) 0 1000 2000 3000 4000 5000 6000 7000 T im e el ap se d (s ) All Sz_sectors Sz tot= 0 All Sz sectors Sztot= 0 TruncateOperators GetRotationMatrices BuildReducedDMs SolveGroundState BuildSuperBlock BuildBlockRight BuildBlockLeft 4₍₂₇₂₎ 8₍₅₄₄₎

Number of nodes (Number of processes) 0 100 200 300 400 500 600 700 T im e el ap se d (s ) All Sz_sectors Sztot= 0 All Sz_sectors Sztot= 0

Figure 4.4: Comparison of elapsed time of simulations with m = 1024

kept states between the basic implementation without symmetries targeting all Sz_totsectors and the implementation which targets the Sz_tot= 0 sector. Left: All steps in the algorithm; Right: The same measurements showing all steps except SolveGroundState.

From the plots in Figure 4.4, we can see that in both implementations the most time-consuming steps are the construction of the large superblock Hamiltonian (BuildSuperBlock), its iterative diagonalization (SolveGroundState), the eigendecom-position of the reduced density matrices (GetRotationMatrices) and the rotation of the operators to the truncated basis (TruncateOperators). In the implementation with a tar-get magnetization, we can immediately see an order of magnitude improvement in the time to solution. The largest improvement comes from SolveGroundState as it also benefits from the smaller size of the Hamiltonian matrix.

Also, there is significant improvement in the elapsed time for the last three steps which involve operations on smaller matrices. The BuildReducedDMs and GetRota-tionMatrices steps are faster because of the block diagonal structure of the reduced density matricesρ_L◦(◦R)since tracking magnetization sectors allows us to work only on several smaller diagonal blocks, instead of a full matrix. The TruncateOperators speeds up the most since the transformation matrices U_L(R) which were constructed dense in the basic implementation become sparse and block diagonal in the second.

During the BuildSuperBlock step, in which the superblock Hamiltonian is con-structed, array slicing for rows and columns corresponding to the target sector states are done as separate steps to reduce memory consumption. This means that within this step intermediate matrices have to be created and destroyed, and values have to be accessed and copied between these matrices. Thus, although overall matrix memory consumption is significantly reduced, in terms of elapsed time this step only slightly improves in the second implementation.

(39)

Since the the implementation with U (1) symmetries gives a very good advantage in terms of solution time and memory consumption, it will be used in all numerical simulations presented in the following sections.

4.3.2 Strong Scalability

In this section, we evaluate the strong scalability of our DMRG code; that is, we quantify the improvement in the time to solution for a fixed problem size as the number of computational resources such as processors and nodes are increased. We can measure strong scalability using speedup and parallel efficiency. The speedup sp is defined as

the ratio

sp=

t1

tp

(4.1)

where tp is the elapsed time for p processing elements (nodes or cores), and t1 is the the best time with 1 processing element. Ideal speedup of sp= p occurs when

addi-tional resources produce a directly proporaddi-tional improvement in performance. In some instances t1is intractable to acquire directly, for example when the problem size doesn’t fit on the memory of one node; instead, we use t1= qtq for some baseline number of

processing elements q. Parallel efficiency, on the other hand, is the ratio between the obtained speedup and the ideal speedup, and it is given by

ep= sp p = t1 p tp (4.2)

with an ideal value of ep= 1. [37]

We first look at the case of a few processors and problem sizes with m = 512, 768 kept states whose memory requirements can fit on one node. Figure 4.5 shows the elapsed time for different values of m and for the two sections of Marconi. We can immediately see that runs performed on BDW are generally faster than those on KNL when comparing among the same number of processes.

From these run times, we obtain the speedup and efficiency as shown in Figure4.6. Although the BDW cores are faster, results from KNL show better scalability. If we also look at the parallel efficiency of the most time-consuming steps of the algorithm, namely BuildSuperBlock and SolveGroundState, both algorithms show similar scaling behavior since they are both memory-bound operations. Thus, the most probable reason

(40)

4 8 16 32 64

Number of MPI processes 0 100 200 300 400 T im e el ap se d (s ) BDW, m = 512 4 8 16 32 64

Number of MPI processes 0 200 400 600 800 1000 1200 T im e el ap se d (s ) KNL, m = 512 4 8 16 32 64

Number of MPI processes 0 250 500 750 1000 1250 1500 T im e el ap se d (s ) BDW, m = 768 TruncateOperators GetRotationMatrices BuildReducedDMs SolveGroundState BuildSuperBlock BuildBlockRight BuildBlockLeft 4 8 16 32 64

Number of MPI processes 0 1000 2000 3000 4000 5000 6000 T im e el ap se d (s ) KNL, m = 768

Figure 4.5: Elapsed time for DMRG runs with m = 512 (top) and m = 768

(bottom) kept states showing strong scaling behavior of the program on BDW (left) and KNL (right).

Large-Scale Implementation of the Density Matrix Renormalization Group Algorithm

Master in High Performance

Computing