Algorithm for solving tridiagonal matrix problems in parallel
Nathan Mattor, Timothy J. Williams, and Dennis W. Hewett 
Lawrence Livermore National Laboratory 
Livermore, California 94550 
FAX: (510) 4233484 email: mattor@m5.llnl.gov 
[Published as Parallel Computing 21 1769 (1995)]
Abstract
A new algorithm is presented, designed to solve tridiagonal matrix
problems efficiently with parallel computers (multiple instruction stream,
multiple data stream (MIMD) machines with distributed memory).
The algorithm is designed to be extendable to higher order banded diagonal systems.
I. Introduction
Currently, there are several popular methods for parallelization of the
tridiagonal problem.
The ``most important'' of these have
recently been described with a unified approach, through parallel
factorization (Amodio and Brugnano).
Essentially, parallel factorization divides and solves the problem by the
following steps:

1.

Factor the original matrix into a product of a block matrix (that
can be divided up between processors) and a reduced matrix, which
couples the block problems.

2.

Solve each block problem with one processor.

3.

Solve the reduced matrix problem.
Here, we propose a new approach to parallel solution of such systems.
It is conceptually different from parallel factorization, in that
the first step is avoided: no manipulations are performed on the original
matrix before subdividing it among the processors.
Avoiding this step has three advantages: simplicity, speed, and one less stability
concern.
The method here is analogous to the solution of an inhomogeneous linear differential
equation, where the solution is a ``particular'' solution added to an arbitrary linear
combination of ``homogeneous'' solutions.
The coefficients of the homogeneous solutions are later determined by boundary
conditions.
In our parallel method, each processor is given a contiguous subsection of a tridiagonal
system.
Even with no information about the neighboring subsystems, the solution can be obtained
up to two constants.
This completes the bulk of the necesary calculations.
Once each processor has obtained such a solution, the global solution
can be found by matching endpoints.
This approach allows a number of desirable features in both generality and
efficient implementation.
First, the method is readily generalizable to 5diagonal and higher banded systems,
as discussed in Sec VI.
Second, the particular and homogeneous solutions can be calculated quite
efficiently, since there are a number of overlapping calculations.
Usual serial LU decomposition of a single M×M tridiagonal system requires 8M floating point operations and a temporary storage array of M elements [Press et al. or
Hockney and Eastwood].
For the parallel routine below, the three solutions (one particular and two homogeneous)
are calculated with 13M operations (not 3×8M = 24M) and no additional temporary
storage arrays, while leaving the input data intact.
Third, the interprocessor communication can be performed quite efficiently,
by the alltoall broadcast method described in Sec. IV.
Finally, vector processors can be utilized effectively in cases where there are
a number of banded diagonal systems to be solved.
This paper gives an implementation of this method.
The algorithm is designed with the following objectives, listed in order of priority.
First, the algorithm must minimize the number of interprocessor communications opened,
since this is the most time consuming process.
Second, the algorithm allows flexibility of the specific solution method of the
tridiagonal submatrices.
Here, we employ a variant of LU decomposition, but this is easily replaced with cyclic
reduction or other.
Third, we wish to minimize storage needs.
The remainder of this paper is organized as follows.
Sec. II outlines the analysis underlying the routine.
Sec. III describes an algorithm for computing the particular and two
homogeneous routines in 13M operations.
Sec. IV gives a method to assemble the reduced system efficiently in each
processor, solve it, and complete the solution.
Sec. V covers time consumption and performance of the algorithm.
Sec. VI gives some conclusions and generalizations of this routine.
The Appendix gives program segments.
II. Basic Algorithm
We consider the problem of solving the N×N linear system
with
Λ = 
‾         _

 
‾    
    _

, (22) 

on a parallel computer with P processors.
For simplicity, we assume N = PM, with M an integer.
Our algorithm is as follows.
First, the linear system of order N is subdivided into P subsystems of
order M.
Thus, the N×N matrix Λ is divided into P
submatrices L_{p}, each of dimension M×M,
Λ = 
‾    
    _







 





 





 





 




c_{M}^{P1}e_{M}e_{1}^{T} 
 





 
‾    
    _

, (25) 

where e_{p} is the pth column of the M×M identity matrix.
Similarly, the N dimensional vectors X and R are divided into
P subvectors x and r, each of dimension M
Here the submatrix L_{p} is given by


‾    
    _

 
‾    
    _


 

‾    
    _

 
‾    
    _

, 
 

for p = 1,2,…P, with similar and obvious definitions for x and
r.
For brevity, we have defined a_{m}^{p} ≡ A_{M(p1)+m}, corresponding to
the m^{th} subdiagonal element of the p^{th} submatrix.
Similarly, b_{m}^{p} ≡ B_{M(p1)+m} and c_{m}^{p} ≡ C_{M(p1) +m}.
For each subsystem p we define three vectors x_{p}^{R},
x_{p}^{UH}, and x_{p}^{LH} as the solutions to the equations
The superscripts on the x stand for ``particular'' solution, ``upper homogeneous''
solution, and ``lower homogeneous'' solution respectively, from the inhomogeneous
differential equation analogy.
The general solution of subsystem p consists of x_{p}^{R} added to an arbitrary
linear combination of x_{p}^{UH} and x_{p}^{LH},
x_{p} = x_{p}^{R} + x_{p}^{UH}x_{p}^{UH} + x_{p}^{LH}x_{p}^{LH}, (212) 

where x_{p}^{UH} and x_{p}^{LH} are yet undetermined coefficients that depend
on coupling to the neighboring solutions.
To find x_{p}^{UH} and x^{LH}, Eq. (212) is substituted into Eq. (21).
Straightforward calculation and various definitions in this section
show that x_{1}^{UH} = x_{P}^{LH} = 0, and the
remaining 2P2 coefficients are determined by the solution to the
following tridiagonal linear system, or "reduced" system

‾       
       _

 
‾       
       _


‾       
       _

 
‾       
       _

=  
‾       
       _

 
‾       
       _

, (213) 

where x_{p,m} refers to the m^{th} element of the appropriate solution from
the p^{th} submatrix.
The solution of Eq. (21), via subdivision and reassembly, is now complete.
The outline of the algorithm is as follows:

1.

For each processor, find x_{p}^{R}, x_{p}^{UH}, and
x_{p}^{LH} by solving Eqs. (29)(211).

2.

Assemble the ``reduced'' system, Eq. (213), and
solve for the x_{p}^{UH} and x_{p}^{LH}.

3.

For each processor, compute the final solution by Eq. (212).
This algorithm is our main result.
The remainder of this paper discusses practical details of reducing operation
count, memory requirement, and communication time.
III. Computing the particular and homogeneous solutions
The three solutions x_{p}^{R}, x_{p}^{UH}, and x_{p}^{LH} are
obtained by solving Eqs. (29)(211).
The method here is based on LU decomposition, which is usually the most efficient.
With vector processors, cyclic reduction may be more efficient (Hockney and
Jesshope), but only if just one system need be solved.
Many applications require solution of multiple tridiagonal systems, for which the
most efficient use of vectorization is to use LU decomposition and vectorize across the
multiple systems.
This is discussed further in Sec. VI.
Equations (29)(211) represent three M×M tridiagonal systems.
If these systems were independent, then solution via LU decomposition would
require 3×8M = 24M binary floating point operations.
However, exploiting overlapping calculations and elements with value 0, gives the
following algorithm, which can be implemented with 13M binary floating point operations.
 Forward elimination:  
ω_{i} = 
c_{i} b_{i}a_{i}ω_{i1}

i = 2,3,…M 
 
γ_{i} = 
r_{i}a_{i}γ_{i1} b_{i}a_{i}ω_{i1}

i = 2,3,…M 
 Back substitution:  
x^{R}_{i} = γ_{i}  ω_{i} x^{R}_{i+1} i = M1,M2,…1 
 
x^{LH}_{i} = ω_{i} x^{LH}_{i+1} i = M1,M2,…1 
 
ω^{UH}_{i} = 
a_{i} b_{i}c_{i}ω_{i+1}^{UH}

i = M1,M2,…1 
 Forward substitution:  
x^{UH}_{i} = ω^{UH}_{i}x^{UH}_{i1} i = 2,3,…M, 


where the processor index p is implicitly present on all variables, and we
have assumed that end elements a_{1} and c_{M} are written in the appropriate
positions in the a and c arrays.
This algorithm is similar to the usual LU decomposition algorithm, (see, e.g.,
Press et al. or Hockney and Eastwod), but with an extra forward substitution.
The sample FORTRAN segment in the Appendix implements this with no temporary
storage arrays.
If the tridiagonal matrix is constant, and only the right hand side changes from
one iteration to the next, then the vectors ω_{i},
1/(b_{i}a_{i}ω_{i}), x^{UH}_{i}, and x^{LH}_{i} can be precalculated
and stored.
The computation then requires only 5M binary floating point operations.
IV. Construction and solution of the reduced matrix
Once each processor has determined x_{p}^{R}, x_{p}^{UH},
and x_{p}^{LH}, it is time to construct and solve the reduced system of
Eq. (213).
This section describes an algorithm for this.
We assume that the following subroutines are available for interprocessor
communication:
 Send(ToPid,data,n): When this is invoked by processor
FromPid, the array data of length n is
sent to processor ToPid.
It is assumed that Send is nonblocking, in that the
processor does not wait for the data to be received by
ToPid before continuing.
 Receive(FromPid,data,n): To complete data transmission,
Receive is invoked by processor ToPid.
Upon execution, the array sent by processor FromPid is stored in
the array data array of length n,
It is assumed that Receive is blocking, in that the processor
waits for the data to be received before continuing.
Opening interprocessor communications is generally the most
timeconsuming step in the entire tridiagonal solution process, so it is important
to minimize this.
The following algorithm consumes a time of T = (log_{2}P)t_{c} in opening
communication channels (where t_{c} is the time to open one channel).
 Each processor writes whatever data it has that is relevant to
Eq. (213) in the array OutData.

The OutData arrays from each processor are concatenated as
follows (Fig. 1):

Each procesor p sends its OutData array to processor
p1 mod P, and receives a corresponding array from processor
p+1 mod P, as depicted in Fig. 1a.
The incoming array is concatenated to the end
of OutData.

At the i^{th} step, repeat the first step, except sending to
processor p2^{i1} mod P, and receiving from processor p+2^{i1} mod P
(Fig. 1b,c), for i = 1,2,….
After log_{2}P iterations (or the next higher integer), each
processor has the contents of the reduced matrix in the OutData
array.

Each processor rearranges the contents of its OutData array
into the reduced tridiagonal system, and then solves. (Each processor solves the same
reduced system.)
This communication is dense (every processor communicates in every step), and
periodic, so that upon completion every processor contains the fully concatenated
OutData array).

Figure 1.

Illustration of the method used to pass reduced
matrix data between processors, with P = 5 for definiteness.
 step 1:

In the first iteration, each processor p
sends data to processor (p1) mod P+1, receives data from
processor (p+1) mod P+1, and concatenates the data arrays.
The result is that each OutData array contains the
data from the processors shown, in order shown. The remaining
elements of the OutData array are not used.
 step 2:

Each processor sends its OutData
array to processor (p2) mod P+1, receives data from processor
(p+2) mod P+1, again concatenating.
 step 3:

In the i^{th} iteration (here third and
final), each processor sends its OutData array to
processor (p2^{i1}) mod P+1, receives data from processor
(p+2^{i1}) mod P+1, again concatenating. The final
OutData array contains all the information of the
reduced matrix, ordered cyclically beginning with the contributions
of the p^{th} processor. The information beyond the element
Outdata(8P) is not used.
A sample program segment is provided in the Appendix.
If the elements of the tridiagonal matrix are constants, then the reduced
matrix can be precalculated and only the reduced right hand side needs to be
assembled.
In this case, the above routine could be rewritten to pass 1/4 as many real
numbers.
This does not represent a very large time savings, since generally it is the
channel openings that is the most costly, not the amount of data passage.
At this point, the necessary components to Eq. (212) are stored in each
processor.
All that is left is the trivial task of picking out the correct coefficients
and constructing the final solution.
A sample program segment is provided in the Appendix.
V. Performance
In this section we discuss execution time of this algorithm,
and present scaling tests made with a working code.
The time consumption for this routine is as follows.

1.

To calculate the three roots x^{R}, x^{UH}, and
x^{LH} requires 13M binary floating point operations by each
processor, done in parallel.

2.

To assemble the reduced matrix in each processor requires
log_{2}P steps where interprocessor communications are opened, and
the i^{th} opening passes 8×2^{i1} real numbers.

3.

Solution of the reduced system through LU decomposition requires
8(2P2) binary floating point operations by each processor.

4.

Calculation of the final solution requires 4M binary floating point
operations by each processor, done in parallel.
If t_{b} is the time of one binary floating point operation,
t_{c} is the time required to open a communication channel (latency), and
t_{p} is the time to pass one real number once communication is opened,
then the time to execute this parallel routine is given by (optimally)

@ 13Mt_{b} + (log_{2}P)t_{c} + 8(P1)t_{p} + 8(2P2)t_{b} + 4Mt_{b} 
 
@ (17M+16P)t_{b} + (log_{2}P)t_{c} + 8Pt_{p}, 
 

for P >> 1.
For cases of present interest, T_{P} is dominated by (log_{2}P)t_{c} and 17M t_{b}.
The parallel efficiency is defined by e_{P} ≡ [(T_{S})/(P T_{P})],
where T_{S} is the execution time of a serial code which solves by LU decomposition.
Since serial LU decomposition solve an N×N system in a time T_{S} = 8Nt_{b}, then
the predicted parallel efficiency is
e_{P} = 
8 17+16P^{2}/N+(log_{2}P)Pt_{c}/Nt_{b} + 8P^{2}t_{p}/Nt_{b}

. (52) 

To test these claims empirically, the execution times of working serial and parallel codes
were measured, and e_{P} was calculated both through its definition and through
Eq. (52).
Fig. 2 shows e_{P} as a function of P for two cases, N = 200 and N = 50,000.
We conclude from Fig. 2 that Eq. (52) (smooth lines) is reasonably accurate, both
for the theoretical maximum efficiency (47%, acheived for small P and large N) and
for the scaling with large P.

Figure 2.

Results of scaling runs, comparing the parallel
time with serial LU decomposition time. Here, e_{P} is the
parallel efficiency and P is the number of processors. The
smooth lines represent Eq. (52), and the points are empirical
results. The upper line and diamonds use N = 200, and the lower line
and squares use N = 50,000.
These timings were made on the BBN TC2000 MIMD machine at Lawrence Livermore National
Laboratory.
This machine has 128 M88100 RISC processors, connected by a butterflyswitch
network.
To calculate the predictions of Eq. (51), the times t_{c}, t_{p}, and t_{b}
were obtained as follows.
We chose t_{c} = 750msec, based on the average time of a send/receive pair
measured in our code.
Based on communications measurements for the BBN (Leith), we chose
the passage time of a single 64bit real number as t_{p} = 9msec.
[In comparison, the peak bandwidth specification for the machine of 38 MB/sec per path
(BBN Advanced Computers Inc.) would yield t_{p} = 0.2msec.
Using this value instead of 9msec makes no visible differences in Figure 2.]
We chose t_{b} = 1.4msec, based on our measured timing of 0.00218 sec for the
serial algorithm on the N = 200 case.
(In comparison, the peak performance specification of 10 MFLOPS for the M88100
on 64bit numbers would yield t_{b} = 0.1msec.
Using this value instead of 1.4msec would yield curves which fall well below
our measured parallel efficiency values.)
All measurements were made with 64bit floating point arithmetic.
It is difficult to make a comprehensive comparison with all other parallel tridiagonal
solvers, but we can compare our speed with a popular algorithm by H. Wang.
That routine requires 2P2 steps where communications are opened (compared with
log_{2}P such steps here), and 21M binary operations per processor (compared with 17M
here).
We have not performed empirical comparisons, but since both dominant processes
have lower counts, it seems reasonable to believe the present algorithm is generally
faster.
VI. Discussion and Conclusions
In this paper, we have described a numerical routine for solving tridiagonal
matrices using a multiple instruction stream/multiple data stream (MIMD) parallel
computer with distributed memory.
The routine has the advantage over existing methods in that the initial factorization
step is not used, leading to a simpler, and probably faster, routine.
Stability of this algorithm is similar to that of serial LU decomposition of a tridiagonal
matrix.
If the L_{i} are unstable to LU decomposition, then pivoting could be used.
If the L_{i} are singular, then LU decomposition fails and some alternative should
be devised.
If the large matrix Λ is diagonally dominant,
(A_{i} > B_{i}+C_{i}) then so too are the L_{i}.
If the reduced system is unstable to LU decomposition, this can be replaced by a
different solution scheme, with little loss of overall speed (if P << M).
This routine is generalizable from tridiagonal to higher systems.
For example, in a 5diagonal system, there would be four homogeneous solutions,
each with an undetermined coefficient.
The coefficients of the homogeneous solutions would be determined by a reduced
system analogous to Eq. (213), except with O(4P) equations, not 2P2.
We briefly discuss implementation of this routine in a problem where there are
many tridiagonal systems to solve.
This situation arises in many of the forseeable applications of parallel tridiagonal
solvers, such as solving differential equations by the alternatingdirection
implicit method (ADI) (Press et al.) on a multidimensional grid.
For definiteness we consider the two dimensional grid depicted in Fig. 3.
For each grid line in y, there is a tridiagonal system to solve, with the
index running in the x direction.

Figure 3.

Schematic representation of
a typical set of tridiagonal systems that might arise in a two
dimensional grid. For each grid line in y, there is a
tridiagonal system to solve.
The routine described in this paper is well suited for this problem, which
may be handled efficiently as follows.

1.

The grid is divided into block subdomains in x and y.
Each processor is assigned one subdomain, with the submatrices and
subright hand sides stored locally.
If the number of y grid lines in each subdomain is L, and the number of x
grid points is M, then each processor has L systems with M equations each.

2.

Each processor finds x^{R}, x^{UH}, and x^{LH}, for each y grid line in its subdomain, by the algorithm described
in Sec. III.

3.

To assemble the reduced systems, each family of processors colinear
in x passes the reduced system data for all y grid lines in the
subdomain, by the algorithm described in Sec. IV.
(Note that the number of communication openings can be minimized by passing
all L reduced systems together.)
After this step is complete, each processor contains L reduced systems, one
for each y grid line in its subdomain.

4.

Each processor solves the L reduced systems, then assembles the
final solution, as described at the end of Sec. IV.
This method has several desirable features.
First, the time spent opening interprocessor communications is P_{x}log_{2}P_{x} (where P_{x} is the number of processors colinear in x), which is not
greater than for solving a single system with P_{x} processors.
Insofar as this is the most time consuming step, multidimensional efficiency
is quite good for this algorithm.
Second, if the processors are vector processors, the calculations for the
y grid lines (steps 1 and 3 above) can be carried out in parallel.
This would seem to be a more efficient use of vectorization than replacing
the LU decomposition with cyclic reduction, since the former involves fewer
operations.
Third, the algorithm is easily converted to a system where the roles of x and
y are reversed; all that needs to be done is exchange indices.
Complicated rearrangement of subdomains is not necessary.
This work was performed for the U.S. Department of Energy at Lawrence
Livermore National Laboratory under contract W7405ENG48.
Appendix
This Appendix gives sample FORTRAN routines for the algorithms in Secs. III and IV.
This makes the algorithms more concrete, and also gives some time and memory
saving steps not mentioned above.
The particular and homogeneous solutions for each submatrix are computed by the algorithm
in Sec. III.
This is to be run by each processor, and it is assumed that the arrays
a_{1}^{p}…a_{M}^{p}, b_{1}^{p}…b_{M}^{p}, c_{1}^{p}…c_{M}^{p}, and
r_{1}^{p}…r_{M}^{p} are stored locally in each processor, with the index p
omitted.
No temporary arrays are needed; all intermediate storage is done in the final
solution arrays, xr, xuh, and xlh
(each with M elements).
!forward elimination:
xuh(1) = c(1)/b(1)
xlh(1) = r(1)/b(1)
do i = 2,M,1
denom = b(i)  c(i)*xuh(i1)
if (denom.eq.0) pause !LU decomposition fails
xuh(i) = c(i)/denom
xlh(i) = (r(i)  a(i)*xlh(i1))/denom
end do
!back substitution:
xr(M) = xlh(M)
xlh(M) = xuh(M)
xuh(M) = a(M)/b(M)
do i = M1,1,1
xr(i) = xlh(i)  xuh(i)*xr(i+1)
xlh(i) = xuh(i)*xlh(i+1)
denom = b(i)  c(i)*xuh(i+1)
if (denom.eq.0) pause !LU decomposition fails
xuh(i) = a(i)/denom
end do
!forward substitution:
xuh(1) = xuh(1)
do i = 2,M,1
xuh(i) = xuh(i)*xuh(i1)
end do
Sec. IV describes how the reduced matrix is written to each processor.
A sample routine follows, to be executed by each processor.
We assume the processors are numbered p = 1,2,…,P.
The integer pid is the local processor number (called p in the
mathematical parts of this paper), and the integer nprocs is the
code name for P.
The integer log2P is the smallest integer greater than or equal
to log_{2}(P).
The real array OutData has 8×2^{log2P}
elements.
The subroutine tridiagonal(a,b,c,r,sol,n) (not given here)
is a serial subroutine that returns the solution sol
for the tridiagonal system with subdiagonal a, diagonal
b, superdiagonal c, and right hand side
r, all of length n.
!write contributions of current processor into OutData:
OutData(1) = 1.
OutData(2) = xuh(1)
OutData(3) = xlh(1)
OutData(4) = xr(1)
OutData(5) = xuh(M)
OutData(6) = xlh(M)
OutData(7) = 1.
OutData(8) = xr(M)
!concatenate all the OutData arrays:
log2P = log(nprocs)/log(2)
if (2**log2P.lt.nprocs) log2P = log2P+1
do i = 0,log2P1,1
nxfer = 8*(2**i)
ToProc = 1 + mod(pid2**i+2*nprocs,nprocs)
FromProc = 1 + mod(pid+2**i,nprocs)
call Send(ToProc,OutData,nxfer)
call Receive(FromProc,OutData(nxfer+1),nxfer)
end do
!Put OutData into reduced tridiagonal form:
nsig=8*nprocs !no. of significant entries in OutData
ifirst=8*(nprocspid) + 5 !index of a(1) in OutData
do i = 1,2*nprocs2,1
ibase = mod(ifirst+4*(i1),nsig)
reduca(i) = OutData(ibase)
reducb(i) = OutData(ibase+1)
reducc(i) = OutData(ibase+2)
reducr(i) = OutData(ibase+3)
end do
!solve reduced system:
call tridiagonal(reduca,reducb,reducc,reducr,coeffs,2*nprocs2)
Once the reduced matrix is solved, then the solution can be assembled in each processor
as follows.
!pick out the appropriate elements of coeffs:
if (processor.ne.1) then
uhcoeff = coeffs(2*processor2)
else
uhcoeff = 0.
end if
if (processor.ne.nprocs) then
lhcoeff = coeffs(2*processor1)
else
lhcoeff = 0.
end if
!compute the final solution:
do i = 1,M,1
x(i) = xr(i) + uhcoeff*xuh(i) + lhcoeff*xlh(i)
end do
References
P. Amodio and L. Brugnano, Parallel Factorizations and Parallel
Solvers for Tridiagonal Linear Systems, Linear algebra and its
applications 172: 347364 (1992).
BBN Advanced Computers Inc., Inside the TC2000 Computer, Cambridge, MA, 25
(1989).
R. W. Hockney and J. W. Eastwood, Computer Simulation Using
Particles, copyright 1988, Adam Hilger, Bristol, p. 185.
R. W. Hockney and C. R. Jesshope, Parallel Computers 2 (copyright
1988 by IOP Publishing, Ltd., Bristol).
C. E. Leith,``Domain Decomposition Message Passing for Diffusion and Fluid Flow,''
in The MPCI Yearly Report: The Attack of the Killer Micros, UCRLID107022
(1991).
William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling, Numerical Recipes, copyright 1986, Cambridge
University Press, pp. 1940.
H. H. Wang, A parallel method for tridiagonal equations, ACM
Trans. Math. Software 7:170183 (1981).
File translated from
T_{E}X
by
T_{T}H,
version 2.60.
On 16 Dec 1999, 16:22.