Solving Nonlinear Equations: Newton’s method
Newton’s method (also known as the Newton–Raphson method), is a method for finding successively
better approximations to the roots (or zeroes) of a real-valued function.
The Newton–Raphson method in one variable is implemented as follows:
Given a function ƒ defined over the reals x, and its derivative ƒ’, we begin with a first guess x0 for a root of
the function f. Provided the function satisfies all the assumptions made in the derivation of the formula, a
better approximation x1 is
Geometrically, (x1, 0) is the intersection with the x-axis of the tangent to the graph of f at (x0, f (x0)).
The process is repeated in the recurrence relation as:
until a sufficiently accurate value is reached.
This algorithm is first in the class of Householder’s methods, succeeded by Halley’s method. The method
can also be extended to complex functions and to systems of equations.
The algorithm of the method is as follows: one starts with an initial guess which is reasonably close to the
true root, then the function is approximated by its tangent line (which can be computed using the tools of
calculus), and one computes the x-intercept of this tangent line (which is easily done with elementary
algebra). This x-intercept will typically be a better approximation to the function’s root than the original
guess, and the method can be iterated.
Suppose ƒ : [a, b] → R is a differentiable function defined on the interval [a, b] with values in the real
numbers R. The formula for converging on the root can be easily derived. Suppose we have some current
approximation xn. Then we can derive the formula for a better approximation, xn+1 by referring to the
diagram on the right. The equation of the tangent line to the curve y = ƒ(x) at the point x=xn is
where, ƒ’ denotes the derivative of the function ƒ.
The x-intercept of this line (the value of x such that y=0) is then used as the next approximation to the root,
xn+1. In other words, setting y to zero and x to xn+1 gives
Solving for xn+1 gives
We start the process off with some arbitrary initial value x0. (The closer to the zero, the better. But, in the
absence of any intuition about where the zero might lie, a “guess and check” method might narrow the
possibilities to a reasonably small interval by appealing to the intermediate value theorem.) The method
will usually converge, provided this initial guess is close enough to the unknown zero, and that ƒ'(x0) ≠ 0.
Furthermore, for a zero of multiplicity 1, the convergence is at least quadratic (see rate of convergence) in
a neighbourhood of the zero, which intuitively means that the number of correct digits roughly at least
doubles in every step.
Newton’s method is an extremely powerful technique—in general the convergence is quadratic: as the
method converges on the root, the difference between the root and the approximation is squared (the
number of accurate digits roughly doubles) at each step. However, there are some difficulties with the
Difficulty in calculating derivative of a function
Newton’s method requires that the derivative be calculated directly. An analytical expression for the
derivative may not be easily obtainable and could be expensive to evaluate. In these situations, it may be
appropriate to approximate the derivative by using the slope of a line through two nearby points on the
function. Using this approximation would result in something like the secant method whose convergence is
slower than that of Newton’s method.
Failure of the method to converge to the root
It is important to review the proof of quadratic convergence of Newton’s Method before implementing it.
Specifically, one should review the assumptions made in the proof. For situations where the method fails to
converge, it is because the assumptions made in this proof are not met.
If the first derivative is not well behaved in the neighborhood of a particular root, the method may
overshoot, and diverge from that root. An example of a function with one root, for which the derivative is
not well behaved in the neighborhood of the root, is
for which the root will be overshot and the sequence of x will diverge. For a = 1/2, the root will still be
overshot, but the sequence will oscillate between two values. For 1/2 < a < 1, the root will still be
overshot but the sequence will converge, and for a ≥ 1 the root will not be overshot at all.
If a stationary point of the function is encountered, the derivative is zero and the method will terminate
due to division by zero.
Poor initial estimate
A large error in the initial estimate can contribute to non-convergence of the algorithm.
Trying to avoid of non-convergence
It is common to place limits on the number of iterations, bound the solution to an interval known to
contain the root, and combine the method with a more robust root finding method.
Newton’s method is only guaranteed to converge if certain conditions are satisfied.
Bad starting points
In some cases the conditions on the function that are necessary for convergence are satisfied, but the point
chosen as the initial point is not in the interval where the method converges. This can happen, for example,
if the function whose root is sought approaches zero asymptotically as x goes to or . In such cases a
different method, such as bisection, should be used to obtain a better estimate for the zero to use as an
Iteration point is stationary
Consider the function:
It has a maximum at x = 0 and solutions of f(x) = 0 at x = ±1. If we start iterating from the stationary point x0
= 0 (where the derivative is zero), x1 will be undefined, since the tangent at (0,1) is parallel to the x-axis:
The same issue occurs if, instead of the starting point, any iteration point is stationary. Even if the
derivative is small but not zero, the next iteration will be a far worse approximation.
Starting point enters a cycle
The tangent lines of x
– 2x + 2 at 0 and 1 intersect the x-axis at 1 and 0 respectively, illustrating why
Newton’s method oscillates between these values for some starting points.
For some functions, some starting points may enter an infinite cycle, preventing convergence.
and take 0 as the starting point. The first iteration produces 1 and the second iteration returns to 0 so the
sequence will alternate between the two without converging to a root. In fact, this 2-cycle is stable: there
are neighborhoods around 0 and around 1 from which all points iterate asymptotically to the 2-cycle (and
hence not to the root of the function). In general, the behavior of the sequence can be very complex.
If the function is not continuously differentiable in a neighborhood of the root then it is possible that
Newton’s method will always diverge and fail, unless the solution is guessed on the first try.
Derivative does not exist at root
A simple example of a function where Newton’s method diverges is the cube root, which is continuous and
infinitely differentiable, except for x = 0, where its derivative is undefined (this, however, does not affect
the algorithm, since it will never require the derivative if the solution is already found):
For any iteration point xn, the next iteration point will be:
The algorithm overshoots the solution and lands on the other side of the y-axis, farther away than it
initially was; applying Newton’s method actually doubles the distances from the solution at each iteration.
In fact, the iterations diverge to infinity for every , where . In the limiting case
of (square root), the iterations will alternate indefinitely between points x0 and −x0, so they do not
converge in this case either.
If the derivative is not continuous at the root, then convergence may fail to occur in any neighborhood of
the root. Consider the function
Its derivative is:
Within any neighborhood of the root, this derivative keeps changing sign as x approaches 0 from the right
(or from the left) while f(x) ≥ x − x
2 > 0 for 0 < x < 1.
So f(x)/f'(x) is unbounded near the root, and Newton’s method will diverge almost everywhere in any
neighborhood of it, even though:
• the function is differentiable (and thus continuous) everywhere;
• the derivative at the root is nonzero;
• f is infinitely differentiable except at the root; and
• the derivative is bounded in a neighborhood of the root (unlike f(x)/f'(x)).
Numerical Integration: Simpson’s rule
Simpson’s rule is a method for numerical integration, the numerical approximation of definite integrals.
Specifically, it is the following approximation:
Simpson’s rule is a staple of scientific data analysis and engineering.
The midpoint and the trapezoidal rules
Derivation constructs Simpson’s rule from two simpler approximations: the midpoint rule
and the trapezoidal rule
The errors in these approximations are
respectively, where denotes a term asymptotically proportional to . The two
terms are not equal;. It follows from the above formulas for the errors of the midpoint and
trapezoidal rule that the leading error term vanishes if we take the weighted average
This weighted average is exactly Simpson’s rule.
Using another approximation (for example, the trapezoidal rule with twice as many points), it is possible to
take a suitable weighted average and eliminate another error term. This is Romberg’s method.
The error in approximating an integral by Simpson’s rule is
where is some number between and .
The error is asymptotically proportional to . However, the above derivations suggest an error
proportional to . Simpson’s rule gains an extra order because the points at which the integrand is
evaluated are distributed symmetrically in the interval [a, b].
Composite Simpson’s rule
If the interval of integration is in some sense “small”, then Simpson’s rule will provide an adequate
approximation to the exact integral. By small, what we really mean is that the function being integrated is
relatively smooth over the interval . For such a function, a smooth quadratic interpolant like the one
used in Simpson’s rule will give good results.
However, it is often the case that the function we are trying to integrate is not smooth over the interval.
Typically, this means that either the function is highly oscillatory, or it lacks derivatives at certain points. In
these cases, Simpson’s rule may give very poor results. One common way of handling this problem is by
breaking up the interval into a number of small subintervals. Simpson’s rule is then applied to each
subinterval, with the results being summed to produce an approximation for the integral over the entire
interval. This sort of approach is termed the composite Simpson’s rule.
Suppose that the interval is split up in subintervals, with an even number. Then, the composite
Simpson’s rule is given by
where for with ; in particular, and
. The above formula can also be written
The error committed by the composite Simpson’s rule is bounded (in absolute value) by
where is the “step length”, given by
This formulation splits the interval in subintervals of equal length. In practice, it is often advantageous
to use subintervals of different lengths, and concentrate the efforts on the places where the integrand is
less well-behaved. This leads to the adaptive Simpson’s method.
Alternative extended Simpson’s rule
This is another formulation of a composite Simpson’s rule: instead of applying Simpson’s rule to disjoint
segments of the integral to be approximated, Simpson’s rule is applied to overlapping segments, yielding:[
The formula above is obtained by combining the original composite Simpson’s rule with the one consisting
of using Simpson’s 3/8 rule in the end subintervals and the standard 3-point rule in the remaining
subintervals. The result is then obtained by taking the mean of the two formulas.
The ability to differentiate numerically is vital because in Engineering many physical systems are modelled
as differential equations and their solution is often impossible analytically. This is especially true when real
world problems are studied using partial differential equations.
For most mathematical models the differential equations only involve first for second derivative terms,
thus we will only examine the approximation of first and second derivates:
To approximate the first derivative we (undo) the limiting process and calculate f’(x) in terms of finite
This can be derived in by using the Taylor series expansion:
Subtracting the backward difference from the forward difference give:
This is our approximation for the first derivate f’(x) and it also gives an indication of the error involved, the
error is QUADRATIC
By adding the forward difference to the backward difference, we can obtain an approximation to the
second derivative f’’(x)
This also shows that the error in this numerical approximation is QUADRATIC. These expressions are very
important in solving differential equations.
Random number generation &
A random number generator (RNG) is a computational or physical device designed to generate a sequence
of numbers or symbols that lack any pattern, i.e. appear random.
The many applications of randomness have led to the development of several different methods for
generating random data.
Several computational methods for (pseudo)random number generation exist. Many fall short of the goal
of true randomness — though they may meet, with varying success, pseudo randomness.
Practical applications and uses
Note that, in general, where unpredictability is paramount – such as in security applications – hardware
generators are generally preferred (where feasible) over pseudo-random algorithms.
Random number generators are very useful in developing Monte Carlo-method simulations. They are also
used in cryptography – so long as the seed is secret. Sender and receiver can generate the same set of
numbers automatically to use as keys.
The generation of pseudo-random numbers is an important and common task in computer programming.
While cryptography and certain numerical algorithms require a very high degree of apparent randomness,
many other operations only need a modest amount of unpredictability.
“True” random numbers vs. pseudo-random numbers
There are two principal methods used to generate random numbers. The first method measures some
physical phenomenon that is expected to be random and then compensates for possible biases in the
measurement process. Example sources include measuring atmospheric noise, thermal noise, and other
external electromagnetic and quantum phenomena. For example, cosmic background radiation or
radioactive decay as measured over short timescales represent sources of natural entropy.
The second method uses computational algorithms that can produce long sequences of apparently random
results, which are in fact completely determined by a shorter initial value, known as a seed or key. The
latter type are often called pseudorandom number generators. These types of generators do not typically
rely on naturally occurring sources.
A “random number generator” based solely on deterministic computation cannot be regarded as a “true”
random number generator in the purest sense of the word, since their output is inherently predictable if all
seed values are known. In practice, they are sufficient for most tasks. Carefully designed and implemented
pseudo-random number generators can even be certified for security-critical cryptographic purposes.
Pseudo-random number generators (PRNGs) are algorithms that can automatically create long runs of
numbers with good random properties but eventually the sequence repeats (or the memory usage grows
without bound). The string of values generated by such algorithms is generally determined by a fixed
number called a seed. One of the most common PRNG is the linear congruential generator, which uses the
to generate numbers. The maximum number of numbers the formula can produce is the modulus, m. To
avoid certain non-random properties of a single linear congruential generator, several such random
number generators with slightly different values of the multiplier coefficient a can be used in parallel, with
a “master” random number generator that selects from among the several different generators.
Most computer programming languages include functions or library routines that provide random number
generators. They are often designed to provide a random byte or word, or a floating point number
uniformly distributed between 0 and 1.
The quality i.e. randomness of such library functions varies widely from completely predictable output, to
cryptographically secure. The default random number generator in many languages, including Python,
Ruby, R, IDL and PHP is based on the Mersenne Twister algorithm and is not sufficient for cryptography
purposes. Such library functions often have poor statistical properties and some will repeat patterns after
only tens of thousands of trials.
They are often initialized using a computer’s real time clock as the seed, since such a clock generally
measures in milliseconds, far beyond the person’s precision. These functions may provide enough
randomness for certain tasks (for example video games) but are unsuitable where high-quality randomness
is required, such as in cryptography applications, statistics or numerical analysis.
Most programming languages, including those mentioned above, provide a means to access these higher
An example of a simple pseudo-random number generator is the multiply-with-carry method. It is
computationally fast and has good (albeit not cryptographically strong) randomness properties
// global variables choose initial values to be the seed
unsigned int m_w = <choose-initializer>; /* must not be zero, nor 0x464fffff */
insigned int m_z = <choose-initializer>; /* must not be zero, nor 0x9068ffff */
// random number generating function
unsigned int get_random(void)
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
return (m_z << 16) + m_w; /* 32-bit result */
Random numbers uniformly distributed between 0 and 1 can be used to generate random numbers of
any desired distribution by passing them through the inverse cumulative distribution function (CDF) of
the desired distribution.
Generating the Gaussian (Normal) distribution
Suppose U1 and U2 are independent random variables that are uniformly distributed in the interval (0, 1]
and we let:
then Z0 and Z1 are independent random variables with a standard normal distribution.
(This is known as the Box-Muller transformation.)
A simple implementation in C++ using the variance. If the noise is generated without the variance, the
multiplication by the variance in the implementation below can simply be removed.
#define TWO_PI 6.2831853071795864769252866
double generateGaussianNoise(const double &variance)
static bool haveSpare = false;
static double rand1, rand2;
haveSpare = false;
rand1 = (double)rand() / ((double) RAND_MAX);
return sqrt(variance * rand1) * sin(rand2);
haveSpare = true;
rand1 = (double)rand() / ((double) RAND_MAX);
if(rand1 < 1e-100) rand1 = 1e-100;
rand1 = -2 * log(rand1);
rand2 = (rand() / ((double) RAND_MAX)) * TWO_PI;
return sqrt(variance * rand1) * cos(rand2);
A faster algorithm that avoids having to calculate trigonometric functions.
We can use the fact that, in a two-dimensional Cartesian system where X and Y coordinates are described
by two independent and normally distributed random variables, the random variables for R
Z and Θ (shown
above) in the corresponding polar coordinates are also independent and can be expressed as:
Two uniformly distributed values, u and v are used to produce the value s = R2
, which is likewise uniformly
Given u and v, independent and uniformly distributed in the closed interval [−1, +1], set s = R2 = u2 + v
(Clearly .) If s = 0 or s ≥ 1, throw u and v away and try another pair (u, v). Because u and v are
uniformly distributed and because only points within the unit circle have been admitted, the values of s will
be uniformly distributed in the open interval (0, 1), too. The latter can be seen by calculating the
cumulative distribution function for s in the interval (0, 1). This is the area of a circle with radius , divided
by . From this we find the probability density function to have the constant value 1 on the interval (0, 1).
Equally so, the angle θ divided by is uniformly distributed in the interval [0, 1) and independent of s.
We now identify the value of s with that of U1 and with that of U2. As shown in the figure, the values
of and I n the original can be replaced with the ratios and
The advantage is that calculating the trigonometric functions directly can be avoided. This is helpful when
trigonometric functions are more expensive to compute than the single division that replaces each one.
This produces two standard normal deviates.
When a computer is used to produce a uniform random variable it will inevitably have some inaccuracies
because there is a lower bound on how close numbers can be to 0. For 32 bit computer the smallest
number that can be generated is . When and are equal to this the Box–Muller transform
produces a normal random variable equal to
This means that the algorithm will not produce random variables more than 6.66 standard deviations from
the mean. This corresponds to a proportion of lost due to the truncation.
is the square of the norm of the standard bivariate normal variable (X, Y), it has the chi-squared
distribution with two degrees of freedom. In the special case of two degrees of freedom, the chi-squared
distribution coincides with the exponential distribution, and the equation for R2 above is a simple way of
generating the required exponential variate.
Solution of Differential Equations
Many engineering problems are examined by setting up a mathematical model for them and then solving
that model. The most useful mathematical model for these problems is the differential equation. Most of
these models require only up to second order differentials but often they are complicated by being nonlinear and for real world problems they require more than one variable (they are multi-dimensional.)
Differential equation that involve more than variable are Partial Differential Equations. These variables are
typically, one,two or three spatial coordinates and time.
Often simplifications are possible, where one or two spatial dimensions can be omitted from the model.
Rarely can non-linear differential equations be solved analytically and often even linear models require
boundary conditions that make an analytic solution impossible. Thus, in practice, we use discrete numerical
approximation for the differentials and a computer to carry out the calculations.
We have already seen how to approximate first and second differentials and so the differential equations
can be reformulated in terms of discretised variables. It is helpful to investigate the solution of differential
equations by classifying them according to the order of the differentials and how they are combined.
The class are:
• Elliptic e.g. Poisson’s Equation
• Parabolic e.g The Diffusion Equation
• Hyperbolic e.g. The Wave Equation
We have two potential solution techniques:
• Finite Differences
• Finite Elements
In this course we use Finite Differences and investigate Elliptic and Parabolic equations:
Typical elliptic equations in a two-dimensional Cartesian system are Laplace’s equations, and Poisson’s
As this involves two dimensions we use partial derivatives. We could study a one-dimensional problem by
removing one of the variables or a three-dimensional problem by adding another variable.
Applying the finite difference approximations to the partial derivates we get:
Five-point central-difference for the Laplace equation:
With the finite difference approximation we get:
, 1 , , 1
1, , 1,
u u u
ui j ui j ui j i j i j i j
This is accurate to:
o x , y
And applies ONLY at discrete points. Note the spacing in horizontal and vertical directions;
This can be generalised:
This expression can be used to solve for the discrete value of the field variable U at the point (i,j) using
values of U in the neighbourhood. Of course these points may themselves need to be calculated and this
this complicates the solution resulting in the need to build up a solution algorithm.
NOTE: Any differential equation can only be solved numerically under specific boundary/initial conditions
and thus these must be specified. An example illustrates the process.
u u u u u
u u u
u u u
i j i j i j i j i j
i j i j i j i j i j i j
2 1 0…
2 2 0
, 1 , 1
, 1 , , 1
1, , 1,
if x y 1
In order to explore various solution procedures, first consider a square domain with Dirichlet Boundary
For instance, a simple 6×6 grid system subject to the following B.C.s. is considered:
x=0 u=u2, y=0 u=u1
x=L u=u4, y=H u=u3
The resulting solution requires 16 linear equations to be written for each of the 16 interior points where U
is to be calculated.
These can be written in matrix form because this makes it easier to see a pattern and set up a numerical
You have learned several techniques for solving such problems. Gauss Elimination is a good example but
the special structure of the matrix make it much easier to use an iterative approach such as the Gauss
Siedel method. (See notes on Jacobi & Gauss Siedel method Later.)
The finite difference equation is given by:
, 1 , 1
, 2 1, 1,
ui j ui j ui j ui j
For the first point (2,2)
2 3,2 1,2
u u u u u
k k k
These equations mostly represent processes that evolve through time. The can potentially also involve two
and three spatial dimensions but we will only investigate one spatial dimension.
Since they involve time we talk about the problem having an Initial Condition as well as one/two/three
There are two approaches to their solution that have different stability properties and involve different
levels of computational complexity.
• Explicit (Forward time, central space)
• Implicit (Backward time, central space)
A typical parabolic second-order PDE is the unsteady heat conduction equation, which is considered first in
one-space dimension. It has the following form:
is expressed by a forward difference approximation which is of order ∆t.
Using the second order central difference of order ∆x2
for the spatial term the diffusion equation can be
This Explicit formulation is stable provided:
And is easily calculated for each discrete point based on the initial condition for U. The time-evolution of U
can be determined at each time step ∆t using the simple formula (2). There are no unknowns, all the Ui
values are known at either the initial boundary or from the previous time step.
The main drawback with this method is the stability condition because it often results in the need to make
VERY small time steps and the solution takes a long time to progress meaningfully.
Implicit approach (backward/central space) method:
In this formulation the adjacent spatial variable values at the ‘current’ time step are required in order to
calculate the current value of the central variable. This makes a straightforward calculation of the current
values impossible and a set of linear equations have to be set up and these need to be solved using either
the Gauss-Seidel or Gauss Elimination techniques:
However, the implicit method is unconditionally stable and thus large time steps may be taken.
By extending either the Explicit or Implicit approach the diffusion equation in two spatial dimensions may
be solved using Finite Differences.
By extending these ideas, any form of ODE or PDE in any shaped environment or with any initial conditions
may be solved.
Working with ‘real’ numbers in a Computer
Computers naturally deal with Integer quantities, there are exact one-to-one relationships between the
integers 0,1,2,3,… and their binary equivalents (up to the capacity of the machine) the same thing can
NOT be said of ‘real’ (floating point) numbers, nor the fractions such as 1/3 nor the irrational numbers
such as √2 or π.
Single-precision binary floating-point format: binary32
The IEEE 754 standard specifies a binary32 as having:
• Sign bit: 1 bit
• Exponent width: 8 bits
• Significand precision: 24 bits (23 explicitly stored)
This gives from 6 to 9 significant decimal digits precision (if a decimal string with at most 6 significant
decimal is converted to IEEE 754 single precision and then converted back to the same number of
significant decimal, then the final string should match the original; and if an IEEE 754 single precision is
converted to a decimal string with at least 9 significant decimal and then converted back to single, then the
final number must match the original).
Sign bit determines the sign of the number, which is the sign of the significand as well. Exponent is either
an 8 bit signed integer from −128 to 127 (2’s Complement) or an 8 bit unsigned integer from 0 to 255 which
is the accepted biased form in IEEE 754 binary32 definition. If the unsigned integer format is used, the
exponent value used in the arithmetic is the exponent shifted by a bias – for the IEEE 754 binary32 case, an
exponent value of 127 represents the actual zero (i.e. for 2e − 127 to be one, e must be 127).
The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to
the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction
bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to
log10(224) ≈ 7.225 decimal digits). The bits are laid out as follows:
The real value assumed by a given 32 bit binary32 data with a given biased exponent e (the 8 bit unsigned
integer) and a 23 bit fraction is where more precisely we have
In this example:
The single-precision binary floating-point exponent is encoded using an offset-binary representation, with
the zero offset being 127; also known as exponent bias in the IEEE 754 standard.
• Emin = 01H−7FH = −126
• Emax = FEH−7FH = 127
• Exponent bias = 7FH = 127
Thus, in order to get the true exponent as defined by the offset binary representation, the offset of 127 has
to be subtracted from the stored exponent.
The stored exponents 00H and FFH are interpreted specially.
In the IEEE 754 standard, zero is signed, meaning that there exist both a “positive zero” (+0) and a
“negative zero” (−0). In most run-time environments, positive zero is usually printed as “0”, while negative
zero may be printed as “-0”. The two values behave as equal in numerical comparisons, but some
operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity, while 1/+0
returns positive infinity (so that the identity 1/(1/±∞) = ±∞ is maintained). Other common functions with a
discontinuity at x=0 which might treat +0 and −0 differently include log(x), signum(x), and the principal
square root of y + xi for any negative number y. As with any approximation scheme, operations involving
“negative zero” can occasionally cause confusion. For example, in IEEE 754, x = y does not imply 1/x = 1/y,
as 0 = −0 but 1/0 ≠ 1/−0.
The infinities of the extended real number line can be represented in IEEE floating-point datatypes, just like
ordinary floating-point values like 1, 1.5, etc. They are not error values in any way, though they are often
(but not always, as it depends on the rounding) used as replacement values when there is an overflow.
Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. An infinity can
also be introduced as a numeral (like C’s “INFINITY” macro, or “∞” if the programming language allows that
IEEE 754 requires infinities to be handled in a reasonable way, such as
• (+∞) + (+7) = (+∞)
• (+∞) × (−2) = (−∞)
• (+∞) × 0 = NaN – there is no meaningful thing to do
IEEE 754 specifies a special value called “Not a Number” (NaN) to be returned as the result of certain
“invalid” operations, such as 0/0, ∞×0, or sqrt(−1). In general, NaNs will be propagated i.e. most operations
involving a NaN will result in a NaN, although functions that would give some defined result for any given
floating-point value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default
quiet NaNs and, optionally, signaling NaNs. A signaling NaN in any arithmetic operation (including
numerical comparisons) will cause an “invalid” exception to be signaled.
The representation of NaNs specified by the standard has some unspecified bits that could be used to
encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs
could be used by a runtime system to flag uninitialized variables, or extend the floating-point numbers
with other special values without slowing down the computations with ordinary values, although such
extensions are not common.
Converting from decimal representation to binary32 format
In general refer to the IEEE 754 standard itself for the strict conversion (including the rounding behaviour)
of a real number into its equivalent binary32 format.
Here we can show how to convert a base 10 real number into an IEEE 754 binary32 format using the
• consider a real number with an integer and a fraction part such as 12.375
• convert and normalize the integer part into binary
• convert the fraction part using the following technique as shown here
• add the two results and adjust them to produce a proper final conversion
Conversion of the fractional part:
consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2,
take the integer part and re-multiply new fraction by 2 until a fraction of zero is found or until the precision
limit is reached which is 23 fraction digits for IEEE 754 binary32 format.
0.375 x 2 = 0.750 = 0 + 0.750 => b−1 = 0, the integer part represents the binary fraction digit. Re-multiply
0.750 by 2 to proceed
0.750 x 2 = 1.500 = 1 + 0.500 => b−2 = 1
0.500 x 2 = 1.000 = 1 + 0.000 => b−3 = 1, fraction = 0.000, terminate
We see that (0.375)10 can be exactly represented in binary as (0.011)2. Not all decimal fractions can be
represented in a finite digit binary fraction. For example decimal 0.1 cannot be represented in binary
exactly. So it is only approximated.
Therefore (12.375)10 = (12)10 + (0.375)10 = (1100)2 + (0.011)2 = (1100.011)2
Also IEEE 754 binary32 format requires that you represent real values in format, so
that 1100.011 is shifted to the right by 3 digits to become
Finally we can see that:
From which we deduce:
• The exponent is 3 (and in the biased form it is therefore 130 = 1000 0010)
• The fraction is 100011 (looking to the right of the binary point)
From these we can form the resulting 32 bit IEEE 754 binary32 format representation of 12.375 as: 0-
10000010-10001100000000000000000 = 41460000H
Note: consider converting 68.123 into IEEE 754 binary32 format: Using the above procedure you expect to
get 42883EF9H with the last 4 bits being 1001 However due to the default rounding behaviour of IEEE 754
format what you get is 42883EFAH whose last 4 bits are 1010 .
Ex 1: Consider decimal 1 We can see that:
From which we deduce:
• The exponent is 0 (and in the biased form it is therefore 127 = 0111 1111 )
• The fraction is 0 (looking to the right of the binary point in 1.0 is all 0 = 000…0)
From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 1 as:
0-01111111-00000000000000000000000 = 3f800000H
Ex 2: Consider a value 0.25 . We can see that:
From which we deduce:
• The exponent is −2 (and in the biased form it is 127+(−2)= 125 = 0111 1101 )
• The fraction is 0 (looking to the right of binary point in 1.0 is all zeros)
From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 0.25
as: 0-01111101-00000000000000000000000 = 3e800000H
Ex 3: Consider a value of 0.375 . We saw that
Hence after determining a representation of 0.375 as we can proceed as above:
• The exponent is −2 (and in the biased form it is 127+(−2)= 125 = 0111 1101 )
• The fraction is 1 (looking to the right of binary point in 1.1 is a single 1 = x1)
From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 0.375
as: 0-01111101-10000000000000000000000 = 3ec00000H
These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes
the sign, (biased) exponent, and significand.
3f80 0000 = 1
c000 0000 = −2
7f7f ffff ≈ 3.4028234 × 1038
(max single precision)
0000 0000 = 0
8000 0000 = −0
7f80 0000 = infinity
ff80 0000 = −infinity
3eaa aaab ≈ 1/3
By default, 1/3 rounds up, instead of down like double precision, because of the even number of bits in the
significand. The bits of 1/3 beyond the rounding point are 1010… which is more than 1/2 of a unit in the
Converting from single-precision binary to decimal
We start with the hexadecimal representation of the value, 41c80000, in this example, and convert it to
41c8 000016 = 0100 0001 1100 1000 0000 0000 0000 00002
then we break it down into three parts; sign bit, exponent and significand.
Sign bit: 0
Exponent: 1000 00112 = 8316 = 131
Significand: 100 1000 0000 0000 0000 00002 = 48000016
We then add the implicit 24th bit to the significand
Significand: 1100 1000 0000 0000 0000 00002 = C8000016
and decode the exponent value by subtracting 127
Raw exponent: 8316 = 131
Decoded exponent: 131 − 127 = 4
Each of the 24 bits of the significand (including the implicit 24th bit), bit 23 to bit 0, represents a value,
starting at 1 and halves for each bit, as follows
bit 23 = 1
bit 22 = 0.5
bit 21 = 0.25
bit 20 = 0.125
bit 19 = 0.0625
bit 0 = 0.00000011920928955078125
The significand in this example has three bits set, bit 23, bit 22 and bit 19. We can now decode the
significand by adding the values represented by these bits.
Decoded significand: 1 + 0.5 + 0.0625 = 1.5625 = C80000/223
Then we need to multiply with the base, 2, to the power of the exponent to get the final result
1.5625 × 24 = 25
41c8 0000 = 25
This is equivalent to:
where is the sign bit, is the exponent, and is the significand.
Double-precision floating-point format
Double-precision floating-point format is a computer number format that occupies 8 bytes (64 bits) in
computer memory and represents a wide dynamic range of values by using floating point.
Computers with 32-bit storage locations use two memory locations to store a 64-bit double-precision
number (a single storage location can hold a single-precision number). Double-precision floating-point
format usually refers to binary64, as specified by the IEEE 754 standard, not to the 64-bit decimal format
IEEE 754 double-precision binary floating-point format: binary64
Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over
single-precision floating point, in spite of its performance and bandwidth cost. As with single-precision
floating-point format, it lacks precision on integer numbers when compared with an integer format of the
same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:
• Sign bit: 1 bit
• Exponent width: 11 bits
• Significand precision: 53 bits (52 explicitly stored)
This gives from 15–17 significant decimal digits precision. If a decimal string with at most 15 significant
digits is converted to IEEE 754 double precision representation and then converted back to a string with the
same number of significant digits, then the final string should match the original. If an IEEE 754 double
precision is converted to a decimal string with at least 17 significant digits and then converted back to
double, then the final number must match the original.
The format is written with the significand having an implicit integer bit of value 1 (except for special
datums, see the exponent encoding below). With the 52 bits of the fraction significand appearing in the
memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈
15.955). The bits are laid out as follows:
The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-
bit fraction is
Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are
exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable
numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.
The spacing as a fraction of the numbers in the range from 2n
to 2n+1 is 2n−52. The maximum relative
rounding error when rounding a number to the nearest representable one (the machine epsilon) is
The 11 bit width of the exponent allows the representation of numbers with a decimal exponent between
10−308 and 10308, with full 15-17 decimal digits precision. By compromising precision, subnormal
representation allows values smaller than 10−323
The double-precision binary floating-point exponent is encoded using an offset-binary representation, with
the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such
representations would be:
• Emin (1) = −1022
• E (50) = −973
• Emax (2046) = 1023
Thus, as defined by the offset-binary representation, in order to get the true exponent the exponent bias of
1023 has to be subtracted from the written exponent.
The exponents 00016 and 7ff16 have a special meaning:
• 00016 is used to represent a signed zero (if M=0) and subnormals (if M≠0); and
• 7ff16 is used to represent ∞ (if M=0) and NaNs (if M≠0),
where M is the fraction mantissa. All bit patterns are valid encoding.
Except for the above exceptions, the entire double-precision number is described by:
In the case of subnormals (E=0) the double-precision number is described by:
3ff0 0000 0000 000016 = 1
3ff0 0000 0000 000116 ≈ 1.0000000000000002, the smallest number > 1
3ff0 0000 0000 000216 ≈ 1.0000000000000004
4000 0000 0000 000016 = 2
c000 0000 0000 000016 = –2
0000 0000 0000 000116 = 2−1022−52 = 2−1074
≈ 4.9406564584124654 × 10−324 (Min subnormal positive double)
000f ffff ffff ffff16 = 2−1022 − 2−1022−52
≈ 2.2250738585072009 × 10−308 (Max subnormal double)
0010 0000 0000 000016 = 2−1022
≈ 2.2250738585072014 × 10−308 (Min normal positive double)
7fef ffff ffff ffff16 = (1 + (1 − 2−52)) × 21023
≈ 1.7976931348623157 × 10308 (Max Double)
0000 0000 0000 000016 = 0
8000 0000 0000 000016 = –0
7ff0 0000 0000 000016 = ∞
fff0 0000 0000 000016 = −∞
3fd5 5555 5555 555516 ≈ 1/3
By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the
In more detail:
Given the hexadecimal representation 3FD5 5555 5555 555516,
Sign = 0
Exponent = 3FD16 = 1021
Exponent Bias = 1023 (constant value; see above)
Significand = 5 5555 5555 555516
Value = 2(Exponent − Exponent Bias) × 1.Significand – Note the Significand must not be converted to decimal here
= 2−2 × (15 5555 5555 555516 × 2−52)
= 2−54 × 15 5555 5555 555516
Execution speed with double-precision arithmetic
Using double precision floating-point variables and mathematical functions (e.g., sin(), cos(), atan2(), log(),
exp(), sqrt()) are slower than working with their single precision counterparts. One area of computing
where this is a particular issue is for parallel code running on GPUs. For example when using NVIDIA’s
CUDA platform, calculations with double precision take three times longer
The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point
operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This
is related to the finite precision with which computers generally represent numbers.
For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to
square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision)
representation, 0.1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is
Squaring this number gives
Squaring it with single-precision floating-point hardware (with rounding) gives
But the representable number closest to 0.01 is
Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not
yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point
hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation
/* Enough digits to be sure we get the correct approximation. */
double pi = 3.1415926535897932384626433832795;
double z = tan(pi/2.0);
will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be
By the same token, an attempted computation of sin(π) will not yield zero. The result will be
(approximately) 0.1225×10−15 in double precision, or −0.8742×10−7 in single precision.
While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they
are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit
significand decimal arithmetic:
a = 1234.567, b = 45.67834, c = 0.0004
(a + b) + c:
+ 45.67834 (b)
1280.24534 rounds to 1280.245
1280.245 (a + b)
+ 0.0004 (c)
1280.2454 rounds to 1280.245 <— (a + b) + c
a + (b + c):
+ 0.0004 (c)
+ 45.67874 (b + c)
1280.24574 rounds to 1280.246 <— a + (b + c)
They are also not necessarily distributive. That is, (a + b) ×c may not be the same as a×c + b×c:
1234.567 × 3.333333 = 4115.223
1.234567 × 3.333333 = 4.115223
4115.223 + 4.115223 = 4119.338
1234.567 + 1.234567 = 1235.802
1235.802 × 3.333333 = 4119.340
In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight
inaccuracies, the following phenomena may occur:
• Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. When we
subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with
just the insignificant, and most erroneous, digits. For example, when determining a derivative of a
function the following formula is used:
Intuitively one would want an h very close to zero, however when using floating-point operations,
the smallest number won’t give the best approximation of a derivative. As h grows smaller the
difference between f (a + h) and f(a) grows smaller, cancelling out the most significant and least
erroneous digits and making the most erroneous digits more important. As a result the smallest
number of h possible will give a more erroneous approximation of a derivative than a somewhat
larger number. This is perhaps the most common and serious accuracy problem.
• Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting
(0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and
ceiling functions may produce answers which are off by one from the intuitively expected value.
• Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal
number or zero. In these cases precision will be lost.
• Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that
a division will not overflow.
• Testing for equality is problematic. Two computational sequences that are mathematically equal
may well produce different floating-point values.
• On 25 February 1991, a loss of significance in a MIM-104 Patriot missile battery prevented it
intercepting an incoming Scud missile in Dhahran, Saudi Arabia, contributing to the death of 28
soldiers from the U.S. Army.
Minimizing the effect of accuracy problems
Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to
within half a ULP, more complicated formulae can suffer from larger errors due to round-off. The loss of
accuracy can be substantial if a problem or its data are ill-conditioned, meaning that the correct result is
hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can
suffer from large loss of accuracy if an algorithm numerically unstable for that data is used: apparently
equivalent formulations of expressions in a programming language can differ markedly in their numerical
stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically
stable algorithms, which is an aim of the branch of mathematics known as numerical analysis. Another
approach that can protect against the risk of numerical instabilities is the computation of intermediate
(scratch) values in an algorithm at a higher precision than the final result requires, which can remove, or
reduce by orders of magnitude, such risk: IEEE 754 quadruple precision and extended precision are
designed for this purpose when computing at double precision.
For example, the following algorithm is a direct implementation to compute the function A(x) = (x–1)/(
exp(x–1) – 1) which is well-conditioned at 1.0, however it can be shown to be numerically unstable and lose
up to half the significant digits carried by the arithmetic when computed near 1.0.
double A(double X)
double Y, Z; // 
Y = X – 1.0;
Z = exp(Y);
if (Z != 1.0) Z = Y/(Z – 1.0); // 
If, however, intermediate computations are all performed in extended precision (e.g. by setting line  to
C99 long double), then up to full precision in the final double result can be maintained. Alternatively, a
numerical analysis of the algorithm reveals that if the following non-obvious change to line  is made:
if (Z != 1.0) Z = log(Z)/(Z – 1.0);
then the algorithm becomes numerically stable and can compute to full double precision.
To maintain the properties of such carefully constructed numerically stable programs, careful handling by
the compiler is required. Certain “optimizations” that compilers might make (for example, reordering
operations) can work against the goals of well-behaved software. There is some controversy about the
failings of compilers and language designs in this area: C99 is an example of a language where such
optimizations are carefully specified so as to maintain numerical precision.
Several rules of thumb that can substantially decrease, by orders of magnitude the risk of numerical
anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include, computing all
expressions and intermediate results in the highest precision supported in hardware (a common rule of
thumb is to carry twice the precision of the desired result i.e. compute in double precision for a final single
precision result, or in double extended or quad precision for up to double precision results); and rounding
input data and results to only the precision required and supported by the input data (carrying excess
precision in the final result beyond that required and supported by the input data can be misleading,
increases storage cost and decreases speed, and the excess bits can affect convergence of numerical
procedures: notably, the first form of the iterative example given below converges correctly when using
this rule of thumb). Brief descriptions of several additional issues and techniques follow.
As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its
best when it is simply being used to measure real-world quantities over a wide range of scales (such as the
orbital period of a moon around Saturn or the mass of a proton), and at its worst when it is expected to
model the interactions of quantities expressed as decimal strings that are expected to be exact. An
example of the latter case is financial calculations. For this reason, financial software tends not to use a
binary floating-point number representation. The “decimal” data type of the C# and Python programming
languages, and the decimal formats of the IEEE 754-2008 standard, are designed to avoid the problems of
binary floating-point representations when applied to human-entered exact decimal values, and make the
arithmetic always behave as expected when numbers are printed in decimal.
Expectations from mathematics may not be realized in the field of floating-point computation. For
example, it is known that , and that , however these
facts cannot be relied on when the quantities involved are the result of floating-point computation.
The use of the equality test (if (x==y) …) requires care when dealing with floating-point numbers. Even
simple expressions like 0.6/0.2-3==0 will, on most computers, fail to be true (in IEEE 754 double precision,
for example, 0.6/0.2-3 is approximately equal to -4.44089209850063e-16). Consequently, such tests are
sometimes replaced with “fuzzy” comparisons (if (abs(x-y) < epsilon) …, where epsilon is sufficiently small
and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require
numerical analysis to bound epsilon. Values derived from the primary data representation and their
comparisons should be performed in a wider, extended, precision to minimize the risk of such
inconsistencies due to round-off errors. It is often better to organize the code in such a way that such tests
are unnecessary. For example, in computational geometry, exact tests of whether a point lies off or on a
line or plane defined by other points can be performed using adaptive precision or exact arithmetic
methods. Small errors in floating-point arithmetic can grow when mathematical algorithms perform
operations an enormous number of times. A few examples are matrix inversion, eigenvector computation,
and differential equation solving. These algorithms must be very carefully designed, using numerical
approaches such as Iterative refinement, if they are to work well.
Summation of a vector of floating-point values is a basic algorithm in scientific computing, and so an
awareness of when loss of significance can occur is essential. For example, if one is adding a very large
number of numbers, the individual addends are very small compared with the sum. This can lead to loss of
significance. A typical addition would then be something like:
The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many
numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about
3000; lost digits are not regained. (e.g 3.00001 not 3000.001)
Big-O notation is a relative representation of the complexity of an algorithm.
There are some important and deliberately chosen words in that sentence:
• relative: you can only compare apples to apples. You can’t compare an algorithm to do
arithmetic multiplication to an algorithm that sorts a list of integers. But a comparison of two
algorithms to do arithmetic operations (one multiplication, one addition) will tell you something
• representation: Big-O (in its simplest form) reduces the comparison between algorithms to a
single variable. That variable is chosen based on observations or assumptions. For example,
sorting algorithms are typically compared based on comparison operations (comparing two
nodes to determine their relative ordering). This assumes that comparison is expensive. But
what if comparison is cheap but swapping is expensive? It changes the comparison; and
• complexity: if it takes me one second to sort 10,000 elements how long will it take me to sort
one million? Complexity in this instance is a relative measure to something else.
The best example of Big-O I can think of is doing arithmetic. Take two numbers (123456 and 789012). The
basic arithmetic operations we learnt in school were:
• multiplication; and
Each of these is an operation or a problem. A method of solving these is called an algorithm.
Addition is the simplest. You line the numbers up (to the right) and add the digits in a column writing the
last number of that addition in the result. The ‘tens’ part of that number is carried over to the next column.
Let’s assume that the addition of these numbers is the most expensive operation in this algorithm. It stands
to reason that to add these two numbers together we have to add together 6 digits (and possibly carry a
7th). If we add two 100 digit numbers together we have to do 100 additions. If we add two 10,000 digit
numbers we have to do 10,000 additions.
See the pattern? The complexity (being the number of operations) is directly proportional to the number
of digits n in the larger number. We call this O(n) or linear complexity.
Subtraction is similar (except you may need to borrow instead of carry).
Multiplication is different. You line the numbers up, take the first digit in the bottom number and multiply
it in turn against each digit in the top number and so on through each digit. So to multiply our two 6 digit
numbers we must do 36 multiplications. We may need to do as many as 10 or 11 column adds to get the
end result too.
If we have two 100-digit numbers we need to do 10,000 multiplications and 200 adds. For two one million
digit numbers we need to do one trillion (1012) multiplications and two million adds.
As the algorithm scales with n-squared, this is O(n2
) or quadratic complexity. This is a good time to
introduce another important concept:
We only care about the most significant portion of complexity.
The astute may have realized that we could express the number of operations as: n2 + 2n. But as you saw
from our example with two numbers of a million digits apiece, the second term (2n) becomes insignificant
(accounting for 0.0002% of the total operations by that stage).
One can notice that we’ve assumed the worst case scenario here. While multiplying 6 digit numbers if one
of them is 4 digit and the other one is 6 digit, then we only have 24 multiplications. Still we calculate the
worst case scenario for that ‘n’, i.e when both are 6 digit numbers. Hence Big-O notation is about the
Worst-case scenario of an algorithm
The Telephone Book
If you were instructing a computer to look up the phone number for “John Smith” in a telephone book that
contains 1,000,000 names, what would you do? Ignoring the fact that you could guess how far in the S’s
started (let’s assume you can’t), what would you do?
A typical implementation might be to open up to the middle, take the 500,000th and compare it to “Smith”.
If it happens to be “Smith, John”, we just got real lucky. Far more likely is that “John Smith” will be before
or after that name. If it’s after we then divide the last half of the phone book in half and repeat. If it’s
before then we divide the first half of the phone book in half and repeat. And so on.
This is called a binary search and is used every day in programming whether you realize it or not.
So if you want to find a name in a phone book of a million names you can actually find any name by doing
this at most 20 times. In comparing search algorithms we decide that this comparison is our ‘n’.
• For a phone book of 3 names it takes 2 comparisons (at most).
• For 7 it takes at most 3.
• For 15 it takes 4.
• For 1,000,000 it takes 20.
That is staggeringly good isn’t it?
In Big-O terms this is O(log n) or logarithmic complexity. Now the logarithm in question could be ln (base
e), log10, log2 or some other base. It doesn’t matter it’s still O(log n) just like O(2n2
) and O(100n2
) are still
It’s worthwhile at this point to explain that Big O can be used to determine three cases with an algorithm:
• Best Case: In the telephone book search, the best case is that we find the name in one
comparison. This is O(1) or constant complexity;
• Expected Case: As discussed above this is O(log n); and
• Worst Case: This is also O(log n).
Normally we don’t care about the best case. We’re interested in the expected and worst case. Sometimes
one or the other of these will be more important.
Back to the telephone book.
What if you have a phone number and want to find a name? The police have a reverse phone book but
such look-ups are denied to the general public. Or are they? Technically you can reverse look-up a number
in an ordinary phone book. How?
You start at the first name and compare the number. If it’s a match, great, if not, you move on to the next.
You have to do it this way because the phone book is unordered (by phone number anyway).
So to find a name given the phone number (reverse lookup):
• Best Case: O(1);
• Expected Case: O(n) (for 500,000); and
• Worst Case: O(n) (for 1,000,000).
The Travelling Salesman
This is a famous problem with many analogues in science and engineering (optimisation) and deserves a
mention. In this problem you have N towns. Each of those towns is linked to 1 or more other towns by a
road of a certain distance. The Travelling Salesman problem is to find the shortest tour that visits every
Sounds simple? Think again.
If you have 3 towns A, B and C with roads between all pairs then you could go:
• A → B → C
• A → C → B
• B → C → A
• B → A → C
• C → A → B
• C → B → A
Well actually there’s less than that because some of these are equivalent (A → B → C and C → B → A are
equivalent, for example, because they use the same roads, just in reverse).
In actuality there are 3 possibilities.
• Take this to 4 towns and you have (iirc) 12 possibilities.
• With 5 it’s 60.
• 6 becomes 360.
This is a function of a mathematical operation called a factorial. Basically:
• 5! = 5 × 4 × 3 × 2 × 1 = 120
• 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720
• 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5040
• 25! = 25 × 24 × … × 2 × 1 = 15,511,210,043,330,985,984,000,000
• 50! = 50 × 49 × … × 2 × 1 = 3.04140932 × 1064
So the Big-O of the Travelling Salesman problem is O(n!) or factorial or combinatorial complexity.
By the time you get to 150 towns, it would take the fastest computer known today about 5 x 10275
to calculate the optimal route.
Any algorithm that has a complexity of O(na
) is said to have polynomial complexity or is solvable in
) etc are all polynomial time. Some problems cannot be solved in polynomial time. Certain things
are used in the world because of this. Public Key Cryptography is a prime example. It is computationally
hard to find two prime factors of a very large number. If it wasn’t, we couldn’t use the public key systems
Additional Background Material
Iterative solution of a set of linear equations
Given a square system of n linear equations:
Then A can be decomposed into a diagonal component D, and the remainder R:
The solution is then obtained iteratively via
The element-based formula is thus:
The computation of xi
(k+1) requires each element in x
(k) except itself. Unlike the Gauss–Seidel method, we
can’t overwrite xi
(k) with xi
(k+1), as that value will be needed by the rest of the computation. The minimum
amount of storage is two vectors of size n.
The method is guaranteed to converge if the matrix A is strictly or irreducibly diagonally dominant. Strict
row diagonal dominance means that for each row, the absolute value of the diagonal term is greater than
the sum of absolute values of other terms:
The Jacobi method sometimes converges even if these conditions are not satisfied.
A linear system of the form with initial estimate is given by
We use the equation , described above, to estimate . First, we rewrite the
equation in a more convenient form , where and
. Note that where and are the strictly lower and upper parts of . From the
we determine as
Further, C is found as
With T and C calculated, we estimate as :
The next iteration yields
This process is repeated until convergence (i.e., until is small). The solution after 25
Suppose we are given the following linear system:
Suppose we choose (0, 0, 0, 0) as the initial approximation, then the first approximate solution is given by
Using the approximations obtained, the iterative procedure is repeated until the desired accuracy has been
reached. The following are the approximated solutions after five iterations.
The exact solution of the system is (1, 2, −1, 1).
Gauss –Siedel Method
The Gauss–Seidel method is an iterative technique for solving a square system of n linear equations with
It is defined by the iteration
where the matrix A is decomposed into a lower triangular component , and a strictly upper triangular
component U: .
In more detail, write out A, x and b in their components:
Then the decomposition of A into its lower triangular component and its strictly upper triangular
component is given by:
The system of linear equations may be rewritten as:
The Gauss–Seidel method now solves the left hand side of this expression for x, using previous value for x
on the right hand side. By taking advantage of the triangular form of , the elements of x
(k+1) can be
computed sequentially using forward substitution:
Consider row i:
When working with iteration k, at this stage the x’s 1 up to i-1 have been calculated and those after i not
The procedure is generally continued until the changes made by an iteration are below some tolerance,
such as a sufficiently small residual.
The element-wise formula for the Gauss–Seidel method is extremely similar to that of the Jacobi method.
The computation of xi
(k+1) uses only the elements of x
(k+1) that have already been computed, and only the
elements of x
that have not yet to be advanced to iteration k+1. This means that, unlike the Jacobi
method, only one storage vector is required as elements can be overwritten as they are computed, which
can be advantageous for very large problems.
However, unlike the Jacobi method, the computations for each element cannot be done in parallel.
Furthermore, the values at each iteration are dependent on the order of the original equations.
The convergence properties of the Gauss–Seidel method are dependent on the matrix A. Namely, the
procedure is known to converge if either:
• A is symmetric positive-definite, or
• A is strictly or irreducibly diagonally dominant.
The Gauss–Seidel method sometimes converges even if these conditions are not satisfied.
Since elements can be overwritten as they are computed in this algorithm, only one storage vector is
Suppose given k equations where xn are vectors of these equations and starting point x0. From the first
equation solve for x1 in terms of For the next equations substitute the previous
values of the xs.
To make it clear let’s consider an example.
Solving for , , and gives:
Suppose we choose (0, 0, 0, 0) as the initial approximation, then the first approximate solution is given by
Using the approximations obtained, the iterative procedure is repeated until the desired accuracy has been
reached. The following are the approximated solutions after four iterations.
The exact solution of the system is (1, 2, −1, 1).
Gaussian elimination is an algorithm for solving systems of linear equations. It is usually understood as a
sequence of operations performed on the associated matrix of coefficients. This method can also be used
to find the rank of a matrix, to calculate the determinant of a matrix, and to calculate the inverse of an
invertible square matrix.
To perform row reduction on a matrix, one uses a sequence of elementary row operations to modify the
matrix until the lower left-hand corner of the matrix is filled with zeros, as much as possible. There are
three types of elementary row operations: 1) Swapping two rows, 2) Multiplying a row by a non-zero
number, 3) Adding a multiple of one row to another row. Using these operations, a matrix can always be
transformed into an upper triangular matrix, and in fact one that is in row echelon form. Once all of the
leading coefficients (the left-most non-zero entry in each row) are 1, and in every column containing a
leading coefficient has zeros elsewhere, the matrix is said to be in reduced row echelon form. This final
form is unique; in other words, it is independent of the sequence of row operations used. For example, in
the following sequence of row operations (where multiple elementary operations might be done at each
step), the third and fourth matrices are the ones in row echelon form, and the final matrix is the unique
reduced row echelon form.
Using row operations to convert a matrix into reduced row echelon form is sometimes called Gauss–Jordan
Definitions and example of algorithm
The process of row reduction makes use of elementary row operations, and can be divided into two parts.
The first part (sometimes called Forward Elimination) reduces a given system to row echelon form, from
which one can tell whether there are no solutions, a unique solution, or infinitely many solutions. The
second part (sometimes called back substitution) continues to use row operations until the solution is
found; in other words, it puts the matrix into reduced row echelon form.
There are three types of elementary row operations which may be performed on the rows of a matrix:
Type 1: Swap the positions of two rows.
Type 2: Multiply a row by a nonzero scalar.
Type 3: Add to one row a scalar multiple of another.
If the matrix is associated to a system of linear equations, then these operations do not change the solution
set. Therefore, if one’s goal is to solve a system of linear equations, then using these row operations could
make the problem easier.
For each row in a matrix, if the row does not consist of only zeros, then the left-most non-zero entry is
called the leading coefficient (or pivot) of that row. So if two leading coefficients are in the same column,
then a row operation of type 3 (see above) could be used to make one of those coefficients zero. Then by
using the row swapping operation, one can always order the rows so that for every non-zero row, the
leading coefficient is to the right of the leading coefficient of the row above. If this is the case, then matrix
is said to be in row echelon form. So the lower left part of the matrix contains only zeros, and all of the
zero rows are below the non-zero rows. The word “echelon” is used here because one can roughly think of
the rows being ranked by their size, with the largest being at the top and the smallest being at the bottom.
For example, the following matrix is in row echelon form, and its leading coefficients are shown in red.
It is in echelon form because the zero row is at the bottom, and the leading coefficient of the second row
(in the third column), is to the right of the leading coefficient of the first row (in the second column).
A matrix is said to be in reduced row echelon form if furthermore all of the leading coefficients are equal
to 1 (which can be achieved by using the elementary row operation of type 2), and in every column
containing a leading coefficient, all of the other entries in that column are zero (which can be achieved by
using elementary row operations of type 3).
Example of the algorithm
Suppose the goal is to find and describe the set of solutions to the following system of linear equations:
The table below is the row reduction process applied simultaneously to the system of equations, and its
associated augmented matrix. In practice, one does not usually deal with the systems in terms of equations
but instead makes use of the augmented matrix, which is more suitable for computer manipulations. The
row reduction procedure may be summarized as follows: eliminate x from all equations below , and
then eliminate y from all equations below . This will put the system into triangular form. Then, using
back-substitution, each unknown can be solved for.
System of equations Row operations Augmented matrix
The matrix is now in echelon form (also called triangular form)
The second column describes which row operations have just been performed. So for the first step, the x is
eliminated from by adding to . Next x is eliminated from by adding to . These row
operations are labelled in the table as
Once y is also eliminated from the third row, the result is a system of linear equations in triangular form,
and so the first part of the algorithm is complete. From a computational point of view, it is faster to solve
the variables in reverse order, a process known as back-substitution. One sees the solution is z = -1, y = 3,
and x = 2. So there is a unique solution to the original system of equations.
Instead of stopping once the matrix is in echelon form, one could continue until the matrix is in reduced
row echelon form, as it is done in the table. The process of row reducing until the matrix is reduced is
sometimes referred to as Gauss-Jordan elimination, to distinguish it from stopping after reaching echelon
The historically first application of the row reduction method is for solving systems of linear equations.
Here are some other important applications of the algorithm.
To explain how Gaussian elimination allows to compute the determinant of a square matrix, we have to
recall how the elementary row operations change the determinant:
• Swapping two rows multiplies the determinant by -1
• Multiplying a row by a nonzero scalar multiplies the determinant by the same scalar
• Adding to one row a scalar multiple of another does not change the determinant.
If the Gaussian elimination applied to a square matrix A produces a row echelon matrix B, let d be the
product of the scalars by which the determinant has been multiplied, using above rules. Then the
determinant of A is the quotient by d of the product of the elements of the diagonal of B.
It should be emphasized that, for a n×n matrix, this method needs only O(n3
) arithmetic operations, while
the elementary methods, usually taught in elementary courses, need O(2n
) or O(n!) operations. Even on the
fastest computers, the elementary methods are impracticable for n ≥ 21.
Finding the inverse of a matrix
A variant of Gaussian elimination called Gauss–Jordan elimination can be used for finding the inverse of a
matrix, if it exists. If A is a n by n square matrix, then one can use row reduction to compute its inverse
matrix, if it exists. First, the n by n identity matrix is augmented to the right of A, forming a n by 2n block
matrix [A | I]. Now through application of elementary row operations, find the reduced echelon form of
this n by 2n matrix. The matrix A is invertible if and only if the left block can be reduced to the identity
matrix I; in this case the right block of the final matrix is A−1. If the algorithm is unable to reduce the left
block to I, then A is not invertible.
For example, consider the following matrix
To find the inverse of this matrix, one takes the following matrix augmented by the identity, and row
reduces it as a 3 by 6 matrix:
By performing row operations, one can check that the reduced row echelon form of this augmented matrix
The matrix on the left is the identity, which shows A is invertible. The 3 by 3 matrix on the right, B, is the
inverse of A. This procedure for finding the inverse works for square matrices of any size.
The number of arithmetic operations required to perform row reduction is one way of measuring the
algorithm’s computational efficiency. For example, to solve a system of n equations for n unknowns by
performing row operations on the matrix until it is in echelon form, and then solving for each unknown in
reverse order, requires n(n-1) / 2 divisions, (2n3 + 3n2 − 5n)/6 multiplications, and (2n3 + 3n2 − 5n)/6
subtractions, for a total of approximately 2n3 / 3 operations. Thus it has arithmetic complexity of O(n3
Big O notation. This arithmetic complexity is a good measure of the time needed for the whole
computation when the time for each arithmetic operation is approximately constant. This is the case when
the coefficients are represented by floating point numbers or when they belong to a finite field. If the
coefficients are integers or rational numbers exactly represented, the intermediate entries can grow
exponentially large, so the bit complexity is exponential. However, there is a variant of Gaussian
elimination, called Bareiss algorithm that avoids this exponential growth of the intermediate entries, and,
with the same arithmetic complexity of O(n3
), has a bit complexity of O(n5
This algorithm can be used on a computer for systems with thousands of equations and unknowns.
However, the cost becomes prohibitive for systems with millions of equations. These large systems are
generally solved using iterative methods. Specific methods exist for systems whose coefficients follow a
To put an n by n matrix into reduced echelon form by row operations, one needs arithmetic operations;
which is approximately 50% more computation steps.
One possible problem is numerical instability, caused by the possibility of dividing by very small numbers.
If, for example, the leading coefficient of one of the rows is very close to zero, then to row reduce the
matrix one would need to divide by that number so the leading coefficient is 1. This means any error that
existed for the number which was close to zero would be amplified. Gaussian elimination is numerically
stable for diagonally dominant or positive-definite matrices. For general matrices, Gaussian elimination is
usually considered to be stable, when using partial pivoting, even though there are examples of stable
matrices for which it is unstable.
As explained above, Gaussian elimination writes a given m × n matrix A uniquely as a product of an
invertible m × m matrix S and a row-echelon matrix T. Here, S is the product of the matrices corresponding
to the row operations performed.
The formal algorithm to compute from follows. We write for the entry in row , column in
matrix with 1 being the first index. The transformation is performed in place, meaning that the original
matrix is lost and successively replaced by .
for k = 1 … m:
Find pivot for column k:
i_max := argmax (i = k … m, abs(A[i, k]))
if A[i_max, k] = 0
error “Matrix is singular!”
swap rows(k, i_max)
Do for all rows below pivot:
for i = k + 1 … m:
Do for all remaining elements in current row:
for j = k + 1 … n:
A[i, j] := A[i, j] – A[k, j] * (A[i, k] / A[k, k])
Fill lower triangular matrix with zeros:
A[i, k] := 0
This algorithm differs slightly from the one discussed earlier, because before eliminating a variable, it first
exchanges rows to move the entry with the largest absolute value to the pivot position. Such partial
pivoting improves the numerical stability of the algorithm; some other variants are used.
Upon completion of this procedure the augmented matrix will be in row-echelon form and may be solved
With modern computers, Gaussian elimination is not always the fastest algorithm to compute the row
echelon form of matrix.
In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. By contrast, if
most of the elements are nonzero, then the matrix is considered dense. The fraction of zero elements
(non-zero elements) in a matrix is called the sparsity (density).
Conceptually, sparsity corresponds to systems which are loosely coupled. Consider a line of balls connected
by springs from one to the next: this is a sparse system as only adjacent balls are coupled. By contrast, if
the same line of balls had springs connecting each ball to all other balls, the system would be correspond to
a dense matrix. The concept of sparsity is useful in combinatorics and application areas such as network
theory, which have a low density of significant data or connections.
Large sparse matrices often appear in scientific or engineering applications when solving partial differential
When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use
specialized algorithms and data structures that take advantage of the sparse structure of the matrix.
Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to
large sparse matrices as processing and memory are wasted on the zeroes. Sparse data is by nature more
easily compressed and thus require significantly less storage. Some very large sparse matrices are
infeasible to manipulate using standard dense-matrix algorithms.
Tridiagonal matrices (Special Cases)
A tridiagonal matrix is a matrix that has nonzero elements only on the main diagonal, the first diagonal
below this, and the first diagonal above the main diagonal.
For example, the following matrix is tridiagonal:
Example of sparse matrix
The above sparse matrix contains only
9 nonzero elements of the 35, with 26
of those elements as zero.
The determinant of a tridiagonal matrix is given by a continuant of its elements.
Many linear algebra algorithms require significantly less computational effort when applied to diagonal
matrices, and this improvement often carries over to tridiagonal matrices as well.
The determinant of a tridiagonal matrix A of order n can be computed from a three-term recurrence
relation. Write f1 = |a1| = a1 and
The sequence (fi) is called the continuant and satisfies the recurrence relation
with initial values f0 = 1 and f-1 = 0. The cost of computing the determinant of a tridiagonal matrix using this
formula is linear in n, while the cost is cubic for a general matrix.
The inverse of a non-singular tridiagonal matrix T
is given by
where the θi satisfy the recurrence relation
with initial conditions θ0 = 1, θ1 = a1 and the ϕi satisfy
with initial conditions ϕn+1 = 1 and ϕn = an.
Solution of linear system
A system of equations A x = b for can be solved by an efficient form of Gaussian elimination when A is
tridiagonal called tridiagonal matrix algorithm, requiring O(n) operations.
Tridiagonal matrix algorithm
The tridiagonal matrix algorithm, is a simplified form of Gaussian elimination that can be used to solve
tridiagonal systems of equations. A tridiagonal system for n unknowns may be written as
where and .
For such systems, the solution can be obtained in operations instead of required by Gaussian
elimination. A first sweep eliminates the ‘s, and then an (abbreviated) backward substitution produces
the solution. Examples of such matrices commonly arise from the discretization of 1D Poisson equation
(e.g., the 1D diffusion problem) and natural cubic spline interpolation; similar systems of matrices arise in
tight binding physics or nearest neighbor effects models.
The forward sweep consists of modifying the coefficients as follows, denoting the new modified
coefficients with primes:
The solution is then obtained by back substitution:
The derivation of the tridiagonal matrix algorithm involves manually performing some specialized Gaussian
elimination in a generic manner.
Suppose that the unknowns are , and that the equations to be solved are:
Consider modifying the second ( ) equation with the first equation as follows:
which would give:
and the effect is that has been eliminated from the second equation. Using a similar tactic with the
modified second equation on the third equation yields:
This time was eliminated. If this procedure is repeated until the row; the (modified) equation will
involve only one unknown, . This may be solved for and then used to solve the equation, and
so on until all of the unknowns are solved for.
Clearly, the coefficients on the modified equations get more and more complicated if stated explicitly. By
examining the procedure, the modified coefficients (notated with tildes) may instead be defined
To further hasten the solution process, may be divided out (if there’s no division by zero risk), the newer
modified coefficients notated with a prime will be:
This gives the following system with the same unknowns and coefficients defined in terms of the original
The last equation involves only one unknown. Solving it in turn reduces the next last equation to one
unknown, so that this backward substitution can be used to find all of the unknowns: