这个作业是C语言牛顿法求解非线性方程

Solving Nonlinear Equations: Newton’s method

Newton’s method (also known as the Newton–Raphson method), is a method for finding successively

better approximations to the roots (or zeroes) of a real-valued function.

The Newton–Raphson method in one variable is implemented as follows:

Given a function ƒ defined over the reals x, and its derivative ƒ’, we begin with a first guess x0 for a root of

the function f. Provided the function satisfies all the assumptions made in the derivation of the formula, a

better approximation x1 is

Geometrically, (x1, 0) is the intersection with the x-axis of the tangent to the graph of f at (x0, f (x0)).

The process is repeated in the recurrence relation as:

until a sufficiently accurate value is reached.

This algorithm is first in the class of Householder’s methods, succeeded by Halley’s method. The method

can also be extended to complex functions and to systems of equations.

Algorithm

The algorithm of the method is as follows: one starts with an initial guess which is reasonably close to the

true root, then the function is approximated by its tangent line (which can be computed using the tools of

calculus), and one computes the x-intercept of this tangent line (which is easily done with elementary

algebra). This x-intercept will typically be a better approximation to the function’s root than the original

guess, and the method can be iterated.

Suppose ƒ : [a, b] → R is a differentiable function defined on the interval [a, b] with values in the real

numbers R. The formula for converging on the root can be easily derived. Suppose we have some current

approximation xn. Then we can derive the formula for a better approximation, xn+1 by referring to the

diagram on the right. The equation of the tangent line to the curve y = ƒ(x) at the point x=xn is

where, ƒ’ denotes the derivative of the function ƒ.

The x-intercept of this line (the value of x such that y=0) is then used as the next approximation to the root,

xn+1. In other words, setting y to zero and x to xn+1 gives

Solving for xn+1 gives

We start the process off with some arbitrary initial value x0. (The closer to the zero, the better. But, in the

absence of any intuition about where the zero might lie, a “guess and check” method might narrow the

possibilities to a reasonably small interval by appealing to the intermediate value theorem.) The method

will usually converge, provided this initial guess is close enough to the unknown zero, and that ƒ'(x0) ≠ 0.

Furthermore, for a zero of multiplicity 1, the convergence is at least quadratic (see rate of convergence) in

a neighbourhood of the zero, which intuitively means that the number of correct digits roughly at least

doubles in every step.

Practical considerations

Newton’s method is an extremely powerful technique—in general the convergence is quadratic: as the

method converges on the root, the difference between the root and the approximation is squared (the

number of accurate digits roughly doubles) at each step. However, there are some difficulties with the

method.

Difficulty in calculating derivative of a function

Newton’s method requires that the derivative be calculated directly. An analytical expression for the

derivative may not be easily obtainable and could be expensive to evaluate. In these situations, it may be

appropriate to approximate the derivative by using the slope of a line through two nearby points on the

function. Using this approximation would result in something like the secant method whose convergence is

slower than that of Newton’s method.

Failure of the method to converge to the root

It is important to review the proof of quadratic convergence of Newton’s Method before implementing it.

Specifically, one should review the assumptions made in the proof. For situations where the method fails to

converge, it is because the assumptions made in this proof are not met.

Overshoot

If the first derivative is not well behaved in the neighborhood of a particular root, the method may

overshoot, and diverge from that root. An example of a function with one root, for which the derivative is

not well behaved in the neighborhood of the root, is

for which the root will be overshot and the sequence of x will diverge. For a = 1/2, the root will still be

overshot, but the sequence will oscillate between two values. For 1/2 < a < 1, the root will still be

overshot but the sequence will converge, and for a ≥ 1 the root will not be overshot at all.

Stationary point

If a stationary point of the function is encountered, the derivative is zero and the method will terminate

due to division by zero.

Poor initial estimate

A large error in the initial estimate can contribute to non-convergence of the algorithm.

Trying to avoid of non-convergence

It is common to place limits on the number of iterations, bound the solution to an interval known to

contain the root, and combine the method with a more robust root finding method.

Failure

Newton’s method is only guaranteed to converge if certain conditions are satisfied.

Bad starting points

In some cases the conditions on the function that are necessary for convergence are satisfied, but the point

chosen as the initial point is not in the interval where the method converges. This can happen, for example,

if the function whose root is sought approaches zero asymptotically as x goes to or . In such cases a

different method, such as bisection, should be used to obtain a better estimate for the zero to use as an

initial point.

Iteration point is stationary

Consider the function:

It has a maximum at x = 0 and solutions of f(x) = 0 at x = ±1. If we start iterating from the stationary point x0

= 0 (where the derivative is zero), x1 will be undefined, since the tangent at (0,1) is parallel to the x-axis:

The same issue occurs if, instead of the starting point, any iteration point is stationary. Even if the

derivative is small but not zero, the next iteration will be a far worse approximation.

Starting point enters a cycle

The tangent lines of x

3

– 2x + 2 at 0 and 1 intersect the x-axis at 1 and 0 respectively, illustrating why

Newton’s method oscillates between these values for some starting points.

For some functions, some starting points may enter an infinite cycle, preventing convergence.

Example: Let

and take 0 as the starting point. The first iteration produces 1 and the second iteration returns to 0 so the

sequence will alternate between the two without converging to a root. In fact, this 2-cycle is stable: there

are neighborhoods around 0 and around 1 from which all points iterate asymptotically to the 2-cycle (and

hence not to the root of the function). In general, the behavior of the sequence can be very complex.

Derivative issues

If the function is not continuously differentiable in a neighborhood of the root then it is possible that

Newton’s method will always diverge and fail, unless the solution is guessed on the first try.

Derivative does not exist at root

A simple example of a function where Newton’s method diverges is the cube root, which is continuous and

infinitely differentiable, except for x = 0, where its derivative is undefined (this, however, does not affect

the algorithm, since it will never require the derivative if the solution is already found):

For any iteration point xn, the next iteration point will be:

The algorithm overshoots the solution and lands on the other side of the y-axis, farther away than it

initially was; applying Newton’s method actually doubles the distances from the solution at each iteration.

In fact, the iterations diverge to infinity for every , where . In the limiting case

of (square root), the iterations will alternate indefinitely between points x0 and −x0, so they do not

converge in this case either.

Discontinuous derivative

If the derivative is not continuous at the root, then convergence may fail to occur in any neighborhood of

the root. Consider the function

Its derivative is:

Within any neighborhood of the root, this derivative keeps changing sign as x approaches 0 from the right

(or from the left) while f(x) ≥ x − x

2 > 0 for 0 < x < 1.

So f(x)/f'(x) is unbounded near the root, and Newton’s method will diverge almost everywhere in any

neighborhood of it, even though:

• the function is differentiable (and thus continuous) everywhere;

• the derivative at the root is nonzero;

• f is infinitely differentiable except at the root; and

• the derivative is bounded in a neighborhood of the root (unlike f(x)/f'(x)).

Numerical Integration: Simpson’s rule

Simpson’s rule is a method for numerical integration, the numerical approximation of definite integrals.

Specifically, it is the following approximation:

Simpson’s rule is a staple of scientific data analysis and engineering.

Derivation

The midpoint and the trapezoidal rules

Derivation constructs Simpson’s rule from two simpler approximations: the midpoint rule

and the trapezoidal rule

The errors in these approximations are

respectively, where denotes a term asymptotically proportional to . The two

terms are not equal;. It follows from the above formulas for the errors of the midpoint and

trapezoidal rule that the leading error term vanishes if we take the weighted average

This weighted average is exactly Simpson’s rule.

Using another approximation (for example, the trapezoidal rule with twice as many points), it is possible to

take a suitable weighted average and eliminate another error term. This is Romberg’s method.

Error

The error in approximating an integral by Simpson’s rule is

where is some number between and .

The error is asymptotically proportional to . However, the above derivations suggest an error

proportional to . Simpson’s rule gains an extra order because the points at which the integrand is

evaluated are distributed symmetrically in the interval [a, b].

Composite Simpson’s rule

If the interval of integration is in some sense “small”, then Simpson’s rule will provide an adequate

approximation to the exact integral. By small, what we really mean is that the function being integrated is

relatively smooth over the interval . For such a function, a smooth quadratic interpolant like the one

used in Simpson’s rule will give good results.

However, it is often the case that the function we are trying to integrate is not smooth over the interval.

Typically, this means that either the function is highly oscillatory, or it lacks derivatives at certain points. In

these cases, Simpson’s rule may give very poor results. One common way of handling this problem is by

breaking up the interval into a number of small subintervals. Simpson’s rule is then applied to each

subinterval, with the results being summed to produce an approximation for the integral over the entire

interval. This sort of approach is termed the composite Simpson’s rule.

Suppose that the interval is split up in subintervals, with an even number. Then, the composite

Simpson’s rule is given by

where for with ; in particular, and

. The above formula can also be written

as:

The error committed by the composite Simpson’s rule is bounded (in absolute value) by

where is the “step length”, given by

This formulation splits the interval in subintervals of equal length. In practice, it is often advantageous

to use subintervals of different lengths, and concentrate the efforts on the places where the integrand is

less well-behaved. This leads to the adaptive Simpson’s method.

Alternative extended Simpson’s rule

This is another formulation of a composite Simpson’s rule: instead of applying Simpson’s rule to disjoint

segments of the integral to be approximated, Simpson’s rule is applied to overlapping segments, yielding:[

The formula above is obtained by combining the original composite Simpson’s rule with the one consisting

of using Simpson’s 3/8 rule in the end subintervals and the standard 3-point rule in the remaining

subintervals. The result is then obtained by taking the mean of the two formulas.

Numerical Differentiation

The ability to differentiate numerically is vital because in Engineering many physical systems are modelled

as differential equations and their solution is often impossible analytically. This is especially true when real

world problems are studied using partial differential equations.

For most mathematical models the differential equations only involve first for second derivative terms,

thus we will only examine the approximation of first and second derivates:

To approximate the first derivative we (undo) the limiting process and calculate f’(x) in terms of finite

differences:

This can be derived in by using the Taylor series expansion:

Subtracting the backward difference from the forward difference give:

This is our approximation for the first derivate f’(x) and it also gives an indication of the error involved, the

error is QUADRATIC

By adding the forward difference to the backward difference, we can obtain an approximation to the

second derivative f’’(x)

This also shows that the error in this numerical approximation is QUADRATIC. These expressions are very

important in solving differential equations.

Random number generation &

Statistical Distributions

A random number generator (RNG) is a computational or physical device designed to generate a sequence

of numbers or symbols that lack any pattern, i.e. appear random.

The many applications of randomness have led to the development of several different methods for

generating random data.

Several computational methods for (pseudo)random number generation exist. Many fall short of the goal

of true randomness — though they may meet, with varying success, pseudo randomness.

Practical applications and uses

Note that, in general, where unpredictability is paramount – such as in security applications – hardware

generators are generally preferred (where feasible) over pseudo-random algorithms.

Random number generators are very useful in developing Monte Carlo-method simulations. They are also

used in cryptography – so long as the seed is secret. Sender and receiver can generate the same set of

numbers automatically to use as keys.

The generation of pseudo-random numbers is an important and common task in computer programming.

While cryptography and certain numerical algorithms require a very high degree of apparent randomness,

many other operations only need a modest amount of unpredictability.

“True” random numbers vs. pseudo-random numbers

There are two principal methods used to generate random numbers. The first method measures some

physical phenomenon that is expected to be random and then compensates for possible biases in the

measurement process. Example sources include measuring atmospheric noise, thermal noise, and other

external electromagnetic and quantum phenomena. For example, cosmic background radiation or

radioactive decay as measured over short timescales represent sources of natural entropy.

The second method uses computational algorithms that can produce long sequences of apparently random

results, which are in fact completely determined by a shorter initial value, known as a seed or key. The

latter type are often called pseudorandom number generators. These types of generators do not typically

rely on naturally occurring sources.

A “random number generator” based solely on deterministic computation cannot be regarded as a “true”

random number generator in the purest sense of the word, since their output is inherently predictable if all

seed values are known. In practice, they are sufficient for most tasks. Carefully designed and implemented

pseudo-random number generators can even be certified for security-critical cryptographic purposes.

Computational generation

Pseudo-random number generators (PRNGs) are algorithms that can automatically create long runs of

numbers with good random properties but eventually the sequence repeats (or the memory usage grows

without bound). The string of values generated by such algorithms is generally determined by a fixed

number called a seed. One of the most common PRNG is the linear congruential generator, which uses the

recurrence relationship:

to generate numbers. The maximum number of numbers the formula can produce is the modulus, m. To

avoid certain non-random properties of a single linear congruential generator, several such random

number generators with slightly different values of the multiplier coefficient a can be used in parallel, with

a “master” random number generator that selects from among the several different generators.

Most computer programming languages include functions or library routines that provide random number

generators. They are often designed to provide a random byte or word, or a floating point number

uniformly distributed between 0 and 1.

The quality i.e. randomness of such library functions varies widely from completely predictable output, to

cryptographically secure. The default random number generator in many languages, including Python,

Ruby, R, IDL and PHP is based on the Mersenne Twister algorithm and is not sufficient for cryptography

purposes. Such library functions often have poor statistical properties and some will repeat patterns after

only tens of thousands of trials.

They are often initialized using a computer’s real time clock as the seed, since such a clock generally

measures in milliseconds, far beyond the person’s precision. These functions may provide enough

randomness for certain tasks (for example video games) but are unsuitable where high-quality randomness

is required, such as in cryptography applications, statistics or numerical analysis.

Most programming languages, including those mentioned above, provide a means to access these higher

quality sources.

An example of a simple pseudo-random number generator is the multiply-with-carry method. It is

computationally fast and has good (albeit not cryptographically strong) randomness properties

// global variables choose initial values to be the seed

unsigned int m_w = <choose-initializer>; /* must not be zero, nor 0x464fffff */

insigned int m_z = <choose-initializer>; /* must not be zero, nor 0x9068ffff */

// random number generating function

unsigned int get_random(void)

{

m_z = 36969 * (m_z & 65535) + (m_z >> 16);

m_w = 18000 * (m_w & 65535) + (m_w >> 16);

return (m_z << 16) + m_w; /* 32-bit result */

}

Random numbers uniformly distributed between 0 and 1 can be used to generate random numbers of

any desired distribution by passing them through the inverse cumulative distribution function (CDF) of

the desired distribution.

Generating the Gaussian (Normal) distribution

Algorithm

Suppose U1 and U2 are independent random variables that are uniformly distributed in the interval (0, 1]

and we let:

and

then Z0 and Z1 are independent random variables with a standard normal distribution.

(This is known as the Box-Muller transformation.)

Implementation

A simple implementation in C++ using the variance. If the noise is generated without the variance, the

multiplication by the variance in the implementation below can simply be removed.

#define TWO_PI 6.2831853071795864769252866

double generateGaussianNoise(const double &variance)

{

static bool haveSpare = false;

static double rand1, rand2;

if(haveSpare)

{

haveSpare = false;

time_t t;

srand(time(&t));

rand1 = (double)rand() / ((double) RAND_MAX);

return sqrt(variance * rand1) * sin(rand2);

}

haveSpare = true;

rand1 = (double)rand() / ((double) RAND_MAX);

if(rand1 < 1e-100) rand1 = 1e-100;

rand1 = -2 * log(rand1);

rand2 = (rand() / ((double) RAND_MAX)) * TWO_PI;

return sqrt(variance * rand1) * cos(rand2);

}

A faster algorithm that avoids having to calculate trigonometric functions.

We can use the fact that, in a two-dimensional Cartesian system where X and Y coordinates are described

by two independent and normally distributed random variables, the random variables for R

Z and Θ (shown

above) in the corresponding polar coordinates are also independent and can be expressed as:

and

Two uniformly distributed values, u and v are used to produce the value s = R2

, which is likewise uniformly

distributed.

Given u and v, independent and uniformly distributed in the closed interval [−1, +1], set s = R2 = u2 + v

2

.

(Clearly .) If s = 0 or s ≥ 1, throw u and v away and try another pair (u, v). Because u and v are

uniformly distributed and because only points within the unit circle have been admitted, the values of s will

be uniformly distributed in the open interval (0, 1), too. The latter can be seen by calculating the

cumulative distribution function for s in the interval (0, 1). This is the area of a circle with radius , divided

by . From this we find the probability density function to have the constant value 1 on the interval (0, 1).

Equally so, the angle θ divided by is uniformly distributed in the interval [0, 1) and independent of s.

We now identify the value of s with that of U1 and with that of U2. As shown in the figure, the values

of and I n the original can be replaced with the ratios and

, respectively.

The advantage is that calculating the trigonometric functions directly can be avoided. This is helpful when

trigonometric functions are more expensive to compute than the single division that replaces each one.

This produces two standard normal deviates.

and

Tail truncation

When a computer is used to produce a uniform random variable it will inevitably have some inaccuracies

because there is a lower bound on how close numbers can be to 0. For 32 bit computer the smallest

number that can be generated is . When and are equal to this the Box–Muller transform

produces a normal random variable equal to

This means that the algorithm will not produce random variables more than 6.66 standard deviations from

the mean. This corresponds to a proportion of lost due to the truncation.

Other Distributions

Because R2

is the square of the norm of the standard bivariate normal variable (X, Y), it has the chi-squared

distribution with two degrees of freedom. In the special case of two degrees of freedom, the chi-squared

distribution coincides with the exponential distribution, and the equation for R2 above is a simple way of

generating the required exponential variate.

Solution of Differential Equations

Many engineering problems are examined by setting up a mathematical model for them and then solving

that model. The most useful mathematical model for these problems is the differential equation. Most of

these models require only up to second order differentials but often they are complicated by being nonlinear and for real world problems they require more than one variable (they are multi-dimensional.)

Differential equation that involve more than variable are Partial Differential Equations. These variables are

typically, one,two or three spatial coordinates and time.

Often simplifications are possible, where one or two spatial dimensions can be omitted from the model.

Rarely can non-linear differential equations be solved analytically and often even linear models require

boundary conditions that make an analytic solution impossible. Thus, in practice, we use discrete numerical

approximation for the differentials and a computer to carry out the calculations.

We have already seen how to approximate first and second differentials and so the differential equations

can be reformulated in terms of discretised variables. It is helpful to investigate the solution of differential

equations by classifying them according to the order of the differentials and how they are combined.

The class are:

• Elliptic e.g. Poisson’s Equation

• Parabolic e.g The Diffusion Equation

• Hyperbolic e.g. The Wave Equation

We have two potential solution techniques:

• Finite Differences

• Finite Elements

In this course we use Finite Differences and investigate Elliptic and Parabolic equations:

Elliptic Equations:

Typical elliptic equations in a two-dimensional Cartesian system are Laplace’s equations, and Poisson’s

equation:

As this involves two dimensions we use partial derivatives. We could study a one-dimensional problem by

removing one of the variables or a three-dimensional problem by adding another variable.

Applying the finite difference approximations to the partial derivates we get:

Five-point central-difference for the Laplace equation:

With the finite difference approximation we get:

0

2 2

2

, 1 , , 1

2

1, , 1,

y

u u u

x

ui j ui j ui j i j i j i j

This is accurate to:

2 2

o x , y

And applies ONLY at discrete points. Note the spacing in horizontal and vertical directions;

This can be generalised:

This expression can be used to solve for the discrete value of the field variable U at the point (i,j) using

values of U in the neighbourhood. Of course these points may themselves need to be calculated and this

this complicates the solution resulting in the need to build up a solution algorithm.

NOTE: Any differential equation can only be solved numerically under specific boundary/initial conditions

and thus these must be specified. An example illustrates the process.

y

x

u u u u u

u u u

y

x

u u u

i j i j i j i j i j

i j i j i j i j i j i j

2 1 0…

2 2 0

,

2

, 1 , 1

2

1, 1,

, 1 , , 1

2

1, , 1,

if x y 1

ui1, j

ui1, j

ui, j1

ui, j1

4ui, j

0

In order to explore various solution procedures, first consider a square domain with Dirichlet Boundary

conditions.

For instance, a simple 6×6 grid system subject to the following B.C.s. is considered:

x=0 u=u2, y=0 u=u1

x=L u=u4, y=H u=u3

The resulting solution requires 16 linear equations to be written for each of the 16 interior points where U

is to be calculated.

These can be written in matrix form because this makes it easier to see a pattern and set up a numerical

computer solution:

You have learned several techniques for solving such problems. Gauss Elimination is a good example but

the special structure of the matrix make it much easier to use an iterative approach such as the Gauss

Siedel method. (See notes on Jacobi & Gauss Siedel method Later.)

The finite difference equation is given by:

, 1 , 1

2

, 2 1, 1,

2 1

1

ui j

ui j ui j ui j ui j

For the first point (2,2)

2,3 2,1

2

2 3,2 1,2

1

2,2

2 1

1

u u u u u

k k k

Parabolic Equations

These equations mostly represent processes that evolve through time. The can potentially also involve two

and three spatial dimensions but we will only investigate one spatial dimension.

Since they involve time we talk about the problem having an Initial Condition as well as one/two/three

boundary conditions.

There are two approaches to their solution that have different stability properties and involve different

levels of computational complexity.

• Explicit (Forward time, central space)

• Implicit (Backward time, central space)

A typical parabolic second-order PDE is the unsteady heat conduction equation, which is considered first in

one-space dimension. It has the following form:

is expressed by a forward difference approximation which is of order ∆t.

… (1)

Using the second order central difference of order ∆x2

for the spatial term the diffusion equation can be

written as:

This Explicit formulation is stable provided:

And is easily calculated for each discrete point based on the initial condition for U. The time-evolution of U

can be determined at each time step ∆t using the simple formula (2). There are no unknowns, all the Ui

values are known at either the initial boundary or from the previous time step.

The main drawback with this method is the stability condition because it often results in the need to make

VERY small time steps and the solution takes a long time to progress meaningfully.

Implicit approach (backward/central space) method:

In this formulation the adjacent spatial variable values at the ‘current’ time step are required in order to

calculate the current value of the central variable. This makes a straightforward calculation of the current

values impossible and a set of linear equations have to be set up and these need to be solved using either

the Gauss-Seidel or Gauss Elimination techniques:

However, the implicit method is unconditionally stable and thus large time steps may be taken.

By extending either the Explicit or Implicit approach the diffusion equation in two spatial dimensions may

be solved using Finite Differences.

By extending these ideas, any form of ODE or PDE in any shaped environment or with any initial conditions

may be solved.

Working with ‘real’ numbers in a Computer

Computers naturally deal with Integer quantities, there are exact one-to-one relationships between the

integers 0,1,2,3,… and their binary equivalents (up to the capacity of the machine) the same thing can

NOT be said of ‘real’ (floating point) numbers, nor the fractions such as 1/3 nor the irrational numbers

such as √2 or π.

Single-precision binary floating-point format: binary32

The IEEE 754 standard specifies a binary32 as having:

• Sign bit: 1 bit

• Exponent width: 8 bits

• Significand precision: 24 bits (23 explicitly stored)

This gives from 6 to 9 significant decimal digits precision (if a decimal string with at most 6 significant

decimal is converted to IEEE 754 single precision and then converted back to the same number of

significant decimal, then the final string should match the original; and if an IEEE 754 single precision is

converted to a decimal string with at least 9 significant decimal and then converted back to single, then the

final number must match the original).

Sign bit determines the sign of the number, which is the sign of the significand as well. Exponent is either

an 8 bit signed integer from −128 to 127 (2’s Complement) or an 8 bit unsigned integer from 0 to 255 which

is the accepted biased form in IEEE 754 binary32 definition. If the unsigned integer format is used, the

exponent value used in the arithmetic is the exponent shifted by a bias – for the IEEE 754 binary32 case, an

exponent value of 127 represents the actual zero (i.e. for 2e − 127 to be one, e must be 127).

The true significand includes 23 fraction bits to the right of the binary point and an implicit leading bit (to

the left of the binary point) with value 1 unless the exponent is stored with all zeros. Thus only 23 fraction

bits of the significand appear in the memory format but the total precision is 24 bits (equivalent to

log10(224) ≈ 7.225 decimal digits). The bits are laid out as follows:

The real value assumed by a given 32 bit binary32 data with a given biased exponent e (the 8 bit unsigned

integer) and a 23 bit fraction is where more precisely we have

.

In this example:

•

•

•

thus:

•

Exponent encoding

The single-precision binary floating-point exponent is encoded using an offset-binary representation, with

the zero offset being 127; also known as exponent bias in the IEEE 754 standard.

• Emin = 01H−7FH = −126

• Emax = FEH−7FH = 127

• Exponent bias = 7FH = 127

Thus, in order to get the true exponent as defined by the offset binary representation, the offset of 127 has

to be subtracted from the stored exponent.

The stored exponents 00H and FFH are interpreted specially.

Signed zero

In the IEEE 754 standard, zero is signed, meaning that there exist both a “positive zero” (+0) and a

“negative zero” (−0). In most run-time environments, positive zero is usually printed as “0”, while negative

zero may be printed as “-0”. The two values behave as equal in numerical comparisons, but some

operations return different results for +0 and −0. For instance, 1/(−0) returns negative infinity, while 1/+0

returns positive infinity (so that the identity 1/(1/±∞) = ±∞ is maintained). Other common functions with a

discontinuity at x=0 which might treat +0 and −0 differently include log(x), signum(x), and the principal

square root of y + xi for any negative number y. As with any approximation scheme, operations involving

“negative zero” can occasionally cause confusion. For example, in IEEE 754, x = y does not imply 1/x = 1/y,

as 0 = −0 but 1/0 ≠ 1/−0.

Infinities

The infinities of the extended real number line can be represented in IEEE floating-point datatypes, just like

ordinary floating-point values like 1, 1.5, etc. They are not error values in any way, though they are often

(but not always, as it depends on the rounding) used as replacement values when there is an overflow.

Upon a divide-by-zero exception, a positive or negative infinity is returned as an exact result. An infinity can

also be introduced as a numeral (like C’s “INFINITY” macro, or “∞” if the programming language allows that

syntax).

IEEE 754 requires infinities to be handled in a reasonable way, such as

• (+∞) + (+7) = (+∞)

• (+∞) × (−2) = (−∞)

• (+∞) × 0 = NaN – there is no meaningful thing to do

NaNs

IEEE 754 specifies a special value called “Not a Number” (NaN) to be returned as the result of certain

“invalid” operations, such as 0/0, ∞×0, or sqrt(−1). In general, NaNs will be propagated i.e. most operations

involving a NaN will result in a NaN, although functions that would give some defined result for any given

floating-point value will do so for NaNs as well, e.g. NaN ^ 0 = 1. There are two kinds of NaNs: the default

quiet NaNs and, optionally, signaling NaNs. A signaling NaN in any arithmetic operation (including

numerical comparisons) will cause an “invalid” exception to be signaled.

The representation of NaNs specified by the standard has some unspecified bits that could be used to

encode the type or source of error; but there is no standard for that encoding. In theory, signaling NaNs

could be used by a runtime system to flag uninitialized variables, or extend the floating-point numbers

with other special values without slowing down the computations with ordinary values, although such

extensions are not common.

Converting from decimal representation to binary32 format

In general refer to the IEEE 754 standard itself for the strict conversion (including the rounding behaviour)

of a real number into its equivalent binary32 format.

Here we can show how to convert a base 10 real number into an IEEE 754 binary32 format using the

following outline:

• consider a real number with an integer and a fraction part such as 12.375

• convert and normalize the integer part into binary

• convert the fraction part using the following technique as shown here

• add the two results and adjust them to produce a proper final conversion

Conversion of the fractional part:

consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2,

take the integer part and re-multiply new fraction by 2 until a fraction of zero is found or until the precision

limit is reached which is 23 fraction digits for IEEE 754 binary32 format.

0.375 x 2 = 0.750 = 0 + 0.750 => b−1 = 0, the integer part represents the binary fraction digit. Re-multiply

0.750 by 2 to proceed

0.750 x 2 = 1.500 = 1 + 0.500 => b−2 = 1

0.500 x 2 = 1.000 = 1 + 0.000 => b−3 = 1, fraction = 0.000, terminate

We see that (0.375)10 can be exactly represented in binary as (0.011)2. Not all decimal fractions can be

represented in a finite digit binary fraction. For example decimal 0.1 cannot be represented in binary

exactly. So it is only approximated.

Therefore (12.375)10 = (12)10 + (0.375)10 = (1100)2 + (0.011)2 = (1100.011)2

Also IEEE 754 binary32 format requires that you represent real values in format, so

that 1100.011 is shifted to the right by 3 digits to become

Finally we can see that:

From which we deduce:

• The exponent is 3 (and in the biased form it is therefore 130 = 1000 0010)

• The fraction is 100011 (looking to the right of the binary point)

From these we can form the resulting 32 bit IEEE 754 binary32 format representation of 12.375 as: 0-

10000010-10001100000000000000000 = 41460000H

Note: consider converting 68.123 into IEEE 754 binary32 format: Using the above procedure you expect to

get 42883EF9H with the last 4 bits being 1001 However due to the default rounding behaviour of IEEE 754

format what you get is 42883EFAH whose last 4 bits are 1010 .

Ex 1: Consider decimal 1 We can see that:

From which we deduce:

• The exponent is 0 (and in the biased form it is therefore 127 = 0111 1111 )

• The fraction is 0 (looking to the right of the binary point in 1.0 is all 0 = 000…0)

From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 1 as:

0-01111111-00000000000000000000000 = 3f800000H

Ex 2: Consider a value 0.25 . We can see that:

From which we deduce:

• The exponent is −2 (and in the biased form it is 127+(−2)= 125 = 0111 1101 )

• The fraction is 0 (looking to the right of binary point in 1.0 is all zeros)

From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 0.25

as: 0-01111101-00000000000000000000000 = 3e800000H

Ex 3: Consider a value of 0.375 . We saw that

Hence after determining a representation of 0.375 as we can proceed as above:

• The exponent is −2 (and in the biased form it is 127+(−2)= 125 = 0111 1101 )

• The fraction is 1 (looking to the right of binary point in 1.1 is a single 1 = x1)

From these we can form the resulting 32 bit IEEE 754 binary32 format representation of real number 0.375

as: 0-01111101-10000000000000000000000 = 3ec00000H

Single-precision examples

These examples are given in bit representation, in hexadecimal, of the floating-point value. This includes

the sign, (biased) exponent, and significand.

3f80 0000 = 1

c000 0000 = −2

7f7f ffff ≈ 3.4028234 × 1038

(max single precision)

0000 0000 = 0

8000 0000 = −0

7f80 0000 = infinity

ff80 0000 = −infinity

3eaa aaab ≈ 1/3

By default, 1/3 rounds up, instead of down like double precision, because of the even number of bits in the

significand. The bits of 1/3 beyond the rounding point are 1010… which is more than 1/2 of a unit in the

last place.

Converting from single-precision binary to decimal

We start with the hexadecimal representation of the value, 41c80000, in this example, and convert it to

binary

41c8 000016 = 0100 0001 1100 1000 0000 0000 0000 00002

then we break it down into three parts; sign bit, exponent and significand.

Sign bit: 0

Exponent: 1000 00112 = 8316 = 131

Significand: 100 1000 0000 0000 0000 00002 = 48000016

We then add the implicit 24th bit to the significand

Significand: 1100 1000 0000 0000 0000 00002 = C8000016

and decode the exponent value by subtracting 127

Raw exponent: 8316 = 131

Decoded exponent: 131 − 127 = 4

Each of the 24 bits of the significand (including the implicit 24th bit), bit 23 to bit 0, represents a value,

starting at 1 and halves for each bit, as follows

bit 23 = 1

bit 22 = 0.5

bit 21 = 0.25

bit 20 = 0.125

bit 19 = 0.0625

.

.

bit 0 = 0.00000011920928955078125

The significand in this example has three bits set, bit 23, bit 22 and bit 19. We can now decode the

significand by adding the values represented by these bits.

Decoded significand: 1 + 0.5 + 0.0625 = 1.5625 = C80000/223

Then we need to multiply with the base, 2, to the power of the exponent to get the final result

1.5625 × 24 = 25

Thus

41c8 0000 = 25

This is equivalent to:

where is the sign bit, is the exponent, and is the significand.

Double-precision floating-point format

Double-precision floating-point format is a computer number format that occupies 8 bytes (64 bits) in

computer memory and represents a wide dynamic range of values by using floating point.

Computers with 32-bit storage locations use two memory locations to store a 64-bit double-precision

number (a single storage location can hold a single-precision number). Double-precision floating-point

format usually refers to binary64, as specified by the IEEE 754 standard, not to the 64-bit decimal format

decimal64.

IEEE 754 double-precision binary floating-point format: binary64

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over

single-precision floating point, in spite of its performance and bandwidth cost. As with single-precision

floating-point format, it lacks precision on integer numbers when compared with an integer format of the

same size. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

• Sign bit: 1 bit

• Exponent width: 11 bits

• Significand precision: 53 bits (52 explicitly stored)

This gives from 15–17 significant decimal digits precision. If a decimal string with at most 15 significant

digits is converted to IEEE 754 double precision representation and then converted back to a string with the

same number of significant digits, then the final string should match the original. If an IEEE 754 double

precision is converted to a decimal string with at least 17 significant digits and then converted back to

double, then the final number must match the original.

The format is written with the significand having an implicit integer bit of value 1 (except for special

datums, see the exponent encoding below). With the 52 bits of the fraction significand appearing in the

memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log10(2) ≈

15.955). The bits are laid out as follows:

The real value assumed by a given 64-bit double-precision datum with a given biased exponent and a 52-

bit fraction is

or

Between 252=4,503,599,627,370,496 and 253=9,007,199,254,740,992 the representable numbers are

exactly the integers. For the next range, from 253 to 254, everything is multiplied by 2, so the representable

numbers are the even ones, etc. Conversely, for the previous range from 251 to 252, the spacing is 0.5, etc.

The spacing as a fraction of the numbers in the range from 2n

to 2n+1 is 2n−52. The maximum relative

rounding error when rounding a number to the nearest representable one (the machine epsilon) is

therefore 2−53

.

The 11 bit width of the exponent allows the representation of numbers with a decimal exponent between

10−308 and 10308, with full 15-17 decimal digits precision. By compromising precision, subnormal

representation allows values smaller than 10−323

.

Exponent encoding

The double-precision binary floating-point exponent is encoded using an offset-binary representation, with

the zero offset being 1023; also known as exponent bias in the IEEE 754 standard. Examples of such

representations would be:

• Emin (1) = −1022

• E (50) = −973

• Emax (2046) = 1023

Thus, as defined by the offset-binary representation, in order to get the true exponent the exponent bias of

1023 has to be subtracted from the written exponent.

The exponents 00016 and 7ff16 have a special meaning:

• 00016 is used to represent a signed zero (if M=0) and subnormals (if M≠0); and

• 7ff16 is used to represent ∞ (if M=0) and NaNs (if M≠0),

where M is the fraction mantissa. All bit patterns are valid encoding.

Except for the above exceptions, the entire double-precision number is described by:

In the case of subnormals (E=0) the double-precision number is described by:

Double-precision examples

3ff0 0000 0000 000016 = 1

3ff0 0000 0000 000116 ≈ 1.0000000000000002, the smallest number > 1

3ff0 0000 0000 000216 ≈ 1.0000000000000004

4000 0000 0000 000016 = 2

c000 0000 0000 000016 = –2

0000 0000 0000 000116 = 2−1022−52 = 2−1074

≈ 4.9406564584124654 × 10−324 (Min subnormal positive double)

000f ffff ffff ffff16 = 2−1022 − 2−1022−52

≈ 2.2250738585072009 × 10−308 (Max subnormal double)

0010 0000 0000 000016 = 2−1022

≈ 2.2250738585072014 × 10−308 (Min normal positive double)

7fef ffff ffff ffff16 = (1 + (1 − 2−52)) × 21023

≈ 1.7976931348623157 × 10308 (Max Double)

0000 0000 0000 000016 = 0

8000 0000 0000 000016 = –0

7ff0 0000 0000 000016 = ∞

fff0 0000 0000 000016 = −∞

3fd5 5555 5555 555516 ≈ 1/3

By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the

significand.

In more detail:

Given the hexadecimal representation 3FD5 5555 5555 555516,

Sign = 0

Exponent = 3FD16 = 1021

Exponent Bias = 1023 (constant value; see above)

Significand = 5 5555 5555 555516

Value = 2(Exponent − Exponent Bias) × 1.Significand – Note the Significand must not be converted to decimal here

= 2−2 × (15 5555 5555 555516 × 2−52)

= 2−54 × 15 5555 5555 555516

= 0.333333333333333314829616256247390992939472198486328125

≈ 1/3

Execution speed with double-precision arithmetic

Using double precision floating-point variables and mathematical functions (e.g., sin(), cos(), atan2(), log(),

exp(), sqrt()) are slower than working with their single precision counterparts. One area of computing

where this is a particular issue is for parallel code running on GPUs. For example when using NVIDIA’s

CUDA platform, calculations with double precision take three times longer

Accuracy problems

The fact that floating-point numbers cannot precisely represent all real numbers, and that floating-point

operations cannot precisely represent true arithmetic operations, leads to many surprising situations. This

is related to the finite precision with which computers generally represent numbers.

For example, the non-representability of 0.1 and 0.01 (in binary) means that the result of attempting to

square 0.1 is neither 0.01 nor the representable number closest to it. In 24-bit (single precision)

representation, 0.1 (decimal) was given previously as e = −4; s = 110011001100110011001101, which is

0.100000001490116119384765625 exactly.

Squaring this number gives

0.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with single-precision floating-point hardware (with rounding) gives

0.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

0.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not

yield a result of infinity, nor will it even overflow. It is simply not possible for standard floating-point

hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation

in C:

/* Enough digits to be sure we get the correct approximation. */

double pi = 3.1415926535897932384626433832795;

double z = tan(pi/2.0);

will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be

−22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be

(approximately) 0.1225×10−15 in double precision, or −0.8742×10−7 in single precision.

While floating-point addition and multiplication are both commutative (a + b = b + a and a×b = b×a), they

are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit

significand decimal arithmetic:

a = 1234.567, b = 45.67834, c = 0.0004

(a + b) + c:

1234.567 (a)

+ 45.67834 (b)

____________

1280.24534 rounds to 1280.245

1280.245 (a + b)

+ 0.0004 (c)

____________

1280.2454 rounds to 1280.245 <— (a + b) + c

a + (b + c):

45.67834 (b)

+ 0.0004 (c)

____________

45.67874

1234.567 (a)

+ 45.67874 (b + c)

____________

1280.24574 rounds to 1280.246 <— a + (b + c)

They are also not necessarily distributive. That is, (a + b) ×c may not be the same as a×c + b×c:

1234.567 × 3.333333 = 4115.223

1.234567 × 3.333333 = 4.115223

4115.223 + 4.115223 = 4119.338

but

1234.567 + 1.234567 = 1235.802

1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight

inaccuracies, the following phenomena may occur:

• Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. When we

subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with

just the insignificant, and most erroneous, digits. For example, when determining a derivative of a

function the following formula is used:

Intuitively one would want an h very close to zero, however when using floating-point operations,

the smallest number won’t give the best approximation of a derivative. As h grows smaller the

difference between f (a + h) and f(a) grows smaller, cancelling out the most significant and least

erroneous digits and making the most erroneous digits more important. As a result the smallest

number of h possible will give a more erroneous approximation of a derivative than a somewhat

larger number. This is perhaps the most common and serious accuracy problem.

• Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting

(0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and

ceiling functions may produce answers which are off by one from the intuitively expected value.

• Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal

number or zero. In these cases precision will be lost.

• Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that

a division will not overflow.

• Testing for equality is problematic. Two computational sequences that are mathematically equal

may well produce different floating-point values.

Incidents

• On 25 February 1991, a loss of significance in a MIM-104 Patriot missile battery prevented it

intercepting an incoming Scud missile in Dhahran, Saudi Arabia, contributing to the death of 28

soldiers from the U.S. Army.

Minimizing the effect of accuracy problems

Although, as noted previously, individual arithmetic operations of IEEE 754 are guaranteed accurate to

within half a ULP, more complicated formulae can suffer from larger errors due to round-off. The loss of

accuracy can be substantial if a problem or its data are ill-conditioned, meaning that the correct result is

hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can

suffer from large loss of accuracy if an algorithm numerically unstable for that data is used: apparently

equivalent formulations of expressions in a programming language can differ markedly in their numerical

stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically

stable algorithms, which is an aim of the branch of mathematics known as numerical analysis. Another

approach that can protect against the risk of numerical instabilities is the computation of intermediate

(scratch) values in an algorithm at a higher precision than the final result requires, which can remove, or

reduce by orders of magnitude, such risk: IEEE 754 quadruple precision and extended precision are

designed for this purpose when computing at double precision.

For example, the following algorithm is a direct implementation to compute the function A(x) = (x–1)/(

exp(x–1) – 1) which is well-conditioned at 1.0, however it can be shown to be numerically unstable and lose

up to half the significant digits carried by the arithmetic when computed near 1.0.

double A(double X)

{

double Y, Z; // [1]

Y = X – 1.0;

Z = exp(Y);

if (Z != 1.0) Z = Y/(Z – 1.0); // [2]

return(Z);

}

If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to

C99 long double), then up to full precision in the final double result can be maintained.[33] Alternatively, a

numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:

if (Z != 1.0) Z = log(Z)/(Z – 1.0);

then the algorithm becomes numerically stable and can compute to full double precision.

To maintain the properties of such carefully constructed numerically stable programs, careful handling by

the compiler is required. Certain “optimizations” that compilers might make (for example, reordering

operations) can work against the goals of well-behaved software. There is some controversy about the

failings of compilers and language designs in this area: C99 is an example of a language where such

optimizations are carefully specified so as to maintain numerical precision.

Several rules of thumb that can substantially decrease, by orders of magnitude the risk of numerical

anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include, computing all

expressions and intermediate results in the highest precision supported in hardware (a common rule of

thumb is to carry twice the precision of the desired result i.e. compute in double precision for a final single

precision result, or in double extended or quad precision for up to double precision results); and rounding

input data and results to only the precision required and supported by the input data (carrying excess

precision in the final result beyond that required and supported by the input data can be misleading,

increases storage cost and decreases speed, and the excess bits can affect convergence of numerical

procedures: notably, the first form of the iterative example given below converges correctly when using

this rule of thumb). Brief descriptions of several additional issues and techniques follow.

As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its

best when it is simply being used to measure real-world quantities over a wide range of scales (such as the

orbital period of a moon around Saturn or the mass of a proton), and at its worst when it is expected to

model the interactions of quantities expressed as decimal strings that are expected to be exact. An

example of the latter case is financial calculations. For this reason, financial software tends not to use a

binary floating-point number representation. The “decimal” data type of the C# and Python programming

languages, and the decimal formats of the IEEE 754-2008 standard, are designed to avoid the problems of

binary floating-point representations when applied to human-entered exact decimal values, and make the

arithmetic always behave as expected when numbers are printed in decimal.

Expectations from mathematics may not be realized in the field of floating-point computation. For

example, it is known that , and that , however these

facts cannot be relied on when the quantities involved are the result of floating-point computation.

The use of the equality test (if (x==y) …) requires care when dealing with floating-point numbers. Even

simple expressions like 0.6/0.2-3==0 will, on most computers, fail to be true (in IEEE 754 double precision,

for example, 0.6/0.2-3 is approximately equal to -4.44089209850063e-16). Consequently, such tests are

sometimes replaced with “fuzzy” comparisons (if (abs(x-y) < epsilon) …, where epsilon is sufficiently small

and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require

numerical analysis to bound epsilon. Values derived from the primary data representation and their

comparisons should be performed in a wider, extended, precision to minimize the risk of such

inconsistencies due to round-off errors. It is often better to organize the code in such a way that such tests

are unnecessary. For example, in computational geometry, exact tests of whether a point lies off or on a

line or plane defined by other points can be performed using adaptive precision or exact arithmetic

methods. Small errors in floating-point arithmetic can grow when mathematical algorithms perform

operations an enormous number of times. A few examples are matrix inversion, eigenvector computation,

and differential equation solving. These algorithms must be very carefully designed, using numerical

approaches such as Iterative refinement, if they are to work well.

Summation of a vector of floating-point values is a basic algorithm in scientific computing, and so an

awareness of when loss of significance can occur is essential. For example, if one is adding a very large

number of numbers, the individual addends are very small compared with the sum. This can lead to loss of

significance. A typical addition would then be something like:

3253.671

+ 3.141276

——–

3256.812

The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many

numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about

3000; lost digits are not regained. (e.g 3.00001 not 3000.001)

Big-O notation is a relative representation of the complexity of an algorithm.

There are some important and deliberately chosen words in that sentence:

• relative: you can only compare apples to apples. You can’t compare an algorithm to do

arithmetic multiplication to an algorithm that sorts a list of integers. But a comparison of two

algorithms to do arithmetic operations (one multiplication, one addition) will tell you something

meaningful;

• representation: Big-O (in its simplest form) reduces the comparison between algorithms to a

single variable. That variable is chosen based on observations or assumptions. For example,

sorting algorithms are typically compared based on comparison operations (comparing two

nodes to determine their relative ordering). This assumes that comparison is expensive. But

what if comparison is cheap but swapping is expensive? It changes the comparison; and

• complexity: if it takes me one second to sort 10,000 elements how long will it take me to sort

one million? Complexity in this instance is a relative measure to something else.

The best example of Big-O I can think of is doing arithmetic. Take two numbers (123456 and 789012). The

basic arithmetic operations we learnt in school were:

• addition;

• subtraction;

• multiplication; and

• division.

Each of these is an operation or a problem. A method of solving these is called an algorithm.

Addition is the simplest. You line the numbers up (to the right) and add the digits in a column writing the

last number of that addition in the result. The ‘tens’ part of that number is carried over to the next column.

Let’s assume that the addition of these numbers is the most expensive operation in this algorithm. It stands

to reason that to add these two numbers together we have to add together 6 digits (and possibly carry a

7th). If we add two 100 digit numbers together we have to do 100 additions. If we add two 10,000 digit

numbers we have to do 10,000 additions.

See the pattern? The complexity (being the number of operations) is directly proportional to the number

of digits n in the larger number. We call this O(n) or linear complexity.

Subtraction is similar (except you may need to borrow instead of carry).

Multiplication is different. You line the numbers up, take the first digit in the bottom number and multiply

it in turn against each digit in the top number and so on through each digit. So to multiply our two 6 digit

numbers we must do 36 multiplications. We may need to do as many as 10 or 11 column adds to get the

end result too.

If we have two 100-digit numbers we need to do 10,000 multiplications and 200 adds. For two one million

digit numbers we need to do one trillion (1012) multiplications and two million adds.

As the algorithm scales with n-squared, this is O(n2

) or quadratic complexity. This is a good time to

introduce another important concept:

We only care about the most significant portion of complexity.

The astute may have realized that we could express the number of operations as: n2 + 2n. But as you saw

from our example with two numbers of a million digits apiece, the second term (2n) becomes insignificant

(accounting for 0.0002% of the total operations by that stage).

One can notice that we’ve assumed the worst case scenario here. While multiplying 6 digit numbers if one

of them is 4 digit and the other one is 6 digit, then we only have 24 multiplications. Still we calculate the

worst case scenario for that ‘n’, i.e when both are 6 digit numbers. Hence Big-O notation is about the

Worst-case scenario of an algorithm

The Telephone Book

If you were instructing a computer to look up the phone number for “John Smith” in a telephone book that

contains 1,000,000 names, what would you do? Ignoring the fact that you could guess how far in the S’s

started (let’s assume you can’t), what would you do?

A typical implementation might be to open up to the middle, take the 500,000th and compare it to “Smith”.

If it happens to be “Smith, John”, we just got real lucky. Far more likely is that “John Smith” will be before

or after that name. If it’s after we then divide the last half of the phone book in half and repeat. If it’s

before then we divide the first half of the phone book in half and repeat. And so on.

This is called a binary search and is used every day in programming whether you realize it or not.

So if you want to find a name in a phone book of a million names you can actually find any name by doing

this at most 20 times. In comparing search algorithms we decide that this comparison is our ‘n’.

• For a phone book of 3 names it takes 2 comparisons (at most).

• For 7 it takes at most 3.

• For 15 it takes 4.

• …

• For 1,000,000 it takes 20.

That is staggeringly good isn’t it?

In Big-O terms this is O(log n) or logarithmic complexity. Now the logarithm in question could be ln (base

e), log10, log2 or some other base. It doesn’t matter it’s still O(log n) just like O(2n2

) and O(100n2

) are still

both O(n2

).

It’s worthwhile at this point to explain that Big O can be used to determine three cases with an algorithm:

• Best Case: In the telephone book search, the best case is that we find the name in one

comparison. This is O(1) or constant complexity;

• Expected Case: As discussed above this is O(log n); and

• Worst Case: This is also O(log n).

Normally we don’t care about the best case. We’re interested in the expected and worst case. Sometimes

one or the other of these will be more important.

Back to the telephone book.

What if you have a phone number and want to find a name? The police have a reverse phone book but

such look-ups are denied to the general public. Or are they? Technically you can reverse look-up a number

in an ordinary phone book. How?

You start at the first name and compare the number. If it’s a match, great, if not, you move on to the next.

You have to do it this way because the phone book is unordered (by phone number anyway).

So to find a name given the phone number (reverse lookup):

• Best Case: O(1);

• Expected Case: O(n) (for 500,000); and

• Worst Case: O(n) (for 1,000,000).

The Travelling Salesman

This is a famous problem with many analogues in science and engineering (optimisation) and deserves a

mention. In this problem you have N towns. Each of those towns is linked to 1 or more other towns by a

road of a certain distance. The Travelling Salesman problem is to find the shortest tour that visits every

town.

Sounds simple? Think again.

If you have 3 towns A, B and C with roads between all pairs then you could go:

• A → B → C

• A → C → B

• B → C → A

• B → A → C

• C → A → B

• C → B → A

Well actually there’s less than that because some of these are equivalent (A → B → C and C → B → A are

equivalent, for example, because they use the same roads, just in reverse).

In actuality there are 3 possibilities.

• Take this to 4 towns and you have (iirc) 12 possibilities.

• With 5 it’s 60.

• 6 becomes 360.

This is a function of a mathematical operation called a factorial. Basically:

• 5! = 5 × 4 × 3 × 2 × 1 = 120

• 6! = 6 × 5 × 4 × 3 × 2 × 1 = 720

• 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5040

• …

• 25! = 25 × 24 × … × 2 × 1 = 15,511,210,043,330,985,984,000,000

• …

• 50! = 50 × 49 × … × 2 × 1 = 3.04140932 × 1064

So the Big-O of the Travelling Salesman problem is O(n!) or factorial or combinatorial complexity.

By the time you get to 150 towns, it would take the fastest computer known today about 5 x 10275

years

to calculate the optimal route.

Polynomial Time

Any algorithm that has a complexity of O(na

) is said to have polynomial complexity or is solvable in

polynomial time.

O(n), O(n2

) etc are all polynomial time. Some problems cannot be solved in polynomial time. Certain things

are used in the world because of this. Public Key Cryptography is a prime example. It is computationally

hard to find two prime factors of a very large number. If it wasn’t, we couldn’t use the public key systems

we use.

Additional Background Material

Iterative solution of a set of linear equations

Jacobi Method

Given a square system of n linear equations:

where:

Then A can be decomposed into a diagonal component D, and the remainder R:

The solution is then obtained iteratively via

The element-based formula is thus:

The computation of xi

(k+1) requires each element in x

(k) except itself. Unlike the Gauss–Seidel method, we

can’t overwrite xi

(k) with xi

(k+1), as that value will be needed by the rest of the computation. The minimum

amount of storage is two vectors of size n.

Convergence

The method is guaranteed to converge if the matrix A is strictly or irreducibly diagonally dominant. Strict

row diagonal dominance means that for each row, the absolute value of the diagonal term is greater than

the sum of absolute values of other terms:

The Jacobi method sometimes converges even if these conditions are not satisfied.

Example

A linear system of the form with initial estimate is given by

We use the equation , described above, to estimate . First, we rewrite the

equation in a more convenient form , where and

. Note that where and are the strictly lower and upper parts of . From the

known values

we determine as

Further, C is found as

With T and C calculated, we estimate as :

The next iteration yields

This process is repeated until convergence (i.e., until is small). The solution after 25

iterations is

Another example

Suppose we are given the following linear system:

Suppose we choose (0, 0, 0, 0) as the initial approximation, then the first approximate solution is given by

Using the approximations obtained, the iterative procedure is repeated until the desired accuracy has been

reached. The following are the approximated solutions after five iterations.

The exact solution of the system is (1, 2, −1, 1).

Gauss –Siedel Method

The Gauss–Seidel method is an iterative technique for solving a square system of n linear equations with

unknown x:

.

It is defined by the iteration

where the matrix A is decomposed into a lower triangular component , and a strictly upper triangular

component U: .

[2]

In more detail, write out A, x and b in their components:

Then the decomposition of A into its lower triangular component and its strictly upper triangular

component is given by:

The system of linear equations may be rewritten as:

The Gauss–Seidel method now solves the left hand side of this expression for x, using previous value for x

on the right hand side. By taking advantage of the triangular form of , the elements of x

(k+1) can be

computed sequentially using forward substitution:

Consider row i:

When working with iteration k, at this stage the x’s 1 up to i-1 have been calculated and those after i not

yet calculated.

Or:

And:

In general:

The procedure is generally continued until the changes made by an iteration are below some tolerance,

such as a sufficiently small residual.

Discussion

The element-wise formula for the Gauss–Seidel method is extremely similar to that of the Jacobi method.

The computation of xi

(k+1) uses only the elements of x

(k+1) that have already been computed, and only the

elements of x

(k)

that have not yet to be advanced to iteration k+1. This means that, unlike the Jacobi

method, only one storage vector is required as elements can be overwritten as they are computed, which

can be advantageous for very large problems.

However, unlike the Jacobi method, the computations for each element cannot be done in parallel.

Furthermore, the values at each iteration are dependent on the order of the original equations.

Convergence

The convergence properties of the Gauss–Seidel method are dependent on the matrix A. Namely, the

procedure is known to converge if either:

• A is symmetric positive-definite, or

• A is strictly or irreducibly diagonally dominant.

The Gauss–Seidel method sometimes converges even if these conditions are not satisfied.

Algorithm

Since elements can be overwritten as they are computed in this algorithm, only one storage vector is

needed.

Examples

Suppose given k equations where xn are vectors of these equations and starting point x0. From the first

equation solve for x1 in terms of For the next equations substitute the previous

values of the xs.

To make it clear let’s consider an example.

Solving for , , and gives:

Suppose we choose (0, 0, 0, 0) as the initial approximation, then the first approximate solution is given by

Using the approximations obtained, the iterative procedure is repeated until the desired accuracy has been

reached. The following are the approximated solutions after four iterations.

The exact solution of the system is (1, 2, −1, 1).

Gaussian elimination

Gaussian elimination is an algorithm for solving systems of linear equations. It is usually understood as a

sequence of operations performed on the associated matrix of coefficients. This method can also be used

to find the rank of a matrix, to calculate the determinant of a matrix, and to calculate the inverse of an

invertible square matrix.

To perform row reduction on a matrix, one uses a sequence of elementary row operations to modify the

matrix until the lower left-hand corner of the matrix is filled with zeros, as much as possible. There are

three types of elementary row operations: 1) Swapping two rows, 2) Multiplying a row by a non-zero

number, 3) Adding a multiple of one row to another row. Using these operations, a matrix can always be

transformed into an upper triangular matrix, and in fact one that is in row echelon form. Once all of the

leading coefficients (the left-most non-zero entry in each row) are 1, and in every column containing a

leading coefficient has zeros elsewhere, the matrix is said to be in reduced row echelon form. This final

form is unique; in other words, it is independent of the sequence of row operations used. For example, in

the following sequence of row operations (where multiple elementary operations might be done at each

step), the third and fourth matrices are the ones in row echelon form, and the final matrix is the unique

reduced row echelon form.

Using row operations to convert a matrix into reduced row echelon form is sometimes called Gauss–Jordan

elimination.

Definitions and example of algorithm

The process of row reduction makes use of elementary row operations, and can be divided into two parts.

The first part (sometimes called Forward Elimination) reduces a given system to row echelon form, from

which one can tell whether there are no solutions, a unique solution, or infinitely many solutions. The

second part (sometimes called back substitution) continues to use row operations until the solution is

found; in other words, it puts the matrix into reduced row echelon form.

Row operations

There are three types of elementary row operations which may be performed on the rows of a matrix:

Type 1: Swap the positions of two rows.

Type 2: Multiply a row by a nonzero scalar.

Type 3: Add to one row a scalar multiple of another.

If the matrix is associated to a system of linear equations, then these operations do not change the solution

set. Therefore, if one’s goal is to solve a system of linear equations, then using these row operations could

make the problem easier.

Echelon form

For each row in a matrix, if the row does not consist of only zeros, then the left-most non-zero entry is

called the leading coefficient (or pivot) of that row. So if two leading coefficients are in the same column,

then a row operation of type 3 (see above) could be used to make one of those coefficients zero. Then by

using the row swapping operation, one can always order the rows so that for every non-zero row, the

leading coefficient is to the right of the leading coefficient of the row above. If this is the case, then matrix

is said to be in row echelon form. So the lower left part of the matrix contains only zeros, and all of the

zero rows are below the non-zero rows. The word “echelon” is used here because one can roughly think of

the rows being ranked by their size, with the largest being at the top and the smallest being at the bottom.

For example, the following matrix is in row echelon form, and its leading coefficients are shown in red.

It is in echelon form because the zero row is at the bottom, and the leading coefficient of the second row

(in the third column), is to the right of the leading coefficient of the first row (in the second column).

A matrix is said to be in reduced row echelon form if furthermore all of the leading coefficients are equal

to 1 (which can be achieved by using the elementary row operation of type 2), and in every column

containing a leading coefficient, all of the other entries in that column are zero (which can be achieved by

using elementary row operations of type 3).

Example of the algorithm

Suppose the goal is to find and describe the set of solutions to the following system of linear equations:

The table below is the row reduction process applied simultaneously to the system of equations, and its

associated augmented matrix. In practice, one does not usually deal with the systems in terms of equations

but instead makes use of the augmented matrix, which is more suitable for computer manipulations. The

row reduction procedure may be summarized as follows: eliminate x from all equations below , and

then eliminate y from all equations below . This will put the system into triangular form. Then, using

back-substitution, each unknown can be solved for.

System of equations Row operations Augmented matrix

The matrix is now in echelon form (also called triangular form)

The second column describes which row operations have just been performed. So for the first step, the x is

eliminated from by adding to . Next x is eliminated from by adding to . These row

operations are labelled in the table as

Once y is also eliminated from the third row, the result is a system of linear equations in triangular form,

and so the first part of the algorithm is complete. From a computational point of view, it is faster to solve

the variables in reverse order, a process known as back-substitution. One sees the solution is z = -1, y = 3,

and x = 2. So there is a unique solution to the original system of equations.

Instead of stopping once the matrix is in echelon form, one could continue until the matrix is in reduced

row echelon form, as it is done in the table. The process of row reducing until the matrix is reduced is

sometimes referred to as Gauss-Jordan elimination, to distinguish it from stopping after reaching echelon

form.

Applications

The historically first application of the row reduction method is for solving systems of linear equations.

Here are some other important applications of the algorithm.

Computing determinants

To explain how Gaussian elimination allows to compute the determinant of a square matrix, we have to

recall how the elementary row operations change the determinant:

• Swapping two rows multiplies the determinant by -1

• Multiplying a row by a nonzero scalar multiplies the determinant by the same scalar

• Adding to one row a scalar multiple of another does not change the determinant.

If the Gaussian elimination applied to a square matrix A produces a row echelon matrix B, let d be the

product of the scalars by which the determinant has been multiplied, using above rules. Then the

determinant of A is the quotient by d of the product of the elements of the diagonal of B.

It should be emphasized that, for a n×n matrix, this method needs only O(n3

) arithmetic operations, while

the elementary methods, usually taught in elementary courses, need O(2n

) or O(n!) operations. Even on the

fastest computers, the elementary methods are impracticable for n ≥ 21.

Finding the inverse of a matrix

A variant of Gaussian elimination called Gauss–Jordan elimination can be used for finding the inverse of a

matrix, if it exists. If A is a n by n square matrix, then one can use row reduction to compute its inverse

matrix, if it exists. First, the n by n identity matrix is augmented to the right of A, forming a n by 2n block

matrix [A | I]. Now through application of elementary row operations, find the reduced echelon form of

this n by 2n matrix. The matrix A is invertible if and only if the left block can be reduced to the identity

matrix I; in this case the right block of the final matrix is A−1. If the algorithm is unable to reduce the left

block to I, then A is not invertible.

For example, consider the following matrix

To find the inverse of this matrix, one takes the following matrix augmented by the identity, and row

reduces it as a 3 by 6 matrix:

By performing row operations, one can check that the reduced row echelon form of this augmented matrix

is:

The matrix on the left is the identity, which shows A is invertible. The 3 by 3 matrix on the right, B, is the

inverse of A. This procedure for finding the inverse works for square matrices of any size.

Computational efficiency

The number of arithmetic operations required to perform row reduction is one way of measuring the

algorithm’s computational efficiency. For example, to solve a system of n equations for n unknowns by

performing row operations on the matrix until it is in echelon form, and then solving for each unknown in

reverse order, requires n(n-1) / 2 divisions, (2n3 + 3n2 − 5n)/6 multiplications, and (2n3 + 3n2 − 5n)/6

subtractions, for a total of approximately 2n3 / 3 operations. Thus it has arithmetic complexity of O(n3

); see

Big O notation. This arithmetic complexity is a good measure of the time needed for the whole

computation when the time for each arithmetic operation is approximately constant. This is the case when

the coefficients are represented by floating point numbers or when they belong to a finite field. If the

coefficients are integers or rational numbers exactly represented, the intermediate entries can grow

exponentially large, so the bit complexity is exponential.[8] However, there is a variant of Gaussian

elimination, called Bareiss algorithm that avoids this exponential growth of the intermediate entries, and,

with the same arithmetic complexity of O(n3

), has a bit complexity of O(n5

).

This algorithm can be used on a computer for systems with thousands of equations and unknowns.

However, the cost becomes prohibitive for systems with millions of equations. These large systems are

generally solved using iterative methods. Specific methods exist for systems whose coefficients follow a

regular pattern.

To put an n by n matrix into reduced echelon form by row operations, one needs arithmetic operations;

which is approximately 50% more computation steps.

One possible problem is numerical instability, caused by the possibility of dividing by very small numbers.

If, for example, the leading coefficient of one of the rows is very close to zero, then to row reduce the

matrix one would need to divide by that number so the leading coefficient is 1. This means any error that

existed for the number which was close to zero would be amplified. Gaussian elimination is numerically

stable for diagonally dominant or positive-definite matrices. For general matrices, Gaussian elimination is

usually considered to be stable, when using partial pivoting, even though there are examples of stable

matrices for which it is unstable.

Pseudocode

As explained above, Gaussian elimination writes a given m × n matrix A uniquely as a product of an

invertible m × m matrix S and a row-echelon matrix T. Here, S is the product of the matrices corresponding

to the row operations performed.

The formal algorithm to compute from follows. We write for the entry in row , column in

matrix with 1 being the first index. The transformation is performed in place, meaning that the original

matrix is lost and successively replaced by .

for k = 1 … m:

Find pivot for column k:

i_max := argmax (i = k … m, abs(A[i, k]))

if A[i_max, k] = 0

error “Matrix is singular!”

swap rows(k, i_max)

Do for all rows below pivot:

for i = k + 1 … m:

Do for all remaining elements in current row:

for j = k + 1 … n:

A[i, j] := A[i, j] – A[k, j] * (A[i, k] / A[k, k])

Fill lower triangular matrix with zeros:

A[i, k] := 0

This algorithm differs slightly from the one discussed earlier, because before eliminating a variable, it first

exchanges rows to move the entry with the largest absolute value to the pivot position. Such partial

pivoting improves the numerical stability of the algorithm; some other variants are used.

Upon completion of this procedure the augmented matrix will be in row-echelon form and may be solved

by back-substitution.

With modern computers, Gaussian elimination is not always the fastest algorithm to compute the row

echelon form of matrix.

Sparse matrices

In numerical analysis, a sparse matrix is a matrix in which most of the elements are zero. By contrast, if

most of the elements are nonzero, then the matrix is considered dense. The fraction of zero elements

(non-zero elements) in a matrix is called the sparsity (density).

Conceptually, sparsity corresponds to systems which are loosely coupled. Consider a line of balls connected

by springs from one to the next: this is a sparse system as only adjacent balls are coupled. By contrast, if

the same line of balls had springs connecting each ball to all other balls, the system would be correspond to

a dense matrix. The concept of sparsity is useful in combinatorics and application areas such as network

theory, which have a low density of significant data or connections.

Large sparse matrices often appear in scientific or engineering applications when solving partial differential

equations.

When storing and manipulating sparse matrices on a computer, it is beneficial and often necessary to use

specialized algorithms and data structures that take advantage of the sparse structure of the matrix.

Operations using standard dense-matrix structures and algorithms are slow and inefficient when applied to

large sparse matrices as processing and memory are wasted on the zeroes. Sparse data is by nature more

easily compressed and thus require significantly less storage. Some very large sparse matrices are

infeasible to manipulate using standard dense-matrix algorithms.

Tridiagonal matrices (Special Cases)

A tridiagonal matrix is a matrix that has nonzero elements only on the main diagonal, the first diagonal

below this, and the first diagonal above the main diagonal.

For example, the following matrix is tridiagonal:

Example of sparse matrix

The above sparse matrix contains only

9 nonzero elements of the 35, with 26

of those elements as zero.

The determinant of a tridiagonal matrix is given by a continuant of its elements.

Properties

Many linear algebra algorithms require significantly less computational effort when applied to diagonal

matrices, and this improvement often carries over to tridiagonal matrices as well.

Determinant

The determinant of a tridiagonal matrix A of order n can be computed from a three-term recurrence

relation. Write f1 = |a1| = a1 and

The sequence (fi) is called the continuant and satisfies the recurrence relation

with initial values f0 = 1 and f-1 = 0. The cost of computing the determinant of a tridiagonal matrix using this

formula is linear in n, while the cost is cubic for a general matrix.

Inversion

The inverse of a non-singular tridiagonal matrix T

is given by

where the θi satisfy the recurrence relation

with initial conditions θ0 = 1, θ1 = a1 and the ϕi satisfy

with initial conditions ϕn+1 = 1 and ϕn = an.

Solution of linear system

A system of equations A x = b for can be solved by an efficient form of Gaussian elimination when A is

tridiagonal called tridiagonal matrix algorithm, requiring O(n) operations.

Tridiagonal matrix algorithm

The tridiagonal matrix algorithm, is a simplified form of Gaussian elimination that can be used to solve

tridiagonal systems of equations. A tridiagonal system for n unknowns may be written as

where and .

For such systems, the solution can be obtained in operations instead of required by Gaussian

elimination. A first sweep eliminates the ‘s, and then an (abbreviated) backward substitution produces

the solution. Examples of such matrices commonly arise from the discretization of 1D Poisson equation

(e.g., the 1D diffusion problem) and natural cubic spline interpolation; similar systems of matrices arise in

tight binding physics or nearest neighbor effects models.

Method

The forward sweep consists of modifying the coefficients as follows, denoting the new modified

coefficients with primes:

and

The solution is then obtained by back substitution:

Derivation

The derivation of the tridiagonal matrix algorithm involves manually performing some specialized Gaussian

elimination in a generic manner.

Suppose that the unknowns are , and that the equations to be solved are:

Consider modifying the second ( ) equation with the first equation as follows:

which would give:

and the effect is that has been eliminated from the second equation. Using a similar tactic with the

modified second equation on the third equation yields:

This time was eliminated. If this procedure is repeated until the row; the (modified) equation will

involve only one unknown, . This may be solved for and then used to solve the equation, and

so on until all of the unknowns are solved for.

Clearly, the coefficients on the modified equations get more and more complicated if stated explicitly. By

examining the procedure, the modified coefficients (notated with tildes) may instead be defined

recursively:

To further hasten the solution process, may be divided out (if there’s no division by zero risk), the newer

modified coefficients notated with a prime will be:

This gives the following system with the same unknowns and coefficients defined in terms of the original

ones above:

The last equation involves only one unknown. Solving it in turn reduces the next last equation to one

unknown, so that this backward substitution can be used to find all of the unknowns: