The FP instructions:
fild -> load integer
fist -> store integer
fistp -> store integer and pop fp stack
works with *integers* there is no instruction for unsigned integers.
Only the instruction fistp could handle 64 bits integers, the instruction fist is limited to 32 bits integers.
A number as 9.3e+18 is too big for a signed 64 bits integer, but fits perfectly in an unsigned 64 bits integer. Unfortunately the Floating Point Unit doesn't know of unsigned integers existence, so if the number won't fit in a signed integer than it is an overflow (0x80000000....).
An optimized compiler remove instructions that could be detected as 'will never be executed' in compiling phase, so if you want a coherent translation of your code you have to select 'no optimizations' in compiler flags. PellesC is optimized enough to detect such occourrencies.
Now the memory layout of a double could be descripted as follows:
#pragma pack(1)
typedef struct
{
unsigned long long int mantissa:52; //Mantissa or significant
unsigned long long int exponent:11; //exponent biased by 1023
unsigned long long int sign:1; //sign of mantissa
} DOUBLE_FMT, *LP_DOUBLE_FMT;
#pragma pack()
The number represented is composed by an implicit 1 before the decimal point followed by the fractional part called mantissa or significant, hold in 52 bits. The whole mantissa is 53 bits long considering the implicit 1.
The exponent is the power of 2 for which we have to multiply the mantissa to obtain our number. The exponent could be positive or negative to express respectively a number >1 or <1. Anyway the IEE standard doesn't use a 2's complement representation for exponent, but biasing. The bias value represent the 0, exponents less than 1023 represent numbers < 1 and exponents greater than 1023 represents numbers > 1. Some values of exponent (0h and 0x7ff have special meanings refer to IEE-754 standard).
So our number could be represented as:
number = (-1)^sign * 1.mantissa * 2^(exponent -1023)
This could seem complicated, but the math operation on base 2 are very simple for the machine, i.e. to compute the multiplication of mantissa for the power of 2 of the exponent simply requires a shift of mantissa itself for the exponent value and direction dependent on the exponent sign. Shift left for positive exponents and shift rigth for negative values.
Sign can have two values (is one bit wide) 1 or 0, because any number elevated to 0 gives 1 as result and itself when elevated to 1, our number will be sign changed if sign is 1 and unchanged if sign is 0 (-1^0=1 => 1*1.mantissa.... , -1^1=-1 => -1*1.mantissa....).
Coding a software converting routine from double to unsigned 64 bits integer could be:
//MyDouble is the double to convert
double MyDouble = 9.3e+18;
//Cast double to the structure describing the format
LP_DOUBLE_FMT pDouble = (LP_DOUBLE_FMT)&MyDouble;
//Get mantissa promoted as integer part (not fraction)
//and add the implicit integer part (53rd bit)
ui64 = pDouble->mantissa | 0x10000000000000LLU;
//Compute exponent subtracting bias and considering the fraction promotion
int exponent = pDouble->exponent - 1023 - 52;
//Adjust result by exponent
if (exponent)
{
if (exponent > 0)
ui64 <<= exponent;
else
ui64 >>= -exponent;
}
//Adjust for negative numbers
if (pDouble->sign)
ui64 = -ui64;
This code performs the same operations of Timo's sample taken from Jochen, which is some poorly coded and documented (oh yes black-magik coding...
).
Last consideration is on the approximation that we get using floating point numbers. As seen the mantissa holds only 53 bits, so only numbers that fits in 53 bits are an exact reppresentation of the number, whichever value greater is just an approximation (the missing bits will be replaced by zeroes):
//Max exact value representable in a double
//(53 bits = 11111111111111111111111111111111111111111111111111111b)
#define MAXDOUBLEEXACT 9007199254740991LL
If your number requires say 54 bits using a double the last bit will have no meaning, and you will see that could no more change by units, but by the power of 2 of the major missing bits. I.e. with 54 bits we will get only numbers ending in 0 or 2 (2^1), with 55 bits numbers will change by 4 and 0 (2^2), with 56 bits numbers will change by 8 and 0 (2^3) and so on....
So if you want run a counter using a double it is a very bad idea (at least for numbers bigger than MAXDOUBLEEXACT ), better use an integer....
P.S.
To prevent any comment about the fact that adding even small values (1.0 or 10.0 or the like) to a value >= MAXDOUBLEEXACT still seems to give consistent results I have to add mention about FPU internal data handling and rounding. The FPU generally always use Extended double internally to allow better precision on float calculations.
Extended double is 10 bytes long, the mantissa is 64 bits long including the integer part (integer part 1 bit, fractional part 63 bits), the exponent is 15 bits and the sign 1 bit.
the instructions to load and store floats convert from internal format to the storing required (float, double, long float...).
So performing calculations on extended double the result is still kind of accurate because the float reppresentation limits has been moved up.....
The rounding has been introduced in floating point because not all numbers are correctly reppresentable in base 2, in some cases the conversion generates an irreducble fraction (to give an idea think of 20/6=3.33333333333333....), in this case the floating number is an approximation. The rounding scope is to get as closer possible to the real value adding a small value that triggers the significant figures close to our value.
Consider that the float reppresentation of 20.1 is 20.0999999999999999999999, by adding a small rounding value as 0.0000000000000000000001 will give 20.1 as required.