[Networkit] NetworKit 3.2: call for participation and release date

Florian Weber uagws at student.kit.edu
Tue May 27 21:30:58 CEST 2014


> A factor 2 speedup would of course be very nice. Nonetheless, we should be extra careful in testing the correctness of the parser, it’s a critical component.

I just added the current version to the new branch “fast_number_parsers”
in the file “/src/cpp/auxiliary/NumberParsing.h”. In addition to that I
also added some tests.

I would be happy if some people would take a look at this and maybe
added a few other tests to “./test/AuxGTest.cpp“.


Usage
-----

These functions work on arbitrary input-iterators over chars and return
a tuple of the value and an iterator to the end of the whitespace-region
after the number:

std::string str = "  123   456   ";
unsigned value;

auto it = str.begin();
auto end = str.end();

std::tie(value, it) = Aux::Parsing::strTo<unsigned>(it, end);
assert(value = 123);
assert(*it == '4');

std::tie(value, it) = Aux::Parsing::strTo<unsigned>(it, end);
assert(value = 456);
assert(it == end);

Some Notes
----------

* The functions swallow all whitespace before and after the number
* All integer-types are supported (std::is_integral<T> is required)
* All floating-point types that have a mantissa which can reasonably be
  stored in std::uintmax_t are supported (on x64-linux, libstdc++ this
  means float and double but not long double)

Important Limitations
---------------------

* In case of signed numbers, parsing their smallest value won't work:
  It is undefined behavior to parse INT_MIN as normal integer. (There
  are no such limitations for positive and unsigned integers.)

  The reason behind this is that negative numbers are parsed as positive
  ones and subtracted from 0 after that.

  Should this be totaly unacceptable, it would also be possible to
  implement this in a different way.

* Parsing floating-point-numbers may exhibit of-by-one-errors. For
  instance:
  “-784008854951861199831040.” should be parsed as
  -7.840088549518612e+23 but will be parsed as -7.8400885495186107e+23

  While this occurs quite often, I never found a case where the result
  is more than one fp-number of and strongly suspect rounding-errors.

  Note that the currently used myStrtod has the same problem (even
  though somewhat rarer, but it definitely occurs).

  Personally I think that everyone who thinks exploiting FP-numbers to
  the last bit will have other problems, so I don't consider this too
  much of a problem (especially since I haven't come across this problem
  so far when parsing strings that have less than the maximum number of
  digits.

Safety-checks
-------------

The integer-parser protects itself against overflow-errors with
assertions. OTOH this makes the release-version pretty fast, but still
enables debugging. The downside is that there is no safety in release-mode.

The float-parser uses the int-parser to parse possible exponents, but
aside from that ships around the issue completely by making sure at
compile-time that the mantissa can be safed completely in the used
integer and drops all unneeded characters.


Performance
-----------

The integer parser is clearly faster than myStrtoul if used in
release-mode. The factor depends on the used compiler.

Testcase: Parsing 100000 random numbers of type size_t, that were
converted with std::to_string and then joined with spaces

GCC (typical result):

stringstream:   1035358683812910607 in 25902324ns
strTo<T>:       1035358683812910607 in 7687193ns
myStrtoul:      1035358683812910607 in 10153870ns

Runtime of strTo<T> compared to myStrtoul: 75,7%

Clang (typical result):

stringstream:   3444364311182819528 in 27020813ns
strTo<T>:       3444364311182819528 in 7150124ns
myStrtoul:      3444364311182819528 in 14022166ns

Runtime of strTo<T> compared to myStrtoul: 50,1%

(The middle number is the xored value of all parsed numbers to prevent
unwanted optimizations; measurements taken with
std::high_precession_clock on x64 linux with libstdc++ and clang 3.4)

Concerning floats: I didn't manage to beat myStrtod clearly for this
challenge, but consider the difference to be acceptable:

Input: 100000 numbers between -10000 and 10000, generated with
std::to_string(); (no exponential notation)

(middle number = sum of all values)

GCC:
stringstream:   3.79426e+06 in 41032370ns
strTo<T>:       3.79426e+06 in 4352989ns
myStrtod:       3.79426e+06 in 3952708ns

Runtime of strTo<T> compared to myStrtod: 110,1%

Clang:
stringstream:   2.15003e+06 in 40050832ns
strTo<T>:       2.15003e+06 in 4270670ns
myStrtod:       2.15003e+06 in 5066246ns

Runtime of strTo<T> compared to myStrtod: 84.3%

Same but with exponential notation:

GCC:
stringstream:   -2.85461e+06 in 36774181ns
strTo<T>:       -2.85461e+06 in 4135008ns
myStrtod:       -2.85461e+06 in 3763526ns

Runtime of strTo<T> compared to myStrtod: 109,9%

Clang:
stringstream:   1.48174e+06 in 35890827ns
strTo<T>:       1.48174e+06 in 4312484ns
myStrtod:       1.48174e+06 in 4874552ns

Runtime of strTo<T> compared to myStrtod: 88,5%

Decreasing the size absolute value of the numbers seems to favor myStrtod.



Future Directions
-----------------

One possible addition would be to add a policy-argument (template) that
allows setting the error-checks. This would allow to enable
always-active overflow-checks where this is acceptable but keep the
critical uses fast.


Summary
-------

* high-level, very general interface
* fast for integers, acceptable for floats, which is OK since we
  probably parse more integers than floats
  (Also: Once clang gets openmp-support, we might want to support it)
* IMHO easier to understand code
* const-correct (unlike myStrtoX)
* the interface is easy to use right and hard to use wrong
* There are corner-cases, but they are known and documented (here)



Regards,
Florian


-------------- n�chster Teil --------------
Ein Dateianhang mit Bin�rdaten wurde abgetrennt...
Dateiname   : signature.asc
Dateityp    : application/pgp-signature
Dateigr��e  : 901 bytes
Beschreibung: OpenPGP digital signature
URL         : <https://lists.ira.uni-karlsruhe.de/mailman/private/networkit/attachments/20140527/06d42222/attachment.sig>


More information about the NetworKit mailing list