In my Node.js vs C++ blog I promised to add the Source Code to GitHub. It is available now at https://github.com/kubimtk/parseGutenberg
Today my first Apple app is available in AppStore. Difficult to find within the first 24 hours.
The app is available for free at the moment and there is no means implemented to earn money.
While waiting for payed projects I am learning new things and keep my fingers moving. Some time ago I learned Node.js from the book
In chapter 5 you learn how to generate a bulk import file (Json format) for Elasticsearch from Project Gutenberg. You do this by parsing each the over 58000 rdf files you can download from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2 and extracting the Gutenberg ID, the book’s title, the list of authors and the list of subjects an writing out a index line and this book info as json lines. This takes quite a while to process and I wanted to get fluid writing test driven C++ code using Jetbrain’s rider and trying out some new C++ Parser – in this case CMarkup. In Node.js Cheerio was used which uses CSS selectors to find the elements you are looking for. Sometime in the future I will implement a sax based C++ xml parser and a CSS selector solution.
Including my tests the C++ source code has 608 lines in total:
% wc $(find inc src tests | egrep ".cpp$|.h$" | egrep -v "googletest|Mark") 30 63 633 inc/Book.h 73 323 2604 inc/KKKLogger.h 40 67 668 inc/GutenbergParser.h 118 281 3762 src/GutenbergParser.cpp 60 163 1490 src/Book.cpp 88 263 2593 src/main.cpp 88 203 2544 tests/GutenbergParser_Tests.cpp 24 125 1231 tests/TestFilePaths.h 87 190 2865 tests/Book_Tests.cpp 608 1678 18390 total
In NodeJS we have only 33 lines in total:
% wc rdf-to-bulk.js lib/parse-rdf.js 20 34 440 rdf-to-bulk.js 13 27 497 lib/parse-rdf.js 33 61 937 total
The C++ has some more functionality (selective logs, different output options) but that makes only 100 or so lines difference while the tests are about 200 lines so in effect you have:
- 300 lines for C++
- 33 lines for Node.js
The Runtime is about 5 times faster in C++ vs. Node.js 8 . Runtime in Node.js in my bash environment with Node.js 8:
$ node --version v8.0.0 $ time node rdf-to-bulk.js /Users/kubi/NodeJsProgs/data/cache/epub >bulk_node real 2m21.932s user 1m58.511s sys 0m18.657s
Runtime in C++:
time ./cmake-build-debug/parseGutenberg -bulk /Users/kubi/NodeJsProgs/data/cache/epub > bulk_cpp ./cmake-build-debug/parseGutenberg -bulk > bulk_cpp 27,24s user 9,69s system 60% cpu 1:00,97 total
Later I installed node fresh from Homebrew (because after using csh as standard it was not available there) and that installed Node.js 14.5 and is much faster:
% node Welcome to Node.js v14.5.0. % time node rdf-to-bulk.js /Users/kubi/NodeJsProgs/data/cache/epub >bulk_node.V14.5.0 node rdf-to-bulk.js /Users/kubi/NodeJsProgs/data/cache/epub > 94,72s user 21,25s system 116% cpu 1:39,59 total
The runtime varies +- 3 seconds when I ran the programs several times within the last two hours (I have mysql, Jenkins, docker and several other servers running on my MacBookPro all the time but of them only mysql pops up with a CPU usage of 0.2% from time to time (the other may hide under kernel_task with around 2.6%) when I use top and the server load between 1.3 and 1.79)
With the newest Node.js environment the run times of the C++ and the Node.js differ by only 60% (1:00m vs :140m) while you need 9 times more code lines in C++ (that is not really comparable as with CMarkup the implementation is not comparable with a CSS selector approach). There is more evaluation needed. I.e. I did not switch on the optimization flag in C++ compilation and I have to use the CSS Selector approach in C++ and reduce the code to a minimalistic version without logs and error handling but then the code line comparison becomes unrealistic. The code in the Node.js book has this minimalistic style to concentrate on the important learning units. I will add the c++ project to my kubimtk GitHub account bin the near future.