Page 1 of 1

Searching in text/unicode(?) files is way too slow for software development project.

Posted: Sat Oct 12, 2024 2:42 pm
by Skybuck
Example:

Try download this github repo:

https://github.com/vitelabs/vite-wallet

also make sure to downloaded the embedded repo:

https://github.com/vitelabs/vite-web-wallet

As indicated by instructions.

(Optionally try building it and running it on windows, some kind of "promises" issue)

Searching 70.000 files for "promises" in "*.js" takes too long in my oppinion, 360.000 milliseconds, 360 seconds basically last time I tried it which was a few minutes ago on a sandisk extreme portable ssd which I bought for 200 euros believing that it would speed up my search.
(To get to 70.000 files it may be necessary to run npm install first to install all the dependency files)

I could have tried internal ssd or RAM, which in hind sight I should have done first.

Turned out, harddisk is even faster than this external SSD, maybe cause source code was slightly updated and contains now more source, but still.

My preliminary conclusion for now is that it's a single core bottleneck... unicode searching in general is already slow, not sure if these source files are stored in unicode/utf8 format. This is also noticed by SQL server guys, ASCII vs UNICODE can be 8 times faster.

Anyway, I notice textpad is only using 1 core.

So my enhancement suggestion is: implemented multi-threaded searching.

This vitelab code base is a nice example of a real world performance bottleneck, and it suxxx.

Bye for now,
Skybuck.

Re: Searching in text/unicode(?) files is way too slow for software development project.

Posted: Sun Oct 13, 2024 12:22 pm
by AmigoJack
That implies that disk access could be parallelized, too. Which is either physically not possible, or the OS's API is limiting it by needing file objects (including folders) to be enumerated. Judging by your previous posts you're quick with false assumptions.

Which other program is able to search your 70k files quicker while also supporting/recognizing multiple text encodings? And which achieves this by a multi-core approach? Have the "SQL server guys" also told you that UTF-32 would be as fast as ASCII, too?

Re: Searching in text/unicode(?) files is way too slow for software development project.

Posted: Sun Oct 13, 2024 12:50 pm
by Skybuck
Youtube video playlist of the unpacking, connecting and using of this device:

https://youtube.com/playlist?list=PL0HG ... 3PXDGmG8fy

^ Showing CPU bottleneck, using only 1 core.

Well you quick to the blame the user.

As a matter of fact I did try another search program, it's called findstr a utility within windows (found it while googling for a better search program).

This findstr found files much faster... however it does not produce nice output like textpad does, textpad shows the found line of text to get an idea in what kind of line it is.

It is unlikely that utf32 would be as fast as ascii simply because it would consume 4x times the amount of L1 data cache space and thus less utilization of L1 data cache. However they did mentioned utf32 brings back some performance.

Also I am not yet sure what kind of encoding is used, the bottleneck seems to be cpu/1 core so that is a starting point to improving performance.

I would assume windows 11 will allow the opening of multiple files and acquiring folder information quite fast... I would be surprised if multi threading would not improve upon it...

I tried to answer some of your question, but this is "research" in progress and hope to work it out with textpad... for now a lazy approach is taking to slowly come to the bottom of this ! At least thx for listening.

Re: Searching in text/unicode(?) files is way too slow for software development project.

Posted: Sun Oct 13, 2024 5:23 pm
by AmigoJack
FINDSTR is at least 25 years old and has its bugs and weird behavior. Most of all it won't support anything with a NUL byte, which means UTF-16, UTF-32 and others, while TextPad can handle them. The regex engine can't be trusted either. It may be faster, but much more unreliable - it's your choice. If you check its parameters you may want to use /N for printing a line number with each match. It's for the command line and is a nice tool to have there tho.

On your 64 bit platform memory is read in blocks of 8 bytes - that means 2 UTF-32 codepoints at once. Likewise on 32 bit platforms 4 bytes are read at once (a DWORD). No slowdown here. And TextPad isn't that naive to read a f.e. 12 GiB text file fully into memory before searching in it, so the amount of space to consume is no critical point either (try that with FINDSTR). UTF-16 and UTF-8 have to parsed, since the amount of bytes per codepoint vary - it's natural that it needs more processing (time).

Neither does Windows 11 (nor did any older version) "open" multiple files in parallel - it's still a preemptive kernel which has to, at some point, do things in a row. You just get the impression of "multiple", but if we'd hook that process to put delays into those actions you would see all those files "open" one after the other, not at once.