Stable Sorting Algorithm Needed.

General questions about using TextPad

Moderators: AmigoJack, bbadmin, helios, Bob Hansen, MudGuard

Post Reply
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Stable Sorting Algorithm Needed.

Post by Lostgallifreyan »

Please could some certainty be added to this sorting, so it is unconditionally stable?

I'm new here, but I have searched.. :)
Used terms: sorting, stable, unstable, algorithm... in combination and singly, with few results that help beyond some confirmation in one post that someone else has seen what I saw.

Here's a basic test to show it:

Code: Select all

000      Test
001      Test
002      Test
003      Test
004      Test
005      Test
006      Test
007      Test
008      Test
009      Test
010      Test

If you block select the word Test in lines 3 through 7, the lines will be rotated so that the last selected line becomes first, with the rest staying in order but shifting. Reverse sort order reverses the rotation. Selecting lines and using a key from 10, 4 chars length, does the same, and switching numeric, case, char code switches and such does not usually make a difference.

I say usually, because in one case I saw it act stably, refusing to change any order during one test, but I could not figure out why it appeared to work correctly. There's only a small set of options to try, yet I could not replicate this. That seems to me to confirm a possible instability, but maybe I missed some trick that would guarantee stable sorting in the absence of difference in selected block or keyed column.

All help welcome, I need an answer for this.

And one offtopic request, made repeatedly over many years. BINARY EDIT! :twisted:
Please. I renew the request. This is the biggest achilles heel in TextPad, that awful business of look and don't touch, when it comes to binaries..



Edit: Just tried UltraEdit, using keyed column ascending sort on the word Test in lines 3-7 as before. Some consolation for Helios here:

Code: Select all

001      Test
002      Test
004      Test
005      Test
003      Test
006      Test
007      Test
008      Test
009      Test
010      Test

It's worse. Instead of 73456, their instability gets 45367. Either way though, this sorting instability is a bane.

More consolation for Helios: UltraEdit would NOT let me use Ctrl Z to undo the sort to try again (and undo was also disabled if I switched to binary edit mode just to see what happened if I typed a new value). Another test inserted a blank line, I have NO idea how or why. :twisted: And it lost the selection field after sorting so I had to reselect... It's crap like that that makes me feel at home with TextPad. It might not be as powerful, on the face of it, but the ease and consistency makes it far more powerful to work with, as it doesn't need so much fiddling around.
Last edited by Lostgallifreyan on Mon Feb 14, 2005 1:46 pm, edited 1 time in total.
User avatar
bbadmin
Site Admin
Posts: 878
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

TextPad uses the Quicksort algorithm, as implemented by Microsoft in their C runtime library. This turns out to be stable when there are more than 16 (as I recall) records to sort, but that's just an implementation peculliarity. An alternative would have been to use std::stable_sort, from the C++ runtime library, but we chose not to, because it is significantly slower. We'll consider treating 16 records or less as a special case in a future release.

Keith MacDonald
Helios Software Solutions
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

Thanks for the answer. :) Got to say though, that with over 16 it wasn't stable either.. I'd been sorting a forum archive of 80555 posts, reformatted as single lines, sorting by timestamp. Those few posts that arrived in the same second, would often be transposed. I'd tested this purely because I had reason to suspect an unstable sort in a previous effort of this kind using TextPad. Unless this is fixed, restoring by subject sorting is going to be impossible, as each thread chunk will have some very messed up post ordering.


Point taken about speed, but this was so fast (Even with 80555 lines on a 500 MHz machine), that a slower speed is not going to be intolerable even at ten to 50 times slower, and the stability gain would more than offset this.

I'm going to solve this by using something I wrote in Lua, for a log sorting with an array system I can adapt to keyed column sorting, with a fast stable sort routine I found and tested. I'm pretty sure that writing your own function will be an ideal answer, as stability above all is important. It's worth the wait, and I guess a switch to choose slow and stable over fast for general use will please everyone.


Edit:
Not sure if this will help with any coding you do but here's the basis of a Lua routine for a stable sort:
(Adapted originally from a JavaScript version I found).

Code: Select all

for X=1,N-1 do
    for Y=1,N-X do
        if LIST[Y]>LIST[Y+1] then  LIST[Y],LIST[Y+1]=LIST[Y+1],LIST[Y]  end
        -- comment: if substring value is more than that in next line, swap lines.
    end
end
With time proportional to N*N/2 where N=number of lines, it's going to chug a bit with tens of thousands, but a major task that need be done only once is reason to want this ability.

Lua is built in ANSI C, and even writing this function rather than relying on a Lua existing function, this is fast. Writing it in C++ might be similarly fast, but I don't know, as I don't use it..
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

I'd rather them not use an O(n^2) sort at all. There are quicker stable sorts. merge sort is stable and is O(n log n). It would probably be best for them to add an option to the preferences to enable or disable stable sorting.
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

Fine by me.
I agree, and already asked here for a switch... My LuaScript swap sort took 4 hours! (The lines that had to be swapped were huge, a forum with many very large posts, so the moving probably took more time than the testing). Still, I was asleep when it did it, and it only had to be done once.

We do need a stable sort though, and I rechecked what I said earlier. The current sort in TextPad is UNStable even when using more than 16 lines. Some sample data from the archive I'm sorting:
http://www.lostgallifreyan.pwp.blueyond ... sorted.zip (466 KB file.)
80555 lines, each with a 5 digit index followed by the timestamps. Copy the unzipped file, sort the copy by timestamps, compare with original... Many instances of equal timestamps will be seen with swapped indexes.
User avatar
MudGuard
Posts: 1295
Joined: Sun Mar 02, 2003 10:15 pm
Location: Munich, Germany
Contact:

Post by MudGuard »

So you want to sort by columns 10 to 13 (that's where "test" is).

Maybe you could use
columns 10 to 13 as first key for the sort and
columns 0 to 9 as second key for the sort?

(but maybe your example is too simple)
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

If the search algorithm is unstable, that won't cure the problem, because the first sort was by subject (ignoring occurences of 'Re: '), then once each subject chunk was isolated (with a blank line; another reason I chose to script my own fix) I would want the timestamps in right order. If the search was stable, they already would be. Conversely, if it wasn't, there's no guarantee that sorting by timestamps within each chunk could restore it if it was the same process that cause the bother to start with.

Can't help wondering what I did wrong though. Nothing wrong with the concept, using linked lists in seperate tables, And no code in tight loops when it should be outside... but Lua wouldn't take four hours if my code was as tight as it ought to be. Still, it worked. TextPad would sort it within seconds!! But that's a tad beside the point if it's not stable, it would take hours of manual work to fix what TextPad's sort did to it... Got to be something in that difference of timescale that would be viable. And like I said, UltraEdit was worse than TextPad, at least in that first tiny example.

I've read a bit on merge sorting today, and I'm going to set up a test, but given the remaining code having to be as it is, merge sort might still take over half the time (two hours). One year of extremely busy forum archive without post indexes is a big lump of text data to work on..
User avatar
bbadmin
Site Admin
Posts: 878
Joined: Mon Feb 17, 2003 8:54 pm
Contact:

Post by bbadmin »

If you have shell access to a system running Linux, its sort command should have a --stable parameter. I'm sure that will do the job in an acceptable time!

Keith MacDonald
Helios Software Solutions
Last edited by bbadmin on Wed Feb 16, 2005 12:41 pm, edited 1 time in total.
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

bbadmin wrote:If you have access to a system running Linux, its sort command should have a --stable parameter. I'm sure that will do the job in an acceptable time!

Keith MacDonald
Helios Software Solutions
I wish :)
I never learned beyond Windows (got hardware needs that nothing else supports).. I think there were some DOS ports though, so I'll look into it. The time mine took was acceptable, as a one-off, but I'm hooked on learning more about sorting now, and Lua. But I really hope that TextPad will have this too. And binary edit. It's the speed and strength of this app that makes me so eager to use it as a one-stop solution to so many things.
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

MudGuard has a valid point (which I kick myself for not seeing!). Sort stability is only important when the sort keys are identical. If you modify your sort to use the numbers as the first key, and make sure to do a numerical sort, you should get what you want.

The Admin also has a good idea. You may not have access to a Unix or Linux machine, but Unix tools (like sort) are freely available on the Web in pre-compiled versions for Windows. I have "sort", et. al., on my machine.
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

Like I said, there are several cases were both the subject, and the timestamp were identical. (It was a very busy forum.)
In cases like that, a stable sort is mandatory.. It takes the guessing out of it. There's a lot to think through already, so a stable sort is a fundamental method to rely on.

(I suppose a workround would be to run a temporary index column down the left margin before sorting, as I did for that sample download, but in a huge file it's wise to avoid anything that makes it bigger. Also, that doesn't avoid the time-consuming check and repair if the sort wasn't stable.)

I will look at UNIX ports though, I think I found some once, but rarely found I needed them. I took (slowly) to Lua instead, as it's an excellent way to make custom solutions to stuff..
User avatar
Drxenos
Posts: 209
Joined: Mon Jul 07, 2003 8:38 pm

Post by Drxenos »

Lostgallifreyan wrote:Can't help wondering what I did wrong though. Nothing wrong with the concept, using linked lists in seperate tables, And no code in tight loops when it should be outside... but Lua wouldn't take four hours if my code was as tight as it ought to be. Still, it worked. TextPad would sort it within seconds!! But that's a tad beside the point if it's not stable, it would take hours of manual work to fix what TextPad's sort did to it... Got to be something in that difference of timescale that would be viable. And like I said, UltraEdit was worse than TextPad, at least in that first tiny example.

I've read a bit on merge sorting today, and I'm going to set up a test, but given the remaining code having to be as it is, merge sort might still take over half the time (two hours). One year of extremely busy forum archive without post indexes is a big lump of text data to work on..
I don't think you did anything wrong. Code correctness should come before optimization. The problem is that the sort you are using runs in quadratic time. As your number of things to sort increases, your time to sorts them grows very, very fast.

Merge sort was just an example. It probably is not the best for your needs. Yes is an n*log(n) sort, which--all things being equal--should be faster than an n^2 sort. But merge sort tends to be a memory hog. All those allocations may eat any saving so get from using a "better" alogorithm.

You also need to keep in mind you are using strings as keys, and thus are doing string comparisions. If your comparision function is poorly implemented and does a simple linear compare, you are may actually be running an algorithm bounded by O(n^3). Yuck!

Also, are you sorting a file in memory or on the disk. This too will affect your time and the type of alogrithm you should choose.

I would do some research on sorting algorithms, and run some tests to determine which bests fits your data. You just have to remember that the algorithm and its implementation will have a bigger impact on time that your choice in language and simple optimizations such as how "tight" your loops are. For a large data set, a linear algorithm written in an interpreted language on a slow machine, will blow away a n^3 one written in assembly on a "super computer." We are talking the difference between seconds, and centuries!
tim
Posts: 1
Joined: Thu Feb 17, 2005 4:36 pm

Post by tim »

Ahha, sort routines!

Have to deal with that sort of thing all the time.

Under MVS and z/OS using DFSORT there is the EQUALS option.
This means that records with equal sort keys will appear in the output file in the same order as the input file.
This is in fact the default setting.

When I need to sort small amounts of data using REXX (or CLIST) I tend to use an implementation of the SELECT sort.
This has lots of compares but not many actual data moves.

This is a REXX example I have typed from memory (I'm not at work ATM) so forgive me if I get it wrong:

Code: Select all

/* data. is the array to be sorted */
/* data.0 has the total array size */
do a = 1 to (data.0-1)
  c = a
  do b = (a+1) to data.0
     if data.b < data.c then c = b /* find lowest key */
  end
  If c <> a then do /* move lowest key up */
      temp = data.a
      data.a = data.c
      data.c = temp
    end
end
Now thinking about it, this routine probably isn't stable in the way you need it.
The record sequence number would need to be added to the key to make a stable sort.
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

ty Drxenos, I like that post. :)
What I'm working on is a Lua thing with merge-sort, and swap-sort once a threshold of sub table size is reached. It's simple, and offers the chance to choose a scalable way to reduce memory use at the expense of slower performance. (And it might also solve the inefficiency that the lower levels of splitting in merge-sort causes, as at some point a simple swap sort will become the more efficient method).

I'll try to take it further, as I think Lua can offer in-place table handling to allow all merge-sort with better use of memory. A lot of work, but I can't think of a better way to learn about sorting, and once done, it will be a stock tool for me. I'll post the result once it's worth doing, in case it helps anyone. Doubtful, as that wheel is well-invented, but it might have something, as most people make these things in their own way too.

Tim, I've been thinking of that point about compares vs data moves too, specially in the merge-sort when it's actually merging. Not sure that it will speed it up much though. Ultimately, the same number of elements must be tested, and the same number of elements moved.. I decided that by testing each first element rather than looking down the list before moving, allowed neater, thus faster code. (I couldn't find a built-in Lua function for bulk element moves anyway.)
Lostgallifreyan
Posts: 25
Joined: Mon Feb 14, 2005 10:51 am

Post by Lostgallifreyan »

Keith, I've decided to mess with sorting stuff outside of this issue now, as my interest is hooked, but I've found the answer to the problem, one I (and others) already mentioned, but I didn't see the significance until my efforts gave me insights into (lack of) ease and (lack of) speed. (Lua was very slow even on merge sort, it took the expected half-time, and I'm not sure if it offers a way to reduce this, data moving was the main bottleneck.)

The answer is this:
Run a temporary index column down the left side, starting from 1 x 10 to the power of the number of digits in the total line count, then sort the column I want as a key, and the index column as a second key.

If TextPad could be coded to do this automatically, hidden, when the 'stable sorting' switch was on, and then strip the indexes before rendering the result, the stable sort would be acheieved, and far faster than any standard stable sort could acheive it, and it can use the existing sort code.

While a macro can be easily written for this one, I think it's so neat and fast an answer that it would be a great hardcoded amendment to TextPad.
Post Reply