How to efficiently sort Python strings containing floating numbers AND a leading + or – sign at multiple and varying locations

Previous solutions focus on strings where numbers are separated from letters by a dash (sorting strings containing numbers and letters) or another consistent delimiter (e.g., ‘_’, python sort strings with leading numbers alphabetically). The numbers are usually at the same position with respect to the letters. These are relatively easy lists, such as

l=['101-8', '101-8A', '101-9', '102-1', '103-4', '103-4B', '101-10', '101-11','103-10'] 

or

l=['10_file','11_file','1_file','20_file','21_file','2_file']

I need to sort something like:

listfromhell=['a_+10.9.mrc','a_-10.0.mrc','a_-12.0_b.mrc','az_x_y_+60.13_a.hdf','bc_ab_+15.0_rst.mrc']

The sorting needs to be based on the number that follows the - or + signs (including the signs).

Thus, the correct sorting for the list above would be:

listfromhell=['a_-12.0_b.mrc','a_-10.0.mrc','a_+10.9.mrc','bc_ab_+15.0_rst.mrc','az_x_y_+60.13_a.mrc']

A one-liner similar to what has been previously proposed for easier lists works nicely IF the floating number used for the sorting (with the preceding + or - sign) occurs at the same location always, where “location” means the index at which the sorting element occurs in the list that results from splitting each string element at some sort of consistent delimiter.

For example, a list like this:

nicelist=['a_b_-12.0_d.mrc','a_r_+10.9_t_z_y.mrc','c_a_-10.0.mrc','bc_ab_+15.0_rst.mrc','az_x_+60.13_a.mrc']

Would be easily sorted with:

sorted(l, key=lambda s: float(s.split("_")[2].replace('.mrc',''))))

because the floating number always occurs at index ‘2’ after splitting each string using the consistent delimiter '_'

How can a similarly simple solution be implemented when the index at which the sorting element occurs (2 in nicelist) is not known a priori?

And there are multiple increasingly complex cases to this question, such as when the floating point number occurs at random locations, when there are no consistent delimiters, and when there are confounding '+' and '-' signs at other places in addition to preceding the floating point number, as well as confounding digits that are not part of the floating point number. E.g.,

listfromhellandthensome=['a5-_-12.0b.mrc','a+101.9-.mrc','-a11_-10.0.mrc','b-c_ab_+15.0_rs+t.mrc','a + z_-x_y_+6.10334_a4.mrc']

Basically, the ultimate task would be to find an elegant solution (a one-liner would be amazing) to sort a list of string elements for which each element contains a single floating point number of unknown size/length and sign (it can be either positive or negative) and can occur at any arbitrary position within the string, with no known consistent delimiters

Thank you for your ideas!

Answer

You just need to extract the float/int from each string, along with the sign (+ or -) and then pass that extracted part into the float() function and sort.

So the regex I came up with (regex101) is:

(+|-)d+(.d+)?

So we check that the float/int is preceded by a + or a - and then match as many as possible up to the decimal point (.) and then as many as possible decimals after – only if there is a decimal point. This last part (“only if there is”) is achieved simply with a ? – meaning 0 or 1 occurrences.

So now to apply this to Python, with your list, l, and having already run import re, you can sort it with this one line:

l.sort(key = lambda s: float(re.search('(+|-)d+(.d+)?', s).group()))

which, for the last example, gives l as:

['a5-_-12.0b.mrc', '-a11_-10.0.mrc', 'a + z_-x_y_+6.10334_a4.mrc', 'b-c_ab_+15.0_rs+t.mrc', 'a+101.9-.mrc']

which I believe to be correct!


And for the listfromhell example, this achieves the expected output of:

['a_-12.0_b.mrc', 'a_-10.0.mrc', 'a_+10.9.mrc', 'bc_ab_+15.0_rst.mrc', 'az_x_y_+60.13_a.hdf']

Leave a Reply

Your email address will not be published. Required fields are marked *