Ticket #266 (closed defect: fixed)

Opened 7 years ago

Last modified 6 years ago

Backward search doesn't find underlined text

Reported by: egmont Owned by:
Priority: minor Milestone: 4.7
Component: mc-core Version: 4.6.2
Keywords: Cc:
Blocked By: Blocking:
Branch state: Votes for changeset:

Description

If you view a manual page in mc's built-in viewer, and do a backward search (F7) for some underlined (a.k.a. red) text, it won't match.

Reason: backward search is implemented by reversing both the haystack and the needle and then calling normal (forward) search. Underlined text in man pages are implemented by an underscore, followed by a backspace, followed by the desired letter. mc's icase_search() ignores the character before the backspace to be able to search in man pages. Guess what happens however if you reverse an underscored letter in a manpage...

Obvious consequence: As the Summary shows, backward search doesn't find the text you're looking for.

Funny consequence: If you do a backward search for some underscores, it finds every red text.

Change History

comment:1 Changed 7 years ago by egmont

I was thinking on approaches to fix this issue, as well as these other forthcoming issues:

  • basic utf8 support (included in the utf8 patch)
  • case insensitive search in utf8 (tricky, not included in the current utf8 patch)
  • combination with manpage formats (that is, to be hardcore, have a utf8 manpage with backspace notation for bold/underlined, and do case insensitive forward/backward search in this). Even displaying a utf8 manpage is not implemented in mc + the utf8 patch, but I have a patch for this. However, my patch breaks when it comes to searching highlighted text; searching simple text is case sensitive for accented letters; and probably horribly breaks for backward search (never tried).

Basically there are two approaches for text search. Summarizing them:

  • The "always fast" one: Build up a state machine first, based on the needle string. Then just go through the whole input once.
  • The simple one: See whether it matches from the given offset. If not, try from the next offset.

Let's see these in detail.

The "always fast" approach, with the state machine:

The advantage is: no matter what the needle is and how large it is, the file has to be processed only once, constant amount of memory is needed and runtime is always linear to the length of the file.

Hard to implement. By each and every newly added feature it becomes even more complicated. Man page format, utf8, case insensitiveness, backwards option, and all combinations thereof make it extremely complicated.

As long as the only "extra" option was utf8, backward could be implemented by reversing both haystack and needle. However, as soon as either manpage format, or utf8 case insensitive search is a requirement, this no longer works, since the state machine would have to be different anyway for the reversed string. And there's no point in reversing the whole haystack and needle in either cases: it's way easier and faster to simply walk downwards in memory.

The simple solution:

The fear of the simple solution could be that under some circumstances it might be slow. E.g. if both the needle and haystack are a series of letter A's then it will require H*N time (H and N stand for haystack and needle size). This is worst case in algorithmical sense, but not a typical use case.

Note that with typical user-entered search strings, the run-time is still linear (to the filesize), does not grow as you enter longer string. Even if the needle is a random string, the average runtime is linear.

So in real life usage I think that this algorithm is absolutely perfect for us. If we were about to write software that was potentially DoS-able (such as a public web-based searching library), this would be an issue. Within mc, where the user types the short query string, this algorithm is perfect.

Okay, let's suppose we have manpage format, utf8, case insensitiveness and all combination of them implemented in forward only mode with this algorithm. And we were about to add backward searching capability. Reversing the haystack and needle raises endless numbers of headaches (different approach for manpage format, different tricks for utf8, no out-of-the-box utf8 library usable anymore etc.) And what would we gain with it? Absolutely nothing! Why?

Because what's a "backward search" from the user's point of view, can simply be implemented as a loop of forward matches. The inner loop, which is a kind of clever strcmp (with features for manpage format, utf8, case insensitiveness and what not) can still go forward in memory. Only one minor thing has to be changed: the outer loop, which tries out the different matching positions, has to go downwards. Absolutely trivial change.

Conclusion:

We don't need any super clever algorithm. The simplest way of implementing search is absolutely perfect.

Reversing strings is a completely wrong idea. We should get rid of it, and simply change the outer loop of _icase_search() to go downwards (and of course adjust parameters and such). (Note that _icase_search() is currently implemented as a single loop, which sometimes resets the loop variable to a previous value, but logically what it does is way easier to imagine as two nested loops.) This would automatically fix the original bug of this ticket.

Once it's done (shouldn't be a hard change), I see two independent issues that would still need to be solved, both are for the utf8 patch only: support case insensitive utf8 search, and support search in utf8 manpage format. Forward mode only, luckily. I might implement these on a random rainy Sunday afternoon...

comment:2 Changed 7 years ago by slavazanko

  • Status changed from new to closed
  • Resolution set to fixed

Fixed in #265

Note: See TracTickets for help on using tickets.