this post was submitted on 06 Sep 2023
3 points (100.0% liked)

Hacker News

3871 readers
3 users here now

This community serves to share top posts on Hacker News with the wider fediverse.

Rules0. Keep it legal

  1. Keep it civil and SFW
  2. Keep it safe for members of marginalised groups

founded 1 year ago
MODERATORS
 

There is a discussion on Hacker News, but feel free to comment here as well.

top 6 comments
sorted by: hot top controversial new old
[–] lvxferre@lemmy.ml 2 points 1 year ago* (last edited 1 year ago)

I like the idea. I don't like the specific implementation.

He was overengineering the problem, wanted a magic solution, and predictably he didn't find one.

OK this is getting tedious and I'll never get it all. My next attempt was to switch to fuzzier string matching via Levenshtein distance. We can compare how closely the address strings match. And if they're lexicographically close, we can assume they match

No, you can't assume. Hell breaks loose when you pretend that you know what you don't = when you assume.

Whoa! I simply told the LLM the logic I wanted. It took fewer than 5 minutes to write and got 90%+ accuracy against our test suite on the first try!

90% accuracy can be great or awful depending on your goals, but in no moment he mentions the scale of the problem, or how bad false positives/negatives would be.

I replaced 50 lines of code with a single LLM prompt

That's fucking dumb. Use both.

Here's what I think that would be a better approach, if accuracy is a concern.

Conceptually (inside your head!), split all pairs of addresses into four categories:

  • dunno - your program didn't test them yet.
  • same - your program tested them and determined them to be the same address.
  • different - your program tested them and determined them to be different addresses.
  • shit - your program tested them and gave up.

All pairs start in the "dunno" category. The job of the program is to accurately move as many of them as possible to the categories "same" and "different", and as few of them as possible to the "shit" category.

Based on that, here's what I would do.

  1. [Sanitisation start] Unless the program ignores case, convert everything to lowercase.
  2. Replace common abbreviations with the respective words, or vice versa. Common ones only. Do not play catch'em all. Stick to St, Ter, Cir, Way, Pl, Blvd. If any of those strings is followed by a dot, remove it; and if it is not followed by a comma, add it. (Yes, you'll need something a bit more complex for Saint vs. Street, but that's fine.)
  3. Check if the address is properly formatted. It should contain a number, 1+ words, comma, 1+ words, comma, 1 2-letters word, number. If it is, go to 5. If it is not, go to 4.
  4. Add some low-hanging approaches to safely fix the address. But don't go overboard; if you still can't fix it, move the pair of addresses that includes that address to the "shit" category. [Sanitisation ends]
  5. If the string between the first and second commas (the city) is different, or if the string before the first comma starts with a different number, then move the pair to the "different" category.
  6. Else, if the whole strings are identical, then move the pair to the "same" category.
  7. Else, move the pair to the "shit" category.

Now run the program with a sizeable amount of pairs of addresses, and check how many of them ended in the "shit" category. Now use your judgment:

  • Is there some underlying pattern among a lot of those pairs in the "shit" category? If yes, can I easily fix step #4 to address them?
  • Based on the scale of my project, is it fine to manually review those pairs?

Now let's say that you already fixed what you could reasonably fix, and manual review is out of question. Now plug in the chatbot.

Why am I suggesting that? Because the chatbot will sometimes output garbage, even for pairs that a simple routine would be able to accurately tell "they're the same" or "they're different". So by using both, you're increasing the accuracy of the whole testing routine. "90%+" might look like "wow such good very accuracy", but it's still one error each 10 pairs, it's a fucking lot.

And that exemplifies better how you're supposed to use LLMs. (Or text generators in general.) You should see them as yet another tool at your disposal, not as a replacement for your current tools.

[–] superfes@beehaw.org 1 points 1 year ago (1 children)

Levenshtein distance would work here without feeding an AI... this seems silly.

[–] lvxferre@lemmy.ml 2 points 1 year ago* (last edited 1 year ago) (1 children)

He tried it, in a rather dumb way, comparing whole strings; e.g. 123 Main St, Brooklyn, NY 11217 vs. 124 Main St, Brooklyn, NY 11217.

It's silly because his whole approach to the problem was assumptive. It's fine to say "I don't know", or to code a program that does it. And yet he's trying to dichotomise the program's output to "same" vs. "different".

[–] superfes@beehaw.org 2 points 1 year ago (1 children)

I've never done Levenshtein on numbers, it seems like a silly thing to do.

Somehow I had skipped over that part of the text, danke.

[–] lvxferre@lemmy.ml 1 points 1 year ago* (last edited 1 year ago)

Yup - it's stupid. The catch is that text is yet another example of people hyping generative bots and trying to "sell" the idea as the solution for everything and a bit more; and one of the ways to do that is to make the alternative look worse than it is, for example incorrectly using the other tools at your disposal.

Even then I wouldn't use fuzzy string matching here, it's bound to introduce more false positives than it's worth. Such as Ant Street and Aunt Street matching (Levenshtein distance = 1). In those cases it's simply better to say "dunno".

[–] thesmokingman@programming.dev 1 points 1 year ago

Talk about throwing KISS out the window. There are much simpler APIs (even some free I think) that solve this exact problem. Did the dev not do any discovery on the problem first?