I have set languages for posts as persian and english and un-selected undetermined.

but I still get post in other languages and after open setting I see that undetermined option is pre-selected. it seems that I can’t disable it. any idea why?

  • رضا@lemmy.worldOP
    link
    fedilink
    English
    arrow-up
    2
    ·
    14 days ago

    for the life of me I can’t make it be unselected. try it for your self. from web page select setting and then de-select the undetermined and select a language (for example english) and then press save at the bottom. come back to setting page and see undetermined be selected again. the English or other languages you selected are saved correctly but for the life of me I cant disable undetermined.

    it pollutes my feed with german french and other languages that I don’t understand.

    • tal@lemmy.today
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      13 days ago

      Honestly, it might be better to change the feature from how it works today, where humans select the language type, to do something like having either the instance or client try to infer the language type and do the filtering there. I can tell you that a huge amount of the content that I want to see doesn’t have people explicitly marking the language. Heck, the comment I responded to isn’t marked as English.

      There’s some Linux utility or library that does statistical guessing of language based on characters seen. Probably also more sophisticated stuff out there. Lemme see if I can dig it up.

      hunts around a bit

      Well, this isn’t it, but here’s a Python module. On Debian trixie:

      $ sudo apt install python3-venv
      $ mkdir langtest
      $ cd langtest
      $ python3 -m venv venv
      $ . venv/bin/activate
      $ pip install langdetect
      $ python -q
      >>> import langdetect
      >>> langdetect.detect_langs('رضا')
      [ar:0.9999953370247615]
      

      So it’d be 99.999% confident that your username is Arabic. Something like PieFed or Lemmy or a client could make use of that. Maybe use some heuristics a bit to default to assuming that the language is the same as the language of the parent comment or post or community average language or something, since very short comment texts might be unclear or ambiguous.

      That’s not perfect, because sometimes people will quote stuff in other languages or something like that, but I’d wager that it’d be more accurate than manually-tagged stuff.