I have set languages for posts as persian and english and un-selected undetermined.

but I still get post in other languages and after open setting I see that undetermined option is pre-selected. it seems that I can’t disable it. any idea why?

  • tal@lemmy.today
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    13 days ago

    Honestly, it might be better to change the feature from how it works today, where humans select the language type, to do something like having either the instance or client try to infer the language type and do the filtering there. I can tell you that a huge amount of the content that I want to see doesn’t have people explicitly marking the language. Heck, the comment I responded to isn’t marked as English.

    There’s some Linux utility or library that does statistical guessing of language based on characters seen. Probably also more sophisticated stuff out there. Lemme see if I can dig it up.

    hunts around a bit

    Well, this isn’t it, but here’s a Python module. On Debian trixie:

    $ sudo apt install python3-venv
    $ mkdir langtest
    $ cd langtest
    $ python3 -m venv venv
    $ . venv/bin/activate
    $ pip install langdetect
    $ python -q
    >>> import langdetect
    >>> langdetect.detect_langs('رضا')
    [ar:0.9999953370247615]
    

    So it’d be 99.999% confident that your username is Arabic. Something like PieFed or Lemmy or a client could make use of that. Maybe use some heuristics a bit to default to assuming that the language is the same as the language of the parent comment or post or community average language or something, since very short comment texts might be unclear or ambiguous.

    That’s not perfect, because sometimes people will quote stuff in other languages or something like that, but I’d wager that it’d be more accurate than manually-tagged stuff.