office-gobmx/lingucomponent
Mike Kaganski 9c14ec81b6 tdf#164006: Only use original word's positions, ignore extra encoded length
The encoding of the string passed to Hunspell/hyphen service depends on the
encoding of the dictionary itself. When the usual UTF-8 encoding is used,
the resulting octet string may be longer than the original UTF-16 code unit
count. In that case, the length of the buffer receiving the positions will
be longer, respectively. But on return, the buffer will only contain data
in positions corresponding to the characters, not code units (it is unclear
if we even need to pass buffer that large). So just as the following loop
only iterates up to nWord length, the calculation of hyphen count must use
its length, too, not the length of encWord.

I suspect that the use of UTF-16 code units as hyphen positions is wrong;
it will break in SMP surrogate pairs. The proper would be to iterate code
points. However, I don't have data to test, so let it be TODO/LATER.

Change-Id: Ieed5e696e03cb22e3b48fabc14537372bbe74363
Reviewed-on: https://gerrit.libreoffice.org/c/core/+/177077
Reviewed-by: Mike Kaganski <mike.kaganski@collabora.com>
Tested-by: Jenkins
2024-11-23 10:03:41 +01:00
..
config Load the locales from config file for languagetool 2023-10-22 19:02:06 +02:00
source tdf#164006: Only use original word's positions, ignore extra encoded length 2024-11-23 10:03:41 +01:00
IwyuFilter_lingucomponent.yaml
Library_guesslang.mk
Library_hyphen.mk
Library_LanguageTool.mk languagetool show cURL errors with the interaction handler 2024-06-28 14:24:22 +02:00
Library_lnth.mk
Library_MacOSXSpell.mk
Library_numbertext.mk
Library_spell.mk
Makefile
Module_lingucomponent.mk Fix --disable-curl build 2023-09-14 12:55:33 +02:00
README.md
StaticLibrary_ulingu.mk

Linguistics Components

lingucomponent contains spellcheck, hyphenator, thesaurus, etc.