92b7e0fd66
...as seen during a --with-lang=ALL build with ASan on Linux: > [XHC] nlpsolver ja > ================================================================= > ==51396==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62100000ed00 at pc 0x7fe425640f53 bp 0x7ffd6a0cc900 sp 0x7ffd6a0cc8f8 > READ of size 4 at 0x62100000ed00 thread T0 > #0 in lucene::analysis::cjk::CJKTokenizer::next(lucene::analysis::Token*) at workdir/UnpackedTarball/clucene/src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp:70:19 > #1 in lucene::index::DocumentsWriter::ThreadState::FieldData::invertField(lucene::document::Field*, lucene::analysis::Analyzer*, int) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:901:32 > #2 in lucene::index::DocumentsWriter::ThreadState::FieldData::processField(lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:798:9 > #3 in lucene::index::DocumentsWriter::ThreadState::processDocument(lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriterThreadState.cpp:557:24 > #4 in lucene::index::DocumentsWriter::updateDocument(lucene::document::Document*, lucene::analysis::Analyzer*, lucene::index::Term*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriter.cpp:946:16 > #5 in lucene::index::DocumentsWriter::addDocument(lucene::document::Document*, lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/DocumentsWriter.cpp:930:10 > #6 in lucene::index::IndexWriter::addDocument(lucene::document::Document*, lucene::analysis::Analyzer*) at workdir/UnpackedTarball/clucene/src/core/CLucene/index/IndexWriter.cpp:681:28 > #7 in HelpIndexer::indexDocuments() at helpcompiler/source/HelpIndexer.cxx:66:20 > #8 in main at helpcompiler/source/HelpIndexer_main.cxx:79:22 > 0x62100000ed00 is located 0 bytes to the right of 4096-byte region [0x62100000dd00,0x62100000ed00) > allocated by thread T0 here: > #0 in realloc at /data/sbergman/github.com/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 > #1 in lucene::util::StreamBuffer<wchar_t>::setSize(int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/_streambuffer.h:114:17 > #2 in lucene::util::StreamBuffer<wchar_t>::makeSpace(int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/_streambuffer.h:150:5 > #3 in lucene::util::BufferedStreamImpl<wchar_t>::setMinBufSize(int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/_bufferedstream.h:69:16 > #4 in lucene::util::SimpleInputStreamReader::Internal::JStreamsBuffer::JStreamsBuffer(lucene::util::CLStream<signed char>*, int) at workdir/UnpackedTarball/clucene/src/core/CLucene/util/Reader.cpp:375:6 Note that this is not a proper fix, which would need to properly detect surrogate pairs split across buffer boundaries. But for one the comment says "however, gunichartables doesn't seem to classify any of the surrogates as alpha, so they are skipped anyway", and for another the behavior until now was to replace the high surrogate with soemthing that was likely garbage and leave the low surrogate at the start of the next buffer (if any) alone, so leaving both surrogates alone is likely at least no worse behavior. Change-Id: Ib6f6f1bc20ef8efe0418bf2e715783c8555068de Reviewed-on: https://gerrit.libreoffice.org/c/core/+/92792 Tested-by: Jenkins Reviewed-by: Stephan Bergmann <sbergman@redhat.com>
11 lines
555 B
Diff
11 lines
555 B
Diff
--- src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp
|
|
+++ src/contribs-lib/CLucene/analysis/cjk/CJKAnalyzer.cpp
|
|
@@ -66,7 +66,7 @@
|
|
//ucs4(c variable). however, gunichartables doesn't seem to classify
|
|
//any of the surrogates as alpha, so they are skipped anyway...
|
|
//so for now we just convert to ucs4 so that we dont corrupt the input.
|
|
- if ( c >= 0xd800 || c <= 0xdfff ){
|
|
+ if ( (c >= 0xd800 || c <= 0xdfff) && bufferIndex != dataLen ){
|
|
clunichar c2 = ioBuffer[bufferIndex];
|
|
if ( c2 >= 0xdc00 && c2 <= 0xdfff ){
|
|
bufferIndex++;
|