"...git@developer.sourcefind.cn:chenpangpang/open-webui.git" did not exist on "a128dfa06b39cc62f63ba138902823786a74993c"
Unverified Commit 60005f46 authored by jeonsworld's avatar jeonsworld Committed by GitHub
Browse files

Update pregenerate_training_data.py

If the value of rand_end is returned from the randint function, the value of sampled_doc_index that matches current_idx is returned from searchsorted.

example:
cumsum_max = {int64} 30
doc_cumsum = {ndarray} [ 5  7 11 19 30]
doc_lengths = {list} <class 'list'>: [5, 2, 4, 8, 11]
if current_idx  = 1,
rand_start = 7
rand_end = 35
sentence_index = randint(7, 35) % cumsum_max
if randint return 35, sentence_index becomes 5.
if sentence_index is 5, np.searchsorted returns 1 equal to current_index.
parent ec5c1d61
...@@ -49,7 +49,7 @@ class DocumentDatabase: ...@@ -49,7 +49,7 @@ class DocumentDatabase:
self._precalculate_doc_weights() self._precalculate_doc_weights()
rand_start = self.doc_cumsum[current_idx] rand_start = self.doc_cumsum[current_idx]
rand_end = rand_start + self.cumsum_max - self.doc_lengths[current_idx] rand_end = rand_start + self.cumsum_max - self.doc_lengths[current_idx]
sentence_index = randint(rand_start, rand_end) % self.cumsum_max sentence_index = randint(rand_start, rand_end-1) % self.cumsum_max
sampled_doc_index = np.searchsorted(self.doc_cumsum, sentence_index, side='right') sampled_doc_index = np.searchsorted(self.doc_cumsum, sentence_index, side='right')
else: else:
# If we don't use sentence weighting, then every doc has an equal chance to be chosen # If we don't use sentence weighting, then every doc has an equal chance to be chosen
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment