Update README.md

44275ae9 · Stella Biderman · GitHub · fcbe193e · 44275ae9
Unverified Commit 44275ae9 authored Apr 13, 2023 by Stella Biderman Committed by GitHub Apr 13, 2023
Hide whitespace changes
Inline Side-by-side

Showing with 2 additions and 0 deletions

README.md README.md +2 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -106,6 +106,8 @@ When reporting eval harness results, please also report the version of each task

 ## Test Set Decontamination

+To address concerns about train / test contamination, we provide utilities for comparing results on a benchmark using only the data points nto found in the model trainign set. Unfortunately, outside of models trained on the Pile ans C4, its very rare that people who train models disclose the contents of the training data. However this utility can be useful to evaluate models you have trained on private data, provided you are willing to pre-compute the necessary indices. We provide computed indices for 13-gram exact match deduplication against the Pile, and plan to add additional precomputed dataset indices in the future (including C4 and min-hash LSH deduplication).
+
 For details on text decontamination, see the [decontamination guide](./docs/decontamination.md).

 Note that the directory provided to the `--decontamination_ngrams_path` argument should contain the ngram files and info.json. See the above guide for ngram generation for the pile, this could be adapted for other training sets.