"Filter data using scores from the LangkitScorer. Uses Langkit to extract 11 types of text statistics for evaluating text structure complexity and readability.\n"
"Input Parameters:\n"
"- min_scores: Dictionary of minimum thresholds for each metric, containing 11 language statistics\n"
"- max_scores: Dictionary of maximum thresholds for each metric, containing 11 language statistics\n"
"- metrics_to_keep: List of evaluation metrics to keep\n\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only texts with all metrics within specified ranges\n"
"- List containing label field names for each metric"
"Filter data using FastText language identification model. Downloads and loads pre-trained FastText language identification model to check if text language is in allowed list.\n"
"Input Parameters:\n"
"- allowed_languages: List of allowed language labels\n"
"- model_cache_dir: Model cache directory path\n\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only texts with language in allowed list\n"
"Filter data using scores from the LexicalDiversityScorer. Measure lexical diversity using MTLD (Moving-Average Type-Token Ratio) and HDD (Hypergeometric Distribution Diversity) methods; higher scores indicate more diverse vocabulary usage.\n"
"Input Parameters:\n"
"- min_scores: Dictionary of minimum thresholds for each metric, containing 'mtld' and 'hdd'\n"
"- max_scores: Dictionary of maximum thresholds for each metric, containing 'mtld' and 'hdd'\n\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only texts with lexical diversity within specified range\n"
"- List containing label field names for each metric"
Operator for filtering text based on language using LLM.
Argument allowed_languages is a list of allowed languages, using the ISO 639-1 two-letter language code to specify the language (for example, 'en' for English, 'zh' for Chinese, etc.).
"Efficient near-duplicate detection using MinHash and LSH (Locality-Sensitive Hashing). Converts texts to MinHash signatures and uses LSH to quickly find similar texts, enabling near-deduplication for large-scale datasets.\n"
"Input Parameters:\n"
"- num_perm: Number of permutations for generating MinHash signatures\n"
"- threshold: Similarity threshold above which texts are considered duplicates\n"
"- use_n_gram: Whether to use n-gram tokenization\n"
"- ngram: n value for n-gram\n\n"
"Output Parameters:\n"
"- Deduplicated DataFrame containing only unique texts\n"
"- List containing deduplication label field name"
"Filter data using scores from the NgramScorer. Evaluate text redundancy via n-gram repetition ratio; higher score means lower repetition and less text redundancy.\n"
"Input Parameters:\n"
"- min_score: Minimum n-gram score threshold\n"
"- max_score: Maximum n-gram score threshold\n"
"- ngrams: n value for n-gram\n\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only texts with n-gram score within specified range\n"
"Detect similar text using n-gram technology and hashing algorithm for near deduplication. Splits text into multiple n-gram segments, computes hash values for each segment, and judges text similarity by comparing hash set similarity.\n"
"Input Parameters:\n"
"- n_gram: Number of segments to split text into\n"
"- hash_func: Hash function type, supporting 'md5', 'sha256', and 'xxh3'\n"
"- diff_size: Hash set difference threshold below which texts are considered similar\n\n"
"Output Parameters:\n"
"- Deduplicated DataFrame containing only unique texts\n"
"- List containing deduplication label field name"
"Filter data using scores from the PresidioScorer. Detect personally identifiable information (PII) entities in text using Microsoft Presidio model and return the count of detected PII items.\n"
"Supports recognition of multiple sensitive information types including names, emails, phone numbers, and IDs for data privacy protection and compliance checks.\n"
"Input Parameters:\n"
"- min_score: Minimum PII count threshold for retaining samples, default is 0\n"
"- max_score: Maximum PII count threshold for retaining samples, default is 5\n"
"- lang: Text language, default is 'en'\n"
"- device: Model running device, default is 'cuda'\n"
"- model_cache_dir: Model cache directory, default is './dataflow_cache'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only samples with PII count within [min_score, max_score] range\n"
"- List containing output field name for subsequent operator reference"
)
else:
return"Filter data based on PII detection results using Microsoft Presidio model."
"This operator verifies if the ratio of stop words in text is above threshold, using NLTK tokenizer for word splitting and stop word identification.\n"
"Initialization Parameters:\n"
"- threshold: Stop word ratio threshold (no default, required)\n"
"- use_tokenizer: Whether to use NLTK tokenizer (no default, required)\n"
"Run Parameters:\n"
"- storage: DataFlowStorage object\n"
"- input_key: Input text field name\n"
"- output_key: Output label field name, default is 'stop_word_filter_label'\n"
"Returns:\n"
"- List containing output_key"
)
else:
return"StopWordFilter verifies stop word ratio using NLTK tokenization with configurable threshold."
"This operator checks if the ratio of unique words in text meets threshold, calculating ratio of unique word count to total word count using set operations.\n"
"Initialization Parameters:\n"
"- threshold: Minimum unique word ratio threshold, default is 0.1\n"
"Run Parameters:\n"
"- storage: DataFlowStorage object\n"
"- input_key: Input text field name\n"
"- output_key: Output label field name, default is 'unique_words_filter'\n"
"Returns:\n"
"- List containing output_key"
)
else:
return"UniqueWordsFilter checks unique word ratio using set operations and threshold comparison."
"Identify semantically duplicate text using BERT embeddings for near deduplication. Calculate cosine similarity between text embedding vectors to detect semantically similar texts and retain unique samples.\n"
"Supports multiple field combinations as deduplication criteria, effectively removing duplicate data with similar content but different expressions to improve dataset diversity.\n"
"Input Parameters:\n"
"- eps: Similarity threshold, smaller values allow lower similarity, default is 0.05 (cosine similarity > 0.95 is considered duplicate)\n"
"- model_name: Pretrained model name, default is 'sentence-transformers/all-MiniLM-L6-v2'\n"
"- model_cache_dir: Model cache directory, default is './dataflow_cache'\n"
"- device: Model running device, default is 'cuda'\n"
"- input_keys: List of multiple input field names, alternative to input_key\n"
"- input_key: Single input field name, alternative to input_keys\n"
"- output_key: Deduplication result field name, default is 'minhash_deduplicated_label'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only semantically unique samples (samples marked as 1)\n"
"- List containing deduplication result field name for subsequent operator reference"
)
else:
return"Near deduplication by identifying semantically similar content using BERT embeddings."
"Detect similar text via SimHash algorithm and Hamming distance for near deduplication. Convert text to fixed-length fingerprints and determine text similarity by calculating Hamming distance between fingerprints.\n"
"Faster than semantic deduplication, suitable for fast deduplication preprocessing of large-scale datasets, especially for detecting character-level similar texts.\n"
"Input Parameters:\n"
"- fingerprint_size: Fingerprint length, default is 64 bits\n"
"- bound: Similarity threshold, smaller values allow lower similarity, default is 0.1 (similarity > 0.9 is considered duplicate)\n"
"- input_keys: List of multiple input field names, alternative to input_key\n"
"- input_key: Single input field name, alternative to input_keys\n"
"- output_key: Deduplication result field name, default is 'minhash_deduplicated_label'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only unique samples with similarity below threshold (samples marked as 1)\n"
"- List containing deduplication result field name for subsequent operator reference"
)
else:
return"Near deduplication by detecting text similarity using SimHash algorithm and Hamming distance."