"This operator generates natural language questions for Text2SQL tasks if the natural language question is empty. Multiple candidate questions are generated to ensure correctness.\n\n"
"Input parameters:\n"
"- input_sql_key: The name of the input SQL column\n"
"- input_db_id_key: The name of the database ID column\n\n"
"Output parameters:\n"
"- output_question_key: The name of the output question column"
assertself.num_labels==self.model.config.num_labels,f"Number of labels ({self.num_labels}) does not match model config ({self.model.config.num_labels})"
"Evaluate the educational value of text using the Fineweb-Edu classifier. This classifier uses a pre-trained sequence classification model "
"to assess text and returns a score between 0 and 1, where higher scores indicate greater educational value. Suitable for filtering educational content.\n"
"Input parameters:\n"
"- text: Text string to be evaluated\n"
"Output parameters:\n"
"- float: Educational value score between 0 and 1, higher values indicate greater educational value"
"Evaluate multiple meta attributes of text using LLM, including Text Structure, Diversity & Complexity, Fluency & Understandability, Safety, Educational Value, and Content Accuracy & Effectiveness.\n"
"- dimensions: List of evaluation dimensions, each dimension corresponding to a dictionary containing dimension_name, description, and example field:\n"
" * dimension_name: Name of the dimension\n"
" * description: Description of the dimension\n"
" * example_list: List containing example texts and scores\n"
"- input_key: Field name for input text\n"
"Output Parameters:\n"
"- DataFrame containing scores for 6 evaluation dimensions with columns: Text Structure, Diversity & Complexity, Fluency & Understandability, Safety, Educational Value, Content Accuracy & Effectiveness"
)
else:
return"Evaluate multiple meta attributes of text using LLM."
"Text quality scorer trained on BGE model and GPT pairwise comparison data, supporting bilingual input. Evaluate text through single-sample assessment, "
"returning a quality score between 0 and 1, where higher scores indicate better text quality. Models include English version (zks2856/PairQual-Scorer-en) and Chinese version (zks2856/PairQual-Scorer-zh).\n"
"Input parameters:\n"
"- text: Text string to be evaluated\n"
"- lang: Language type, optional 'en' or 'zh'\n"
"Output parameters:\n"
"- float: Quality score between 0 and 1, higher values indicate better quality"
"Evaluate text quality across four dimensions using the Qurating model (princeton-nlp/QuRater-1.3B): writing_style, required_expertise, "
"facts_and_trivia, and educational_value. Each dimension returns a score between 0 and 1, providing a comprehensive assessment of overall text quality.\n"
"Input parameters:\n"
"- text: Text string to be evaluated\n"
"- labels: List of evaluation dimensions, default ['writing_style', 'required_expertise', 'facts_and_trivia', 'educational_value']\n"
"Output parameters:\n"
"- dict: Dictionary containing scores for each dimension, with keys as dimension names and values as scores between 0 and 1"
)
def_score_func(self,sample):
"""Process a single sample and return the score."""
batch_dict={'text':[sample]}# Wrap sample into a list for processing
"Assess the educational value of text using a FastText classifier (kenhktsui/llm-data-textbook-quality-fasttext-classifer-v2), categorizing text into Low, Mid, and High levels, "
"mapped to scores of 1.0, 3.0, and 5.0 respectively. Suitable for filtering high-quality text content suitable as teaching materials.\n"
"Input parameters:\n"
"- text: Text string to be evaluated\n"
"Output parameters:\n"
"- float: Educational value score, possible values 1.0 (Low), 3.0 (Mid), 5.0 (High)"
"Filter data using scores from the PairQualScorer. Bilingual text quality scorer trained on GPT pairwise comparison annotations using BGE model; higher scores indicate better quality.\n"
"Input Parameters:\n"
"- min_score: Minimum quality score threshold\n"
"- max_score: Maximum quality score threshold\n"
"- model_cache_dir: Model cache directory path\n"
"- lang: Text language type\n\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only texts with quality score within specified range\n"
"Filter data using scores from the PerplexityScorer. Uses Huggingface model to calculate text perplexity; lower scores indicate better fluency and understandability.\n"
"Input Parameters:\n"
"- min_score: Minimum perplexity threshold\n"
"- max_score: Maximum perplexity threshold\n"
"- model_name: Huggingface model path or name\n"
"- device: Model device\n\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only texts with perplexity within specified range\n"
"Filter data using scores from the QuratingScorer. Evaluate text quality across four dimensions using Qurating model: writing style, required expertise, facts and trivia content, and educational value.\n"
"Each dimension is scored from 0-9, providing comprehensive quality assessment for filtering high-quality educational or knowledge-based content.\n"
"Input Parameters:\n"
"- min_scores: Minimum score thresholds for each dimension, default is {'writing_style':0,'required_expertise':0,'facts_and_trivia':0,'educational_value':0}\n"
"- max_scores: Maximum score thresholds for each dimension, default is {'writing_style':9,'required_expertise':9,'facts_and_trivia':9,'educational_value':9}\n"
"- map_batch_size: Mapping batch size, default is 512\n"
"- num_workers: Number of data loading workers, default is 1\n"
"- device_batch_size: Device batch size, default is 16\n"
"- device: Model running device, default is 'cuda'\n"
"- labels: List of evaluation dimensions, default is ['writing_style', 'required_expertise', 'facts_and_trivia', 'educational_value']\n"
"- model_cache_dir: Model cache directory, default is './dataflow_cache'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only samples with all dimension scores within corresponding threshold ranges\n"
"- List containing field names of each dimension's filtering results for subsequent operator reference"
)
else:
return"Filter data based on multi-dimensional quality assessment using Qurating model."
"Filter data using scores from the TextbookScorer. Assess educational value of text using FastText classifier to determine if text is suitable as educational material.\n"
"Classifier is trained to identify text with educational significance, clear structure, and accurate knowledge, suitable for building educational datasets.\n"
"Input Parameters:\n"
"- min_score: Minimum educational value score threshold for retaining samples, default is 0.99\n"
"- max_score: Maximum educational value score threshold for retaining samples, default is 1.0\n"
"- model_cache_dir: Model cache directory, default is './dataflow_cache'\n"
"- input_key: Input text field name\n"
"- output_key: Educational value score field name, default is 'TextbookScore'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only samples with educational value scores within [min_score, max_score] range\n"
"- List containing educational value score field name for subsequent operator reference"
)
else:
return"Filter data based on educational value assessment using FastText textbook classifier."