return"Measure instruction complexity using Llama-based Deita model."
definfer_complexity(self,input_text):
complexity_template=("You are a helpful assistant. Please identify the complexity score of the following user query. \n##Query: {instruction}\n##Complexity: ")
"Evaluate instruction content diversity and intention tags using the Instag scorer. Generate relevant tags by analyzing instruction text, "
"with more tags indicating greater content diversity, while returning detailed explanations of tags. Implemented based on OFA-Sys/InsTagger model.\n"
"Input parameters:\n"
"- query: Instruction text to be evaluated\n"
"Output parameters:\n"
"- int: Number of tags (content diversity indicator)\n"
"- list: List of dictionaries containing tags and explanations"
)
defmake_prompt(self,query):
prompt=f"Please identify tags of user intentions in the following user query and provide an explanation for each tag. Please respond in the JSON format {{\"tag\": str, \"explanation\": str}}.\nUser query: {query}"
messages=[("user",prompt),("Assistant",None)]
seps=[" ","</s>"]
ret="system: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions."+seps[0]
fori,(role,message)inenumerate(messages):
ifmessage:
ret+=role+": "+message+seps[i%2]
else:
ret+=role+":"
returnret
definference_batch(self,queries):
"""Process batch of queries using either local model or API."""
"Score text quality using a reward model trained on human preference data (OpenAssistant/reward-model-deberta-v3-large-v2), where higher scores indicate better quality. "
"The model takes instruction-response text pairs as input and outputs a reward score between 0 and 1, reflecting human preference judgments on text quality.\n"
"Input parameters:\n"
"- instruction: Instruction text string\n"
"- output: Response text string\n"
"Output parameters:\n"
"- float: Reward score between 0 and 1, higher values indicate better quality"
"Evaluate the follow difficulty of instructions using the Superfiltering method, which calculates the ratio of conditional perplexity to independent perplexity based on the GPT-2 model. "
"Higher scores indicate greater difficulty in following the instruction. This method assesses instruction clarity and follow difficulty by comparing response perplexity under instruction conditions with independent response perplexity.\n"
return"Filter operator based on InstagScorer. Uses pre-trained Instag model to analyze instructions, returning the number of tags to evaluate content diversity. Parameters include model cache directory (model_cache_dir), computing device (device), and maximum new tokens (max_new_tokens). Filter range is controlled by min_score and max_score parameters, with more tags indicating greater content diversity."
"Filter data using scores from the RMScorer. Quality scoring using reward model trained with human preference data, where higher scores indicate better quality.\n"
"Reward model evaluates human preference metrics such as relevance, helpfulness, and harmlessness, useful for filtering high-quality text aligned with human values.\n"
"Input Parameters:\n"
"- min_score: Minimum reward score threshold for retaining samples, default is 0.2\n"
"- max_score: Maximum reward score threshold for retaining samples, default is 0.8\n"
"- device: Model running device, default is 'cuda'\n"
"- model_cache_dir: Model cache directory, default is './dataflow_cache'\n"
"- input_instruction_key: Instruction field name, default is 'instruction'\n"
"- input_output_key: Output field name, default is 'output'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only samples with reward scores within [min_score, max_score] range\n"
"- List containing reward score field name for subsequent operator reference"
)
else:
return"Filter data based on quality scores from human preference-trained reward model."
"Filter out low-quality data using the Superfiltering scorer. Evaluate instruction following difficulty by calculating perplexity ratio with GPT-2 model; lower ratios indicate instructions are easier for models to understand and execute.\n"
"Suitable for selecting instruction data appropriate for specific model capabilities, improving model training efficiency and effectiveness.\n"
"Input Parameters:\n"
"- min_score: Minimum score threshold for retaining samples, default is 0.0\n"
"- max_score: Maximum score threshold for retaining samples, default is 1.0\n"
"- device: Model running device, default is 'cuda'\n"
"- model_cache_dir: Model cache directory, default is './dataflow_cache'\n"
"- max_length: Maximum text length, default is 512\n"
"- input_instruction_key: Instruction field name, default is 'instruction'\n"
"- input_input_key: Input field name, default is 'input'\n"
"- input_output_key: Output field name, default is 'output'\n"
"- output_key: Filter result score field name, default is 'SuperfilteringScore'\n"
"Output Parameters:\n"
"- Filtered DataFrame containing only samples with scores within [min_score, max_score] range\n"
"- List containing filter result score field name for subsequent operator reference"
)
else:
return"Filter low-quality data using perplexity ratio calculated with GPT-2 model."
"Filter data using scores from the TreeinstructScore. Measure instruction complexity by the number of nodes in the generated syntax tree; more nodes indicate more complex instructions.\n"
"Suitable for selecting instruction data within specific complexity ranges, balancing dataset difficulty distribution and optimizing model training effectiveness.\n"
"Input Parameters:\n"
"- min_score: Minimum syntax tree node count threshold for retaining samples, default is 7\n"
"- max_score: Maximum syntax tree node count threshold for retaining samples, default is 100\n"
# Based on the existing topics, it is recommended to set num_samples below 5000. Otherwise, it is recommended to add topics in dataflow.prompts.general_text.CondorPrompt on your own to increase data richness
"Two-stage generation of SFT-style data from scratch based on predefined knowledge tree tags (for over 5000 samples, consider increasing the number of tags). \n"
"First stage generates questions of varying difficulty levels, second stage generates answers for each question.\n"