"Apply comprehensive document-level quality filtering rules using scores from CodeDocumentQualitySampleEvaluator to remove low-quality code and text samples.\n\n"
"Evaluation Metrics:\n"
"- Content length: character/word/line count range checks\n"
"- Repetition patterns: duplicate line ratio, 2-10gram repetition ratios\n"
"- Character composition: curly bracket ratio, all-caps word ratio\n"
"- Text entropy: unigram entropy checks\n"
"- Comprehensive document quality score: 0-1, 1 means passes all quality checks\n\n"
"Input Parameters:\n"
"- input_key: Input field name (requires 'text', 'filename', 'language' columns)\n"
"- output_key: Output label field name (default: 'doc_quality_filter_label')\n"
self.num_dialogs_per_intent=num_dialogs_per_intent# Based on the topic_dict in the existing prompt, it is recommended to set the value to below 1000 (which can generate 9000 conversation data). Otherwise, it is recommended to add more topic_dict in dataflow.prompts.general_text.ConsistentChatPrompt to increase data richness
"Two-stage generation of multi-turn dialogue data from scratch based on predefined topics and human intents (for over 9000 samples, consider increasing the number of tags).\n"
"Generate new or alternative scenarios based on the original scenario using LLM service. The original content is rewritten or reimagined to create a different version.\n"