Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
zhougaofeng
internlm2-math-7B
Commits
d04b4aee
Commit
d04b4aee
authored
Jun 11, 2024
by
zhougaofeng
Browse files
Upload New File
parent
97390a14
Pipeline
#1154
canceled with stages
Changes
1
Pipelines
1
Show whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
100 additions
and
0 deletions
+100
-0
src/llmfactory/hparams/data_args.py
src/llmfactory/hparams/data_args.py
+100
-0
No files found.
src/llmfactory/hparams/data_args.py
0 → 100644
View file @
d04b4aee
from
dataclasses
import
dataclass
,
field
from
typing
import
Literal
,
Optional
@
dataclass
class
DataArguments
:
r
"""
Arguments pertaining to what data we are going to input our model for training and evaluation.
"""
template
:
Optional
[
str
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"Which template to use for constructing prompts in training and inference."
},
)
dataset
:
Optional
[
str
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"The name of provided dataset(s) to use. Use commas to separate multiple datasets."
},
)
dataset_dir
:
str
=
field
(
default
=
"data"
,
metadata
=
{
"help"
:
"Path to the folder containing the datasets."
},
)
split
:
str
=
field
(
default
=
"train"
,
metadata
=
{
"help"
:
"Which dataset split to use for training and evaluation."
},
)
cutoff_len
:
int
=
field
(
default
=
1024
,
metadata
=
{
"help"
:
"The cutoff length of the tokenized inputs in the dataset."
},
)
reserved_label_len
:
int
=
field
(
default
=
1
,
metadata
=
{
"help"
:
"The minimum cutoff length reserved for the tokenized labels in the dataset."
},
)
train_on_prompt
:
bool
=
field
(
default
=
False
,
metadata
=
{
"help"
:
"Whether to disable the mask on the prompt or not."
},
)
streaming
:
bool
=
field
(
default
=
False
,
metadata
=
{
"help"
:
"Enable dataset streaming."
},
)
buffer_size
:
int
=
field
(
default
=
16384
,
metadata
=
{
"help"
:
"Size of the buffer to randomly sample examples from in dataset streaming."
},
)
mix_strategy
:
Literal
[
"concat"
,
"interleave_under"
,
"interleave_over"
]
=
field
(
default
=
"concat"
,
metadata
=
{
"help"
:
"Strategy to use in dataset mixing (concat/interleave) (undersampling/oversampling)."
},
)
interleave_probs
:
Optional
[
str
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"Probabilities to sample data from datasets. Use commas to separate multiple datasets."
},
)
overwrite_cache
:
bool
=
field
(
default
=
False
,
metadata
=
{
"help"
:
"Overwrite the cached training and evaluation sets."
},
)
preprocessing_num_workers
:
Optional
[
int
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"The number of processes to use for the pre-processing."
},
)
max_samples
:
Optional
[
int
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"For debugging purposes, truncate the number of examples for each dataset."
},
)
eval_num_beams
:
Optional
[
int
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"Number of beams to use for evaluation. This argument will be passed to `model.generate`"
},
)
ignore_pad_token_for_loss
:
bool
=
field
(
default
=
True
,
metadata
=
{
"help"
:
"Whether or not to ignore the tokens corresponding to padded labels in the loss computation."
},
)
val_size
:
float
=
field
(
default
=
0.0
,
metadata
=
{
"help"
:
"Size of the development set, should be an integer or a float in range `[0,1)`."
},
)
packing
:
Optional
[
bool
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"Whether or not to pack the sequences in training. Will automatically enable in pre-training."
},
)
tokenized_path
:
Optional
[
str
]
=
field
(
default
=
None
,
metadata
=
{
"help"
:
"Path to save or load the tokenized datasets."
},
)
def
__post_init__
(
self
):
if
self
.
reserved_label_len
>=
self
.
cutoff_len
:
raise
ValueError
(
"`reserved_label_len` must be smaller than `cutoff_len`."
)
if
self
.
streaming
and
self
.
val_size
>
1e-6
and
self
.
val_size
<
1
:
raise
ValueError
(
"Streaming mode should have an integer val size."
)
if
self
.
streaming
and
self
.
max_samples
is
not
None
:
raise
ValueError
(
"`max_samples` is incompatible with `streaming`."
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment