Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
Megatron-LM
Commits
d0878333
Unverified
Commit
d0878333
authored
May 11, 2019
by
Raul Puri
Committed by
GitHub
May 11, 2019
Browse files
added missing presplit_sentences_json.py
parent
66719e97
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
with
27 additions
and
0 deletions
+27
-0
scripts/presplit_sentences_json.py
scripts/presplit_sentences_json.py
+27
-0
No files found.
scripts/presplit_sentences_json.py
0 → 100644
View file @
d0878333
"""
Usage:
python scripts/presplit_sentences_json.py <original loose json file> <output loose json file>
"""
import
sys
import
json
import
nltk
nltk
.
download
(
'punkt'
)
input_file
=
sys
.
argv
[
1
]
output_file
=
sys
.
argv
[
2
]
line_seperator
=
"
\n
"
with
open
(
input_file
,
'r'
)
as
ifile
:
with
open
(
output_file
,
"w"
)
as
ofile
:
for
doc
in
ifile
.
readlines
():
parsed
=
json
.
loads
(
doc
)
sent_list
=
[]
for
line
in
parsed
[
'text'
].
split
(
'
\n
'
):
if
line
!=
'
\n
'
:
sent_list
.
extend
(
nltk
.
tokenize
.
sent_tokenize
(
line
))
parsed
[
'text'
]
=
line_seperator
.
join
(
sent_list
)
ofile
.
write
(
json
.
dumps
(
parsed
)
+
'
\n
'
)
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment