Skip to content
Toggle navigation
P
Projects
G
Groups
S
Snippets
Help
likorn
/
estonian-lstm
This project
Loading...
Sign in
Toggle navigation
Go to a project
Project
Repository
Issues
0
Merge Requests
0
Pipelines
Wiki
Snippets
Members
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Commit
7a49bc01
authored
Jan 05, 2019
by
Paktalin
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Preprocessed forms
parent
266b978c
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
9 additions
and
8 deletions
encoded_forms.csv
preprocessing.py
encoded_forms.csv
0 → 100644
View file @
7a49bc01
This source diff could not be displayed because it is too large. You can
view the blob
instead.
preprocessing.py
View file @
7a49bc01
from
estnltk
import
Text
import
numpy
as
np
from
keras.preprocessing.text
import
text_to_word_sequence
from
tqdm
import
tqdm
# the maximum length of a sentence
maxlen
=
70
...
...
@@ -9,20 +10,20 @@ articles = Text(open('articles.txt', encoding='utf-8').read())
# transform to an array of sentences
sentences
=
articles
.
sentence_texts
N
=
10
# create an empty dict to store forms like {form: code}
dict_forms
=
{}
# initialize a prefilled with zeros numpy array
values
=
np
.
zeros
((
N
,
maxlen
),
dtype
=
int
)
for
i
in
range
(
N
):
encoded_forms
=
np
.
zeros
((
len
(
sentences
),
maxlen
),
dtype
=
int
)
# loop over all sentences showing a loading bar
for
i
in
tqdm
(
range
(
len
(
sentences
))):
# split the sentence into a list of lowercase words
sentences
[
i
]
=
text_to_word_sequence
(
sentences
[
i
])
# loop over the words in the current sentence
for
j
in
range
(
len
(
sentences
[
i
])):
for
j
in
range
(
len
(
sentences
[
i
]
[:
maxlen
]
)):
form
=
Text
(
sentences
[
i
][
j
])
.
forms
[
0
]
# add the unseen form to the dictionary
# add the unseen form to the dictionary
increasing its code value by one
if
form
not
in
dict_forms
:
dict_forms
[
form
]
=
len
(
dict_forms
)
+
1
# set the form's code to the current form
values
[
i
,
j
]
=
dict_forms
[
form
]
print
(
values
)
\ No newline at end of file
encoded_forms
[
i
,
j
]
=
dict_forms
[
form
]
np
.
savetxt
(
"encoded_forms.csv"
,
encoded_forms
,
delimiter
=
"~"
,
fmt
=
'
%
i'
)
\ No newline at end of file
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment