The results and ramblings of research

phewww!

Tokenizing Text for Gate

leave a comment »

After much headache, I have noticed that GATE isnt really helping out when it comes to my NE tasks on social media. I looked into the outputs and I found that the tokenization is probably one of the reasons why I am unable to extract entities correctly. First text such as:

ABC/ CD is tokenized as WORD{ABC/} SPACETOKEN{ } WORD{CD}. Notice that punctuation “/” has been rolled into the word itself. So I have made some changes to the default tokenizer and added a SpaceToken of type “other”. To
do this go to “$GATE_HOME/plugins/ANNIE/tokeniser/DefaultTokeniser.rules” and add the following lines after the block “#whitespace#”

(OTHER_PUNCTUATION) >SpaceToken;kind=other;

And change the following in the block #punctuation#
(CONNECTOR_PUNCTUATION|OTHER_PUNCTUATION)>Token;kind=punctuation;
to
(CONNECTOR_PUNCTUATION)>Token;kind=punctuation;

And now GATE tokenizes the string as
WORD{ABC}SPACETOKEN{/} SPACETOKEN{ } WORD{CD}

Advertisements

Written by anujjaiswal

April 19, 2011 at 9:15 pm

Posted in NLP

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: