The results and ramblings of research


Tokenizing Text for Gate

leave a comment »

After much headache, I have noticed that GATE isnt really helping out when it comes to my NE tasks on social media. I looked into the outputs and I found that the tokenization is probably one of the reasons why I am unable to extract entities correctly. First text such as:

ABC/ CD is tokenized as WORD{ABC/} SPACETOKEN{ } WORD{CD}. Notice that punctuation “/” has been rolled into the word itself. So I have made some changes to the default tokenizer and added a SpaceToken of type “other”. To
do this go to “$GATE_HOME/plugins/ANNIE/tokeniser/DefaultTokeniser.rules” and add the following lines after the block “#whitespace#”

(OTHER_PUNCTUATION) >SpaceToken;kind=other;

And change the following in the block #punctuation#

And now GATE tokenizes the string as


Written by anujjaiswal

April 19, 2011 at 9:15 pm

Posted in NLP

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: