Tokenizing Text for Gate

After much headache, I have noticed that GATE isnt really helping out when it comes to my NE tasks on social media. I looked into the outputs and I found that the tokenization is probably one of the reasons why I am unable to extract entities correctly. First text such as:

ABC/ CD is tokenized as WORD{ABC/} SPACETOKEN{ } WORD{CD}. Notice that punctuation “/” has been rolled into the word itself. So I have made some changes to the default tokenizer and added a SpaceToken of type “other”. To
do this go to “$GATE_HOME/plugins/ANNIE/tokeniser/DefaultTokeniser.rules” and add the following lines after the block “#whitespace#”

(OTHER_PUNCTUATION) >SpaceToken;kind=other;

And change the following in the block #punctuation#

And now GATE tokenizes the string as


Written by anujjaiswal

April 19, 2011 at 9:15 pm

Posted in NLP

