The results and ramblings of research

phewww!

Removing Out of Memory Errors in GATE

with 11 comments

On running a batch ANNIE job to extract named entities, I encounter OutOfMemory errors after about 10-15K documents. It turns out the problem lies in the fact that GATE resources need to be statically loaded. My solution is to create a gate application singleton and to statically load all processing resources. To do so I create the following two classes:

  • The first class called GATEExtractor initializes GATE
private static GATEExtractor instance = null;
private static Gate gate = null;
private static Logger logger = Logger.getRootLogger();
protected GATEExtractor() throws GateException, IOException{
logger.info("Initializing GATE");
logger.setLevel(Level.FATAL);
gate = new Gate();
gate.init();
logger.info("Loading ANNIE Plugin");
File gateHome = gate.getGateHome();
File pluginsHome = new File(gateHome, "plugins");
gate.getCreoleRegister().registerDirectories(new File(pluginsHome, "ANNIE").toURL());
gate.getCreoleRegister().registerDirectories(new File(pluginsHome, "Tools").toURL());
gate.getCreoleRegister().registerDirectories(new File(pluginsHome, "Annotation_Merging").toURL());
logger.info("Done Initializing GATE");
}
public static GATEExtractor getInstance() throws GateException, IOException{
if(instance==null){
logger.info("Creating Gate Loader");
instance = new GATEExtractor();
}
logger.info("Returning Gate Loader Instance");
return instance;
}
}
  • And the second class which initializes ANNIE and allows me to send documents for processing
public class ANNIEExtractor {
private static RealtimeCorpusController annieController;
private static Logger logger = Logger.getRootLogger();
private static ProcessingResource annotpr = null;
private static ProcessingResource tokeniser =null;
private static ProcessingResource bgazetteer =null;
private static ProcessingResource split = null;
private static ProcessingResource postagger = null;
private static ProcessingResource transducer = null;
private static ProcessingResource orthoMatcher = null;
public void initAnnie() throws GateException {
logger.info("Initialising ANNIE...");
annieController = (RealtimeCorpusController)
Factory.createResource("gate.creole.RealtimeCorpusController", Factory.newFeatureMap(), Factory.newFeatureMap(), "ANNIE_" + Gate.genSym());
annotpr = (ProcessingResource)Factory.createResource("gate.creole.annotdelete.AnnotationDeletePR", Factory.newFeatureMap());
annieController.add(annotpr);
tokeniser = (ProcessingResource)Factory.createResource("gate.creole.tokeniser.DefaultTokeniser", Factory.newFeatureMap());
annieController.add(tokeniser);
FeatureMap gmap = Factory.newFeatureMap();
gmap.put("wholeWordsOnly", true);
gmap.put("longestMatchOnly", true);
bgazetteer = (ProcessingResource)Factory.createResource("gate.creole.gazetteer.DefaultGazetteer", gmap);
annieController.add(bgazetteer);
split = (ProcessingResource)Factory.createResource("gate.creole.splitter.SentenceSplitter",Factory.newFeatureMap());
annieController.add(split);
postagger = (ProcessingResource)Factory.createResource("gate.creole.POSTagger",Factory.newFeatureMap());
annieController.add(postagger);
ProcessingResource morpho = (ProcessingResource)Factory.createResource("gate.creole.morph.Morph",Factory.newFeatureMap());
annieController.add(morpho);
transducer = (ProcessingResource)Factory.createResource("gate.creole.ANNIETransducer",Factory.newFeatureMap());
annieController.add(transducer);
orthoMatcher = (ProcessingResource)Factory.createResource("gate.creole.orthomatcher.OrthoMatcher", Factory.newFeatureMap());
annieController.add(orthoMatcher);
logger.info("...ANNIE loaded");
} // initAnnie()
/**
* Set the current corpus (Run execute following this)
* @param corpus Corpus to be processed
*/
public void setCorpus(Corpus corpus) {
annieController.setCorpus(corpus);
} // setCorpus
/**
* Run ANNIE
*/
public void execute() throws GateException {
annieController.execute();
} // execute()
/**
* Unloads Resources
*/
public void cleanUp(){
Corpus corp= annieController.getCorpus();
if(!corp.isEmpty()){
for(int i=0;i<corp.size();i++){
Document doc1 = (Document)corp.remove(i);
corp.unloadDocument(doc1);
Factory.deleteResource(corp);
Factory.deleteResource(doc1);
}
}
}
}

Now I am able to run batch processing without OutOfMemory errors. To use the ANNIEExtractor first initialize, set corpus and then execute. To get named entities you will need to implement the code for extracting annotations.

Advertisements

Written by anujjaiswal

June 1, 2011 at 6:49 pm

Posted in GATE, NLP

11 Responses

Subscribe to comments with RSS.

  1. Hi,

    I’m seeing a similar out of memory problem as you describe after about 300K files. Two things:
    1. I’m loading my gate application using an application.xgapp file exported from gate developer. Not sure how that would make things different in your code.
    2. I’m not understanding your use of the Factory.deleteResource(corp); in the for loop that’s iterating over the corpus.

    Bill Gosse

    October 21, 2011 at 6:25 pm

    • Hi Bill,
      1. The xapp file follows a similar process of loading processing resources. I prefer using the file I have since I can add or remove resources depending on my requirements. This helps GATE execution times especially if I dont load very intensive memory/ computation resources.
      2. I think you are talking about:
      public void cleanUp(){
      Corpus corp= annieController.getCorpus();
      if(!corp.isEmpty()){
      for(int i=0;i<corp.size();i++){
      Document doc1 = (Document)corp.remove(i);
      corp.unloadDocument(doc1);
      Factory.deleteResource(corp);
      Factory.deleteResource(doc1);
      }

      This piece of code will remove all documents from GATE thereby freeing RAM. I essentially run this code after every 100 documents that were processed.

      I have a daemon code that runs the two files and it has been working fine for the last 30 days.

      anujjaiswal

      October 22, 2011 at 12:15 am

      • when you say that using the xgapp follows a similar loading process do you mean it loads the resources statically like you are doing.?

        Bill Gosse

        October 23, 2011 at 3:35 am

      • Hi Bill

        I am unsure of that. Based on their documentation for the StandAloneANNIE.java (http://gate.ac.uk/wiki/code-repository/src/sheffield/examples/StandAloneAnnie.java), I would assume they don’t. Based on my experience, I have learnt that
        1) Loading PR’s statically obviously is the right way since you only load resources once.
        2) The unloading of documents as soon as you are done with them (or after a limited set of docs), helps because processed text is marked for garbage collection and doesnt use memory.

        If you are still encountering out of memory errors after unloading documents and you are sure of the PR’s you need I would suggest that you try to load them statically. Memory leaks (I know JAVA people dont like to call them that) suck in JAVA. Best of luck.

        Regards
        AJ

        anujjaiswal

        October 23, 2011 at 9:21 am

  2. Are doing anything special with your jvm command line arguments to optimize it for GATE.

    Bill Gosse

    October 27, 2011 at 11:08 pm

  3. Not in my current daemon. However, in a previous version whose input was extremely large documents, I have used the -Xmx memory setting.

    anujjaiswal

    October 27, 2011 at 11:11 pm

    • Btw Bill did any of this help?

      anujjaiswal

      November 10, 2011 at 2:59 am

      • Yes I actually got my stuff to run without running out off memory.

        My GATE project turned out to be very successful.
        In a corpus of 2.5 million IT related articles we were able to identify 700,000 business contacts.

        Thanks again for all your help.

        Bill

        January 7, 2012 at 2:23 am

      • Bill,

        That is great to hear. It turns out GATE works really well once you figure out all its weird complexities. Just FYI, some further ideas to improve performance,
        1) GATE works great in Map-reduce/ hadoop environment as well. I have used it on over 50M short documents in under 8 hours of computing time. You can easily setup GATE using the setup() function in map-reduce.
        2) Sometimes, it turns out that some corpora have their own consistent tokenization characteristics that are not captured by the default GATE tokenizer. For example, I use GATE to process tweets where a large number of characters ‘_’, ‘@’ etc. are present. GATE’s default tokenizer will tokenize a string such as @user_name into two string ‘@user’ and ‘name’. However, such behavior will inherently reduce NER performance since ‘@user_name’ should be tokenized as one string. Thus, Modifying tokenizer rules is something that should be explored to improve NER performance. Simple example https://anujjaiswal.wordpress.com/2011/04/19/tokenizing-text-for-gate/

        Let me know, if you need simple code examples for these two scenarios.

        Cheers,
        AJ

        anujjaiswal

        January 7, 2012 at 3:06 am

  4. Hello Anuj,
    I am trying to setup GATE for Map/Reduce environment.
    I am stuck in deciding the input format for the Job.
    My input is a GATE-annotated XML file. So do I need to write my own custom GATECorpusInputFormat so that each mapper get a gate corpus?

    Bolaka

    June 29, 2012 at 5:31 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: