• Home
  • About
  • OpenMRS
  • Proposal
  • Timeline

OpenMRS: Record Linkage Project

Google Summer of Code™ 2007






Code Spotlight: Analyzing token frequencies

Jul 11th, 2007 by sarp

I’d like to read a text file and store the frequency of each word in it into a database. Memory is fast, database is slow. At the two extremes, we have:1) I don’t use any memory, for each word I read, I query the database to learn it’s frequency, and update it in the database.Problem: Very Inefficient2) Everything is stored in the memory, I keep a hashtable in which I store frequencies of words. After reading the whole file, I write all entries to the database.Problem: Very fast, but I may have so many words that they won’t fit into memory.Solution:I keep part of the words in a hashtable, for those entries that are not in the hashtable, I query the database.SubProblem: How do I decide if I’ll put a word in the hashtable or not? I’d like to store the most frequent words in the hashtable so that I will minimize the number of queries to the database. But I don’t know frequencies of words in advance, they change as I read the file. My hashtable has to organize itself as I read the text file.Solution: If the next word I read is not in the hashtable, and is more frequent than the *least* frequent word in the hashtable, I put the new word into the table and remove the least frequent one. The reasoning is that if a word has high frequency, it is likely that I will see more of the same word.SubProblem: I need a way to efficiently determine the least frequent word in the hash table. Further more, after I replace the least frequent one, which word will be the least frequent one now? What about next round?Solution: Along with the hashtable, I also keep a data structure called PriorityQueue. If implemented using a heap, it provides insert, remove and findMin operations in O(logN) time.Here are some experimental results for analyzing one column of 10,000 records:

  • Without a lookup table, only using database: 16 seconds
  • With a lookup table of 250 records (covers 60% of data): 8 seconds
  • With a lookup table of 500 records (covers 80% of data): 5 seconds

Any ideas or comments on this approach will be appreciated :)

Posted in code reviews, open-source, Patient Matching, SoC, OpenMRS, Intern, Public | 2 Comments

2 Responses to “Code Spotlight: Analyzing token frequencies”

  1. on 13 Jul 2007 at 1:24 am1Darius Jazayeri

    Hi Sarp,

    That’s probably a conversation worth having more broadly.

    For example when I was working on cohort builder I did some back-of-the-envelope math and decided that I’d be taking up 5MB of RAM, and that’s okay because it’s important enough.

    Yesterday Christian and Justin and I we were discussing the Find a Patient search widget, and whether we might want to build a list of tokens for that. We decided that was also important enough. :-)

    We probably need a less ad-hoc process for deciding “this feature is worth X megs of RAM”.

    -Darius

  2. on 20 Jul 2007 at 1:24 am2OpenMRS: Record Linkage Project » Blog Archive » Midterm update

    […] order to introduce weight scaling, we first analyze data sources (could be database or character delimited file) that will be used in linkage for token […]

  • Email Updates

    To receive email updates on new posts, click here.

  • About Me

    I am 22 years old, recently graduated from Computer Science and Engineering program of Sabanci University in Istanbul, TURKEY.

  • Recent Posts

    • Performance Test: Analyzing token frequencies
    • Midterm update
    • Code Spotlight: Analyzing token frequencies
    • Phase3 Complete
    • Phase2 Complete
    • Phase1 Complete
  • Archives

    • July 2007 (5)
    • June 2007 (4)
    • May 2007 (3)
    • April 2007 (1)
  • Categories

    • book (1)
    • code reviews (2)
    • Intern (8)
    • open-source (4)
    • OpenMRS (8)
    • Patient Matching (8)
    • Private (7)
    • Public (6)
    • SoC (8)
  • Pages

    • About
    • OpenMRS
    • Proposal
    • Timeline
Powered by Wordpress |
Feed on
Posts
Comments