• Home
  • About
  • OpenMRS
  • Proposal
  • Timeline

OpenMRS: Record Linkage Project

Google Summer of Code™ 2007






Performance Test: Analyzing token frequencies

Jul 23rd, 2007 by sarp

For analyzing token frequencies, I have implemented two different designs.Design A: For different data sources, create specific data readers that convert your data into Records. For frequency analysis, operate on the abstraction of RecordAdvantage:

  • Easy to maintain, implement once and for all
  • Reduces complexity of the project

Disadvantage:

  • Poor performance
  • High memory requirement, since you are doing analysis record by record, instead of column by column, you need to keep frequency data for ALL columns at the same time

Design B: For each type of data source (database, character delimited file etc.) we implement a specific analyzer in addition to a data readerAdvantage:

  • Better performance in some data sources since you are operating on a lower level of abstraction
  • Possibility of doing analysis column by column, the way it is supposed to be

Disadvantage:

  • Each data source needs a different analyzer, increases complexity
  • Could be tiresome to implement

Performance difference becomes apparent in analyzing databases. Instead of loading our data into memory, processing it and writing it back, we can just ask MySQL to tell us token frequencies with a query, and store this information for later use. Below is the result for a data source of 5000 records (fileA_5000) consisting of 19 columns. Token frequency analysis of one column takes:Design B: 1843 millisecondsDesign A: 4035 millisecondsDesign A does not improve much even if we reduce column count to one. This could mean that execution time consists of mostly overhead from iterating over all records and numerous function calls it brings.Last word:Even though there is significant performance difference for database data, I wouldn’t argue passionately for design B. This decision depends on how many different data sources we are planning to support, their nature (possibility of performance gain or not) and whether execution time stays within reasonable limits in design A or not.

Posted in , , , , , | 2 Comments

2 Responses to “Performance Test: Analyzing token frequencies”

  1. on 10 Oct 2007 at 11:53 pm1Andre

    Hello. I’ve been working with Record Linkage for some time and i’m building a FrameWork (which i’ve already been using).
    Maybe we could exchange ideas, and maybe merging the project could be a good idea too.

    Hope you answer this soon

    André

  2. on 11 Oct 2007 at 4:41 am2sarp

    It would be interesting to see what possible collaborations are possible. Please contact Shaun Grannis who is leading this project. You can find his contact details here:
    http://www.regenstrief.org/bio/full?member=sgrannis

  • Email Updates

    To receive email updates on new posts, click here.

  • About Me

    I am 22 years old, recently graduated from Computer Science and Engineering program of Sabanci University in Istanbul, TURKEY.

  • Recent Posts

    • Performance Test: Analyzing token frequencies
    • Midterm update
    • Code Spotlight: Analyzing token frequencies
    • Phase3 Complete
    • Phase2 Complete
    • Phase1 Complete
  • Archives

    • July 2007 (5)
    • June 2007 (4)
    • May 2007 (3)
    • April 2007 (1)
  • Categories

    • (6)
    • (7)
    • (8)
    • (8)
    • (8)
    • (8)
    • (1)
    • (2)
    • (4)
  • Pages

    • About
    • OpenMRS
    • Proposal
    • Timeline
Powered by Wordpress |
Feed on
Posts
Comments