• Home
  • About
  • OpenMRS
  • Proposal
  • Timeline

OpenMRS: Record Linkage Project

Google Summer of Code™ 2007






Agenda for the week

May 30th, 2007 by sarp

After receiving a great guideline on how to implement weight scaling from James, I examined part of the existing code today. In the analytic phase, for each field, we need to calculate:(1) number of unique values(2) frequency of each unique value(3) total recordsThese calculated values should be stored in the database for future analysis.My idea for the data structure to hold these values, ScaleWeightData, is:- integer array indexed by column id for each field regarding (1) and (3)- hash table indexed by token that contains most frequent K records for (2), with database lookups for tokens that are not found in the hash table (for large data sets)A few questions as usual:1) Scalability: What magnitude of scalability are we trying to achieve? Will scalability be a major design concern right from the beginning or are we trying to make it work first, and later worry about scalability? Should I worry about a real system with millions of patient records that may not fit into memory?2) Null values: How do they affect our statistical analysis? Do we ignore them when calculating total number of records?

Posted in , , , , | 2 Comments

2 Responses to “Agenda for the week”

  1. on 30 May 2007 at 2:32 am1Shaun Grannis

    Response to question 1) Scalability: Given that it is still uncertain how end-users will specifically implement scaling, we’ll need flexibility to limit the number of entries in the data structure. Options that immediately come to mind include:
    a) Store all token frequencies
    b) Limit by percent of total unique tokens, eg, “store the 10% most frequent tokens”, or “store the 10% least frequent tokens”
    c) Limit by absolute number of tokens, eg, “store 1,000 most frequent tokens”, or “store 3,000 least frequent tokens”
    d) conditional on token frequency: store all tokens with frequency above ‘n’, eg ’store all tokens with frequency above 100′, or ’store all tokens with frequency below 100′
    e) may need to be able to combine the above criteria, eg ’store top 1,000 tokens with frequency greater than 2,000′

    Will get to 2) later… Thanks!

  2. on 30 May 2007 at 7:14 am2Shaun Grannis

    Regarding NULLs:
    For frequency scaling, I anticipate that we will ignore nulls in most cases. However, because there is sure to be variation in how other users want to treat nulls, we want to capture sufficient information on null frequencies for those who may want to use them. That information includes: a) total number of non-null tokens, b) total number of null tokens, c) total number of unique non-null tokens.

    From the OpenMRS matching design pages, there should be the following options pertaining to nulls contained in the config file (talk to James to see “if” and “how” these are implemented):

    - A flag indicating whether to use null tokens when scaling agreement weight based on term frequency (default-no)
    - A flag indicating how to establish agreement among fields when one or both fields are null (eg, apply disagreement weight, apply agreement weight, or apply zero weight) (default-apply zero weight)

  • Email Updates

    To receive email updates on new posts, click here.

  • About Me

    I am 22 years old, recently graduated from Computer Science and Engineering program of Sabanci University in Istanbul, TURKEY.

  • Recent Posts

    • Performance Test: Analyzing token frequencies
    • Midterm update
    • Code Spotlight: Analyzing token frequencies
    • Phase3 Complete
    • Phase2 Complete
    • Phase1 Complete
  • Archives

    • July 2007 (5)
    • June 2007 (4)
    • May 2007 (3)
    • April 2007 (1)
  • Categories

    • (6)
    • (7)
    • (8)
    • (8)
    • (8)
    • (8)
    • (1)
    • (2)
    • (4)
  • Pages

    • About
    • OpenMRS
    • Proposal
    • Timeline
Powered by Wordpress |
Feed on
Posts
Comments