Agenda for the week
May 30th, 2007 by sarp
After receiving a great guideline on how to implement weight scaling from James, I examined part of the existing code today. In the analytic phase, for each field, we need to calculate:(1) number of unique values(2) frequency of each unique value(3) total recordsThese calculated values should be stored in the database for future analysis.My idea for the data structure to hold these values, ScaleWeightData, is:- integer array indexed by column id for each field regarding (1) and (3)- hash table indexed by token that contains most frequent K records for (2), with database lookups for tokens that are not found in the hash table (for large data sets)A few questions as usual:1) Scalability: What magnitude of scalability are we trying to achieve? Will scalability be a major design concern right from the beginning or are we trying to make it work first, and later worry about scalability? Should I worry about a real system with millions of patient records that may not fit into memory?2) Null values: How do they affect our statistical analysis? Do we ignore them when calculating total number of records?
2 Responses to “Agenda for the week”
Response to question 1) Scalability: Given that it is still uncertain how end-users will specifically implement scaling, we’ll need flexibility to limit the number of entries in the data structure. Options that immediately come to mind include:
a) Store all token frequencies
b) Limit by percent of total unique tokens, eg, “store the 10% most frequent tokens”, or “store the 10% least frequent tokens”
c) Limit by absolute number of tokens, eg, “store 1,000 most frequent tokens”, or “store 3,000 least frequent tokens”
d) conditional on token frequency: store all tokens with frequency above ‘n’, eg ’store all tokens with frequency above 100′, or ’store all tokens with frequency below 100′
e) may need to be able to combine the above criteria, eg ’store top 1,000 tokens with frequency greater than 2,000′
Will get to 2) later… Thanks!
Regarding NULLs:
For frequency scaling, I anticipate that we will ignore nulls in most cases. However, because there is sure to be variation in how other users want to treat nulls, we want to capture sufficient information on null frequencies for those who may want to use them. That information includes: a) total number of non-null tokens, b) total number of null tokens, c) total number of unique non-null tokens.
From the OpenMRS matching design pages, there should be the following options pertaining to nulls contained in the config file (talk to James to see “if” and “how” these are implemented):
- A flag indicating whether to use null tokens when scaling agreement weight based on term frequency (default-no)
- A flag indicating how to establish agreement among fields when one or both fields are null (eg, apply disagreement weight, apply agreement weight, or apply zero weight) (default-apply zero weight)