<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>OpenMRS: Record Linkage Project</title>
	<atom:link href="http://soc.sarpcentel.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://soc.sarpcentel.com</link>
	<description>Google Summer of Code™ 2007</description>
	<pubDate>Fri, 11 Jan 2008 14:16:18 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
	<language>en</language>
			<item>
		<title>Performance Test: Analyzing token frequencies</title>
		<link>http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/</link>
		<comments>http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/#comments</comments>
		<pubDate>Sun, 22 Jul 2007 22:16:32 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/</guid>
		<description><![CDATA[For analyzing token frequencies, I have implemented two different designs.Design A: For different data sources, create specific data readers that convert your data into Records. For frequency analysis, operate on the abstraction of RecordAdvantage:

Easy to maintain, implement once and for all
Reduces complexity of the project

Disadvantage:

Poor performance
High memory requirement, since you are doing analysis record by [...]]]></description>
			<content:encoded><![CDATA[<p>For analyzing token frequencies, I have implemented two different designs.<strong>Design A: </strong>For different data sources, create specific data readers that convert your data into Records. For frequency analysis, operate on the abstraction of Record<u>Advantage</u>:
<ul>
<li>Easy to maintain, implement once and for all</li>
<li>Reduces complexity of the project</li>
</ul>
<p><u>Disadvantage:</u>
<ul>
<li>Poor performance</li>
<li>High memory requirement, since you are doing analysis record by record, instead of column by column, you need to keep frequency data for ALL columns at the same time</li>
</ul>
<p><strong>Design B:</strong> For each type of data source (database, character delimited file etc.) we implement a specific analyzer in addition to a data reader<u>Advantage:</u>
<ul>
<li>Better performance in some data sources since you are operating on a lower level of abstraction</li>
<li>Possibility of doing analysis column by column, the way it is supposed to be</li>
</ul>
<p><u>Disadvantage:</u>
<ul>
<li>Each data source needs a different analyzer, increases complexity</li>
<li>Could be tiresome to implement<u> </u></li>
</ul>
<p>Performance difference becomes apparent in analyzing databases. Instead of loading our data into memory, processing it and writing it back, we can just ask MySQL to tell us token frequencies with a query, and store this information for later use. Below is the result for a data source of 5000 records (fileA_5000) consisting of 19 columns. Token frequency analysis of one column takes:<u>Design B:</u> <strong>1843 </strong>milliseconds<u>Design A:</u> <strong>4035 </strong>millisecondsDesign A does not improve much even if we reduce column count to one. This could mean that execution time consists of mostly overhead from iterating over all records and numerous function calls it brings.<u>Last word:</u>Even though there is significant performance difference for database data, I wouldn&#8217;t argue passionately for design B. This decision depends on how many different data sources we are planning to support, their nature (possibility of performance gain or not) and whether execution time stays within reasonable limits in design A or not.</p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Midterm update</title>
		<link>http://soc.sarpcentel.com/2007/07/20/midterm-update/</link>
		<comments>http://soc.sarpcentel.com/2007/07/20/midterm-update/#comments</comments>
		<pubDate>Thu, 19 Jul 2007 20:24:29 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/20/midterm-update/</guid>
		<description><![CDATA[As many of you know, we are working on a patient matching module for OpenMRS that will allow users to identify records that belong to the same patient among different data sources.In the first part of SoC, I&#8217;ve completed adding weight scaling functionality to the existing record linkage framework.Matching records are determined by assigning a [...]]]></description>
			<content:encoded><![CDATA[<p>As many of you know, we are working on a patient matching module for OpenMRS that will allow users to <span>identify records that belong to the same patient among different data sources.</span>In the first part of SoC, I&#8217;ve completed adding weight scaling functionality to the existing record linkage framework.Matching records are determined by assigning a score to each possible record pair. Weight scaling improves the accuracy of patient matching because fields that match on a common value, for instance James for first name, will be scaled down, and they will contribute less to the overall score for the given pair.In order to introduce weight scaling, we first <a href="http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies">analyze</a> data sources (could be database or character delimited file) that will be used in linkage for token frequencies. We store this data in a relational database and use it later during calculating scores for possible pairs. We have the ability to use different lookup tables for token frequencies (top N most/least frequent tokens, top N% most/least frequent tokens and frequencies above/below N).There are other possible improvements for scoring, therefore we&#8217;re currently working on refactoring the framework to make it easier to adjust matching scores.</p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/07/20/midterm-update/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Code Spotlight: Analyzing token frequencies</title>
		<link>http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/</link>
		<comments>http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/#comments</comments>
		<pubDate>Tue, 10 Jul 2007 19:48:52 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/</guid>
		<description><![CDATA[I&#8217;d like to read a text file and store the frequency of each word in it into a database. Memory is fast, database is slow. At the two extremes, we have:1) I don&#8217;t use any memory, for each word I read, I query the database to learn it&#8217;s frequency, and update it in the database.Problem: Very Inefficient2) Everything [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;d like to read a text file and store the frequency of each word in it into a database. Memory is fast, database is slow. At the two extremes, we have:<strong>1)</strong> I don&#8217;t use any memory, for each word I read, I query the database to learn it&#8217;s frequency, and update it in the database.<u>Problem:</u> Very Inefficient<strong>2)</strong> Everything is stored in the memory, I keep a hashtable in which I store frequencies of words. After reading the whole file, I write all entries to the database.<u>Problem:</u> Very fast, but I may have so many words that they won&#8217;t fit into memory.<strong>Solution:</strong>I keep part of the words in a hashtable, for those entries that are not in the hashtable, I query the database.<u>SubProblem:</u> How do I decide if I&#8217;ll put a word in the hashtable or not? I&#8217;d like to store the most frequent words in the hashtable so that I will minimize the number of queries to the database. But I don&#8217;t know frequencies of words in advance, they change as I read the file. My hashtable has to organize itself as I read the text file.<u>Solution:</u> If the next word I read is not in the hashtable, and is more frequent than the *least* frequent word in the hashtable, I put the new word into the table and remove the least frequent one. The reasoning is that if a word has high frequency, it is likely that I will see more of the same word.<u>SubProblem:</u> I need a way to efficiently determine the least frequent word in the hash table. Further more, after I replace the least frequent one, which word will be the least frequent one now? What about next round?<u>Solution:</u> Along with the hashtable, I also keep a data structure called PriorityQueue. If implemented using a heap, it provides insert, remove and findMin operations in O(logN) time.Here are some experimental results for analyzing one column of 10,000 records:
<ul>
<li>Without a lookup table, only using database: <strong>16 seconds</strong></li>
<li>With a lookup table of 250 records (covers 60% of data): <strong>8 seconds</strong></li>
<li>With a lookup table of 500 records (covers 80% of data): <strong>5 seconds</strong></li>
</ul>
<p>Any ideas or comments on this approach will be appreciated <img src='http://soc.sarpcentel.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Phase3 Complete</title>
		<link>http://soc.sarpcentel.com/2007/07/11/phase3-complete/</link>
		<comments>http://soc.sarpcentel.com/2007/07/11/phase3-complete/#comments</comments>
		<pubDate>Tue, 10 Jul 2007 19:48:04 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/11/phase3-complete/</guid>
		<description><![CDATA[Now that Phase3 is ready for review, we have added weight scaling functionality to patient matching module. In fact, I have both versions of weight scaling implemented (Framework A and B). This week, I will be performing performance tests this week to determine if there is noteworthy performance difference between these two approaches.
3. Runtime Component, operational:  [...]]]></description>
			<content:encoded><![CDATA[<p>Now that Phase3 is ready for review, we have added weight scaling functionality to patient matching module. In fact, I have both versions of weight scaling implemented (Framework A and B). This week, I will be performing performance tests this week to determine if there is noteworthy performance difference between these two approaches.<br />
<blockquote><strong>3.</strong> <strong>Runtime Component, operational:</strong>  Modify the ScorePair method to incorporate frequency scaling.  This process should be performed incrementally, in two phases.The first phase will hard code the frequency scaling equation, into the existing ScorePair method.  Once the entire linkage process (from analytics to operational phase) has been tested and successfully implements frequency scaling as a prototype, we will proceed to phase 2.In the second phase ScorePair will be re-factored to accommodate a framework that accepts future modifications to linkage scores established by the Felligi-Sunter model.  These modifications include the frequency scaling, and will also include modifying the agreement weight based on the degree of string similarity as established by various string comparators.</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/07/11/phase3-complete/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Phase2 Complete</title>
		<link>http://soc.sarpcentel.com/2007/07/02/phase2-complete/</link>
		<comments>http://soc.sarpcentel.com/2007/07/02/phase2-complete/#comments</comments>
		<pubDate>Mon, 02 Jul 2007 13:40:10 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/02/phase2-complete/</guid>
		<description><![CDATA[I refactored the code for Phase1 according to the feedback I received, and completed Phase 2 (Here is the changeset). I will make a more detailed post tomorrow (before the phone call). The next step (changing the formula for weight scaling) should be fairly easy, after I figure out where scores are calculated in the existing code. Testing and [...]]]></description>
			<content:encoded><![CDATA[<p>I refactored the code for Phase1 according to the feedback I received, and completed Phase 2 (<em>Here is the </em><a href="http://dev.openmrs.org/changeset/1959"><em>changeset</em></a>). I will make a more detailed post tomorrow (before the phone call). The next step (changing the formula for weight scaling) should be fairly easy, after I figure out where scores are calculated in the existing code. Testing and commenting the code should not take too long either, since I know that the code is functional, just have to check for boundary cases etc.<br />
<blockquote><strong>Runtime Component, start-up:</strong>  Implement functionality instantiating a data structure that provides fast lookup of individual token frequencies.  This data structure will likely be a hash table, where the key is the token value (eg, last name of &#8220;SMITH&#8221;) and the value is the token frequency (eg, 2,102). This data will be loaded from the persistent data structure created in task 1(e).Because the primary performance constraint for weight-based frequency scaling will be the lookup, we will need to be able to configure the number of elements loaded into the hash table.  For example, it is likely that some fields will have hundreds of thousands of unique tokens (eg, name fields), while others will have on the order of 10 or 20 (middle initial, month of birth).Also, weight scaling can be used to either increase or decrease individual field weights.  If an individual token frequency is less than the average frequency it will be increased, if it is above the average frequency it will be decreased.  Consequently, there needs to be some ability to configure the total number of tokens loaded into the lookup structure for each field.a. Implement functionality to load top ‘N’ most/least frequent tokens from the persistent data structure, where top, bottom, and ‘N’ are specified in the configuration file.  Ifb. Other (future) options may include top or bottom N%, frequencies above or below N.He</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/07/02/phase2-complete/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Phase1 Complete</title>
		<link>http://soc.sarpcentel.com/2007/06/25/phase1-complete/</link>
		<comments>http://soc.sarpcentel.com/2007/06/25/phase1-complete/#comments</comments>
		<pubDate>Mon, 25 Jun 2007 15:56:35 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/06/25/phase1-complete/</guid>
		<description><![CDATA[I completed Phase 1, the Analytic Component. It includes:
Implement and analytic object that performs the following tasks:a. Read linkage configuration file and determine which fields/columns need to be scaled.b. Connect to data source containing individual tokens. Assume that the data source is either a flat file such as a CSV file or a relational database.c. Add new information [...]]]></description>
			<content:encoded><![CDATA[<p>I completed Phase 1, the <strong>Analytic Component</strong>. It includes:<br />
<blockquote>Implement and analytic object that performs the following tasks:<strong>a.</strong> Read linkage configuration file and determine which fields/columns need to be scaled.<strong>b.</strong> Connect to data source containing individual tokens. Assume that the data source is either a flat file such as a CSV file or a relational database.<strong>c.</strong> Add new information to configuration file that indicates location of token frequencies<strong>d.</strong> Count frequencies of individual tokens.<strong>e.</strong> Store token frequency results in persistent structure (eg, a relational database table).  In order to access the token frequency data at runtime, the frequency tables need to be identified in the configuration file.  Thus, will need to develop a programmatic scheme to identify each token frequency table associated with a given data source, eg:</p></blockquote>
<p><u>What has changed in the source code?</u>Testing code has been moved into a new package called <em>org.openmrs.testing</em><u><em>org.regenstrief.linkage.analysis</em>,</u>A new abstract class for analyzers called <strong>DataSourceAnalyzer</strong>Two classes that extend it are <strong>CharDelimFileAnalyzer</strong> and <strong>DataBaseAnalyzer</strong>These classes contain a CharDelimFileReader/DataBaseReader to go over/query recordsIn this schema, existing classes such as <em>DataSourceAnalysis</em>, <em>Analyzer</em>  and <em>ScaleWeightAnalyzer</em> are not used. The schema provided by existing classes was more general, and if we can find a way to fit my classes into it, I&#8217;m willing to refactor the code. Otherwise, I&#8217;ll delete them.<u><em>org.regenstrief.linkage.db</em></u>Analyzers contain a <strong>LinkDBManager</strong> to insert token frequencies into the database. I&#8217;m not sure if this was the most suitable class for adding this code.<em><u>org.regenstrief.linkage.io</u></em><strong>DataSourceReader</strong> has a new parameter to determine if it will be used for analysis or reading. This change was necessary because in reading, some columns are excluded and blocking is done to make the source ready for linkage. However, we don&#8217; want these in analyzing the data.<u><em>org.regenstrief.linkage.util</em></u><strong>LinkDataSource:</strong> Added a variable to store a unique identifier for each linkdatasource (is used in storing token frequencies)<strong>RecMatchConfig</strong>: Added a LinkDBManager to create a connection<strong>XMLTranslator:</strong> Modified to include id of the linkdatasource and a new parameter for storing token frequencies </p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/06/25/phase1-complete/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Database schema for analysis results</title>
		<link>http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/</link>
		<comments>http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/#comments</comments>
		<pubDate>Sat, 16 Jun 2007 21:36:27 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/</guid>
		<description><![CDATA[Here is a draft on how to store analytical phase results in a relational database table. I created this diagram using DBDesigner 4, the same tool used to create OpenMRS Data Model. Right click here and choose &#8221;Save target as&#8221; if you&#8217;d like to load this schema in DBDesigner and modify it.Datasource_analysis table mimics LinkDataSource class in org.regenstrief.linkage.io package. I am imagining a [...]]]></description>
			<content:encoded><![CDATA[<p>Here is a draft on how to store analytical phase results in a relational database table. I created this diagram using <a href="http://213.115.162.124/external/DBDesigner4/DBDesigner4.0.5.6_Setup.exe">DBDesigner 4</a>, the same tool used to create OpenMRS Data Model. Right click <a href="http://soc.sarpcentel.com/wp-content/uploads/analysis.xml">here</a> and choose &#8221;Save target as&#8221; if you&#8217;d like to load this schema in DBDesigner and modify it.<img src="http://soc.sarpcentel.com/wp-content/uploads/2007/06/model.png" alt="model.png" />Datasource_analysis table mimics <strong>LinkDataSource</strong> class in <strong>org.regenstrief.linkage.io</strong> package. I am imagining a GUI in OpenMRS where the user manually chooses among existing data sources, or adds a new data source in which the <strong>datasource_id</strong> is automatically assigned by the database.Field table contains <strong>changed </strong>and <strong>data_changed</strong> attributes to determine how fresh the statistics are.</p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Explaining grandma record linkage</title>
		<link>http://soc.sarpcentel.com/2007/06/17/explaining-grandma-record-linkage/</link>
		<comments>http://soc.sarpcentel.com/2007/06/17/explaining-grandma-record-linkage/#comments</comments>
		<pubDate>Sat, 16 Jun 2007 20:45:19 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/06/17/explaining-grandma-record-linkage/</guid>
		<description><![CDATA[I was so eager to start working on this project that I forgot to introduce the problem that I will be working on during this summer.Record linkage, roughly, is the task of identifying pieces of scattered information that refer to the same thing. Patient matching is a specific application, in which we try to identify records that [...]]]></description>
			<content:encoded><![CDATA[<p>I was so eager to start working on this project that I forgot to <span>introduce the problem that I will be working on during this summer.</span><span>Record linkage, roughly, is <em>the task of identifying pieces of scattered information that refer to the same thing. </em></span><span>Patient matching is a specific application, in which we try to identify records that belong to the same patient among different data sources. These sources can range from patient data collected at different hospitals to external information from governmental institutions, such as death master file etc.</span><span>One of the interesting and challenging aspects of this project is to deal with erroneous data, for instance when your name is misspelled or your birth date is entered incorrectly. These kinds of things often happen in reality, and we can account for them by using flexible distance metrics and statistical models. </span><span>Why is then record linkage important and what are the benefits?</span><span></span><span>Well, we are living in an exciting period of globalization, where computers and internet make world-wide collaboration easy and necessary. Patient linkage and data aggregation techniques will allow medical institutions to store their own data, yet at the same time work together with others to offer better treatment to patients. </span><span>For instance, patients often forget their test results at home, or old tests get lost eventually. Imagine that all your medical records are stored in digital format, and when you go to</span><span> Hospital A, a doctor there can examine your tomogram taken 4 years ago at Hospital B where your name was misspelled by the clerk <img src='http://soc.sarpcentel.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </span><span>I hope that record linkage functionality will be a step forward to increase collaboration between OpenMRS implementers.</span></p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/06/17/explaining-grandma-record-linkage/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Code reviews</title>
		<link>http://soc.sarpcentel.com/2007/06/08/code-reviews/</link>
		<comments>http://soc.sarpcentel.com/2007/06/08/code-reviews/#comments</comments>
		<pubDate>Thu, 07 Jun 2007 21:35:38 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/06/08/code-reviews/</guid>
		<description><![CDATA[Google was cool enough to send out Karl Fogel&#8217;s Producing Open Source Software book as a suprise gift to all Summer of Code &#8216;07 participants. I have already read that book at our local library after getting accepted, therefore I was hoping to receive another book, Open Sources 2.0: The Continuing Evolution  Nevertheless, the [...]]]></description>
			<content:encoded><![CDATA[<p>Google was cool enough to send out Karl Fogel&#8217;s <a href="http://www.producingoss.com" target="_blank">Producing Open Source Software</a> book as a suprise gift to all <em>Summer of Code &#8216;07</em> participants. I have already read that book at our local library after getting accepted, therefore I was hoping to receive another book, <a href="http://www.amazon.com/Open-Sources-2-0-Continuing-Evolution/dp/0596008023/ref=sr_1_1/104-6654402-4575137?ie=UTF8&amp;s=books&amp;qid=1181249433&amp;sr=8-1" target="_blank">Open Sources 2.0: The Continuing Evolution</a> <img src='http://soc.sarpcentel.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> Nevertheless, the book we received is more suitable in terms of the goals of this program, so there is nothing to complain :)Karl Fogel was one of the developers of Subversion, and a past Google employee. In his book, he shares his experience about different aspects of open-source software development, from setting up the technical infrastucture to licensing and copyright issues. This book is worth reading because it contains specific examples and good advice.Fogel mentions a developer named Greg Stein, who decided to set an example by reviewing every line of every single commit that went into the code repository and eventually started a tradition of code reviews among developers of Subversion project.As a young developer, code reviews would be a great way to get useful feedback and improve my coding abilities. Fogel argues that one could contribute as much to the project by reviewing others&#8217; changes as by writing new code because bugs and non-optimal coding practices that would go unnoticed would be caught on the fly.I agree with him, and would love to receive some reviews after my first few commits <img src='http://soc.sarpcentel.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /></p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/06/08/code-reviews/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Agenda for the week</title>
		<link>http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/</link>
		<comments>http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/#comments</comments>
		<pubDate>Tue, 29 May 2007 20:30:19 +0000</pubDate>
		<dc:creator>sarp</dc:creator>
		
		<category><![CDATA[]]></category>

		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/</guid>
		<description><![CDATA[After receiving a great guideline on how to implement weight scaling from James, I examined part of the existing code today. In the analytic phase, for each field, we need to calculate:(1) number of unique values(2) frequency of each unique value(3) total recordsThese calculated values should be stored in the database for future analysis.My idea [...]]]></description>
			<content:encoded><![CDATA[<p>After receiving a great guideline on how to implement weight scaling from James, I examined part of the existing code today. In the analytic phase, <em>for each field</em><strong>,</strong> we need to calculate:(1) number of unique values(2) frequency of each unique value(3) total records<u>These calculated values should be stored in the database for future analysis</u>.My idea for the data structure to hold these values, ScaleWeightData, is:- integer array indexed by column id for each field regarding (1) and (3)- hash table indexed by token that contains most frequent K records for (2), with database lookups for tokens that are not found in the hash table (<em>for large data sets</em>)A few questions as usual:1) <strong>Scalability:</strong> What magnitude of scalability are we trying to achieve? Will scalability be a major design concern right from the beginning or are we trying to <em>make it work</em> first, and later worry about scalability? Should I worry about a real system with millions of patient records that may not fit into memory?2) <strong>Null values:</strong> How do they affect our statistical analysis? Do we ignore them when calculating total number of records?</p>
]]></content:encoded>
			<wfw:commentRss>http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
