<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments for OpenMRS: Record Linkage Project</title>
	<atom:link href="http://soc.sarpcentel.com/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://soc.sarpcentel.com</link>
	<description>Google Summer of Code™ 2007</description>
	<pubDate>Fri, 29 Aug 2008 05:46:55 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
		<item>
		<title>Comment on Performance Test: Analyzing token frequencies by sarp</title>
		<link>http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/#comment-440</link>
		<dc:creator>sarp</dc:creator>
		<pubDate>Wed, 10 Oct 2007 23:41:28 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/#comment-440</guid>
		<description>It would be interesting to see what possible collaborations are possible. Please contact Shaun Grannis who is leading this project. You can find his contact details here:
http://www.regenstrief.org/bio/full?member=sgrannis</description>
		<content:encoded><![CDATA[<p>It would be interesting to see what possible collaborations are possible. Please contact Shaun Grannis who is leading this project. You can find his contact details here:<br />
<a href="http://www.regenstrief.org/bio/full?member=sgrannis" rel="nofollow">http://www.regenstrief.org/bio/full?member=sgrannis</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Performance Test: Analyzing token frequencies by Andre</title>
		<link>http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/#comment-437</link>
		<dc:creator>Andre</dc:creator>
		<pubDate>Wed, 10 Oct 2007 18:53:26 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/23/performance-test-analyzing-token-frequencies/#comment-437</guid>
		<description>Hello. I've been working with Record Linkage for some time and i'm building a FrameWork (which i've already been using).
Maybe we could exchange ideas, and maybe merging the project could be a good idea too.

Hope you answer this soon

André</description>
		<content:encoded><![CDATA[<p>Hello. I&#8217;ve been working with Record Linkage for some time and i&#8217;m building a FrameWork (which i&#8217;ve already been using).<br />
Maybe we could exchange ideas, and maybe merging the project could be a good idea too.</p>
<p>Hope you answer this soon</p>
<p>André</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Code Spotlight: Analyzing token frequencies by OpenMRS: Record Linkage Project &#187; Blog Archive &#187; Midterm update</title>
		<link>http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/#comment-12</link>
		<dc:creator>OpenMRS: Record Linkage Project &#187; Blog Archive &#187; Midterm update</dc:creator>
		<pubDate>Thu, 19 Jul 2007 20:24:34 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/#comment-12</guid>
		<description>[...] order to introduce weight scaling, we first analyze data sources (could be database or character delimited file) that will be used in linkage for token [...]</description>
		<content:encoded><![CDATA[<p>[...] order to introduce weight scaling, we first analyze data sources (could be database or character delimited file) that will be used in linkage for token [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Code Spotlight: Analyzing token frequencies by Darius Jazayeri</title>
		<link>http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/#comment-9</link>
		<dc:creator>Darius Jazayeri</dc:creator>
		<pubDate>Thu, 12 Jul 2007 20:24:23 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/07/11/code-spotlight-analyzing-token-frequencies/#comment-9</guid>
		<description>Hi Sarp,

That's probably a conversation worth having more broadly.

For example when I was working on cohort builder I did some back-of-the-envelope math and decided that I'd be taking up 5MB of RAM, and that's okay because it's important enough.

Yesterday Christian and Justin and I we were discussing the Find a Patient search widget, and whether we might want to build a list of tokens for that. We decided that was also important enough. :-) 

We probably need a less ad-hoc process for deciding "this feature is worth X megs of RAM".

-Darius</description>
		<content:encoded><![CDATA[<p>Hi Sarp,</p>
<p>That&#8217;s probably a conversation worth having more broadly.</p>
<p>For example when I was working on cohort builder I did some back-of-the-envelope math and decided that I&#8217;d be taking up 5MB of RAM, and that&#8217;s okay because it&#8217;s important enough.</p>
<p>Yesterday Christian and Justin and I we were discussing the Find a Patient search widget, and whether we might want to build a list of tokens for that. We decided that was also important enough. <img src='http://soc.sarpcentel.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>We probably need a less ad-hoc process for deciding &#8220;this feature is worth X megs of RAM&#8221;.</p>
<p>-Darius</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Database schema for analysis results by sarp</title>
		<link>http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/#comment-8</link>
		<dc:creator>sarp</dc:creator>
		<pubDate>Wed, 20 Jun 2007 08:05:55 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/#comment-8</guid>
		<description>&lt;p&gt;1. access refers to the delimiter character in flat files, for databases it could be referring to username-password information etc. James would know better because it is a variable he used while designing LinkDataSource class.&lt;/p&gt;
&lt;p&gt;2. The motivation for these fields was that since analytic phase may take a long time for large data sources, we may want to use old statistics in the linkage process to skip the analytic phase. &lt;/p&gt;
&lt;p&gt;changed field will be either 0 or 1, depending on whether any update/insert/delete operations were done to the database after our statistical analysis. This field would need to be maintained by functions outside of patient linkage module that make changes to the database.&lt;/p&gt;
&lt;p&gt;date_changed is to store the last date in which changes to the database were made. It would be good to know how fresh the calculated statistics are, to decide whether to run the analytic phase again or not.&lt;/p&gt;
&lt;p&gt;3. Thanks, I'll add that&lt;/p&gt;
</description>
		<content:encoded><![CDATA[<p>1. access refers to the delimiter character in flat files, for databases it could be referring to username-password information etc. James would know better because it is a variable he used while designing LinkDataSource class.</p>
<p>2. The motivation for these fields was that since analytic phase may take a long time for large data sources, we may want to use old statistics in the linkage process to skip the analytic phase. </p>
<p>changed field will be either 0 or 1, depending on whether any update/insert/delete operations were done to the database after our statistical analysis. This field would need to be maintained by functions outside of patient linkage module that make changes to the database.</p>
<p>date_changed is to store the last date in which changes to the database were made. It would be good to know how fresh the calculated statistics are, to decide whether to run the analytic phase again or not.</p>
<p>3. Thanks, I&#8217;ll add that</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Database schema for analysis results by Shaun Grannis</title>
		<link>http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/#comment-7</link>
		<dc:creator>Shaun Grannis</dc:creator>
		<pubDate>Tue, 19 Jun 2007 15:21:44 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/06/17/database-schema-for-analysis-results/#comment-7</guid>
		<description>Sarp, a great start to the the data model! A couple of questions:

1. In the patient_matching_datasource_analysis table, there is a field labeled "access", can you describe its use?

2. In the patient_matching_field table, can you describe how the "date_changed" and "changed" fields will be used? I'm not certain how the "changed" field will be used

3. A given field (such as last name) will have an overall entropy.  Additionally, each individual token possesses an individual entropy.  I would suggest adding an "entropy" field to the patientmatching_token table. Calculating the entropy of individual tokens is not top priority, but I anticipate potentially using those values in the future.</description>
		<content:encoded><![CDATA[<p>Sarp, a great start to the the data model! A couple of questions:</p>
<p>1. In the patient_matching_datasource_analysis table, there is a field labeled &#8220;access&#8221;, can you describe its use?</p>
<p>2. In the patient_matching_field table, can you describe how the &#8220;date_changed&#8221; and &#8220;changed&#8221; fields will be used? I&#8217;m not certain how the &#8220;changed&#8221; field will be used</p>
<p>3. A given field (such as last name) will have an overall entropy.  Additionally, each individual token possesses an individual entropy.  I would suggest adding an &#8220;entropy&#8221; field to the patientmatching_token table. Calculating the entropy of individual tokens is not top priority, but I anticipate potentially using those values in the future.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Agenda for the week by Shaun Grannis</title>
		<link>http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/#comment-6</link>
		<dc:creator>Shaun Grannis</dc:creator>
		<pubDate>Wed, 30 May 2007 02:14:58 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/#comment-6</guid>
		<description>Regarding NULLs: 
For frequency scaling, I anticipate that we will ignore nulls in most cases. However, because there is sure to be variation in how other users want to treat nulls, we want to capture sufficient information on null frequencies for those who may want to use them. That information includes: a) total number of non-null tokens, b) total number of null tokens, c) total number of unique non-null tokens.

From the OpenMRS matching design pages, there should be the following options pertaining to nulls contained in the config file (talk to James to see "if" and "how" these are implemented): 

- A flag indicating whether to use null tokens when scaling agreement weight based on term frequency (default-no)
- A flag indicating how to establish agreement among fields when one or both fields are null (eg, apply disagreement weight, apply agreement weight, or apply zero weight) (default-apply zero weight)</description>
		<content:encoded><![CDATA[<p>Regarding NULLs:<br />
For frequency scaling, I anticipate that we will ignore nulls in most cases. However, because there is sure to be variation in how other users want to treat nulls, we want to capture sufficient information on null frequencies for those who may want to use them. That information includes: a) total number of non-null tokens, b) total number of null tokens, c) total number of unique non-null tokens.</p>
<p>From the OpenMRS matching design pages, there should be the following options pertaining to nulls contained in the config file (talk to James to see &#8220;if&#8221; and &#8220;how&#8221; these are implemented): </p>
<p>- A flag indicating whether to use null tokens when scaling agreement weight based on term frequency (default-no)<br />
- A flag indicating how to establish agreement among fields when one or both fields are null (eg, apply disagreement weight, apply agreement weight, or apply zero weight) (default-apply zero weight)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Agenda for the week by Shaun Grannis</title>
		<link>http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/#comment-5</link>
		<dc:creator>Shaun Grannis</dc:creator>
		<pubDate>Tue, 29 May 2007 21:32:29 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/05/30/agenda-for-the-week/#comment-5</guid>
		<description>Response to question 1) Scalability: Given that it is still uncertain how end-users will specifically implement scaling, we'll need flexibility to limit the number of entries in the data structure. Options that immediately come to mind include:
a) Store all token frequencies
b) Limit by percent of total unique tokens, eg, "store the 10% most frequent tokens", or "store the 10% least frequent tokens"
c) Limit by absolute number of tokens, eg, "store 1,000 most frequent tokens", or "store 3,000 least frequent tokens"
d) conditional on token frequency: store all tokens with frequency above 'n', eg 'store all tokens with frequency above 100', or 'store all tokens with frequency below 100'
e) may need to be able to combine the above criteria, eg 'store top 1,000 tokens with frequency greater than 2,000'

Will get to 2) later... Thanks!</description>
		<content:encoded><![CDATA[<p>Response to question 1) Scalability: Given that it is still uncertain how end-users will specifically implement scaling, we&#8217;ll need flexibility to limit the number of entries in the data structure. Options that immediately come to mind include:<br />
a) Store all token frequencies<br />
b) Limit by percent of total unique tokens, eg, &#8220;store the 10% most frequent tokens&#8221;, or &#8220;store the 10% least frequent tokens&#8221;<br />
c) Limit by absolute number of tokens, eg, &#8220;store 1,000 most frequent tokens&#8221;, or &#8220;store 3,000 least frequent tokens&#8221;<br />
d) conditional on token frequency: store all tokens with frequency above &#8216;n&#8217;, eg &#8217;store all tokens with frequency above 100&#8242;, or &#8217;store all tokens with frequency below 100&#8242;<br />
e) may need to be able to combine the above criteria, eg &#8217;store top 1,000 tokens with frequency greater than 2,000&#8242;</p>
<p>Will get to 2) later&#8230; Thanks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Connecting the dots by Shaun Grannis</title>
		<link>http://soc.sarpcentel.com/2007/05/21/connecting-the-dots/#comment-4</link>
		<dc:creator>Shaun Grannis</dc:creator>
		<pubDate>Sat, 26 May 2007 04:22:38 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/05/21/connecting-the-dots/#comment-4</guid>
		<description>In response to comment C) above: string comparators in fact can produce a value between 0 and 1. In practice, if the comparator value exceeds a pre-specified threshold, then the fields are considered to agree, and $latex \gamma_{k}$ for that field is set to 1. Conversely, if the comparator value is less than a pre-specified threshold, then the fields are considered to disagree, and $latex \gamma_{k}$ for that field is set to 0.

Another interesting (and unproven) approach is to make the $latex m_{k}$ and $latex u_{k}$ rates conditional on the string comparator score. For instance, if we have an arbitrary comparator that produced values  (x) between 0 and 1, we can divide these values into 2 ranges:
range a: $latex 0.50\leq{ x}\leq0.75$
range b: $latex 0.75\textless{ x}\leq1.00$

Then for each field there can be 2 likelihood scores depending on the value of the string comparator. For example, if the string comparator returned a score of 0.78, then the agreement weight would be $latex \frac{m_{b}}{u_{b}}$, the disagreement weight would be $latex \frac{1-m_{a}-m_{b}}{1-u_{a}-u_{b}}$</description>
		<content:encoded><![CDATA[<p>In response to comment C) above: string comparators in fact can produce a value between 0 and 1. In practice, if the comparator value exceeds a pre-specified threshold, then the fields are considered to agree, and <img src="http://l.wordpress.com/latex.php?latex=%5Cgamma_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="\gamma_{k}" /> for that field is set to 1. Conversely, if the comparator value is less than a pre-specified threshold, then the fields are considered to disagree, and <img src="http://l.wordpress.com/latex.php?latex=%5Cgamma_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="\gamma_{k}" /> for that field is set to 0.</p>
<p>Another interesting (and unproven) approach is to make the <img src="http://l.wordpress.com/latex.php?latex=m_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="m_{k}" /> and <img src="http://l.wordpress.com/latex.php?latex=u_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="u_{k}" /> rates conditional on the string comparator score. For instance, if we have an arbitrary comparator that produced values  (x) between 0 and 1, we can divide these values into 2 ranges:<br />
range a: <img src="http://l.wordpress.com/latex.php?latex=0.50%5Cleq%7B+x%7D%5Cleq0.75&bg=ffffff&fg=000000&s=0" class="tex" alt="0.50\leq{ x}\leq0.75" /><br />
range b: <img src="http://l.wordpress.com/latex.php?latex=0.75%5Ctextless%7B+x%7D%5Cleq1.00&bg=ffffff&fg=000000&s=0" class="tex" alt="0.75\textless{ x}\leq1.00" /></p>
<p>Then for each field there can be 2 likelihood scores depending on the value of the string comparator. For example, if the string comparator returned a score of 0.78, then the agreement weight would be <img src="http://l.wordpress.com/latex.php?latex=%5Cfrac%7Bm_%7Bb%7D%7D%7Bu_%7Bb%7D%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="\frac{m_{b}}{u_{b}}" />, the disagreement weight would be <img src="http://l.wordpress.com/latex.php?latex=%5Cfrac%7B1-m_%7Ba%7D-m_%7Bb%7D%7D%7B1-u_%7Ba%7D-u_%7Bb%7D%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="\frac{1-m_{a}-m_{b}}{1-u_{a}-u_{b}}" /></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Connecting the dots by Shaun Grannis</title>
		<link>http://soc.sarpcentel.com/2007/05/21/connecting-the-dots/#comment-3</link>
		<dc:creator>Shaun Grannis</dc:creator>
		<pubDate>Fri, 25 May 2007 19:29:21 +0000</pubDate>
		<guid isPermaLink="false">http://soc.sarpcentel.com/2007/05/21/connecting-the-dots/#comment-3</guid>
		<description>Sarp,

Great work above! I'll try to cover several issues in this response. In this reply, the term "token" refers to a specific instance in a given field. For example, tokens for the Last_Name field might be 'SMITH', 'GRANNIS', 'CENTEL', etc. Tokens for Zip_code may be '46224', '02139', '44235', etc.

Not sure I follow the equation you list above, which I repeat below as equation (1). Specifically, which are the token frequencies, and which are the weights? 

$latex (1) \hspace{20pt}\sum_{k=1}^n w_{k} \cdot \log(\frac{m_{k}}{u_{k}}) ^ {w^{j}_{k}\gamma^{j}_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^ {({1-w^{j}_{k}\cdot\gamma^{j}_{k}})}$

The Felligi-Sunter (F-S) "weight" for fields that *agree* is:
$latex (2) \hspace{20pt}agree\hspace{5pt}weight = \log(\frac{m_{k}}{u_{k}})$
And the Felligi-Sunter "weight" for fields that *disagree* is:
$latex (3) \hspace{20pt}disagree\hspace{5pt}weight = \log(\frac{1-m_{k}}{1-u_{k}})$
Where:
$latex m_{k}$ is the agreement rate (frequency) among *true* matches
$latex u_{k}$ is the agreement rate (frequency) among *false* matches

... So it seems that the F-S field weights are included twice in equation (1)?

We propose the following general scaling factor (s) for each field, which decreases wieghts for frequently occuring tokens, while:

$latex (4) \hspace{20pt}s_{k} = \sqrt{\frac{T_{k}}{(Q_{k} \cdot I_{k})}}$
Where:
$latex T_{k}$ is the total number of tokens for the field (constant for the field)
$latex Q_{k}$ is the total number of *unique* tokens for the field (constant for the field)
$latex I_{k}$ is the specific frequency of the current Individual token (varies for each token)

Since $latex T_{k}$ and $latex Q_{k}$ are constant for the field, equation (4) can be re-written as:
$latex (5) \hspace{20pt}s_{k} = \sqrt{\frac{A_{k}}{I_{k}}}$
Where:
$latex A_{k} = \frac{T_{k}}{Q_{k}}$, and is constant for the field. It represents the average token frequency for the field.

The F-S log-likelihood equation is:
$latex (6) \hspace{20pt}likelihood=\sum_{k=1}^n\log(\frac{m_{k}}{u_{k}}) ^{\gamma_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^{(1-\gamma_{k})}$

Where:
$latex m_{k}$ is the agreement rate among *true* matches
$latex u_{k}$ is the agreement rate among *false* matches
$latex \gamma_{k}$ is the binary agreement status for the current field (eg, 1=agree, 0=disagree)

Incorporating the scaling factor from equation (5) into equation (6), we get the following modified F-S equation:
$latex (7) \hspace{20pt}likelihood=\sum_{k=1}^n\log(s_{k}\cdot\frac{m_{k}}{u_{k}}) ^{\gamma_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^{(1-\gamma_{k})}$

So equation (7) is largely the goal. Please let me know your thoughts.

Thanks!</description>
		<content:encoded><![CDATA[<p>Sarp,</p>
<p>Great work above! I&#8217;ll try to cover several issues in this response. In this reply, the term &#8220;token&#8221; refers to a specific instance in a given field. For example, tokens for the Last_Name field might be &#8216;SMITH&#8217;, &#8216;GRANNIS&#8217;, &#8216;CENTEL&#8217;, etc. Tokens for Zip_code may be &#8216;46224&#8242;, &#8216;02139&#8242;, &#8216;44235&#8242;, etc.</p>
<p>Not sure I follow the equation you list above, which I repeat below as equation (1). Specifically, which are the token frequencies, and which are the weights? </p>
<p><img src="http://l.wordpress.com/latex.php?latex=%281%29+%5Chspace%7B20pt%7D%5Csum_%7Bk%3D1%7D%5En+w_%7Bk%7D+%5Ccdot+%5Clog%28%5Cfrac%7Bm_%7Bk%7D%7D%7Bu_%7Bk%7D%7D%29+%5E+%7Bw%5E%7Bj%7D_%7Bk%7D%5Cgamma%5E%7Bj%7D_%7Bk%7D%7D+%5Ccdot+%5Clog%28%5Cfrac%7B1-m_%7Bk%7D%7D%7B1-u_%7Bk%7D%7D%29+%5E+%7B%28%7B1-w%5E%7Bj%7D_%7Bk%7D%5Ccdot%5Cgamma%5E%7Bj%7D_%7Bk%7D%7D%29%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="(1) \hspace{20pt}\sum_{k=1}^n w_{k} \cdot \log(\frac{m_{k}}{u_{k}}) ^ {w^{j}_{k}\gamma^{j}_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^ {({1-w^{j}_{k}\cdot\gamma^{j}_{k}})}" /></p>
<p>The Felligi-Sunter (F-S) &#8220;weight&#8221; for fields that *agree* is:<br />
<img src="http://l.wordpress.com/latex.php?latex=%282%29+%5Chspace%7B20pt%7Dagree%5Chspace%7B5pt%7Dweight+%3D+%5Clog%28%5Cfrac%7Bm_%7Bk%7D%7D%7Bu_%7Bk%7D%7D%29&bg=ffffff&fg=000000&s=0" class="tex" alt="(2) \hspace{20pt}agree\hspace{5pt}weight = \log(\frac{m_{k}}{u_{k}})" /><br />
And the Felligi-Sunter &#8220;weight&#8221; for fields that *disagree* is:<br />
<img src="http://l.wordpress.com/latex.php?latex=%283%29+%5Chspace%7B20pt%7Ddisagree%5Chspace%7B5pt%7Dweight+%3D+%5Clog%28%5Cfrac%7B1-m_%7Bk%7D%7D%7B1-u_%7Bk%7D%7D%29&bg=ffffff&fg=000000&s=0" class="tex" alt="(3) \hspace{20pt}disagree\hspace{5pt}weight = \log(\frac{1-m_{k}}{1-u_{k}})" /><br />
Where:<br />
<img src="http://l.wordpress.com/latex.php?latex=m_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="m_{k}" /> is the agreement rate (frequency) among *true* matches<br />
<img src="http://l.wordpress.com/latex.php?latex=u_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="u_{k}" /> is the agreement rate (frequency) among *false* matches</p>
<p>&#8230; So it seems that the F-S field weights are included twice in equation (1)?</p>
<p>We propose the following general scaling factor (s) for each field, which decreases wieghts for frequently occuring tokens, while:</p>
<p><img src="http://l.wordpress.com/latex.php?latex=%284%29+%5Chspace%7B20pt%7Ds_%7Bk%7D+%3D+%5Csqrt%7B%5Cfrac%7BT_%7Bk%7D%7D%7B%28Q_%7Bk%7D+%5Ccdot+I_%7Bk%7D%29%7D%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="(4) \hspace{20pt}s_{k} = \sqrt{\frac{T_{k}}{(Q_{k} \cdot I_{k})}}" /><br />
Where:<br />
<img src="http://l.wordpress.com/latex.php?latex=T_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="T_{k}" /> is the total number of tokens for the field (constant for the field)<br />
<img src="http://l.wordpress.com/latex.php?latex=Q_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="Q_{k}" /> is the total number of *unique* tokens for the field (constant for the field)<br />
<img src="http://l.wordpress.com/latex.php?latex=I_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="I_{k}" /> is the specific frequency of the current Individual token (varies for each token)</p>
<p>Since <img src="http://l.wordpress.com/latex.php?latex=T_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="T_{k}" /> and <img src="http://l.wordpress.com/latex.php?latex=Q_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="Q_{k}" /> are constant for the field, equation (4) can be re-written as:<br />
<img src="http://l.wordpress.com/latex.php?latex=%285%29+%5Chspace%7B20pt%7Ds_%7Bk%7D+%3D+%5Csqrt%7B%5Cfrac%7BA_%7Bk%7D%7D%7BI_%7Bk%7D%7D%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="(5) \hspace{20pt}s_{k} = \sqrt{\frac{A_{k}}{I_{k}}}" /><br />
Where:<br />
<img src="http://l.wordpress.com/latex.php?latex=A_%7Bk%7D+%3D+%5Cfrac%7BT_%7Bk%7D%7D%7BQ_%7Bk%7D%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="A_{k} = \frac{T_{k}}{Q_{k}}" />, and is constant for the field. It represents the average token frequency for the field.</p>
<p>The F-S log-likelihood equation is:<br />
<img src="http://l.wordpress.com/latex.php?latex=%286%29+%5Chspace%7B20pt%7Dlikelihood%3D%5Csum_%7Bk%3D1%7D%5En%5Clog%28%5Cfrac%7Bm_%7Bk%7D%7D%7Bu_%7Bk%7D%7D%29+%5E%7B%5Cgamma_%7Bk%7D%7D+%5Ccdot+%5Clog%28%5Cfrac%7B1-m_%7Bk%7D%7D%7B1-u_%7Bk%7D%7D%29+%5E%7B%281-%5Cgamma_%7Bk%7D%29%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="(6) \hspace{20pt}likelihood=\sum_{k=1}^n\log(\frac{m_{k}}{u_{k}}) ^{\gamma_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^{(1-\gamma_{k})}" /></p>
<p>Where:<br />
<img src="http://l.wordpress.com/latex.php?latex=m_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="m_{k}" /> is the agreement rate among *true* matches<br />
<img src="http://l.wordpress.com/latex.php?latex=u_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="u_{k}" /> is the agreement rate among *false* matches<br />
<img src="http://l.wordpress.com/latex.php?latex=%5Cgamma_%7Bk%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="\gamma_{k}" /> is the binary agreement status for the current field (eg, 1=agree, 0=disagree)</p>
<p>Incorporating the scaling factor from equation (5) into equation (6), we get the following modified F-S equation:<br />
<img src="http://l.wordpress.com/latex.php?latex=%287%29+%5Chspace%7B20pt%7Dlikelihood%3D%5Csum_%7Bk%3D1%7D%5En%5Clog%28s_%7Bk%7D%5Ccdot%5Cfrac%7Bm_%7Bk%7D%7D%7Bu_%7Bk%7D%7D%29+%5E%7B%5Cgamma_%7Bk%7D%7D+%5Ccdot+%5Clog%28%5Cfrac%7B1-m_%7Bk%7D%7D%7B1-u_%7Bk%7D%7D%29+%5E%7B%281-%5Cgamma_%7Bk%7D%29%7D&bg=ffffff&fg=000000&s=0" class="tex" alt="(7) \hspace{20pt}likelihood=\sum_{k=1}^n\log(s_{k}\cdot\frac{m_{k}}{u_{k}}) ^{\gamma_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^{(1-\gamma_{k})}" /></p>
<p>So equation (7) is largely the goal. Please let me know your thoughts.</p>
<p>Thanks!</p>
]]></content:encoded>
	</item>
</channel>
</rss>
