• Home
  • About
  • OpenMRS
  • Proposal
  • Timeline

OpenMRS: Record Linkage Project

Google Summer of Code™ 2007






Connecting the dots

May 21st, 2007 by sarp

After reading the Fellegi-Sunter paper, I’d like to map the ideas described in words to formulas used in the original paper:

(1) FS generates a likelihood score based on agreement pattern among corresponding fields from 2 records. The higher the likelihood score, the more likely two records represent a match, rather than simple random agreement among fields.

Since we expect more fields to agree on matches, the ratio will have high values for matches, and low values for unmatched records.

(2) Digging deeper, for a given record pair, the score is calculated by multiplying the *agreement* likelihood weight for each  set of corresponding fields that agree (eg, both last names are ‘JONES’) and multiplying by the *disagreement* weight for corresponding fields that disagree (eg, dates of birth disagree).

Score =

(3) Conditional independence is assumed when we multiply each field’s full weight.


(4) There are a few benefits to assuming conditional independence. First, we can make some reasonable estimates regarding the true-positive and true negative rates for various likelihood scores. These measures help us set a score threshold for true- and false-links.

I’d like to get the terminology right with true-links and false-links: Let’s say we pair all records together. If two records in the pair belong to the same person in reality, we call it a true-link. Those records that are paired together, but correspond to different people in reality, are called false-links. In the FS paper, these correspond to:A \bowtie B = \{(a,b); a\in A, b\in B)\}True-links:M = \left\{ (a,b); a=b, a \in A, b \in B \right\}False-links:U = \left\{(a,b); a \neq b, a \in A, b \in B \right\}

(5) Second, assuming conditional independence allows us to associate monotonically increasing scores with increased true positive rates. Simply stated, higher likelihood scores can be considered to be more likely true matches.

w^{k}(\gamma_{k})=\log m(\gamma^{k})-\log u (\gamma^{k})w(\gamma)=w^{1}+w^{2}+\cdot\cdot\cdot+w^{k}So what we would like to do is to enhance the scoring function, so that it takes into account the frequency of the values being compared:Some questions that came up to my mind:A) Please comment on the mapping between ideas and formulas, and point out any misunderstandings I have.B) In the formula, we assume each identifier to have equal weight. However, in reality, some identifiers can provide us more information than others (Shannon’s entropy). For instance, if we have records from a neighbourhood, their zip codes will be mostly same and it won’t be of much use. This suggests a proposed scaling like this:However, isn’t this kind of scaling already inherit in the formula? In the extreme case where all the values of a field are the same, m/u will be 1 and the logarithm of it will be 0, meaning that it won’t have an effect on the score. (assuming that EM provides good estimates)C) String comparison functions do not have to produce binary results, right? They don’t have to say either these two field match or do not match. They can give any value between 0…1

Posted in | 2 Comments

2 Responses to “Connecting the dots”

  1. on 26 May 2007 at 12:29 am1Shaun Grannis

    Sarp,

    Great work above! I’ll try to cover several issues in this response. In this reply, the term “token” refers to a specific instance in a given field. For example, tokens for the Last_Name field might be ‘SMITH’, ‘GRANNIS’, ‘CENTEL’, etc. Tokens for Zip_code may be ‘46224′, ‘02139′, ‘44235′, etc.

    Not sure I follow the equation you list above, which I repeat below as equation (1). Specifically, which are the token frequencies, and which are the weights?

    (1) \hspace{20pt}\sum_{k=1}^n w_{k} \cdot \log(\frac{m_{k}}{u_{k}}) ^ {w^{j}_{k}\gamma^{j}_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^ {({1-w^{j}_{k}\cdot\gamma^{j}_{k}})}

    The Felligi-Sunter (F-S) “weight” for fields that *agree* is:
    (2) \hspace{20pt}agree\hspace{5pt}weight = \log(\frac{m_{k}}{u_{k}})
    And the Felligi-Sunter “weight” for fields that *disagree* is:
    (3) \hspace{20pt}disagree\hspace{5pt}weight = \log(\frac{1-m_{k}}{1-u_{k}})
    Where:
    m_{k} is the agreement rate (frequency) among *true* matches
    u_{k} is the agreement rate (frequency) among *false* matches

    … So it seems that the F-S field weights are included twice in equation (1)?

    We propose the following general scaling factor (s) for each field, which decreases wieghts for frequently occuring tokens, while:

    (4) \hspace{20pt}s_{k} = \sqrt{\frac{T_{k}}{(Q_{k} \cdot I_{k})}}
    Where:
    T_{k} is the total number of tokens for the field (constant for the field)
    Q_{k} is the total number of *unique* tokens for the field (constant for the field)
    I_{k} is the specific frequency of the current Individual token (varies for each token)

    Since T_{k} and Q_{k} are constant for the field, equation (4) can be re-written as:
    (5) \hspace{20pt}s_{k} = \sqrt{\frac{A_{k}}{I_{k}}}
    Where:
    A_{k} = \frac{T_{k}}{Q_{k}}, and is constant for the field. It represents the average token frequency for the field.

    The F-S log-likelihood equation is:
    (6) \hspace{20pt}likelihood=\sum_{k=1}^n\log(\frac{m_{k}}{u_{k}}) ^{\gamma_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^{(1-\gamma_{k})}

    Where:
    m_{k} is the agreement rate among *true* matches
    u_{k} is the agreement rate among *false* matches
    \gamma_{k} is the binary agreement status for the current field (eg, 1=agree, 0=disagree)

    Incorporating the scaling factor from equation (5) into equation (6), we get the following modified F-S equation:
    (7) \hspace{20pt}likelihood=\sum_{k=1}^n\log(s_{k}\cdot\frac{m_{k}}{u_{k}}) ^{\gamma_{k}} \cdot \log(\frac{1-m_{k}}{1-u_{k}}) ^{(1-\gamma_{k})}

    So equation (7) is largely the goal. Please let me know your thoughts.

    Thanks!

  2. on 26 May 2007 at 9:22 am2Shaun Grannis

    In response to comment C) above: string comparators in fact can produce a value between 0 and 1. In practice, if the comparator value exceeds a pre-specified threshold, then the fields are considered to agree, and \gamma_{k} for that field is set to 1. Conversely, if the comparator value is less than a pre-specified threshold, then the fields are considered to disagree, and \gamma_{k} for that field is set to 0.

    Another interesting (and unproven) approach is to make the m_{k} and u_{k} rates conditional on the string comparator score. For instance, if we have an arbitrary comparator that produced values (x) between 0 and 1, we can divide these values into 2 ranges:
    range a: 0.50\leq{ x}\leq0.75
    range b: 0.75\textless{ x}\leq1.00

    Then for each field there can be 2 likelihood scores depending on the value of the string comparator. For example, if the string comparator returned a score of 0.78, then the agreement weight would be \frac{m_{b}}{u_{b}}, the disagreement weight would be \frac{1-m_{a}-m_{b}}{1-u_{a}-u_{b}}

  • Email Updates

    To receive email updates on new posts, click here.

  • About Me

    I am 22 years old, recently graduated from Computer Science and Engineering program of Sabanci University in Istanbul, TURKEY.

  • Recent Posts

    • Performance Test: Analyzing token frequencies
    • Midterm update
    • Code Spotlight: Analyzing token frequencies
    • Phase3 Complete
    • Phase2 Complete
    • Phase1 Complete
  • Archives

    • July 2007 (5)
    • June 2007 (4)
    • May 2007 (3)
    • April 2007 (1)
  • Categories

    • (6)
    • (7)
    • (8)
    • (8)
    • (8)
    • (8)
    • (1)
    • (2)
    • (4)
  • Pages

    • About
    • OpenMRS
    • Proposal
    • Timeline
Powered by Wordpress |
Feed on
Posts
Comments