Connecting the dots

After reading the Fellegi-Sunter paper, I’d like to map the ideas described in words to formulas used in the original paper:

(1) FS generates a likelihood score based on agreement pattern among corresponding fields from 2 records. The higher the likelihood score, the more likely two records represent a match, rather than simple random agreement among fields.

Since we expect more fields to agree on matches, the ratio will have high values for matches, and low values for unmatched records.

(2) Digging deeper, for a given record pair, the score is calculated by multiplying the *agreement* likelihood weight for each  set of corresponding fields that agree (eg, both last names are ‘JONES’) and multiplying by the *disagreement* weight for corresponding fields that disagree (eg, dates of birth disagree).

Score =

(3) Conditional independence is assumed when we multiply each field’s full weight.


(4) There are a few benefits to assuming conditional independence. First, we can make some reasonable estimates regarding the true-positive and true negative rates for various likelihood scores. These measures help us set a score threshold for true- and false-links.

I’d like to get the terminology right with true-links and false-links: Let’s say we pair all records together. If two records in the pair belong to the same person in reality, we call it a true-link. Those records that are paired together, but correspond to different people in reality, are called false-links. In the FS paper, these correspond to:A \bowtie B = \{(a,b); a\in A, b\in B)\}True-links:M = \left\{ (a,b); a=b, a \in A, b \in B \right\}False-links:U = \left\{(a,b); a \neq b, a \in A, b \in B \right\}

(5) Second, assuming conditional independence allows us to associate monotonically increasing scores with increased true positive rates. Simply stated, higher likelihood scores can be considered to be more likely true matches.

w^{k}(\gamma_{k})=\log m(\gamma^{k})-\log u (\gamma^{k})w(\gamma)=w^{1}+w^{2}+\cdot\cdot\cdot+w^{k}So what we would like to do is to enhance the scoring function, so that it takes into account the frequency of the values being compared:Some questions that came up to my mind:A) Please comment on the mapping between ideas and formulas, and point out any misunderstandings I have.B) In the formula, we assume each identifier to have equal weight. However, in reality, some identifiers can provide us more information than others (Shannon’s entropy). For instance, if we have records from a neighbourhood, their zip codes will be mostly same and it won’t be of much use. This suggests a proposed scaling like this:However, isn’t this kind of scaling already inherit in the formula? In the extreme case where all the values of a field are the same, m/u will be 1 and the logarithm of it will be 0, meaning that it won’t have an effect on the score. (assuming that EM provides good estimates)C) String comparison functions do not have to produce binary results, right? They don’t have to say either these two field match or do not match. They can give any value between 0…1

Background check

In Overview of Record Linkage and Current Research Directions, Winkler states that basic ideas in Fellegi-Sunter model are based on statistical concepts such as odds ratios, hypothesis testing, and relative frequency.Luckily I had a lot of exposure to hypothesis testing and relative frequency in the statistical modeling course this semester. Apparently the rest is basic probability:Odds ratio:  Tells how much more likely it is that the event A occurs than it is that it does not occur.\frac{P(A)}{P(A^{C} )} = \frac{P(A)}{1-P(A)}

Information about OpenMRS

Our world continues to be ravaged by a pandemic of epic proportions, as over 40 million people are infected with or dying from HIV/AIDS — most (up to 95%) are in developing countries. Prevention and treatment of HIV/AIDS on this scale requires efficient information management, which is critical as HIV/AIDS care must increasingly be entrusted to less skilled providers. Whether for lack of time, developers, or money, most HIV/AIDS programs in developing countries manage their information with simple spreadsheets or small, poorly designed databases…if anything at all. To help them, we need to find a way not only to improve management tools, but also to reduce unnecessary, duplicative efforts.

As a response to these challenges, OpenMRS formed in 2004 as a open source medical record system framework for developing countries — a tide which rises all ships. OpenMRS is a multi-institution, nonprofit collaborative led by Regenstrief Institute, Inc., a world-renowned leader in medical informatics research, and Partners In Health, a Boston-based philanthropic organization with a focus on improving the lives of underprivileged people worldwide through health care service and advocacy. These teams nurture a growing worldwide network of individuals and organizations all focused on creating medical record systems and a corresponding implementation network to allow system development self reliance within resource constrained environments.

To date, OpenMRS has been implemented in several African countries, including South Africa, Kenya, Rwanda, Lesotho, Uganda, and Tanzania. This work is supported in part by organizations such as the World Health Organization (WHO), the Centers for Disease Control (CDC), the Rockefeller Foundation, and the President’s Emergency Plan for AIDS Relief (PEPFAR).

« Newer Posts