### Revision for “Statistics for distributions” created on March 23, 2018 @ 20:59:08

Title | Statistics for distributions |
---|---|

Content | <p>The first question to ask is "Distribution of ''what''?"</p>
<p>Let's say that you want to look at the distribution of the surname SMITHBLOGGINS. Do you want the distribution of the '''number''' of SMITHBLOGGINSs or of the '''proportion of the population''' who are called SMITHBLOGGINS?</p>
<p>Depending upon the source of information you are using, either quantity could be the easier one to find. If this is the wrong quantity for you then either you will need to put in a lot of extra work or you will need to use a different source. Do not let the source determine the question; let the question determine the source.</p>
<p>The '''number''' of SMITHBLOGGINS in an area will normally depend upon the population: we could expect to find more SMITHBLOGGINS in London than in the middle of the Sahara desert. In Britain, since the major cities of London, Birmingham, Coventry, Manchester, Liverpool and Glasgow lie approximately along a line going from SE to NW, counting numbers will tend to show a SE/NW trend for most names. This difficulty, which distorts – sometimes very badly – any genealogical distributions, does not exist for proportions.</p>
<p>There are too many difficulties when it comes to interpreting distributions of '''numbers''', even though they might be very easy to produce and give pretty maps. From the genealogical point of view, the '''proportion''' is the more important quantity.</p>
<p>To find the proportion of the population of an area who are SMITHBLOGGINS, all that is needed is a source for which the following two quantities are known or can be estimated</p>
<p>(a) The total number of entries<br />
(b) The number of SMITHBLOGGINS entries.</p>
<p>Dividing (b) by (a) then gives the approximate proportion of SMITHBLOGGINSs in that area. Since the result will be an extremely small fraction, it is usual to quote results "the other way up", that is (a) divided by (b), as "1 in ...", eg "1 in 10,000": there is no need for great precision, here; if the arithmetic happens to work out at, say, 1 in 9876 then it is pointless using that number – just call it 1 in 10,000.</p>
<p>There is one other property needed of the source: the proportion of SMITHBLOGGINSs in the source should be the same (at least, approximately) as in the population as a whole. If this is not the case, then the source is ''biased'', either against or in favour of SMITHBLOGGINS. For example, telephone directory entries might be biased against the poorer members of society since telephones cost money which poor people might not be able to afford. So are people called SMITHBLOGGINS known to be significantly poorer than the population as a whole? If so, then telephone directories will be biased against SMITHBLOGGINS so, depending upon how serious the biase is, it might not be appropriate to use telephone directories to map the distribution of SMITHBLOGGINS. Research has in fact shown that telephone entries today show a bias the other way, because richer people tend to go ex-directory and so are not shown in the figures. This bias is greater in south-eastern Britain than in other parts of the country.</p>
<p>The question here often confuses people and takes us into the philosophy of statistics. You do not have to know that there is no bias. That is, you do not have to set about proving that there is no bias before you can use the source. It is simply that you shouldn't be aware that there is a bias for/against your surname; ignorance on your part is fine. You don't have to worry about whether the SMITHBLOGGINSs are significantly poorer, on average, than the rest of the population (and so are likely to be significantly under-represented in telephone directories)-you may proceed on the assumption that they're not, unless you happen to know otherwise.</p>
<p>In practice, property (a) – the total number of entries in the source – can be difficult to find, in which case it will be necessary to estimate it. If the source is divided into a known number of relatively small sections then this is best done by counting the average number of entries in a random sample of sections (as many as you can manage before boredom sets in) and then multiplying by the number of sections. For example, staying with telephone directories, we could take a "section" as being a page: count the average number of entries per page in a random sample of pages and then multiply by the number of pages in the directory.</p>
<p>Another way of finding the proportion of SMITHBLOGGINS is by taking advantage of someone else's work on another surname: let's say, on the name BLOGGINSSMITH. If the BLOGGINSSMITH researcher has said that the 20 BLOGGINSSMITH entries in a particular source represent 6% of the total number of entries, and you know that the same source contains 30 SMITHBLOGGINS entries then it isn't hard to see that SMITHBLOGGINS represents 9% of the entries in that source. As a variation, the BLOGGINSSMITH researcher might not have said how many BLOGGINSSMITH entries there are, but might have given merely the percentage; in which case there is nothing to stop you from counting the number of BLOGGINSMITHs for yourself, and thus finding out that there are 20. Alternatively, of course, you could just ask.</p>
<p>Early systems with relevance include the <a href="/wiki/guild-wiki/analyse/statistics-for-distributions/smallshaw-and-banwell-methods/" target="_blank">Smallshaw and Banwell methods</a>.</p>
<p>See also the article <a href="/members/pdfs/keith_percy.pdf" target="_blank">Methods of estimating surname frequency</a> by Keith Percy in <em>JOONS</em> of January 1998.</p>
<p>Surname frequency information is available at the following:</p>
<ul>
<li><a href="http://www.britishsurnames.co.uk/" target="_blank">British surnames</a></li>
<li><a href="http://www.census.gov/genealogy/www/freqnames.html" target="_blank">US census data</a> (broken URL)</li>
<li><a href="http://worldnames.publicprofiler.org" target="_blank">World Names Public Profiler</a></li>
</ul> |

Excerpt |