Skip to content

Probabilities in Proofreading

Suppose you write a program and you send the source code to two of your friends, LaTeX Formula and LaTeX Formula. Your two friends read the code and when they finish, A errors are detected by LaTeX Formula, B errors are detected by LaTeX Formula, and C errors are detected by both. So, in total, A+B-C errors are detected and can now be eliminated. We wish to estimate the number of errors that remain unnoticed and uncorrected.

The original version of this problem concerns manuscripts and proofreaders, instead of source code and programmers. It was posed and solved by George Polya and published in 1976 on The American Mathematical Monthly under the name of Probabilities in Proofreading. Because the problem is interesting and Polya’s solution is short and elegant, I have decided to record and share it. Also, since code sharing and reading is a frequent activity in the software development world, estimating the desired value can be helpful for some readers of this blog.

Estimating the number of unnoticed errors

Let E be the number of all errors, noticed and unnoticed, in the source code. Our goal is to estimate the value of E-(A+B-C). Let p be the probability that friend LaTeX Formula notices any given error and q the analogous probability for friend LaTeX Formula. The expected number of errors that may be detected by LaTeX Formula is LaTeX Formula and by LaTeX Formula is LaTeX Formula. Assuming that these probabilities are independent, the expected number of errors that may be mutually detected by both friends is LaTeX Formula.

Because we are interested in an estimate, we can safely assume that the expected numbers are approximately equal to the number of errors detected, that is, LaTeX Formula, LaTeX Formula, and LaTeX Formula. (We use the notation LaTeX Formula to denote that two numbers are approximately equal.)

We now have all the ingredients to conclude the solution. Recall that our goal is to estimate the value of E-(A+B-C). We calculate:

LaTeX Formula

This is the desired estimate!

Related Articles:

Making ThinkFinger Faster

Some time ago, I have installed and configured ThinkFinger, a driver for the fingerprint-readers found in most IBM/Lenovo Thinkpads, including my Lenovo X61 Tablet. Although it worked well, it was very slow! I had to wait 5 to 10 seconds for the prompt to show, then a bit more before I could swipe my finger or write my password, and then a bit more for the password to be accepted.

To make my world a better place to live, I have downloaded the development version of the driver and tried to figure out why it was being so slow. It turns out that in the library that is used to communicate with the fingerprint-reader, there was a call to the function usb_bulk_read that was taking too long. I have tried to find some official documentation about this function, but all I could find was how not to document a function (seriously, what is the role of the parameter timeout? and is it measured in seconds? milliseconds?).

Anyway, I have changed the value of the USB_TIMEOUT constant from 5000 to 5 and the fingerprint-reader is much, much better! In detail, here is the patch:

Index: libthinkfinger/libthinkfinger.c
===================================================================
--- libthinkfinger/libthinkfinger.c (revision 118)
+++ libthinkfinger/libthinkfinger.c (working copy)
@@ -35,7 +35,7 @@

#define USB_VENDOR_ID 0x0483
#define USB_PRODUCT_ID 0x2016
-#define USB_TIMEOUT 5000
+#define USB_TIMEOUT 5
#define USB_WR_EP 0x02
#define USB_RD_EP 0x81
#define DEFAULT_BULK_SIZE 0x40

I am not sure why this works and what are the implications, but all the tests I did were satisfactory. If you know more about this problem, please let me know! I would be more than happy to understand why the USB_TIMEOUT constant seems to be acting as a time-wait delay, rather than as a timeout in the following call (line 324, libthinkfinger/libthinkfinger.c):

usb_retval = usb_bulk_read (tf->usb_dev_handle, USB_RD_EP, bytes, size, USB_TIMEOUT);

(I think it is strange that I haven’t found anyone complaining about the same problem. Nevertheless, I’ve decided to publish this post anyway, in case someone runs into the same problem.)

Update (2009/07/31): I’ve upload a patched SRC RPM: thinkfinger-0.3-9.jff.src.rpm. Please note that I have added the patch to Fedora’s SRC RPM.

Sir Tony Hoare on the future of Computing Science

Sir Tony Hoare on the future of Computing Science (CACM, March 2009):

I expect the future to be as wonderful as the past has been. There’s still an enormous amount of interesting work to do. As far as the fundamental science is concerned, we still certainly do not know how to prove programs correct. We need a lot of steady progress in this area, which one can foresee, and a lot of breakthroughs where people suddenly find there’s a simple way to do something that everybody hitherto has thought to be far too difficult.

A Calculational Proof of the Handshaking Lemma

In graph theory, the degree of a vertex A, d.A, is the number of edges incident with the vertex A, counting loops twice. So, considering graph 0 below, we have d.A=3, d.B=3, d.C=1, d.D=3, and d.E=2.

Example of an undirected graph with five nodes

Graph 0: Example of an undirected graph with five nodes

A well-known property is that every undirected graph contains an even number of vertices with odd degree. The result first appeared in Euler’s 1736 paper on the Seven Bridges of Königsberg and is also known as the handshaking lemma (that’s because another way of formulating the property is that the number of people that have shaken hands an odd number of times is even).

As we can easily verify, graph 0 satisfies this property. There are four vertices with odd degree (A,B,C, and D), and 4, of course, is an even number.

Although the proof of this property is simple, I have never seen it proved in a calculational and goal-oriented way. My aim with this post is to show you a development of a goal-oriented proof.
(Continued)

Multiples in the Fibonacci series

I found the following problem on K. Rustan M. Leino’s puzzles page:

[Carroll Morgan told me this puzzle.]

Prove that for any positive K, every Kth number in the Fibonacci sequence is a multiple of the Kth number in the Fibonacci sequence.

More formally, for any natural number n, let F(n) denote Fibonacci number n. That is, F(0) = 0, F(1) = 1, and F(n+2) = F(n+1) + F(n). Prove that for any positive K and natural n, F(n*K) is a multiple of F(K).

This problem caught my attention, because it looks like a good example for using a result that I have derived last year. My result gives a reasonable sufficient condition for showing that a function distributes over the greatest common divisor and shows that the Fibonacci function satisfies the condition.

In fact, using the property that the Fibonacci function distributes over the greatest common divisor, we can solve this problem very easily. Using LaTeX Formula to denote the Fibonacci number LaTeX Formula, LaTeX Formula to denote the greatest common divisor of LaTeX Formula and LaTeX Formula, and LaTeX Formula to denote the division relation, a possible proof is:

LaTeX Formula

The crucial step is clearly the one where we apply the distributivity property. Distributivity properties are very important, because they allow us to rewrite expressions in a way that prioritizes the function that has the most relevant properties. In the example above we could not simplify LaTeX Formula nor LaTeX Formula, but applying the distributivity property prioritised the LaTeX Formula operator — and we know how to simplify LaTeX Formula. Furthermore, in practice, distributivity properties reduce to simple syntactic manipulations, thus reducing the introduction of error and simplifying the verification of our arguments.

(Now that I think about it, perhaps it would be a good idea to write a note on distributivity properties, summarizing their importance and their relation with symbol dynamics.)

If you have any corrections, questions, or alternative proofs, please leave a comment!

New domain name (joaoff.com)

Some people were complaining about the domain joaoferreira.org, because it was a bit long and they never got the number of r’s in Ferreira right. From today, they can’t complain anymore!

The new and official URL of this website is now shorter and r’s-free: joaoff.com .

If you don’t like it and prefer the old one, please let me know!

A reward check from Donald Knuth

The other day I went to my pigeon-hole to collect my snail mail, and I had a letter from Donald E. Knuth, Professor Emeritus of the Art of Computer Programming!

Cover of the letter that Knuth sent to Joao

Inside, there was a check for a correction I sent him some months ago. In fact, it was not really a correction; it was more like a comment. And it was so obvious (he even said that) that he just sent $0.32, instead of the usual $2.56. But hey, who cares? I’ve got Knuth’s autograph now :-)

A reward check from Donald Knuth

Perhaps I should set as one of my goals to find a proper error, so that I can receive a $2.56 check :) By the way, the errata of the Concrete Mathematics is available online and this particular omission is documented as follows:

page 338, line 2 from the bottom
change “for $z$” to “for $z$ and multiplying by $a$”

Calculational proofs are usually direct

jd2718 asked in his blog if anyone knew a direct proof of the irrationality of LaTeX Formula  . In this post I present a proof that, even if some don’t consider it direct, is a nice example of the effectiveness of calculational proof. But first, there are two concepts that need to be clarified: direct proof and irrational number.

Direct proofs

The concept of direct proof can vary slightly from person to person. For instance, Wikipedia defines it as:

In mathematics and logic, a direct proof is a way of showing the truth or falsehood of a given statement by a straightforward combination of established facts, usually existing lemmas and theorems, without making any further assumptions.

Alternatively, in Larry W. Cusick’s website we can read:

A direct poof [sic] should be thought of as a flow of implications beginning with “P” and ending with “Q”.

P -> … -> Q

Most proofs are (and should be) direct proofs. Always try direct proof first, unless you have a good reason not to.

I consider the wording ‘without making any further assumptions‘ in the first definition ambiguous and I don’t understand why the second definition only applies to implications. But anyway, with these definitions in mind, a direct proof for the irrationality of LaTeX Formula can be something like:

LaTeX Formula

Or, alternatively, we can also use a proof of the following shape:

LaTeX Formula

Irrational numbers

An irrational number is a real number that can’t be expressed as a simple fraction. Therefore, the number LaTeX Formula is irrational because for all integers m and n, with n non-negative, we have that:

LaTeX Formula

A direct proof for the irrationality of LaTeX Formula

Now that we have clarified the concepts above, we prove that LaTeX Formula is irrational. For all integers m and n, with n non-negative, we have:

LaTeX Formula

Note that, unlike traditional proofs, we don’t assume that m and n are co-prime, nor that LaTeX Formula is a rational. We essentially derive the boolean value of the expression LaTeX Formula

If you have any suggestions or corrections, please leave a comment. I’d be more than happy to hear from you.

Note: I learnt the contrapositive of this proof from Roland Backhouse (page 38, Program Construction — Calculating Implementations from Specifications).

In Defense Of Computer Science

Very good post from Paper Trail:

So why study computer science? The job prospects at the end are usually pretty good – because, if nothing else, you can become a pretty good software developer fairly quickly – but they are unknown. My argument is that the study of computer science is enough of an incentive enough to make it worthwhile.

Related Articles:

The programmers of tomorrow

A recent article written by Dr. Robert B.K. Dewar and Dr. Edmond Schonberg (both from AdaCore Inc.) is generating some discussion on the state of Computer Science (CS) education in the United States. In “Computer Science Education: Where Are the Software Engineers of Tomorrow?“, Dewar and Schonberg claim that U.S. universities are training unqualified and easily replaceable programmers.

“It is our view that Computer Science (CS) education is neglecting basic skills, in particular in the areas of programming and formal methods. We consider that the general adoption of Java as a first programming language is in part responsible for this decline. We examine briefly the set of programming skills that should be part of every software professional’s repertoire.”

The comment about Java’s adoption annoyed some Java aficionados, but in a recent interview, Robert Dewar adds that the problem goes far beyond the choice of Java as the first programming language. The real problem is that CS programs are being dumbed down, so that they become more accessible and popular. In result, they “are not rigorous enough and don’t promote in-depth thinking and problem solving”.

“A lot of it is, ‘Let’s make this all more fun.’ You know, ‘Math is not fun, let’s reduce math requirements. Algorithms are not fun, let’s get rid of them. Ewww – graphic libraries, they’re fun. Let’s have people mess with libraries. And [forget] all this business about ‘command line’ – we’ll have people use nice visual interfaces where they can point and click and do fancy graphic stuff and have fun.”

Although the paper is concerned with the American reality, I believe we have the same problem in Europe — at least, and as far as I know, in the UK and in Portugal. However, in my opinion, the problem starts before university. The maths’s programs in secondary schools are also being simplified (or dumbed down, if you prefer) and many important concepts, like logic and proofs, are being ignored.

In result, first-year students usually have a poor background on maths and problem solving. In fact, most of them have never seen a proof and don’t even understand the importance of mathematical reasoning. With poor reasoning abilities, they become intellectually less curious, accepting things as they are presented, and they have tremendous difficulties creating new algorithms, or convincing someone that their own algorithms are correct.

Moreover, once they are in the university, one of two things happens:

  1. they are not taught explicitly how to solve problems or how to derive algorithms from their formal specifications (this is the most common case);
  2. or they are taught the above skills but their poor background doesn’t allow them to fully appreciate these subjects.

(Continued)

Related Articles: