In studies of bird song, experimental designs and statistical analyses are often inappropriate for the hypotheses being tested. Frequently, only one playback or tutor stimulus is used to test a number of subjects, and the repeated samples of that single replicate are then analysed statistically as though the samples themselves were replicates. Proper experimental design demands that both the population of subjects and the [`]population' of song stimuli be sampled adequately. Several designs commonly used for playback or tutoring experiments are illustrated, and changes in design are suggested that would increase the number of independent samples and would thus improve the reliability and external validity (i.e. generalizability) of the experimental work.