Andy's observations as he continues to attempt to know all that is .NET...

Wednesday, October 01, 2008

Random Sampling extension method

Had reason recently to select a random sample of data from a stream of elements.  The amount of samples I needed to take was finite but what was unknown was the number of elements in the input stream.  I managed to find various implementations of an IEnumerable extension method on the net, but all the ones I found would cache the entire stream before selecting the sample.  This clearly isn't scalable as the input stream increases in size.

After a look around I found this blog article that describes how to implement Reservoir sampling without the need to keep all the items.  This then allowed me to write a Random Sample extension method to IEnumerable<T> to do what I needed.

public static IEnumerable<T> RandomSample<T>(this IEnumerable<T> rows, int nSamples)
      {
          Random rnd = new Random();

          T[] sample = new T[nSamples];

          int nSamplesTaken = 0;

          foreach (T item in rows)
          {
              if (nSamplesTaken < sample.Length)
              {
                  sample[nSamplesTaken] = item;
              }
              else
              {                  
                  // As the amount of samples increases the probability
                  // of including a value gets less..due to the fact
                  // that it has a greater chance of surviving if it gets
                  // placed into the sample over earlier selections


                  if (rnd.Next(nSamplesTaken) < nSamples)
                  {
                      sample[rnd.Next(nSamples)] = item;
                  }
              }

              nSamplesTaken++;
          }

          if (nSamplesTaken >= nSamples)
          {
              return sample;
          }
          else
          {
              return sample.Take(nSamplesTaken);
          }
      }

1 comment:

D. Patrick Caldwell said...

Hey, I enjoyed your blog post. I've been writing a lot of extension methods lately myself. Sometimes you really have to be thankful to be a C# developer, you know?

Also, I was wondering if you've tried any syntax highlighters? I use one from google code and I have a pretty easy one liner you can add to your template. If you'd like to use it, feel free.

About Me

My photo
Im a freelance consultant for .NET based technology. My last real job, was at Cisco System were I was a lead architect for Cisco's identity solutions. I arrived at Cisco via aquisition and prior to that worked in small startups. The startup culture is what appeals to me, and thats why I finally left Cisco after seven years.....I now filll my time through a combination of consultancy and teaching for Developmentor...and working on insane startups that nobody with an ounce of sense would look twice at...