Jack of All Trades

Friday, August 26, 2011

Single Pass Random Sampling

I needed to create a random sample from a file containing several million lines. The script below will extract random lines from a file in a single pass while guaranteeing that each sample line was chosen with probability of 1/N (where N is the number of lines in the original file):

 
from contextlib import closing
from optparse import OptionParser
import random
import sys

def parse_options(argv, **defaults):
    options = OptionParser()
    options.add_option('-n', "--sample-size",
                       action="store",
                       type="int",
                       dest="sample_size",
                       default=defaults.get('sample_size', 100),
                       metavar='SIZE')
    options.add_option('-o', "--output",
                       action="store",
                       dest="output",
                       default='-',
                       metavar='FILE')

    return options.parse_args(argv)


def sample(sample_size, items):
    results = []
    with closing(items):
        for count, item in enumerate(items):
            if len(results) < sample_size:
                results.append(item)
            else:
                should_use = int(random.uniform(0, count)) < sample_size
                if should_use:
                    replace_index = int(random.uniform(0, sample_size))
                    results[replace_index] = item

    return results


def main(argv):
    options, file_paths = parse_options(argv, sample_size=10, file_path=r'c:/temp/branches')
    sample_size = options.sample_size
    output = options.output
    
    if len(file_paths) > 1 :
       source = open(file_paths[1])
    else:
        source = sys.stdin

    results = sample(sample_size, source.xreadlines())

    if output == '-':
        out = sys.stdout
    else:
        out = open(output, "wb")

    with closing(out):
        for line in results:
            out.write(line)


random.seed()

if __name__ == "__main__":
    main(sys.argv)

Thursday, August 25, 2011

Week Sequence in Posgres

Here's how to generate a list of "week of" dates for the last 12 weeks using Postgres SQL:


select date_trunc('week', current_date)::date - s.t as "weekOf"
  from generate_series(0, 7*12, 7) as s(t)
  order by "weekOf" asc;

Result:

5/30/2011
6/6/2011
6/13/2011
6/20/2011
6/27/2011
7/4/2011
7/11/2011
7/18/2011
7/25/2011
8/1/2011
8/8/2011
8/15/2011
8/22/2011

Wednesday, December 31, 2008

SOA Dependencies

What happens when there are existing applications that use a service and the service needs to change to support new functionality? There are two scenarios:

1. All existing applications need to be updated to use the latest version of the service.

2. All existing applications keep using the previous version of the service and the new application gets to use the latest version.

there are pros and cons to both.

With approach #1, extra time and effort are needed to update existing implementations and to retest applications to make sure they the new version doesn't break anything. Automated regression testing is of a key value here. If a team has good coverage in automated tests, retesting can be done fast.

Approach #2 would let a developer to only worry about the new application, with the existing applications continuing to use the previous version of the service. This approach, though straight forward and simple on the surface, is laden with hidden problems of version conflicts. Let's say we have Service A packaged as version 1.0 (a-1.0.jar) and version 2.0 (a-2.0.jar). Even though the two versions are packaged independently of each other, they still not be both used from within the same client application (for example, a web application war deployment) if the classes with changes have the same names. If you deploy a-1.0.jar and a-2.0.jar, you may not get an immediate indication that you may have a problem. But only one of the two jars will actually be used and you may get strange linkage errors or inconsistent and unexpected results later.

Approach #2 is obviously superior to #1 if we could only find a way to resolve conflicts and make sure that the old code runs against v 1.0 of the service and the new code runs against v.2.0 of the service. What's the answer?

I believe relief will come from the OSGi Alliance. I've heard a lot about it (Eclipse is based on OSGi), but only now am finally facing a situation, similar to the one described above, that requires me to take a closer look at what OSGi's is really promising. Over the next few weeks, I'll be playing with various OSGi implementations and APIs. I'll report my findings here.

Tuesday, November 20, 2007

Ruby - Part 2

I haven't dropped Ruby and have been playing with it since my last post. I'm not enthused, but neither am ready to give up. My problems might simply be learning pains of a new language, environment and paradigm or Ruby may simply not be the best tool for this particular project.

I've run into two problems so far:

Background processing: on top of me going to the page with our production server's status (this page is being generated by RoR), I would also like to put in place a few monitors. These monitors should run in the background and fire notifications when server stats get out of whack (e.g. when tomcat's number of current connections approaches the peak). There doesn't seem to be a unified way of doing this with RoR. I've seen solutions using half-assed hacks that look like Ruby ports of cron. But there is nothing that looks finished, polished, and usable.

Preference: I want to be able to persist some runtime parameters between application's runs. My options are: a database or the file system. Neither looks appealing. Dragging the RDBMS baggage around just so that I could save a handful of parameters seems excessive. On the other hand, I also hate when web apps mock with the file system because setting up and maintaining such an app is a hassle (setting up directory permissions, etc.). What I'd like a simple Preferences API similar to what Java has.

Thursday, November 15, 2007

Trying Ruby

I want to create a web page for monitoring the health of our production server. The page would gather data from different places and display them in a dashboard kinda format. I didn't really want to use Java for this and instead decided to give Ruby on Rails a try.

Getting Ruby On Rails running on Windows in a Microsoft shop isn't trivial. Ruby (RubyGems) is really not designed for a Windows based corporate environment. I immediately ran into the firewall problem: we are running a MS firewall that uses NTLM authentication and Ruby doesn't support that out of the box.

I spent a few hours browsing around, reading blogs and articles, trying various things. Finally ran across a post pointing to a gem that interfaces with the native NTLM library: rubysspi-1.0.4-i386-mswin32.gem. After mocking about with it for a little while and finally RTFMing, I managed to get it to work and got RoR downloaded and installed.

Wednesday, October 31, 2007

Adapting Code Documentation Practices

Today I recieved a decree from above to start spot checking code for proper documentation and if I don't deem the documentation effort to be adequate, I need to tell the QA not to accept the project.

Implementing this decree will be tricky socially, technically, and logistically. On the social front, it will be difficult to gain the necessary buy-in from other developers, who may perceive spot checks as being too Big Brotherly and invasive. Technically, documenting every function is wasteful, we'll need to define exactly what type of functions and classes are in most need of javadocs. And finally logistically, we need to find answers to questions like "How do we enforce these rules?" "How do we plug this new requirement into our existing processes?" "How do we measure our progress and the level of compliance?"

Documentation for documentation's sake is pointless and wasteful. I believe the best approach in satisfying all three areas is to look at the Agile Manifesto and start implementing the process with a single goal in mind: to maximize each and every principle of agile development. Doing so, will help us focus our reasons (we may end up finding that there are none), define criteria for gathering performance metrics, and most importantly will provide the grounds for a successful buy-in from the management and developers.

Tuesday, October 30, 2007

System Reliability

As they say, if you haven't seen it before, it's new to you. So today I learned something amazing, shocking (it was shocking to me), and completely mundane for those who deal with providing services day in and day out. System Reliability, which is relate to the system up time, is defined in terms of 9's: 99%, 99.9%, 99.99%, etc. Seems like splitting hair, at first. Big deal, whether it's 99% or 99.99%. Well, it does. For a system running 24/7, 99% uptime implies 88 hours of downtime per year, while 99.99% implies 53 minutes a year. Big efen difference!