how CJK is your page?

i made a python app that scans the content of a web page to see how Japanese it is based on the characters used. so it could be Japanese, Chinese etc. here it is:

# -*- coding: utf-8 -*-

from sys import argv
from urllib2 import build_opener
from HTMLParser import HTMLParser

class jaHTMLParser(HTMLParser):

ja = nonja = 0
encoding = “utf-8”

def handle_starttag(self, tag, attrs):
     for attr in attrs:
         if tag == ‘meta’ and attr[0] == “content” and attr[1].find(“charset=”) != -1 :
         self.encoding = attr[1].split(“charset=”)[1]

def handle_data(self, data):
     if data in (“/*”, “*/”) or data.isspace(): return
    uni = data.decode(self.encoding)
     for c in uni:
         u8 = c.encode(“utf-8”)
        if u8 >= ‘⺀’ and u8 <= ‘𯨟’: self.ja += 1 # very approximate
         else: self.nonja += 1

def unknown_decl(self, data): pass # CDATA is not an error!

opener = build_opener()
opener.addheaders = [(‘User-agent’, ‘Mozilla/5.0’)] # avoid 403 forbidden

reader = jaHTMLParser()
reader.CDATA_CONTENT_ELEMENTS = [] # don’t treat any CDATA as textual content
reader.feed(opener.open(argv[1]).read())
reader.close()
print “%d%%” % (float(100 * reader.ja) / float(reader.nonja + reader.ja))

🙂

Advertisements

barcoded

well, i’ve scanned over 200 barcodes now. but what have i learned?

  • you can get away with a lot of noise in the picture.
  • all those silly things that we try at supermarkets to get the scan to work are actually very sensible.
  • with my dodgy setup it’s only very slightly quicker than typing. but lots more geeky fun.
  • shiny things are bad.

Achieve! – Week 45

  • reached 1000th word in Japanese vocabulary quiz. it was “recently”. (今度)
  • cycled again .. twice! – it’s been a while
  • put grapher on hold to focus on table merging
  • cooked food from actual ingredients
  • chopped up parts for another monitor stand and built it
  • tried switching off fans – got freaked and turned them back on
  • finished a book
  • manually removed remaining spam from a few years of emails
  • upgraded athlon PC to F8 – it’s not very happy about that!
  • wrote new OS GML loader
  • joined freecycle and passed on a telly
  • tidied up slightly

11112007120