fleshyorgans

Journal of a software engineer with a romantic heart

Ruby and processing character sets

Posted on | July 19, 2007 | No Comments

Posting this because it’s been difficult for me to find resources on how to convert between different character encodings, such as Windows-1252 (cp1252) to UTF-8 or ISO-8859-1. It took me about a day to discover some of these libraries. Maybe I wasn’t searching for the correct keywords — dunno.


require 'iconv'
require 'charguess'
# Reads an arbitrary file and attempts to convert to a new file in UTF-8
# Sort of quick and dirty
# You should note that Iconv.conv will barf if you specify
# a source encoding that doesn't match the actual source encoding
def poop_out_utf8(infile, outfile)
  if infile == outfile
    puts 'Source and destination are the same, silly!"
    return
  end
  out_file = File.new outfile, 'w'
  begin
    File.open(infile, 'r') do |file|
      file.each_line{|line|
        out_file.puts Iconv.conv('UTF-8', CharGuess::guess(line),line) unless line.nil?
      }
  end
  rescue Exception => e
    puts e
  ensure
    out_file.close
  end
end

You should have iconv already. CharGuess you’ll need to install yourself, most likely. Grab the clib here, and the ruby binding here.

Assuming you’ve extracted the clib to /home/user/libcharguess/ (and done the ./configure/make/sudo make install bit), install the ruby charguess bindings with

ruby extconf.rb --with-charguess-include=/home/user/libcharguess/cpp/
make
make install

Comments

Leave a Reply





Powered by WP Hashcash

  • Follow me

  • Twitter

    Powered by Twitter Tools

  • RSS Tumblr Posts