Ruby and processing character sets
Posted on | July 19, 2007 | No Comments
Posting this because it’s been difficult for me to find resources on how to convert between different character encodings, such as Windows-1252 (cp1252) to UTF-8 or ISO-8859-1. It took me about a day to discover some of these libraries. Maybe I wasn’t searching for the correct keywords — dunno.
require 'iconv'
require 'charguess'
# Reads an arbitrary file and attempts to convert to a new file in UTF-8
# Sort of quick and dirty
# You should note that Iconv.conv will barf if you specify
# a source encoding that doesn't match the actual source encoding
def poop_out_utf8(infile, outfile)
if infile == outfile
puts 'Source and destination are the same, silly!"
return
end
out_file = File.new outfile, 'w'
begin
File.open(infile, 'r') do |file|
file.each_line{|line|
out_file.puts Iconv.conv('UTF-8', CharGuess::guess(line),line) unless line.nil?
}
end
rescue Exception => e
puts e
ensure
out_file.close
end
end
You should have iconv already. CharGuess you’ll need to install yourself, most likely. Grab the clib here, and the ruby binding here.
Assuming you’ve extracted the clib to /home/user/libcharguess/ (and done the ./configure/make/sudo make install bit), install the ruby charguess bindings with
ruby extconf.rb --with-charguess-include=/home/user/libcharguess/cpp/
make
make install
Comments
Leave a Reply
