Lasso Programming

Mixed unicode and ASCII

1 Messages
Collapse All
Expand All
Subscribe

Mar 21, 2009; 13:25

Doug McNutt

Mixed unicode and ASCII

I have long been frustrated by BBEdit's refusal to edit files that,
for good reason, have different line end characters throughout - like
BBEdit Worksheet files for instance.

Now I have another similar problem. Adobe has changed the format of
its .fdf, form data, files. For almost a decade I have been filling
out US tax forms by creating fdf files with Excel macros. I could
download the PDF blanks from the USIRS web site and Acrobat would
politely load the data from the fdf files.

The new format mixes not line ends but UTF16 and ASCII encoding in
the same file! Needless to say, BBEdit doesn't handle it well.

Here's what the first few lines looks like when opened with BBEdit

%FDF-1.2
%Â’Â“Â¦"
1 0 obj
<<
/FDF
<<
/Fields [
<<
/V (Âœ l i n e 1 4)
/T (Âœ f 1 _ 0 5 8 \( 0 \))
>>
<<
/V (Âœ l i n e 1 5)
/T (Âœ f 1 _ 0 6 0 \( 0 \))
>>
<<
/V (Âœ l i n e 6)
/T (Âœ f 1 _ 0 4 2 \( 0 \))

Here's the hexdump of the same first few lines produced by BBEdit

0000: 25 46 44 46 2D 31 2E 32 0A 25 E2 E3 CF D3 0A 31 %FDF-1.2.%Â’Â“Â¦"..1
0010: 20 30 20 6F 62 6A 20 0A 3C 3C 0A 2F 46 44 46 20 0 obj .<<./FDF
0020: 0A 3C 3C 0A 2F 46 69 65 6C 64 73 20 5B 0A 3C 3C .<<./Fields [.<<
0030: 0A 2F 56 20 28 FE FF 00 6C 00 69 00 6E 00 65 00 ./V (Âœ .l.i.n.e.
0040: 20 00 31 00 34 29 0A 2F 54 20 28 FE FF 00 66 00 .1.4)./T (Âœ .f.
0050: 31 00 5F 00 30 00 35 00 38 00 5C 28 00 30 00 5C 1._.0.5.8.\(.0.\
0060: 29 29 0A 3E 3E 20 0A 3C 3C 0A 2F 56 20 28 FE FF )).>> .<<./V (Âœ
0070: 00 6C 00 69 00 6E 00 65 00 20 00 31 00 35 29 0A .l.i.n.e. .1.5).
0080: 2F 54 20 28 FE FF 00 66 00 31 00 5F 00 30 00 36 /T (Âœ .f.1._.0.6
0090: 00 30 00 5C 28 00 30 00 5C 29 29 0A 3E 3E 20 0A .0.\(.0.\)).>> .
00A0: 3C 3C 0A 2F 56 20 28 FE FF 00 6C 00 69 00 6E 00 <<./V (Âœ .l.i.n.

It appears that the parentheses that are not escaped designate blocks
that are encoded as UTF16. They begin with an FEFF code point which
is surely a byte order mark. After that there are 16 bit entries the
first byte of which is a null for every file I have looked at. The
escaped parentheses are there because the author of the PDF used
parentheses in his definitions of the form names. Note though that
the backslash escape character is preceded by a null but the
parenthesis following it is not.

So my question is. . . Is there any way I can make use of BBEdit to
post process the plain ASCII files produced by my Excel macros and
create a version with the mixed ASCII and UTF16? An AppleScript would
be an easy way to go and I care not a whit about speed. Can I tell
BBEdit to change from U16 to ASCII and back as it writes a file?
BBEdit uses U16 internally for everything. When it reads my fdf file
does it convert 0066 ( an f ) to 0000 0066? or does it leave the 16
bits alone by effectively ignoring the null character in the file?

I have looked at reworking my VBA code and that will be a pain. There
seems to be no way to handle nulls inside of a worksheet cell. Perl
will probably handle the task and I have started that but perl's "use
unicode" options are not helpful. There is also UNIX sed which might
work with a bunch of successive substitutions. Any other ideas?
This is a once a year project and I really don't want to use C for it.

--
-> Stocks are getting pelloreid <-

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "BBEdit Talk" group.
To post to this group, send email to bbedit@googlegroups.com
To unsubscribe from this group, send email to
bbedit+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/bbedit?hl=en
If you have a specific feature request or would like to report a suspected (or confirmed) problem with the software, please email to "support@barebones.com" rather than posting to the group.
-~----------~----~----~----~------~----~------~--~-

Mar 21

Maarten Sneep Re: Mixed unicode and ASCII

Mar 22

Walter Ian Kaye Re: Mixed unicode and ASCII

Mixed unicode and ASCII

Search

LassoSoft Inc. > Home