Welcome to Geeklog, Anonymous Friday, March 29 2024 @ 09:20 am EDT

Geeklog Forums

Replacing curly quotes in Windows Western (CP-1252) encoding


Status: offline

SynrG

Forum User
Newbie
Registered: 09/18/04
Posts: 1
It happens that sometimes a user will submit an article in Windows Western (CP-1252) encoding, which is not an Internet standard. This will look OK in some browsers, but will look crappy in others. The CP-1252-specific characters, such as curly-quotes, ellipsis, or em-dash all show up as a question-mark, a "missing Unicode character" glyph, or are dropped altogether, depending on the browser. All of these characters are easily identified because they occupy unprintable character positions in the Latin-1 range of Unicode, so cleaning them up should be a simple matter of substituting all of these "bad" characters with the proper HTML entities (entity 8212 for an em-dash, etc.)

Therefore, I would like all submissions to be put through a CP-1252 to HTML character entity filter. I found a C one called "quoter" which I hacked for a proof-of-principle exercise to add a few extra characters that appear in our data. I ran my "broken" documents through this by pulling the text fields from the database by hand, putting them through my modified "quoter", and saving them back to the database. Of course, rewriting it in PHP and integrating it with the Geeklog text editor (or is there a "file save" layer beneath the editor?) would probably be better. To fix old broken documents, a user would just need to edit & save each document to ensure they go through the filter. All future submissions would be entered properly in the database because they would be passed through this filter before being saved.

Before I bother going away and coding this myself — it would be my first PHP project, though I am well-versed in many other languages — I would like to know if anyone has encountered this problem and solved it already on their own, or if you have suggestions as to how to best integrate this with Geeklog.

Thanks,
Ben
 Quote

All times are EDT. The time is now 09:20 am.

  • Normal Topic
  • Sticky Topic
  • Locked Topic
  • New Post
  • Sticky Topic W/ New Post
  • Locked Topic W/ New Post
  •  View Anonymous Posts
  •  Able to post
  •  Filtered HTML Allowed
  •  Censored Content