Discussion:
Best Way to Handle Inserting HTML Data
Matthew Woodward
2010-11-17 00:46:33 UTC
Permalink
I have a SQL Server database I'm converting to CouchDB. I wrote a script
that pulls data form SQL Server, denormalizes into the JSON structure I want
to have in CouchDB, and then makes an HTTP call to insert each record.

The issue I'm running into is some of the fields in SQL Server have HTML in
them, and I'm not sure the best way to handle this. It's easy enough to
brute-force escape all the HTML tags, quotes, etc. but then the data won't
display as HTML when I want to display it on a web page.

The more I thought about it, I think the only character that will really
cause issues with the JSON (other than high-ASCII MS Word junk, which I'm
stripping out/converting) is ". Here's an example:

myJSONString = '{"foo":"bar", "baz":"<a href="http://foo.com">Here's a
hyperlink</a>"}';

So if I don't do any sort of escaping the first " in the href will cause the
JSON to be invalid. But whereas I can change " to &quot; in regular text
(e.g. "Here's a quote" can become &quot;Here's a quote&quot; and still be
fine), that won't really work with hyperlinks since <a href=&quot;
http://foo.com&quot;> won't work properly I don't believe, or at any rate
isn't all that desirable.

If I escape the " with \ (using the example above, this becomes <a href=\"
http://foo.com\">) that lets me get the document into Couch successfully,
but I have a feeling that isn't going to work if I just want to display that
field as HTML on a web page (though I have yet to experiment with that to
see what happens when it's rendered).

Anyway, I've been going around in circles about how best to handle this and
since I'm new to Couch, I figured I'd ask to see what the conventional
wisdom/"best practice" is for something like this. I'm happy to go a
different route completely (write JSON to a file first and load using curl,
whatever works ...) if my current approach isn't the best way of going about
this.

Thanks!
--
Matthew Woodward
matt-***@public.gmane.org
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Mike Miller
2010-11-17 01:07:50 UTC
Permalink
Can you just save the HTML as an attachment? -Mike
Post by Matthew Woodward
I have a SQL Server database I'm converting to CouchDB. I wrote a script
that pulls data form SQL Server, denormalizes into the JSON
structure I want
to have in CouchDB, and then makes an HTTP call to insert each record.
The issue I'm running into is some of the fields in SQL Server have HTML in
them, and I'm not sure the best way to handle this. It's easy enough to
brute-force escape all the HTML tags, quotes, etc. but then the data won't
display as HTML when I want to display it on a web page.
The more I thought about it, I think the only character that will really
cause issues with the JSON (other than high-ASCII MS Word junk, which I'm
myJSONString = '{"foo":"bar", "baz":"<a href="http://foo.com">Here's a
hyperlink</a>"}';
So if I don't do any sort of escaping the first " in the href will cause the
JSON to be invalid. But whereas I can change " to &quot; in regular text
(e.g. "Here's a quote" can become &quot;Here's a quote&quot; and still be
fine), that won't really work with hyperlinks since <a href=&quot;
http://foo.com&quot;> won't work properly I don't believe, or at any rate
isn't all that desirable.
If I escape the " with \ (using the example above, this becomes <a href=\"
http://foo.com\">) that lets me get the document into Couch
successfully,
but I have a feeling that isn't going to work if I just want to display that
field as HTML on a web page (though I have yet to experiment with that to
see what happens when it's rendered).
Anyway, I've been going around in circles about how best to handle this and
since I'm new to Couch, I figured I'd ask to see what the conventional
wisdom/"best practice" is for something like this. I'm happy to go a
different route completely (write JSON to a file first and load using curl,
whatever works ...) if my current approach isn't the best way of going about
this.
Thanks!
--
Matthew Woodward
http://blog.mattwoodward.com
Please do not send me proprietary file formats such as Word,
PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Matthew Woodward
2010-11-17 01:27:46 UTC
Permalink
Post by Mike Miller
Can you just save the HTML as an attachment? -Mike
Well ... I thought about that too. Here's the issue.

Basically the database is of an event with some high-level details (name of
event, date/time, location, etc.), and an array of attendees. The attendee
JSON object has name, title, etc. and then an attendee statement, which is
where the HTML comes in since these statements can be either hyperlinks or
full-blown HTML documents right in the database.

Right now I have the event as my document in my CouchDB database, and then I
have an array of attendees. So if I have 10 attendees each with a statement,
then if I do attachments I'd have to figure out how to map the statement
back to the attendee within my attendee array. So I considered doing
attachments but couldn't figure out how to handle knowing which attachment
went with which attendee.
--
Matthew Woodward
matt-***@public.gmane.org
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Jonathan Stott
2010-11-17 01:35:56 UTC
Permalink
I suggest not doing JSON (de)serialization on your own.

Make your script do all the denormalization, and convert the result to
a hash/dictionary. Almost all languages have a JSON encoder that will
do the correct escaping for you. POST it to couch DB. On the way out,
JSON decode the result, which will unescape things appropriately, and
give you back a dictionary/hash.

Regards
Jon
Post by Matthew Woodward
Can you just save the HTML as an attachment?  -Mike
Well ... I thought about that too. Here's the issue.
Basically the database is of an event with some high-level details (name of
event, date/time, location, etc.), and an array of attendees. The attendee
JSON object has name, title, etc. and then an attendee statement, which is
where the HTML comes in since these statements can be either hyperlinks or
full-blown HTML documents right in the database.
Right now I have the event as my document in my CouchDB database, and then I
have an array of attendees. So if I have 10 attendees each with a statement,
then if I do attachments I'd have to figure out how to map the statement
back to the attendee within my attendee array. So I considered doing
attachments but couldn't figure out how to handle knowing which attachment
went with which attendee.
--
Matthew Woodward
http://blog.mattwoodward.com
Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Matthew Woodward
2010-11-17 02:03:16 UTC
Permalink
Post by Jonathan Stott
I suggest not doing JSON (de)serialization on your own.
Make your script do all the denormalization, and convert the result to
a hash/dictionary. Almost all languages have a JSON encoder that will
do the correct escaping for you. POST it to couch DB. On the way out,
JSON decode the result, which will unescape things appropriately, and
give you back a dictionary/hash.
Thanks! I'll give that approach a shot and see where I get.
--
Matthew Woodward
matt-***@public.gmane.org
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Matthew Woodward
2010-11-17 04:07:41 UTC
Permalink
Post by Jonathan Stott
I suggest not doing JSON (de)serialization on your own.
Well, huge duh on my part. ;-) Not sure why I was making it so much harder
than it needed to be. I was already taking the JSON and deserializing into
native objects, so I'm not sure why it didn't dawn on my to go the other
direction for my inserts.

Thanks for the shove in the right direction!

Matt
--
Matthew Woodward
matt-***@public.gmane.org
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Rafał Pocztarski
2010-11-17 04:05:25 UTC
Permalink
Post by Matthew Woodward
myJSONString = '{"foo":"bar", "baz":"<a href="http://foo.com">Here's a
hyperlink</a>"}';
So if I don't do any sort of escaping the first " in the href will cause the
JSON to be invalid. But whereas I can change " to &quot; in regular text
(e.g. "Here's a quote" can become &quot;Here's a quote&quot; and still be
fine), that won't really work with hyperlinks since <a href=&quot;
http://foo.com&quot;> won't work properly I don't believe, or at any rate
isn't all that desirable.
If I escape the " with \ (using the example above, this becomes <a href=\"
http://foo.com\">) that lets me get the document into Couch successfully,
but I have a feeling that isn't going to work if I just want to display that
field as HTML on a web page (though I have yet to experiment with that to
see what happens when it's rendered).
I'll just add to what others have already said that no matter what,
why and how you escape, whether manually or automatically, what should
happen is to always escape once what gets in and then always unescape
once what gets out. (I say "once" because it's easy to escape
something twice and unescape once or vice versa and have a problem.)

I say it because you worry that escaped strings might not work in HTML
but it would be true only if you forget to unescape them when you get
them out and you have to do it anyway when you decode JSON.

The correct way to store any text in JSON is to always escape the
double quote, backslash and control characters (" as \", \ as \\ etc.)
and the correct way to decode strings in JSON is to unescape
everything that starts with backslash - see http://json.org/

It's easy to get it wrong so that's why I second Jonathan's advice to
use a ready JSON encoder/decoder library and not worry about it
yourself.

By the way, I'm new on this list so hello to everyone.

Best regards,
Rafał Pocztarski.
Matthew Woodward
2010-11-17 04:08:52 UTC
Permalink
Post by Rafał Pocztarski
It's easy to get it wrong so that's why I second Jonathan's advice to
use a ready JSON encoder/decoder library and not worry about it
yourself.
By the way, I'm new on this list so hello to everyone.
Thanks Rafał (and hello!)--this approach is working like a charm.
--
Matthew Woodward
matt-***@public.gmane.org
http://blog.mattwoodward.com
identi.ca / Twitter: @mpwoodward

Please do not send me proprietary file formats such as Word, PowerPoint,
etc. as attachments.
http://www.gnu.org/philosophy/no-word-attachments.html
Loading...