For Each, Replace
May 23, 2008 – 4:55 amMy greatest weakness as a programmer (other than impatience) is how little I know about relational databases. I know how it happened—I spent my twenties working in high-performance scientific computing, before very large scientific databases became common, and there was always somebody else to worry about that side of things for the projects I was on in my thirties—but it trips me up with ever-increasing frequency.
So, having told my students many times not to be ashamed of not knowing things, I figure I should set a good example and parade my own ignorance in public. The entries in this blog are stored in a PostgreSQL database as chunks of text. I’ve discovered that different entries represent characters in different ways: the long dash, for example, is usually a triple dash “—”, but is also sometimes represented as either a two-byte or three-byte sequence (depending, I presume, on what machine I was on the day I wrote the entry—I sometimes compose in a regular editor, then paste the entry into the blog). I’d like to clean up these inconsistencies, which means doing search/replace on approximately 1450 database entries. As I see it, my options are:
- Write a shell script that uses the command-line PostgreSQL client to pull entries in turn, run them through sed(1), and push the results back into the database.
- Write a Python script that connects to the database, pulls the entries, does string substitution, and pushes the results.
- Learn a little more SQL and do this with a query of some kind.
#3 is where I’m stuck. I know how to find the offending entries: “select entry from table where entry like chars”. I even think I know how to do string substitution. My question is, how do I push the results back into the table? I clearly want to replace each entry with the update of itself; is there a way to do this, or do I insert a new bunch of rows into the table, then delete the old rows? How? And why?
11 Responses to “For Each, Replace”
Hey Greg,
What about this:
update
set =
where ;
The idea is that the where condition is the same as your select statement to get the records with the chars and the substitution function is whatever bit of character replacing code you already figured out.
update is used to update records, it works like a select and replace command.
Anyway, HTH. Good luck!
Krys
By Krys Wilken on May 23, 2008
Update myEntries set text = replace(myEntries,’–’,'—’);
I believe.
Add a where clause if you’re paranoid
See http://www.postgresql.org/docs/8.1/static/functions-string.html for more details
By Flo on May 23, 2008
I think you are out tricking yourself; just go with option 2. I think it would be easier to write/debug/manage in both the long and short term. (Oh, and when you are done you can give it to me since I need to fix things on my blog — moved from mysql4 to 5 and a bunch of character encoding got messed up).
Warning, pseudocode written with a headcold coming..
d = your_database_connection
offending = [[’–’, ‘—’]]
for o in offending:
c = d.cursor()
c.execute(”select id, post from posts where data like ‘%%s%’” % o[0])
for r in c.fetchall():
r.replace(o[0], o[1])
try:
u = d.cursor()
u.execute(”update …”)
d.commit()
else:
d.rollback()
finally:
d.close()
c.close()
Or something like that.
-adam
By Adam Goucher on May 23, 2008
Use the UPDATE statement. You can use a qualification that matches several rows.
UPDATE table set col = strsubst(col) where col like chars.
By Daniel Dittmar on May 23, 2008
update table set entry = “goodchars” where entry = “badchars”
By Peter Herndon on May 23, 2008
Something like..
update table set entry = replace(entry, “A”, “B”) where entry like “%A%”;
.. should to the trick, replacing all occurrences of A with B (untested ;).
You might want to check out the PostgreSQL documentation, relevant pages include:
http://www.postgresql.org/docs/8.3/interactive/functions-string.html
http://www.postgresql.org/docs/8.3/interactive/sql-update.html
Good luck.
By Arjan on May 23, 2008
OK, so what I didn’t know was that SQL will implicitly do a for-each on substitution as well. I thought that if I wrote “update table set col=fxn(…);” that SQL would pick one of the function results and assign that values to all rows, rather than doing the match-up. Thanks, everyone.
By Greg Wilson on May 23, 2008
There is no step two. The key thing about relational databases (CJ Date-esque complaints about how SQL != “relational” aside), is that they are based on set theory, and that SQL itself is declarative. So, you just grab the set of all entries from your table that match your criteria, and basically just tell the database to set the field to whatever you want.
If your entry field is not a simple field of limited characters, if it is a free-text field, and you are wanting to change some of the characters in the field but not all, you are much better off doing this in Python. At least in my experience. Writing complex pl/pgsql is a waste of time on an application-specific database — do it in your programming language. Writing complex pl/pgsql is *necessary*, on the other hand, if your database is central, and multiple apps talk to it. At that point, you need to have your business logic embedded in the database via triggers and stored procedures, so that it applies to all connections, regardless of the app connecting to it.
But that’s likely not something you need to worry about. That sort of thing is less prevalent nowadays, when MySQL and/or PostgreSQL are available on every system. It’s more of an enterprise situation, where you have a really big Oracle or DB2 database serving as a central data repository for the entire organizaiton.
In your case, and in many cases, deep knowledge of the procedural variations of the embedded language of your database is not an efficient allocation of time. Instead, understand the relational concepts. In short, read C.J. Date. “Introduction to Database Systems” is a really good place to start. Then learn how your database of choice varies from the ideal, learn the implications of choosing one field type over another, that sort of thing. And don’t be afraid to do your data processing in Python. I find it loads easier to do things like your example, where you’re working with a large text field and replacing parts of the text, in Python instead.
By Peter Herndon on May 23, 2008
I wouldn’t feel *too* bad. I still use nano instead of vim.
By David Cooper on May 23, 2008
You can go the sql route and still do it in pure python with sqlalchemy. It’s a brilliant library and you can even wire it up to your existing database and introspect the structure. Then you can operate on your blog posts like regular old python objects — no dbapi nonsense even.
It doesn’t keep you from having to understand set theory, but it does shield you from the nastiness of sql (unless you’re in to that kind of thing).
By Dean Landolt on May 23, 2008
Since you have more of a background in math than in relational databases, you might enjoy the book “Applied Mathematics for Database Professionals” by Lex de Haan and Toon Koppelaars. It’s a mathematical view of relational databases that you might find easier to absorb than other perspectives.
By John on May 26, 2008