View Full Version : Special Characters / Foreign Languages


David Eagar
12-24-2008, 04:19 PM
To anybody's knowledge, does having special characters associated with foreign languages in the data lead to potential problems with database performance?

My issue is not with having these characters in field names, purely in the data itself. I have seen lots of threads on how to remove them etc etc, but nothing that I can find on is it good/bad practice to be doing this

Thanks for any advice
Merry Xmas

Banana
12-24-2008, 04:42 PM
In a RDBMS solution, we can specify character sets and collation (sorting order for a given character set). But in Access, I believe, assumes the character sets and collation based on Windows' Language and Regional Settings. At least, I have not seen a setting that allowed me to choose a character set/collation on a per-database setting or even on application level.

That said, my best advice would be to experiment. Access defaults to unicode character set so I don't think there'll be problems with using 2-bytes sets. If you find any problem with this, try using RDBMS as a backend and see how it handles it.

I also wonder if Jet has option to control the character set programmatically. A quick look and I can see there's CollatingOrder property. The help also gives a hint:
The CollatingOrder property setting corresponds to the locale argument of the CreateDatabase method when the database was created or the CompactDatabase method when the database was most recently compacted.

Maybe it'll help you get further. I would be interested in knowing how it turns out.

David Eagar
12-24-2008, 04:49 PM
At the moment, I am migrating the data in - and it all goes in and looks fine, but at this stage I haven't got my database to do much, other than a few queries & forms to display the data.

I am still thinking of a useful purpose for the datbase, but that is another challenge.

I just had a thought that all these characters, while looking impressive, may well be slowly clutching their fingers around my throat to strike sometime in the future.

Banana
12-24-2008, 07:19 PM
What I probably would want to test is how you would get results if you were to query upon a text comparison or maybe order by and see if something unexpected comes out.

For example, in default sorting order (e.g. General), Mueller != Müeller. So if you wanted to query for all clients named Mueller, do you want to see those who are Müeller as well or not? Can you trust your users to be consistent in entering Müeller as Müeller and not Mueller?

Looking at the list of CollatingOrder, I don't see that you have much control as I would see in MySQL (and I'm sure this is case with MS-SQL; I just haven't looked at MS-SQL's lists carefully yet). You can choose between different regions but no specifications for case sensitive or case insenstive and whether Mueller should be same as Müeller.

If you feel this will be an issue, then I would want to look at using a RDBMS backend that can provide greater control over the character sets and collation and you can then query the RDBMS backend and let it do the comparisons in way you expect it to come out. In this scenario, you would only need to ensure that all Access needs is to display the results received from the RDBMS correctly and ensure that the parameter inputs for the query is exactly as what you expect it to be, and that RDBMS will read it as such.

I hope that helps some...

wazz
12-25-2008, 12:21 AM
also check out the Customer Phone List form and macro in Northwind. quite excellent.

David Eagar
12-25-2008, 10:07 AM
I think I'm going to have to come up with some creative answers for this to get off the ground.

A query search for Hüffler not surprisingly, found it
a search for Huffler did not
a like Huffler also did not

my intention was to access this data via combo boxes so it may not be a problem, however......

Banana
12-25-2008, 12:06 PM
Be sure to look at other Collations. It's possible that one of those (I'm going to go out on a limb and guess Swedish languages or German) may allow you to say that Huffler is same as Hüffler. If this fails then the best advice would be to do all querying against against a RDBMS that can provide more control over the character sets and collation being use and thus return the expected results (e.g. both Huffler and Hüffler gets returned whether you specify Huffler or Hüffler).

Best of luck.

David Eagar
01-01-2009, 09:08 PM
For those thinking of following this example, my advice is:
DON'T - it isn't worth the extra effort..

2 major issues have been
1. If the data is coming from several sources, there is a good chance that the use of the characters won't be the same and when it comes time to line up the text, it DOESN'T (and seeing it's a foreign language, you can spend endless hours fun trying to get it all lined up)

2. And more importantly, there is a good chance of corrupting the database. I had a good living case of it when importing a large slab of jibberish and the autonumber id suddenly jumped to 6358474147 and no more work was possible with that database.

The good news was that by some miracle, I had backed up that morning, so no drama. Thanks to most of the stuff picked up from this forum, I knew the steps to extract myself from the doo doo and was up & running in 15 minutes.

Banana
01-01-2009, 09:30 PM
Is this even the case with a backend that's capable of supporting character sets/collations?

David Eagar
01-01-2009, 09:51 PM
Is this even the case with a backend that's capable of supporting character sets/collations?

I'm not that advanced, but I think the real problem was that the data itself was corrupted in Excel first - it was trying to put
© and a few other wierd & wonderful characters in as a town name. It was going OK up till then, except for the time it took to tidy up problem 1.

At the end of the day, is worth all the time & effort? Where was my benefit?

Banana
01-01-2009, 10:08 PM
Quite understandable.

It was interesting problem, though, as I've wondered about what would happen if we needed to use Access in international setting?


I clean forgot about my World database so I looked into this. It has cities from all over world. For example, in MySQL, I have a city named:
Sétif

In Access it's:
SÈtif

Not exactly same.

It's not just that. Here's one of more egregious mismatch:
In MySQL:
Béjaïa
In Access:
BÈjaÔa

I guess we'd have to do something entirely different if we're going to go further on this project. Or not take up on that project at all. :D

David Eagar
01-01-2009, 10:15 PM
Quite understandable.

It was interesting problem, though, as I've wondered about what would happen if we needed to use Access in international setting?


I clean forgot about my World database so I looked into this. It has cities from all over world. For example, in MySQL, I have a city named:


In Access it's:


Not exactly same.

It's not just that. Here's one of more egregious mismatch:
In MySQL:

In Access:


I guess we'd have to do something entirely different if we're going to go further on this project. Or not take up on that project at all. :D


It seems we had the same idea! I've got no idea why I'm doing this, other than to improve my skills at handling large amounts of data efficiently (I discovered combo boxes can't handle 130,000 records) so it has still been a worthwhile exercise.

I'll keep plowing on - it is probably worthwhile if it can help it to become a 'global' application - See Site Suggestions, Group Projects: Static data like countries/states/towns/postcodes doesn't 'belong' to anybody (although you'd be surprised how many sites want to sell me this data). If memebers can slowly build up this sort of stuff, it should be free to members to make use of, rather than all individually spending days/months/years reinventing the same wheel

gemma-the-husky
01-04-2009, 10:51 AM
i suspect the reason for most text tidying is to eliminate single and double quote characters (possible commas also) as they can interfere with normal access functions - I doubt if there is too much need to eliminate other characters.

I have no experience, but possibly the slight display differences between accented characters is most likely to be connected with regional character sets?

David Eagar
01-04-2009, 11:03 AM
i suspect the reason for most text tidying is to eliminate single and double quote characters (possible commas also) as they can interfere with normal access functions - I doubt if there is too much need to eliminate other characters.

I have no experience, but possibly the slight display differences between accented characters is most likely to be connected with regional character sets?

I guess what I'm learning the most is - time versus benefit. With what I'm working on, the time needed to sort it all out will far outweigh any benefit I might have got from using the special characters

Banana
01-04-2009, 11:05 AM
Gemma, please explain what you mean by 'eliminating single and double quote characters', as I've not observed this.

I can do this for example:

SELECT "Joe's ""Good"" Cars";
SELECT 'Joe''s "Good" Cars';

Both will return:
Joe's "Good" Cars

So it's persevered. Furthermore, This doubling up isn't needed if you enter this directly in table or in a control bound to the field.

gemma-the-husky
01-05-2009, 04:12 AM
all i meant was that possibly single and double quotes and commas embedded with in strings could interfere with tasks such as producing csv files, and sql strings, and that often string manipulation to massage such characters was carried out, so that such things worked correctly - but this doesnt affect every unusual character