Match based on multiple values within one field.

Hoenen · Sep 23, 2014

For my study on academic research I need to match patents that refer to academic research as prior work with the actual prior work.

I have two tables (see attached images below).

One regarding AcademicPublications (AP), which is neatly organized with title, year, journal, volume, pages, first author, etc... 480,000 rows

One regarding Patentswhere all this information is hidden within one field, in the most messy way possible... for instance, a field could have:

Sugita et al, "Nonsurgical Implantation of a Vascular Ring Prosthesis Using Thermal Shape Memory Ti/Ni Alloy (Nitionl Wire)," Trans. Amer. Soc. Artif. Intern. Organs, vol. 23, pp. 30-34.

or

Willingham et al., Cell 13, 501-507 (1978).

Or many other ways.

I want to create a new table that is set up like this:

Patents.PatentNumber | AP.ID | Patents.Reference | AP.Title | AP.Year | AP.Volume | AP.PageStart

The question is: How do I match different fields from one table on one field of another and make it return another field (the ID)? Some references are too horrible to match, but I need as many as I can get.

I can imagine two queries would give me the bulk:

A match in [Title] AND [Year]
A match on ([SourceTitle] OR [AbbreviatedSourceTitle] ) AND [Volume] AND [Year] AND [PageStart]

I understand that I have to make use of the Like "*"&[value]&"*", but how do I make it return the matching ID?

JHB · Sep 23, 2014

Could you show us some result you want to get, (+ from where you get it), it should be from the table/data you have shown in the pictures.

Hoenen · Sep 23, 2014

JHB said:
Could you show us some result you want to get, (+ from where you get it), it should be from the table/data you have shown in the pictures.

I've manually looked up some matches, in attachment 1 you can find what I would see as ideal outcome table. In attachment 2 i explain it further:

The columns with a red frame are from my AcademicPublications table, the columns with a blue frame are from my Patents table.

You can see that these examples all match in Author, title, year.

But you must imagine that the Patents.Reference column is the most messy thing! Half of the fields do not include a title (for instance: Petty et al., Biophys. J. 47, 837-840 (1985). which is Author(Petty), AbbreviatedJournal(Biophys. J.), Volume (47), PageStart (837), Year(1985)). Many have author missing. I guess the fields which are most prevalent are Year, Journal/AbbreviatedJournal, Volume(which can be just a single digit) and PageStart.

So in the end I will have to play around with several queries and different sets of booleans. I need some direction to how I can start.

spikepl · Sep 23, 2014

One method would be to methodically (!) progress through the info available in your messy basket and split it out into your desired columns, in the same record. All depends on how many of those things you have. A human eye can quickly detemrine if youi made a mess out of it or not - faster than the machien, provided a step-by-step method is used.

SO extract the stuff preceding a comma and containing "et al." into an Authors column, and so on

Hoenen · Sep 23, 2014

I have been able to extract Author and title (split left and mid on " (chr34)). Both are rather messy but I'd guess that 90% is correct. The issue however is that too many observations do not have a title and FirstAuthor on itself is not good enough if its not missing.

The most unique and occuring set of values would be (Abbreviated)Journal, Year, Volume, PageStart. There can be only 1 article with this mix.

If we take the following observation :

Smith, A. et al., J. Am. Chem. Soc, 101:141-165 (1979).

[AbbreviatedJournal] = J. Am. Chem.
[Volume] = 101
[PageStart] = 141
[Year] = 1979

When volume or pagestart are in the single and double digits it will create problems (Like "*10*").... Maybe I should take your advice and check how many proper values I get when I split after "vol. " and before "-" (to get PageStart which is often given as 123-456). EDIT: from a sample of 300k rows, 72k have the Volume information in the following sequence "Vol. [volume]," which I've been able to extract into a new column.

Another way is to accept that these errors will occur and spend a few days manually checking every produced match and eliminating the ones that are not right.

JHB · Sep 23, 2014

spikepl had a good point about the human eye!
How many records do you have?
Am I correct, AcademicPublications (AP) and Patents are 2 different tables with complete different information, nothing is common?
The title seems easy to get it is between 2 ("") - "The title is here", and also the start page seems easy to find it is just after the 2 (pp.).
I would have 2 fields with the references, I showing the whole reference and one in which a found information is taken out, which at the very best will be empty and if not, you'll be able to see what miss in your query/queries. Then it could be some "hand" work, or a way to improve the query.
I would not use queries but some VBA code, but if you are not familiar with code, then ...!

Hoenen · Sep 23, 2014

JHB said:
spikepl had a good point about the human eye!
How many records do you have?
Am I correct, AcademicPublications (AP) and Patents are 2 different tables with complete different information, nothing is common?
The title seems easy to get it is between 2 ("") - "The title is here", and also the start page seems easy to find it is just after the 2 (pp.).
I would have 2 fields with the references, I showing the whole reference and one in which a found information is taken out, which at the very best will be empty and if not, you'll be able to see what miss in your query/queries. Then it could be some "hand" work, or a way to improve the query.

Thanks for your reply.

AcademicPublications is around 480k and Patents will be 1000k. There is no common field sadly.

I liked the idea of extracting the information into new columns first, as much as possible and started on it.

Overal, I'd say there's 3 dominant ways of writing the reference

1

Mandenius et al., Anal. Biochem., 137:106-114 (1984). 42
LaCelle, Blood Cells, 12:179-189 (1986).
Kricka et al., Clin Chem., 26:741-744 (1980).
Stange et al., Biomaterials, 9:3-6 (1988).
Shoji, et al., Sensors and Actuators, 15:101-107 (1988).
Sato, et al., Sensors and Actuators, A21-A23:948-951 (1990). 47
Sankolli et al., J. Imun. Methods, 104:191-194 (1987).
Rosenberg et al., Clin. Chem., 31:1444-1448 (1985).
Rosenberg et al., Clin. Chem., 30:1462-1466 (1984).
Parce et al., Science, 24:243-247 (1988).
Nakamura et al., Anal. Chem., 63:268-272 (1991).

Which is loosely [Author] et al., [AbbreviatedJournal], [Volume],[PageStart],.....([Year])

2
Reference
" Josephs et al., ""Transforming Potential of Human C-sis Nucleotide Sequences Encoding Platelet Derived Growth Factor,"" Science 225:636-639, 1984. 13 "
" Robbins et al., ""Structural and Immunological Similarities Between Simian Sarcoma Virus Gene Product(s) and Human Platelet-Derived Growth Factor,"" Nature 305:605-608, 1983. 24 "
" Owen et al., ""Simian Sarcoma Virus-Transformed Cells Secrete a Mitogen Identical to Platelet-Derived Growth Factor"", Science 225:54, 1984. "
" Johnsson et al., ""The C-Sis Gene Encodes a Precursor of the B Chain of Platelet-Derived Growth Factor,"" EMBO J. 3:921-928, 1984. 26 "
" Deuel et al., ""Human Platelet-Derived Growth Factor,"" J. Biol. Chem. 256:8896-8899, 1981. 15 "

[Author] et al., "[Title]", (sometimes the comma is before the ") [Journal] [PageStart]...., [year].

3
Reference
Pollefeys, M. and L. V. Cool, From images to 3D models, Communications of the ACM, pp. 50-55, vol. 45, No. 7.
Nistér, D., and H. Stewénius, Scalable recognition with a vocabulary tree, IEEE Conf. on Computer Vision and Pattern Recognition, Jun. 2006, pp. 2161-2168, vol. 2.
Mikolajczyk, K., and C. Schmid, A performance evaluation of local descriptors, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2003, pp. 257-263, Madison, Wisconsin. 11
Mikolajczyk, K., C. Schmid, Indexing based on scale invariant interest points, Proc. 8th Int'l. Conf. Comput. Vis., 2001, pp. 525-531, Vancouver, Canada. 5
Lowe, D., Distinctive image features from scale invariant keypoints, Int. J. of Computer Vision, 2004, pp. 91-110, vol. 60, No. 2. 178
Laptev, I., and T. Lindeberg, Interest point detection and scale selection in space-time, Lecture Notes in Computer Science, 2003, vol. 2695, pp. 372-387, L.D. Griffin and M Lillholm, ed., Scale-Space03, Springer Verlag, Berlin, pp. 372-387.

[Author], [Title], Journal], Vol. [Volume], [Year], pp. [PageStart]

I'll work on splitting them into 3 different tables based on reference style and then try to split them up in columns more neatly...

I'll be away from my computer for a few hours now so I will not be able to respond quickly, I apologize.

spikepl · Sep 23, 2014

Now for a complete exercise in masochism, there is a thing called RegEx -regular expressions. Using those those you could match every pattern that you enoucnter one at a time. But, there is a learning curve.

If you only have 1000 records, I'd suggest to proceed as discussed - do the best you can trying to extract values for one column at a time, and then inspect the results.

Besides, do not do everything at once - break it down. You can do quadruple digits first, then triple digits, then double digits, then single digits (in this sequence - the inverse sequence will get you into deep doo doo -think why!?)

JHB · Sep 23, 2014

I think it was not only 1000 records but 1000k records.

Hoenen · Sep 24, 2014

spikepl said:
Now for a complete exercise in masochism, there is a thing called RegEx -regular expressions. Using those those you could match every pattern that you enoucnter one at a time. But, there is a learning curve.

If you only have 1000 records, I'd suggest to proceed as discussed - do the best you can trying to extract values for one column at a time, and then inspect the results.

Besides, do not do everything at once - break it down. You can do quadruple digits first, then triple digits, then double digits, then single digits (in this sequence - the inverse sequence will get you into deep doo doo -think why!?)

Like JHB says, 1000k

I dont need to match them all, but I need a good sample size of matches (near 50k)with as little selection bias as possible.

I will continue the struggle, but I have a question..

How would a query look that selects t1.ID if t1.x, t1.y and t1.z match t2.x, t2.y and t2.z ? None of these tables being unique. Or is easier to then combine the xyz tables into 1 (pretty unique) string?

spikepl · Sep 24, 2014

1000k? oops!

Do not worry about the query - sort out your data first. Since your data is not uniform, you won't be able to do everything in one shot (or one query) anyways, but you'll have to process the records you can in one round, mark the ones processed , and then start on a new round.

With 1000K records I'd look into RegEx to identify each data pattern uniquely, since visual inspection is tough when so many records are involved. Google vba regex and judge for yourself.

Match based on multiple values within one field.

Hoenen

New member

Attachments

JHB

Have been here a while

Hoenen

New member

Attachments

spikepl

Eledittingent Beliped

Hoenen

New member

JHB

Have been here a while

Hoenen

New member

spikepl

Eledittingent Beliped

JHB

Have been here a while

Hoenen

New member

spikepl

Eledittingent Beliped

Similar threads

Users who are viewing this thread