Match based on multiple values within one field.

Hoenen

New member
Local time
Today, 10:21
Joined
Aug 8, 2014
Messages
7
For my study on academic research I need to match patents that refer to academic research as prior work with the actual prior work.

I have two tables (see attached images below).

One regarding AcademicPublications (AP), which is neatly organized with title, year, journal, volume, pages, first author, etc... 480,000 rows

One regarding Patentswhere all this information is hidden within one field, in the most messy way possible... for instance, a field could have:

Sugita et al, "Nonsurgical Implantation of a Vascular Ring Prosthesis Using Thermal Shape Memory Ti/Ni Alloy (Nitionl Wire)," Trans. Amer. Soc. Artif. Intern. Organs, vol. 23, pp. 30-34.

or

Willingham et al., Cell 13, 501-507 (1978).

Or many other ways.

I want to create a new table that is set up like this:

Patents.PatentNumber | AP.ID | Patents.Reference | AP.Title | AP.Year | AP.Volume | AP.PageStart

The question is: How do I match different fields from one table on one field of another and make it return another field (the ID)? Some references are too horrible to match, but I need as many as I can get.

I can imagine two queries would give me the bulk:

A match in [Title] AND [Year]
A match on ([SourceTitle] OR [AbbreviatedSourceTitle] ) AND [Volume] AND [Year] AND [PageStart]

I understand that I have to make use of the Like "*"&[value]&"*", but how do I make it return the matching ID?
 

Attachments

  • AcademicResearch.jpg
    AcademicResearch.jpg
    107.3 KB · Views: 121
  • patent.jpg
    patent.jpg
    104.4 KB · Views: 121
Last edited:
Could you show us some result you want to get, (+ from where you get it), it should be from the table/data you have shown in the pictures.
 
Could you show us some result you want to get, (+ from where you get it), it should be from the table/data you have shown in the pictures.

I've manually looked up some matches, in attachment 1 you can find what I would see as ideal outcome table. In attachment 2 i explain it further:

The columns with a red frame are from my AcademicPublications table, the columns with a blue frame are from my Patents table.

You can see that these examples all match in Author, title, year.

But you must imagine that the Patents.Reference column is the most messy thing! Half of the fields do not include a title (for instance: Petty et al., Biophys. J. 47, 837-840 (1985). which is Author(Petty), AbbreviatedJournal(Biophys. J.), Volume (47), PageStart (837), Year(1985)). Many have author missing. I guess the fields which are most prevalent are Year, Journal/AbbreviatedJournal, Volume(which can be just a single digit) and PageStart.

So in the end I will have to play around with several queries and different sets of booleans. I need some direction to how I can start.
 

Attachments

  • end result1.png
    end result1.png
    31.1 KB · Views: 101
  • end result2.png
    end result2.png
    31.9 KB · Views: 97
One method would be to methodically (!) progress through the info available in your messy basket and split it out into your desired columns, in the same record. All depends on how many of those things you have. A human eye can quickly detemrine if youi made a mess out of it or not - faster than the machien, provided a step-by-step method is used.

SO extract the stuff preceding a comma and containing "et al." into an Authors column, and so on
 
I have been able to extract Author and title (split left and mid on " (chr34)). Both are rather messy but I'd guess that 90% is correct. The issue however is that too many observations do not have a title and FirstAuthor on itself is not good enough if its not missing.

The most unique and occuring set of values would be (Abbreviated)Journal, Year, Volume, PageStart. There can be only 1 article with this mix.

If we take the following observation :

Smith, A. et al., J. Am. Chem. Soc, 101:141-165 (1979).

[AbbreviatedJournal] = J. Am. Chem.
[Volume] = 101
[PageStart] = 141
[Year] = 1979

When volume or pagestart are in the single and double digits it will create problems (Like "*10*").... Maybe I should take your advice and check how many proper values I get when I split after "vol. " and before "-" (to get PageStart which is often given as 123-456). EDIT: from a sample of 300k rows, 72k have the Volume information in the following sequence "Vol. [volume]," which I've been able to extract into a new column.

Another way is to accept that these errors will occur and spend a few days manually checking every produced match and eliminating the ones that are not right.
 
Last edited:
spikepl had a good point about the human eye!
How many records do you have?
Am I correct, AcademicPublications (AP) and Patents are 2 different tables with complete different information, nothing is common?
The title seems easy to get it is between 2 ("") - "The title is here", and also the start page seems easy to find it is just after the 2 (pp.).
I would have 2 fields with the references, I showing the whole reference and one in which a found information is taken out, which at the very best will be empty and if not, you'll be able to see what miss in your query/queries. Then it could be some "hand" work, or a way to improve the query.
I would not use queries but some VBA code, but if you are not familiar with code, then ...!
 
spikepl had a good point about the human eye!
How many records do you have?
Am I correct, AcademicPublications (AP) and Patents are 2 different tables with complete different information, nothing is common?
The title seems easy to get it is between 2 ("") - "The title is here", and also the start page seems easy to find it is just after the 2 (pp.).
I would have 2 fields with the references, I showing the whole reference and one in which a found information is taken out, which at the very best will be empty and if not, you'll be able to see what miss in your query/queries. Then it could be some "hand" work, or a way to improve the query.

Thanks for your reply.

AcademicPublications is around 480k and Patents will be 1000k. There is no common field sadly.

I liked the idea of extracting the information into new columns first, as much as possible and started on it.

Overal, I'd say there's 3 dominant ways of writing the reference

1

Mandenius et al., Anal. Biochem., 137:106-114 (1984). 42
LaCelle, Blood Cells, 12:179-189 (1986).
Kricka et al., Clin Chem., 26:741-744 (1980).
Stange et al., Biomaterials, 9:3-6 (1988).
Shoji, et al., Sensors and Actuators, 15:101-107 (1988).
Sato, et al., Sensors and Actuators, A21-A23:948-951 (1990). 47
Sankolli et al., J. Imun. Methods, 104:191-194 (1987).
Rosenberg et al., Clin. Chem., 31:1444-1448 (1985).
Rosenberg et al., Clin. Chem., 30:1462-1466 (1984).
Parce et al., Science, 24:243-247 (1988).
Nakamura et al., Anal. Chem., 63:268-272 (1991).

Which is loosely [Author] et al., [AbbreviatedJournal], [Volume],[PageStart],.....([Year])

2
Reference
" Josephs et al., ""Transforming Potential of Human C-sis Nucleotide Sequences Encoding Platelet Derived Growth Factor,"" Science 225:636-639, 1984. 13 "
" Robbins et al., ""Structural and Immunological Similarities Between Simian Sarcoma Virus Gene Product(s) and Human Platelet-Derived Growth Factor,"" Nature 305:605-608, 1983. 24 "
" Owen et al., ""Simian Sarcoma Virus-Transformed Cells Secrete a Mitogen Identical to Platelet-Derived Growth Factor"", Science 225:54, 1984. "
" Johnsson et al., ""The C-Sis Gene Encodes a Precursor of the B Chain of Platelet-Derived Growth Factor,"" EMBO J. 3:921-928, 1984. 26 "
" Deuel et al., ""Human Platelet-Derived Growth Factor,"" J. Biol. Chem. 256:8896-8899, 1981. 15 "

[Author] et al., "[Title]", (sometimes the comma is before the ") [Journal] [PageStart]...., [year].


3
Reference
Pollefeys, M. and L. V. Cool, From images to 3D models, Communications of the ACM, pp. 50-55, vol. 45, No. 7.
Nistér, D., and H. Stewénius, Scalable recognition with a vocabulary tree, IEEE Conf. on Computer Vision and Pattern Recognition, Jun. 2006, pp. 2161-2168, vol. 2.
Mikolajczyk, K., and C. Schmid, A performance evaluation of local descriptors, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, 2003, pp. 257-263, Madison, Wisconsin. 11
Mikolajczyk, K., C. Schmid, Indexing based on scale invariant interest points, Proc. 8th Int'l. Conf. Comput. Vis., 2001, pp. 525-531, Vancouver, Canada. 5
Lowe, D., Distinctive image features from scale invariant keypoints, Int. J. of Computer Vision, 2004, pp. 91-110, vol. 60, No. 2. 178
Laptev, I., and T. Lindeberg, Interest point detection and scale selection in space-time, Lecture Notes in Computer Science, 2003, vol. 2695, pp. 372-387, L.D. Griffin and M Lillholm, ed., Scale-Space03, Springer Verlag, Berlin, pp. 372-387.

[Author], [Title], Journal], Vol. [Volume], [Year], pp. [PageStart]

I'll work on splitting them into 3 different tables based on reference style and then try to split them up in columns more neatly...

I'll be away from my computer for a few hours now so I will not be able to respond quickly, I apologize.
 
Now for a complete exercise in masochism, there is a thing called RegEx -regular expressions. Using those those you could match every pattern that you enoucnter one at a time. But, there is a learning curve.

If you only have 1000 records, I'd suggest to proceed as discussed - do the best you can trying to extract values for one column at a time, and then inspect the results.

Besides, do not do everything at once - break it down. You can do quadruple digits first, then triple digits, then double digits, then single digits (in this sequence - the inverse sequence will get you into deep doo doo -think why!?) :D
 
I think it was not only 1000 records but 1000k records. :)
 
Now for a complete exercise in masochism, there is a thing called RegEx -regular expressions. Using those those you could match every pattern that you enoucnter one at a time. But, there is a learning curve.

If you only have 1000 records, I'd suggest to proceed as discussed - do the best you can trying to extract values for one column at a time, and then inspect the results.

Besides, do not do everything at once - break it down. You can do quadruple digits first, then triple digits, then double digits, then single digits (in this sequence - the inverse sequence will get you into deep doo doo -think why!?) :D

Like JHB says, 1000k :rolleyes: I dont need to match them all, but I need a good sample size of matches (near 50k)with as little selection bias as possible.

I will continue the struggle, but I have a question..

How would a query look that selects t1.ID if t1.x, t1.y and t1.z match t2.x, t2.y and t2.z ? None of these tables being unique. Or is easier to then combine the xyz tables into 1 (pretty unique) string?
 
Last edited:
1000k? oops! :D

Do not worry about the query - sort out your data first. Since your data is not uniform, you won't be able to do everything in one shot (or one query) anyways, but you'll have to process the records you can in one round, mark the ones processed , and then start on a new round.

With 1000K records I'd look into RegEx to identify each data pattern uniquely, since visual inspection is tough when so many records are involved. Google vba regex and judge for yourself.
 

Users who are viewing this thread

Back
Top Bottom