File size: 10,977 Bytes
4749869
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
prompt_author: Will Weaver, Kendall Fitzgerald
prompt_author_institution: University of Michigan, Field Museum of Natural History
prompt_name: FMNH_mammals_test6
prompt_version: v-6
prompt_description: Prompt developed by the University of Michigan. Adapted from SLTPvM.
  SLTPvB prompts all have standardized column headers (fields) that were chosen due
  to their reliability and prevalence in herbarium records. All field descriptions
  are based on the official Darwin Core guidelines. SLTPvB_long - The most verbose
  prompt option. Descriptions closely follow DwC guides. Detailed rules for the LLM
  to follow. Works best with double or triple OCR to increase attention back to the
  OCR (select 'use both OCR models' or 'handwritten + printed' along with trOCR).
  SLTPvB_medium - Shorter verion of _long. SLTPvB_short - The least verbose possible
  prompt while still providing rules and DwC descriptions.
LLM: General Purpose
instructions: 1. Refactor the unstructured OCR text into a dictionary based on the
  JSON structure outlined below. 2. Map the unstructured OCR text to the appropriate
  JSON key and populate the field given the user-defined rules. 3. JSON key values
  are permitted to remain empty strings if the corresponding information is not found
  in the unstructured OCR text. 4. Duplicate dictionary fields are not allowed. 5.
  Ensure all JSON keys are in camel case. 6. Ensure new JSON field values follow sentence
  case capitalization. 7. Ensure all key-value pairs in the JSON dictionary strictly
  adhere to the format and data types specified in the template. 8. Ensure output
  JSON string is valid JSON format. It should not have trailing commas or unquoted
  keys. 9. Only return a JSON dictionary represented as a string. You should not explain
  your answer.
json_formatting_instructions: This section provides rules for formatting each JSON
  value organized by the JSON key.
rules:
  catalogNumber: Barcode identifier, typically a number with at least 6 digits, but
    fewer than 30 digits.
  scientificName: The scientific name of the taxon including genus, specific epithet,
    and any lower classifications. Occasionally, the genus or specific epithet will
    be crossed out with pen or pencil and the correct genus or specific epithet name  will
    be written above it. In this case, use the text written above the crossed-out
    text.
  genus: Taxonomic determination to genus. Genus must be capitalized. If genus is
    not present use the taxonomic family name followed by the word 'indet'. Occasionally,
    the genus name will be crossed out with pen or pencil and the correct genus name
    will be written above it. In this case, use the name written above the crossed
    out name.
  specificEpithet: The name of the species epithet of the scientificName. Only include
    the species epithet. Occasionally, the specific epithet name will be crossed out
    with pen or pencil and the correct specific epithet name will be written above
    it. In this case, use the name written above the crossed out name.
  speciesNameAuthorship: The authorship information for the scientificName formatted
    according to the conventions of the applicable Darwin Core nomenclatural code.
  collectedBy: A comma separated list of names of people, groups, or organizations
    responsible for observing, recording, collecting, or presenting the original specimen.
    The primary collector or observer should be listed first.
  collectorNumber: An identifier given to the occurrence at the time it was recorded,
    the specimen collectors number. It is often written vertically on the edge of
    the paper tag, with a line separating it from other information. It is often written
    in the y-axis orientation while the rest of the numbers, data and text are written
    in the x-axis orientation. It is sometimes written next to the sex symbol or next
    to the collector name or initials.
  identifiedBy: A comma separated list of names of people, groups, or organizations
    who assigned the taxon to the subject organism. This is not the specimen collector.
  verbatimCollectionDate: The verbatim original representation of the date and time
    information for when the specimen was collected. Date of collection exactly as
    it appears on the label. Do not change the format or correct typos.
  collectionDate: Date the specimen was collected formatted as year-month-day, YYYY-MM-DD.
    If specific components of the date are unknown, they should be replaced with zeros.
    Use 0000-00-00 if the entire date is unknown, YYYY-00-00 if only the year is known,
    and YYYY-MM-00 if year and month are known but day is not.
  collectionDateEnd: If a range of collection dates is provided, this is the later
    end date while collectionDate is the beginning date. Use the same formatting as
    for collectionDate.
  occurrenceRemarks: Verbatim text describing the specimens geographic location. Text
    describing the appearance of the specimen. A statement about the presence or absence
    of a taxon at a the collection location. Text describing the significance of the
    specimen, such as a specific expedition or notable collection. Description of
    mammal features such as size, color, wellbeing, molting pattern, smell and any
    other distinguishing morphological or physiological characteristics.
  habitat: Verbatim category or description of the habitat in which the specimen collection
    event occurred.
  country: The name of the country or major administrative unit in which the specimen
    was originally collected.
  stateProvince: The name of the next smaller administrative region than country (state,
    province, canton, department, region, etc.) in which the specimen was originally
    collected.
  county: The full, unabbreviated name of the next smaller administrative region than
    stateProvince (county, shire, department, parish etc.) in which the specimen was
    originally collected.
  locality: Description of geographic location, landscape, landmarks, regional features,
    nearby places, municipality, city, or any contextual information aiding in pinpointing
    the exact origin or location of the specimen.
  verbatimCoordinates: Verbatim location coordinates as they appear on the label.
    Do not convert formats. Possible coordinate types include [Lat, Long, UTM, TRS].
  decimalLatitude: Latitude decimal coordinate. Correct and convert the verbatim location
    coordinates to conform with the decimal degrees GPS coordinate format.
  decimalLongitude: Longitude decimal coordinate. Correct and convert the verbatim
    location coordinates to conform with the decimal degrees GPS coordinate format.
  elevationUnits: Use m if the final elevation is reported in meters. Use ft if the
    final elevation is in feet. Units should match elevation.
  measurementsTL: The total length of the animal from snout to tip of the tail. This
    is usually a 3 digit number. It is the first number in a string of 3 or  4 measurement
    numbers that are usually separated by dashes, commas or spaces or are sometimes
    written vertically in the same order. This total length measurement will be the
    largest number in the series of 3 or 4 measurements numbers.
  measurementsTV: The length of the tail vertebrae of the animal from the first tail
    vertebrae to the last tail vertebrae. This is usually a minimum of 1 digit to
    a maximum of 3 digit number. It is the second number in a string of 3 or  4 measurement
    numbers that are usually separated by dashes, commas or spaces or are sometimes
    written vertically in the same order.
  measurementsHF: The length of the hindfoot of the animal with claw (H.F. cu) from
    the ankle to the tip of the longest claw. This is usually has at least 2 digits
    and a maximum of 3 digit number. It is the third number in a string of 3 or  4
    measurement numbers that are usually separated by dashes, commas or spaces or
    are sometimes written vertically in the same order.
  measurementsEAR: The length of the ear of the animal. This is usually a 1 to 3 digit
    number. It is usually the fourth number in a string of 3 or  4 measurement numbers
    that are usually separated by dashes, commas or spaces or are sometimes written
    vertically in the same order.
  measurementsWEIGHT: The weight of the animal. This is usually a 1 to 3 digit number.
    It is sometimes preceded by an equal sign and or followed by the letter g which
    stands for the unit of grams. It is sometimes followed or preceded by the letters
    lbs for the unit of pounds.
  catalogNumberFMNH: Barcode identifier, typically a number with at least 3 digits,
    but fewer than 8 digits. It is typically preceded by or near the words Field Museum,
    FM, FMNH, or CNMH.
  collectionMethod: Mammals are sometimes intentionally caught by collectors, brought
    to collectors as roadkill or brought to collectors after being killed as pest.
    Text description may include description of how the animal was killed, for example
    as roadkill or in a trap or by a hunter. Record that information verbatim here.
  measurementsTLunits: Use mm if the Total Length is recorded in millimeters. Use
    in if the Total Length is recorded in inches. Units should match measurementsTVunits
    and measurementsHFunits and measurementsEARunits.
  measurementsTVunits: Use mm if the Tail Length is recorded in millimeters. Use in
    if the Tail Length is recorded in inches. Units should match measurementsTLunits
    and measurementsHFunits and measurementsEARunits.
  measurementsHFunits: Use mm if the hindfoot length is recorded in millimeters. Use
    in if the hindfoot length is recorded in inches. Units should match measurementsTVunits
    and measurementsTLunits and measurementsEARunits.
  measurementsEARunits: Use mm if the ear length is recorded in millimeters. Use in
    if the ear length is recorded in inches. Units should match measurementsTVunits
    and measurementsTLunits and measurementsHFunits.
  measurementsWEIGHTunits: Use g if the weight is recorded in millimeters. Use lbs
    if the weight is recorded in pounds.
  elevation: Elevation or altitude in meters or feet.
mapping:
  TAXONOMY:
  - catalogNumber
  - scientificName
  - genus
  - specificEpithet
  - speciesNameAuthorship
  - collectedBy
  - collectorNumber
  - identifiedBy
  - catalogNumberFMNH
  GEOGRAPHY:
  - country
  - stateProvince
  - county
  - locality
  - verbatimCoordinates
  - decimalLatitude
  - decimalLongitude
  - elevationUnits
  - elevation
  COLLECTING:
  - verbatimCollectionDate
  - collectionDate
  - collectionDateEnd
  - habitat
  - occurrenceRemarks
  - collectionMethod
  LOCALITY: []
  MISC:
  - measurementsTL
  - measurementsTV
  - measurementsEAR
  - measurementsHF
  - measurementsWEIGHT
  - measurementsTLunits
  - measurementsTVunits
  - measurementsHFunits
  - measurementsEARunits
  - measurementsWEIGHTunits