Anusha-TS commited on
Commit
3271057
·
verified ·
1 Parent(s): 35e4975

Create app.py

Browse files
Files changed (1) hide show
  1. app.py +124 -0
app.py ADDED
@@ -0,0 +1,124 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import AutoTokenizer, AutoModelForCausalLM
2
+
3
+ Instructions ="""
4
+ ### Task
5
+ Generate a SQL query to answer [QUESTION]{question}[/QUESTION]
6
+
7
+ ### Database Schema
8
+ The user query should run on the database whose schema is represented by using create queries in a json file with Keys as the table name and value as the create query :
9
+ {schema}
10
+
11
+ ### Metadata for Schema
12
+ The detailed table and column description should be considered while referring to schema to understand link between tables. The json file has an array of tables with each table name having a prompt to be considered while generating the query and detailed description of each data column.
13
+ {Metadata}
14
+
15
+ ### Further context
16
+
17
+ - All are entities i.e. city, county, districts all form entities. Each entity has an Entity ID and Entity type
18
+ - Entity type for County is County and the entity type ID for all counties will be same which is 15. Only the entity ID for county will vary.
19
+ - Every county has different entity types and each entity type will have different entities in it
20
+ - Example: Ada is a county. So Ada's entity type is county, entity type id is 15 and entity id will be 518. Ada has different districts which form entity types.
21
+ Fire district, water district, Abatement district are all different entity types in Ada county. Water district's entity type ID is 3 and will remain same for the type across all entities
22
+ Each entity type will have different entities like 'Boise Warm Springs Water District' which is an entity of entity type 'water district'
23
+ -When is regarding extracting budget of something, first identify the entities in the query like county, entity type, then match the corresponding things.
24
+ - You can get list of all counties by running 'select EntityName from ods.Entity where EntityTypeID=15'
25
+ - Remember to use only the tables given in schema and use the metadata to understand the context
26
+ - Do not create your own columns or tables. Use only those provided in schema and in metadata
27
+ - Entities belonging to a particular county can be obtained by querying ods.EntityCounties table by providing appropriate countyid
28
+ - If you cannot answer the question based on the information available, respond as 'I dont know'
29
+
30
+ ### Answer
31
+ Given the database schema, here is the SQL query that answers [QUESTION]{question}\n
32
+ [SQL]
33
+ """
34
+
35
+ def format_prompt(question, schema, metadata, Instructions):
36
+ """
37
+ Combines schema and question into a single prompt for the model.
38
+ """
39
+
40
+
41
+ return f"""
42
+
43
+ ### Instructions
44
+ {Instructions}
45
+ ### Schema
46
+ {schema}
47
+
48
+ ### Metadata:
49
+ {metadata}
50
+
51
+ ### Question
52
+ {question}
53
+
54
+ ### SQL
55
+ """
56
+
57
+
58
+ def load_model():
59
+ """
60
+ Loads the SQL generation model and tokenizer from Hugging Face.
61
+ """
62
+ model_name = "premai-io/prem-1B-SQL"
63
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
64
+ model = AutoModelForCausalLM.from_pretrained(model_name)
65
+ return tokenizer, model
66
+
67
+
68
+
69
+ # Step 5: Generate SQL from question
70
+ def generate_sql(question, prompt_inputs, tokenizer, model):
71
+ """
72
+ Generates an SQL query based on the question and schema.
73
+
74
+ Parameters:
75
+ - question: Natural language question
76
+ - schema: Database schema
77
+ -metadata : Has detailed description of schema
78
+ - tokenizer: Tokenizer instance
79
+ - model: Pre-trained SQL generation model
80
+ - device: Device to run the model on (e.g., 'cpu' or 'cuda')
81
+
82
+ Returns:
83
+ - Generated SQL query as a string
84
+ """
85
+ # Format the prompt
86
+ prompt = format_prompt(question, prompt_inputs["schema"], prompt_inputs["metadata"], prompt_inputs["instructions"])
87
+
88
+ # Tokenize input"
89
+ inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
90
+
91
+ # Generate SQL
92
+ outputs = model.generate(
93
+ **inputs,
94
+ max_new_tokens=128,
95
+ # temperature=0.1,
96
+ # do_sample=False,
97
+ )
98
+
99
+ # Decode and return the generated SQL
100
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
101
+
102
+
103
+ tokenizer, model = load_model()
104
+
105
+ import json
106
+ prompt_inputs={
107
+ "schema":"",
108
+ "metadata":"",
109
+ "instructions":Instructions
110
+ }
111
+ instructions=Instructions
112
+ with open('table_create.json', 'r')as file:
113
+ prompt_inputs["schema"]=json.load(file)
114
+
115
+ with open('tables_metadata.json', 'r')as file:
116
+ prompt_inputs["metadata"]=json.load(file)
117
+
118
+
119
+
120
+ question = "Get list of all available distinct entity types with their entity type id"
121
+ # Generate SQL
122
+ generated_sql = generate_sql(question, prompt_inputs,tokenizer, model)
123
+ print("\nGenerated SQL:")
124
+ print(generated_sql)