A resource descriptor with fields with format set to None validates with validate, but fails with dump_to_sql.
{'name': 'srm_services', 'path': 'res_1.csv', 'profile': 'tabular-data-resource', 'schema': {'fields': [{'name': 'id', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'source', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'name', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'payment_required', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'urls', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'description', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'details', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'status', 'type': 'string', 'format': None, 'constraints': {}}, {'name': 'payment_details', 'type': 'string', 'format': None, 'constraints': {}}], 'missingValues': ['']}}
Solution: set format to "default"
{'name': 'srm_services', 'path': 'res_1.csv', 'profile': 'tabular-data-resource', 'schema': {'fields': [{'name': 'id', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'source', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'name', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'payment_required', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'urls', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'description', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'details', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'status', 'type': 'string', 'format': 'default', 'constraints': {}}, {'name': 'payment_details', 'type': 'string', 'format': 'default', 'constraints': {}}], 'missingValues': ['']}}
does it work with constraints.unique?
Investigate: some fields being read out as any and not string, for no clear reason. Breaks downstream in pipeline (elastic search doesnt handle any)
Investigate: Experienced some exceptions with rows when an array field is None. Can't really deal with upstream in current case (dont have any type data from airtable)..... not sure this is the actual error, just looked like it from looking at data with a more cryptic exception message.
it would be really great to have a flag on Flow that, on any exception in the flow, drops you into a debugging repl, maybe with some handy variables set up to inspect current state. This will be more convenient than on_error on DataStreamProcessor for processors in the flow, as many processors just implment part of that interface as functions and don't have on_error. Plus, you just set up your desired behaviour once, on the flow, for development.
Related to above, the same or similar machinery but inserted as a processor at any point in the flow, so not triggered by an exception, and allow simple inspection of current state. at point in pipeline
You have a database called, for example, srm, and want to load a table from it:
load(DUMP_DB, table="srm_services")
One would expect a resource in the package called srm_services (table > resource mapping is conceputally correct), but, the resource gets called srm, which conceptially is the name of the package (collection of tables > collection of resources (package)).
psql srm
srm=# \d+ srm_services
Table "public.srm_services"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
------------------+-------+-----------+----------+---------+----------+--------------+-------------
__airtable_id | text | | | | extended | |
id | text | | | | extended | |
source | text | | | | extended | |
name | text | | | | extended | |
payment_required | text | | | | extended | |
urls | text | | | | extended | |
description | text | | | | extended | |
details | text | | | | extended | |
situations | jsonb | | | | extended | |
status | text | | | | extended | |
payment_details | text | | | | extended | |
responses | jsonb | | | | extended | |
Access method: heap
but
Flow(
load(DUMP_DB, table="srm_services"),
)
generates the following descriptor:
{'profile': 'data-package', 'resources': [{'path': 'srm.sql', 'profile': 'tabular-data-resource', 'name': 'srm', 'schema': {'fields': [{'name': '__airtable_id', 'type': 'string', 'format': 'default'}, {'name': 'id', 'type': 'string', 'format': 'default'}, {'name': 'source', 'type': 'string', 'format': 'default'}, {'name': 'name', 'type': 'string', 'format': 'default'}, {'name': 'payment_required', 'type': 'string', 'format': 'default'}, {'name': 'urls', 'type': 'string', 'format': 'default'}, {'name': 'description', 'type': 'any', 'format': 'default'}, {'name': 'details', 'type': 'any', 'format': 'default'}, {'name': 'situations', 'type': 'any', 'format': 'default'}, {'name': 'status', 'type': 'string', 'format': 'default'}, {'name': 'payment_details', 'type': 'any', 'format': 'default'}, {'name': 'responses', 'type': 'any', 'format': 'default'}], 'missingValues': ['']}, 'format': 'sql'}]}
I guess without inpsecting data the converter can't know what to do with a JSONB field. So, what probably needs to happen is that tableschema_sql should map array columns to postgres array fields (would require that all items in array are of the same type).
Because if not present in schema, sets default to [] ... but ES doesnt have a concept of primary key.
Here is a dumped DP:
location/
location/datapackage.json
location/responses.csv
location/situations.csv
location/services.csv
location/organizations.csv
location/branches.csv
location/locations.csv
But the following flow:
DF.Flow(
DF.load('location/datapackage.json'),
DF.dump_to_path('other_location'),
).process()
produces:
other_location/
other_location/organizations.csv
other_location/responses.csv
other_location/situations.csv
!!!!
DF.map_rows
DF.dereference
join on array fields (?)
dataspot/dataflows-ckan#1