Skip to content

Commit 1f4099d

Browse files
authored
update in ms excel compatible formats documentation (#20)
1 parent dedd807 commit 1f4099d

4 files changed

Lines changed: 89 additions & 24 deletions

File tree

src/SUMMARY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
- [NDJSON](config/dataset-formats/ndjson.md)
1515
- [Delta Lake](config/dataset-formats/delta.md)
1616
- [Arrow](config/dataset-formats/arrow.md)
17-
- [Xlsx](config/dataset-formats/xlsx.md)
17+
- [MS Excel compatible formats](config/dataset-formats/excel.md)
1818
- [Blob store](./config/blob-store.md)
1919
- [Databases](./config/databases.md)
2020
- [Postgres wire protocol](postgres.md)
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
# MS Excel compatible formats.
2+
3+
ROAPI supports loading a few Microsoft Excel compatible formats like xls, xlsx, xlsb, ods.
4+
5+
## Configuration
6+
To load MS Excel compatible files the config should be specified like:
7+
```yaml
8+
tables:
9+
- name: "<table name>"
10+
uri: "<files path>"
11+
option:
12+
format: "<file format>"
13+
sheet_name: "Sheet1"
14+
rows_range_start: 2
15+
rows_range_end: 5
16+
columns_range_start: 1
17+
columns_range_end: 6
18+
schema_inference_lines: 3
19+
```
20+
* **format** - name of file format. Currently supported files format:
21+
* xls (Microsoft Excel 5.0/95 Workbook)
22+
* xlsx (Excel Workbook)
23+
* xlsb (Excel Binary Workbook)
24+
* ods (OpenDocument Spreadsheet)
25+
* **sheet_name** - the name of the spread sheet with table data. By default, most files initially use Sheet1 as the `sheet_name`. Be sure to change this `sheet_name` as needed if your spreadsheet uses a different name.
26+
![xlsx_sheet_name](../../images/xlsx_sheet_name.png)
27+
If no `sheet_name` is specified, ROAPI will use first spreadsheet.
28+
* **Table range options**
29+
* **rows_range_start** - the first row of the table. It contains column names. By default, `rows_range_start` is 0 (the first raw in spreadsheet)
30+
* **rows_range_end** - the last row of the table. By default, ROAPI reads all data.
31+
* **columns_range_start** - the column of the table. By default, `columns_range_start` is 0 (first column in spreadsheet)
32+
* **columns_range_end** - the last column of the table. By default, ROAPI reads all columns.
33+
For example, to take only selected data:
34+
![spread_sheet_range](../../images/spread_sheet_range.png)
35+
the config file looks like:
36+
```yaml
37+
tables:
38+
- name: "<table name>"
39+
uri: "<files path>"
40+
option:
41+
format: "<file format>"
42+
sheet_name: "Sheet1"
43+
rows_range_start: 1
44+
rows_range_end: 4
45+
columns_range_start: 1
46+
columns_range_end: 3
47+
```
48+
* **schema_inference_lines** - the number of rows (inside table range) to use in schema inference. This number includes the row with column names, so, for example, `schema_inference_lines: 3` means ROAPI will use first row for column names inference and 2 rows for column types inference. If this option is not specified then ROAPI reads all rows for column data types inference.
49+
50+
## Schema inference.
51+
ROAPI can infer schema of data automatically. The first row of data range is a row with column names. After column names inference ROAPI will infer data types by scanning all remaining rows or limited number of rows specified in `schema_inference_lines` option.
52+
If column contains more than one data type (for exaple, float and int) then ROAPI use Utf8 datatype.
53+
54+
Also, it is possible to specify schema in configuration file. This allows to avoid schema inference from data and loading of table will be faster.
55+
56+
```yaml
57+
tables:
58+
- name: "excel_table"
59+
uri: "path/to/file.xlsx"
60+
option:
61+
format: "xlsx"
62+
schema:
63+
columns:
64+
- name: "int_column"
65+
data_type: "Int64"
66+
nullable: true
67+
- name: "string_column"
68+
data_type: "Utf8"
69+
nullable: true
70+
- name: "float_column"
71+
data_type: "Float64"
72+
nullable: true
73+
- name: "datetime_column"
74+
data_type: !Timestamp [Seconds, null]
75+
nullable: true
76+
- name: "duration_column"
77+
data_type: !Duration Second
78+
nullable: true
79+
- name: "date32_column"
80+
data_type: Date32
81+
nullable: true
82+
- name: "date64_column"
83+
data_type: Date64
84+
nullable: true
85+
- name: "null_column"
86+
data_type: Null
87+
nullable: true
88+
```

src/config/dataset-formats/xlsx.md

Lines changed: 0 additions & 23 deletions
This file was deleted.

src/images/spread_sheet_range.png

31.4 KB
Loading

0 commit comments

Comments
 (0)