Reading
The format for this section will go through the various inputs/options supported by CSV.File/CSV.read, with notes about compatibility with the other reading functionality (CSV.Rows, CSV.Chunks, etc.).
input
A required argument for reading. Input data should be ASCII or UTF-8 encoded text; for other text encodings, use the StringEncodings.jl package to convert to UTF-8.
Any delimited input is ultimately converted to a byte buffer (Vector{UInt8}) for parsing/processing, so with that in mind, let’s look at the various supported input types:
-
File name as a
StringorFilePath; parsing will callMmap.mmap(string(file))to get a byte buffer to the file data. For gzip compressed inputs, likefile.gz, the CodecZlib.jl package will be used to decompress the data to a temporary file first, then mmapped to a byte buffer. Decompression can also be done in memory by passingbuffer_in_memory=true. Note that only gzip-compressed data is automatically decompressed; for other forms of compressed data, seek out the appropriate package to decompress and pass an IO orVector{UInt8}of decompressed data as input. -
Vector{UInt8}orSubArray{UInt8, 1, Vector{UInt8}}: if you already have a byte buffer from wherever, you can just pass it in directly. If you have a csv-formatted string, you can pass it likeCSV.File(IOBuffer(str)) -
IOorCmd: you can pass anIOorCmddirectly, which will be consumed into a temporary file, then mmapped as a byte vector; to avoid a temp file and instead buffer data in memory, passbuffer_in_memory=true. -
For files from the web, you can call
HTTP.get(url).bodyto request the file, then access the data as aVector{UInt8}from thebodyfield, which can be passed directly for parsing. For Julia 1.6+, you can also use theDownloadsstdlib, likeDownloads.download(url)which can be passed to parsing
header
The header keyword argument controls how column names are treated when processing files. By default, it is assumed that the column names are the first row/line of the input, i.e. header=1. Alternative valid aguments for header include:
-
Integer, e.g.header=2: provide the row number as anIntegerwhere the column names can be found -
Bool, e.g.header=false: no column names exist in the data; column names will be auto-generated depending on the # of columns, likeColumn1,Column2, etc. -
Vector{String}orVector{Symbol}: manually provide column names as strings or symbols; should match the # of columns in the data. A copy of theVectorwill be made and converted toVector{Symbol} -
AbstractVector{<:Integer}: in rare cases, there may be multi-row headers; by passing a collection of row numbers, each row will be parsed and the values for each row will be concatenated to form the final column names
normalizenames
Controls whether column names will be "normalized" to valid Julia identifiers. By default, this is false. If normalizenames=true, then column names with spaces, or that start with numbers, will be adjusted with underscores to become valid Julia identifiers. This is useful when you want to access columns via dot-access or getproperty, like file.col1. The identifier that comes after the . must be valid, so spaces or identifiers starting with numbers aren’t allowed.
skipto
An Integer can be provided that specifies the row number where the data is located. By default, the row immediately following the header row is assumed to be the start of data. If header=false, or column names are provided manually as Vector{String} or Vector{Symbol}, the data is assumed to start on row 1, i.e. skipto=1.
footerskip
An Integer argument specifying the number of rows to ignore at the end of a file. This works by the parser starting at the end of the file and parsing in reverse until footerskip # of rows have been parsed, then parsing the entire file, stopping at the newly adjusted "end of file".
transpose
If transpose=true is passed, data will be read "transposed", so each row will be parsed as a column, and each column in the data will be returned as a row. Useful when data is extremely wide (many columns), but you want to process it in a "long" format (many rows). Note that multithreaded parsing is not supported when parsing is transposed.
comment
A String argument that, when encountered at the start of a row while parsing, will cause the row to be skipped. When providing header, skipto, or footerskip arguments, it should be noted that commented rows, while ignored, still count as "rows" when skipping to a specific row. In this way, you can visually identify, for example, that column names are on row 6, and pass header=6, even if row 5 is a commented row and will be ignored.
ignoreemptyrows
This argument specifies whether "empty rows", where consecutive newlines are parsed, should be ignored or not. By default, they are. If ignoreemptyrows=false, then for an empty row, all existing columns will have missing assigned to their value for that row. Similar to commented rows, empty rows also still count as "rows" when any of the header, skipto, or footerskip arguments are provided.
select / drop
Arguments that control which columns from the input data will actually be parsed and available after processing. select controls which columns will be accessible after parsing while drop controls which columns to ignore. Either argument can be provided as a vector of Integer, String, or Symbol, specifing the column numbers or names to include/exclude. A vector of Bool matching the number of columns in the input data can also be provided, where each element specifies whether the corresponding column should be included/excluded. Finally, these arguments can also be given as boolean functions, of the form (i, name) -> Bool, where each column number and name will be given as arguments and the result of the function will determine if the column will be included/excluded.
limit
An Integer argument to specify the number of rows that should be read from the data. Can be used in conjunction with skipto to read contiguous chunks of a file. Note that with multithreaded parsing (when the data is deemed large enough), it can be difficult for parsing to determine the exact # of rows to limit to, so it may or may not return exactly limit number of rows. To ensure an exact limit on larger files, also pass ntasks=1 to force single-threaded parsing.
ntasks
not applicable to CSV.Rows
|
For large enough data inputs, ntasks controls the number of multithreaded tasks used to concurrently parse the data. By default, it uses Threads.nthreads(), which is the number of threads the julia process was started with, either via julia -t N or the JULIA_NUM_THREADS environment variable. To avoid multithreaded parsing, even on large files, pass ntasks=1. This argument is only applicable to CSV.File, not CSV.Rows. For CSV.Chunks, it controls the total number of chunk iterations a large file will be split up into for parsing.
rows_to_check
not applicable to CSV.Rows
|
When input data is large enough, parsing will attempt to "chunk" up the data for multithreaded tasks to parse concurrently. To chunk up the data, it is split up into even chunks, then initial parsers attempt to identify the correct start of the first row of that chunk. Once the start of the chunk’s first row is found, each parser will check rows_to_check number of rows to ensure the expected number of columns are present.
source
only applicable to vector of inputs passed to CSV.File
|
A Symbol, String, or Pair of Symbol or String to Vector. As a single Symbol or String, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As a Pair, the 2nd part of the pair should be a Vector of values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
missingstring
Argument to control how missing values are handled while parsing input data. The default is missingstring="", which means two consecutive delimiters, like ,,, will result in a cell being set as a missing value. Otherwise, you can pass a single string to use as a "sentinel", like missingstring="NA", or a vector of strings, where each will be checked for when parsing, like missingstring=["NA", "NAN", "NULL"], and if any match, the cell will be set to missing. By passing missingstring=nothing, no missing values will be checked for while parsing.
delim
A Char or String argument that parsing looks for in the data input that separates distinct columns on each row. If no argument is provided (the default), parsing will try to detect the most consistent delimiter on the first 10 rows of the input, falling back to a single comma (,) if no other delimiter can be detected consistently.
ignorerepeated
A Bool argument, default false, that, if set to true, will cause parsing to ignore any number of consecutive delimiters between columns. This option can often be used to accurately parse fixed-width data inputs, where columns are delimited with a fixed number of delimiters, or a row is fixed-width and columns may have a variable number of delimiters between them based on the length of cell values.
quoted
A Bool argument that controls whether parsing will check for opening/closing quote characters at the start/end of cells. Default true. If you happen to know a file has no quoted cells, it can simplify parsing to pass quoted=false, so parsing avoids treating the quotechar or openquotechar/closequotechar arguments specially.
quotechar / openquotechar / closequotechar
An ASCII Char argument (or arguments if both openquotechar and closequotechar are provided) that parsing uses to handle "quoted" cells. If a cell string value contains the delim argument, or a newline, it should start and end with quotechar, or start with openquotechar and end with closequotechar so parsing knows to treat the delim or newline as part of the cell value instead of as significant parsing characters. If the quotechar or closequotechar characters also need to appear in the cell value, they should be properly escaped via the escapechar argument.
escapechar
An ASCII Char argument that parsing uses when parsing quoted cells and the quotechar or closequotechar characters appear in a cell string value. If the escapechar character is encountered inside a quoted cell, it will be "skipped", and the following character will not be checked for parsing significance, but just treated as another character in the value of the cell. Note the escapechar is not included in the value of the cell, but is ignored completely.
dateformat
A String or AbstractDict argument that controls how parsing detects datetime values in the data input. As a single String (or DateFormat) argument, the same format will be applied to all columns in the file. For columns without type information provided otherwise, parsing will use the provided format string to check if the cell is parseable and if so, will attempt to parse the entire column as the datetime type (Time, Date, or DateTime). By default, if no dateformat argument is explicitly provided, parsing will try to detect any of Time, Date, or DateTime types following the standard Dates.ISOTimeFormat, Dates.ISODateFormat, or Dates.ISODateTimeFormat formats, respectively. If a datetime type is provided for a column, (see the types argument), then the dateformat format string needs to match the format of values in that column, otherwise, a warning will be emitted and the value will be replaced with a missing value (this behavior is also configurable via the strict and silencewarnings arguments). If an AbstractDict is provided, different dateformat strings can be provided for specific columns; the provided dict can map either an Integer for column number or a String, Symbol or Regex for column name to the dateformat string that should be used for that column. Columns not mapped in the dict argument will use the default format strings mentioned above.
decimal
An ASCII Char argument that is used when parsing float values that indicates where the fractional portion of the float value begins. i.e. for the truncated values of pie 3.14, the '.' character separates the 3 and 14 values, whereas for 3,14 (common European notation), the ',' character separates the fractional portion. By default, decimal='.'.
groupmark / thousands separator
A "groupmark" is a symbol that separates groups of digits so that it easier for humans to read a number. Thousands separators are a common example of groupmarks. The argument groupmark, if provided, must be an ASCII Char which will be ignored during parsing when it occurs between two digits on the left hand side of the decimal. e.g the groupmark in the integer 1,729 is ',' and the groupmark for the US social security number 875-39-3196 is -. By default, groupmark=nothing which indicates that there are no stray characters separating digits.
truestrings / falsestrings
These arguments can be provided as Vector{String} to specify custom values that should be treated as the Bool true/false values for all the columns of a data input. By default, ["true", "True", "TRUE", "T", "1"] string values are used to detect true values, and ["false", "False", "FALSE", "F", "0"] string values are used to detect false values. Note that even though "1" and "0" can be used to parse true/false values, in terms of auto detecting column types, those values will be parsed as Int64 first, instead of Bool. To instead parse those values as Bools for a column, you can manually provide that column’s type as Bool (see the type argument).
types
Argument to control the types of columns that get parsed in the data input. Can be provided as a single Type, an AbstractVector of types, an AbstractDict, or a function.
-
If a single type is provided, like
types=Float64, then all columns in the data input will be parsed asFloat64. If a column’s value isn’t a validFloat64value, then a warning will be emitted, unlesssilencewarnings=falseis passed, then no warning will be printed. However, ifstrict=trueis passed, then an error will be thrown instead, regarldess of thesilencewarningsargument. -
If a
AbstractVector{Type}is provided, then the length of the vector should match the number of columns in the data input, and each element gives the type of the corresponding column in order. -
If an
AbstractDict, then specific columns can have their column type specified with the key of the dict being anIntegerfor column number, orStringorSymbolfor column name orRegexmatching column names, and the dict value being the column type. Unspecified columns will have their column type auto-detected while parsing. -
If a function, then it should be of the form
(i, name) -> Union{T, Nothing}, and will be applied to each detected column during initial parsing. Returningnothingfrom the function will result in the column’s type being automatically detected during parsing.
By default types=nothing, which means all column types in the data input will be detected while parsing. Note that it isn’t necessary to pass types=Union{Float64, Missing} if the data input contains missing values. Parsing will detect missing values if present, and promote any manually provided column types from the singular (Float64) to the missing equivalent (Union{Float64, Missing}) automatically. Standard types will be auto-detected in the following order when not otherwise specified: Int64, Float64, Date, DateTime, Time, Bool, String.
Non-standard types can be provided, like Dec64 from the DecFP.jl package, but must support the Base.tryparse(T, str) function for parsing a value from a string. This allows, for example, easily defining a custom type, like struct Float64Array; values::Vector{Float64}; end, as long as a corresponding Base.tryparse definition is defined, like Base.tryparse(::Type{Float64Array}, str) = Float64Array(map(x -> parse(Float64, x), split(str, ';'))), where a single cell in the data input is like 1.23;4.56;7.89.
Note that the default stringtype can be overridden by providing a column’s type manually, like CSV.File(source; types=Dict(1 => String), stringtype=PosLenString), where the first column will be parsed as a String, while any other string columns will have the PosLenString type.
typemap
An AbstractDict{Type, Type} argument that allows replacing a non-String standard type with another type when a column’s type is auto-detected. Most commonly, this would be used to force all numeric columns to be Float64, like typemap=IdDict(Int64 => Float64), which would cause any columns detected as Int64 to be parsed as Float64 instead. Another common case would be wanting all columns of a specific type to be parsed as strings instead, like typemap=IdDict(Date => String), which will cause any columns detected as Date to be parsed as String instead.
pool
Argument that controls whether columns will be returned as PooledArrays. Can be provided as a Bool, Float64, Tuple{Float64, Int}, vector, dict, or a function of the form (i, name) -> Union{Bool, Real, Tuple{Float64, Int}, Nothing}. As a Bool, controls absolutely whether a column will be pooled or not; if passed as a single Bool argument like pool=true, then all string columns will be pooled, regardless of cardinality. When passed as a Float64, the value should be between 0.0 and 1.0 to indicate the threshold under which the % of unique values found in the column will result in the column being pooled. For example, if pool=0.1, then all string columns with a unique value % less than 10% will be returned as PooledArray, while other string columns will be normal string vectors. If pool is provided as a tuple, like (0.2, 500), the first tuple element is the same as a single Float64 value, which represents the % cardinality allowed. The second tuple element is an upper limit on the # of unique values allowed to pool the column. So the example, pool=(0.2, 500) means if a String column has less than or equal to 500 unique values and the # of unique values is less than 20% of total # of values, it will be pooled, otherwise, it won’t. As mentioned, when the pool argument is a single Bool, Real, or Tuple{Float64, Int}, only string columns will be considered for pooling. When a vector or dict is provided, the pooling for any column can be provided as a Bool, Float64, or Tuple{Float64, Int}. Similar to the types argument, providing a vector to pool should have an element for each column in the data input, while a dict argument can map column number/name to Bool, Float64, or Tuple{Float64, Int} for specific columns. Unspecified columns will not be pooled when the argument is a dict.
downcast
A Bool argument that controls whether Integer detected column types will be "shrunk" to the smallest possible integer type. Argument is false by default. Only applies to auto-detected column types; i.e. if a column type is provided manually as Int64, it will not be shrunk. Useful for shrinking the overall memory footprint of parsed data, though care should be taken when processing the results as Julia by default as integer overflow behavior, which is increasingly likely the smaller the integer type.
stringtype
An argument that controls the precise type of string columns. Supported values are InlineString (the default), PosLenString, or String. The various string types are aimed at being mostly transparent to most users. In certain workflows, however, it can be advantageous to be more specific. Here’s a quick rundown of the possible options:
-
InlineString: a set of fixed-width, stack-allocated primitive types. Can take memory pressure off the GC because they aren’t reference types/on the heap. For very large files with string columns that have a fairly low variance in string length, this can provide much better GC interaction thanString. When string length has a high variance, it can lead to lots of "wasted space", since an entire column will be promoted to the smallest InlineString type that fits the longest string value. For small strings, that can mean a lot of wasted space when they’re promoted to a high fixed-width. -
PosLenString: results in columns returned asPosLenStringVector(orChainedVector{PosLenStringVector}for the multithreaded case), which holds a reference to the original input data, and acts as one large "view" vector into the original data where each cell begins/ends. Can result in the smallest memory footprint for string columns.PosLenStringVector, however, does not support traditional mutable operations like regularVectors, likepush!,append!, ordeleteat!. -
String: each string must be heap-allocated, which can result in higher GC pressure in very large files. But columns are returned as normalVector{String}(orChainedVector{Vector{String}}), which can be processed normally, including any mutating operations.
strict / silencewarnings / maxwarnings
Arguments that control error behavior when invalid values are encountered while parsing. Only applicable when types are provided manually by the user via the types argument. If a column type is manually provided, but an invalid value is encountered, the default behavior is to set the value for that cell to missing, emit a warning (i.e. silencewarnings=false and strict=false), but only up to 100 total warnings and then they’ll be silenced (i.e. maxwarnings=100). If strict=true, then invalid values will result in an error being thrown instead of any warnings emitted.
API Reference
#
CSV.read — Function
CSV.read(source, sink::T; kwargs...) => T
Read and parses a delimited file or files, materializing directly using the sink function. Allows avoiding excessive copies of columns for certain sinks like DataFrame.
Example
julia> using CSV, DataFrames
julia> path = tempname();
julia> write(path, "a,b,c\n1,2,3");
julia> CSV.read(path, DataFrame)
1×3 DataFrame
Row │ a b c
│ Int64 Int64 Int64
─────┼─────────────────────
1 │ 1 2 3
julia> CSV.read(path, DataFrame; header=false)
2×3 DataFrame
Row │ Column1 Column2 Column3
│ String1 String1 String1
─────┼───────────────────────────
1 │ a b c
2 │ 1 2 3
Arguments
File layout options:
-
header=1: how column names should be determined; if given as anInteger, indicates the row to parse for column names; as anAbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names;Vector{Symbol}orVector{String}give column names explicitly (should match # of columns in dataset); if a dataset doesn’t have column names, either provide them as aVector, or setheader=0orheader=falseand column names will be auto-generated (Column1,Column2, etc.). Note that if a row number header andcommentorignoreemptyrowsare provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row. -
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1getpropertysyntax or iterating rows and accessing column values of a row viagetproperty(e.g.row.col1) -
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after theheaderrow(s) is used. Ifheader=0, then the 1st row is assumed to be the start of data; providing askiptoargument does not affect theheaderargument. Note that if a row numberskiptoandcommentorignoreemptyrowsare provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row. -
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecommentkeyword argument) do not count towards the row number provided forfooterskip, they are completely ignored by the parser -
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row -
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskiptoandcommentare provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row. -
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (iffalse, each column will be assignedmissingfor that empty row) -
select: anAbstractVectorofInteger,Symbol,String, orBool, or a "selector" function of the form(i, name) -> keep::Bool; only columns in the collection or for which the selector function returnstruewill be parsed and accessible in the resultingCSV.File. Invalid values inselectare ignored. -
drop: inverse ofselect; anAbstractVectorofInteger,Symbol,String, orBool, or a "drop" function of the form(i, name) -> drop::Bool; columns in the collection or for which the drop function returnstruewill ignored in the resultingCSV.File. Invalid values indropare ignored. -
limit: anIntegerto indicate a limited number of rows to parse in a csv file; use in combination withskiptoto read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimitargument may not result in an exact # of rows parsed; usethreaded=falseto ensure an exact limit if necessary -
buffer_in_memory: aBool, defaultfalse, which controls whether aCmd,IO, or gzipped source will be read/decompressed in memory vs. using a temporary file. -
ntasks::Integer=Threads.nthreads(): [not applicable toCSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADSenvironment variable orjulia -t N); settingntasks=1will avoid any calls toThreads.@spawnand just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells) -
rows_to_check::Integer=30: [not applicable toCSV.Rows] a multithreaded parsed file will be split up intontasks# of equal chunks;rows_to_checkcontrols the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_checkmay need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows -
source: [only applicable for vector of inputs toCSV.File] aSymbol,String, orPairofSymbolorStringtoVector. As a singleSymbolorString, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair, the 2nd part of the pair should be aVectorof values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
-
missingstring: either anothing,String, orVector{String}to use as sentinel values that will be parsed asmissing; ifnothingis passed, no sentinel/missing values will be parsed; by default,missingstring="", which means only an empty field (two consecutive delimiters) is consideredmissing -
delim=',': aCharorStringthat indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file -
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells -
quoted::Bool=true: whether parsing should check forquotecharat the start/end of cells -
quotechar='"',openquotechar,closequotechar: aChar(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters -
escapechar='"': theCharused to escape quote characters in a quoted field -
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt, or nameSymbolorStringto the format string for that column. -
decimal='.': aCharindicating how decimals are separated in floats, i.e.3.14uses'.', or3,14uses a comma',' -
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00). -
truestrings,falsestrings:+Vector{String}+s that indicate howtrueorfalsevalues are represented; by default"true", "True", "TRUE", "T", "1"are used to detecttrueand"false", "False", "FALSE", "F", "0"are used to detectfalse; note that columns with only1and0values will default toInt64column type unless explicitly requested to beBoolviatypeskeyword argument -
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
-
types: a singleType,AbstractVectororAbstractDictof types, or a function of the form(i, name) -> Union{T, Nothing}to be used for column types; if a singleTypeis provided, all columns will be parsed with that single type; anAbstractDictcan map column indexInteger, or nameSymbolorStringto type for a column, i.e.Dict(1=>Float64)will set the first column as aFloat64,Dict(:column1=>Float64)will set the column namedcolumn1toFloat64and,Dict("column1"=>Float64)will set thecolumn1toFloat64; if aVectoris provided, it must match the # of columns provided or detected inheader. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothingto signal the column’s type should be detected while parsing. -
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)would change every detectedFloat64column to be parsed asString; only "standard" types are allowed to be mapped to another type, i.e.Int64,Float64,Date,DateTime,Time, andBool. If a column of one of those types is "detected", it will be mapped to the specified type. -
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported byCSV.Rows] controls whether columns will be built asPooledArray; iftrue, all columns detected asStringwill be pooled; alternatively, the proportion of unique values below whichStringcolumns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25, it will be pooled). If provided as aTuple{Float64, Int}like(0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If anAbstractVector, each element should beBool,Real, orTuple{Float64, Int}and the # of elements should match the # of columns in the dataset; if anAbstractDict, aBool,Real, orTuple{Float64, Int}value can be provided for individual columns where the dict key is given as column indexInteger, or column name asSymbolorString. If a function is provided, it should take a column index and name as 2 arguments, and return aBool,Real,Tuple{Float64, Int}, ornothingfor each column. -
downcast::Bool=false: controls whether columns detected asInt64will be "downcast" to the smallest possible integer type likeInt8,Int16,Int32, etc. -
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default isInlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString. IfStringis passed, all string columns will just be normalStringvalues. IfPosLenStringis passed, string columns will be returned asPosLenStringVector, which is a special "lazy"AbstractVectorthat acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVectormakes it read-only, so operations likepush!,append!, orsetindex!are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail -
strict::Bool=false: whether invalid values should throw a parsing error or be replaced withmissing -
silencewarnings::Bool=false: ifstrict=false, whether invalid value warnings should be silenced -
maxwarnings::Int=100: if more thanmaxwarningsnumber of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings -
debug::Bool=false: passingtruewill result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed -
validate::Bool=true: whether or not to validate that columns specified in thetypes,dateformatandpoolkeywords are actually found in the data. Iffalseno validation is done, meaning no error will be thrown iftypes/dateformat/poolspecify settings for columns not actually found in the data.
Iteration options:
-
reusebuffer=false: [only supported byCSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it’s not safe to use this option if doingcollect(CSV.Rows(file))because only current iterated row is "valid")
#
CSV.File — Type
CSV.File(input; kwargs...) => CSV.File
Read a UTF-8 CSV input and return a CSV.File object, which is like a lightweight table/dataframe, allowing dot-access to columns and iterating rows. Satisfies the Tables.jl interface, so can be passed to any valid sink, yet to avoid unnecessary copies of data, use CSV.read(input, sink; kwargs...) instead if the CSV.File intermediate object isn’t needed.
The input argument can be one of:
-
filename given as a string or FilePaths.jl type
-
a
Vector{UInt8}orSubArray{UInt8, 1, Vector{UInt8}}byte buffer -
a
CodeUnitsobject, which wraps aString, likecodeunits(str) -
a csv-formatted string can also be passed like
IOBuffer(str) -
a
Cmdor otherIO -
a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
-
a
Vectorof any of the above, which will parse and vertically concatenate each source, returning a single, "long"CSV.File
To read a csv file from a url, use the Downloads.jl stdlib or HTTP.jl package, where the resulting downloaded tempfile or HTTP.Response body can be passed like:
using Downloads, CSV
f = CSV.File(Downloads.download(url))
# or
using HTTP, CSV
f = CSV.File(HTTP.get(url).body)
Opens the file or files and uses passed arguments to detect the number of columns and column types, unless column types are provided manually via the types keyword argument. Note that passing column types manually can slightly increase performance for each column type provided (column types can be given as a Vector for all columns, or specified per column via name or index in a Dict).
When a Vector of inputs is provided, the column names and types of each separate file/input must match to be vertically concatenated. Separate threads will be used to parse each input, which will each parse their input using just the single thread. The results of all threads are then vertically concatenated using ChainedVectors to lazily concatenate each thread’s columns.
For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc"ISO-8859-1")).
The returned CSV.File object supports the Tables.jl interface and can iterate CSV.Rows. CSV.Row supports propertynames and getproperty to access individual row values. CSV.File also supports entire column access like a DataFrame via direct property access on the file object, like f = CSV.File(file); f.col1. Or by getindex access with column names, like f[:col1] or f["col1"]. The returned columns are AbstractArray subtypes, including: SentinelVector (for integers), regular Vector, PooledVector for pooled columns, MissingVector for columns of all missing values, PosLenStringVector when stringtype=PosLenString is passed, and ChainedVector will chain one of the previous array types together for data inputs that use multiple threads to parse (each thread parses a single "chain" of the input). Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:
for row in CSV.File(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
By supporting the Tables.jl interface, a CSV.File can also be a table input to any other table sink function. Like:
# materialize a csv file as a DataFrame, copying columns from CSV.File
df = CSV.File(file) |> DataFrame
# to avoid making a copy of parsed columns, use CSV.read
df = CSV.read(file, DataFrame)
# load a csv file directly into an sqlite database table
db = SQLite.DB()
tbl = CSV.File(file) |> SQLite.load!(db, "sqlite_table")
Arguments
File layout options:
-
header=1: how column names should be determined; if given as anInteger, indicates the row to parse for column names; as anAbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names;Vector{Symbol}orVector{String}give column names explicitly (should match # of columns in dataset); if a dataset doesn’t have column names, either provide them as aVector, or setheader=0orheader=falseand column names will be auto-generated (Column1,Column2, etc.). Note that if a row number header andcommentorignoreemptyrowsare provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row. -
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1getpropertysyntax or iterating rows and accessing column values of a row viagetproperty(e.g.row.col1) -
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after theheaderrow(s) is used. Ifheader=0, then the 1st row is assumed to be the start of data; providing askiptoargument does not affect theheaderargument. Note that if a row numberskiptoandcommentorignoreemptyrowsare provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row. -
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecommentkeyword argument) do not count towards the row number provided forfooterskip, they are completely ignored by the parser -
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row -
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskiptoandcommentare provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row. -
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (iffalse, each column will be assignedmissingfor that empty row) -
select: anAbstractVectorofInteger,Symbol,String, orBool, or a "selector" function of the form(i, name) -> keep::Bool; only columns in the collection or for which the selector function returnstruewill be parsed and accessible in the resultingCSV.File. Invalid values inselectare ignored. -
drop: inverse ofselect; anAbstractVectorofInteger,Symbol,String, orBool, or a "drop" function of the form(i, name) -> drop::Bool; columns in the collection or for which the drop function returnstruewill ignored in the resultingCSV.File. Invalid values indropare ignored. -
limit: anIntegerto indicate a limited number of rows to parse in a csv file; use in combination withskiptoto read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimitargument may not result in an exact # of rows parsed; usethreaded=falseto ensure an exact limit if necessary -
buffer_in_memory: aBool, defaultfalse, which controls whether aCmd,IO, or gzipped source will be read/decompressed in memory vs. using a temporary file. -
ntasks::Integer=Threads.nthreads(): [not applicable toCSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADSenvironment variable orjulia -t N); settingntasks=1will avoid any calls toThreads.@spawnand just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells) -
rows_to_check::Integer=30: [not applicable toCSV.Rows] a multithreaded parsed file will be split up intontasks# of equal chunks;rows_to_checkcontrols the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_checkmay need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows -
source: [only applicable for vector of inputs toCSV.File] aSymbol,String, orPairofSymbolorStringtoVector. As a singleSymbolorString, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair, the 2nd part of the pair should be aVectorof values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
-
missingstring: either anothing,String, orVector{String}to use as sentinel values that will be parsed asmissing; ifnothingis passed, no sentinel/missing values will be parsed; by default,missingstring="", which means only an empty field (two consecutive delimiters) is consideredmissing -
delim=',': aCharorStringthat indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file -
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells -
quoted::Bool=true: whether parsing should check forquotecharat the start/end of cells -
quotechar='"',openquotechar,closequotechar: aChar(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters -
escapechar='"': theCharused to escape quote characters in a quoted field -
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt, or nameSymbolorStringto the format string for that column. -
decimal='.': aCharindicating how decimals are separated in floats, i.e.3.14uses'.', or3,14uses a comma',' -
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00). -
truestrings,falsestrings:+Vector{String}+s that indicate howtrueorfalsevalues are represented; by default"true", "True", "TRUE", "T", "1"are used to detecttrueand"false", "False", "FALSE", "F", "0"are used to detectfalse; note that columns with only1and0values will default toInt64column type unless explicitly requested to beBoolviatypeskeyword argument -
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
-
types: a singleType,AbstractVectororAbstractDictof types, or a function of the form(i, name) -> Union{T, Nothing}to be used for column types; if a singleTypeis provided, all columns will be parsed with that single type; anAbstractDictcan map column indexInteger, or nameSymbolorStringto type for a column, i.e.Dict(1=>Float64)will set the first column as aFloat64,Dict(:column1=>Float64)will set the column namedcolumn1toFloat64and,Dict("column1"=>Float64)will set thecolumn1toFloat64; if aVectoris provided, it must match the # of columns provided or detected inheader. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothingto signal the column’s type should be detected while parsing. -
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)would change every detectedFloat64column to be parsed asString; only "standard" types are allowed to be mapped to another type, i.e.Int64,Float64,Date,DateTime,Time, andBool. If a column of one of those types is "detected", it will be mapped to the specified type. -
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported byCSV.Rows] controls whether columns will be built asPooledArray; iftrue, all columns detected asStringwill be pooled; alternatively, the proportion of unique values below whichStringcolumns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25, it will be pooled). If provided as aTuple{Float64, Int}like(0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If anAbstractVector, each element should beBool,Real, orTuple{Float64, Int}and the # of elements should match the # of columns in the dataset; if anAbstractDict, aBool,Real, orTuple{Float64, Int}value can be provided for individual columns where the dict key is given as column indexInteger, or column name asSymbolorString. If a function is provided, it should take a column index and name as 2 arguments, and return aBool,Real,Tuple{Float64, Int}, ornothingfor each column. -
downcast::Bool=false: controls whether columns detected asInt64will be "downcast" to the smallest possible integer type likeInt8,Int16,Int32, etc. -
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default isInlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString. IfStringis passed, all string columns will just be normalStringvalues. IfPosLenStringis passed, string columns will be returned asPosLenStringVector, which is a special "lazy"AbstractVectorthat acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVectormakes it read-only, so operations likepush!,append!, orsetindex!are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail -
strict::Bool=false: whether invalid values should throw a parsing error or be replaced withmissing -
silencewarnings::Bool=false: ifstrict=false, whether invalid value warnings should be silenced -
maxwarnings::Int=100: if more thanmaxwarningsnumber of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings -
debug::Bool=false: passingtruewill result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed -
validate::Bool=true: whether or not to validate that columns specified in thetypes,dateformatandpoolkeywords are actually found in the data. Iffalseno validation is done, meaning no error will be thrown iftypes/dateformat/poolspecify settings for columns not actually found in the data.
Iteration options:
-
reusebuffer=false: [only supported byCSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it’s not safe to use this option if doingcollect(CSV.Rows(file))because only current iterated row is "valid")
#
CSV.Chunks — Type
CSV.Chunks(source; ntasks::Integer=Threads.nthreads(), kwargs...) => CSV.Chunks
Returns a file "chunk" iterator. Accepts all the same inputs and keyword arguments as CSV.File, see those docs for explanations of each keyword argument.
The ntasks keyword argument specifies how many chunks a file should be split up into, defaulting to the # of threads available to Julia (i.e. JULIA_NUM_THREADS environment variable) or 8 if Julia is run single-threaded.
Each iteration of CSV.Chunks produces the next chunk of a file as a CSV.File. While initial file metadata detection is done only once (to determine # of columns, column names, etc), each iteration does independent type inference on columns. This is significant as different chunks may end up with different column types than previous chunks as new values are encountered in the file. Note that, as with CSV.File, types may be passed manually via the type or types keyword arguments.
This functionality is new and thus considered experimental; please open an issue if you run into any problems/bugs.
Arguments
File layout options:
-
header=1: how column names should be determined; if given as anInteger, indicates the row to parse for column names; as anAbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names;Vector{Symbol}orVector{String}give column names explicitly (should match # of columns in dataset); if a dataset doesn’t have column names, either provide them as aVector, or setheader=0orheader=falseand column names will be auto-generated (Column1,Column2, etc.). Note that if a row number header andcommentorignoreemptyrowsare provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row. -
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1getpropertysyntax or iterating rows and accessing column values of a row viagetproperty(e.g.row.col1) -
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after theheaderrow(s) is used. Ifheader=0, then the 1st row is assumed to be the start of data; providing askiptoargument does not affect theheaderargument. Note that if a row numberskiptoandcommentorignoreemptyrowsare provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row. -
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecommentkeyword argument) do not count towards the row number provided forfooterskip, they are completely ignored by the parser -
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row -
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskiptoandcommentare provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row. -
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (iffalse, each column will be assignedmissingfor that empty row) -
select: anAbstractVectorofInteger,Symbol,String, orBool, or a "selector" function of the form(i, name) -> keep::Bool; only columns in the collection or for which the selector function returnstruewill be parsed and accessible in the resultingCSV.File. Invalid values inselectare ignored. -
drop: inverse ofselect; anAbstractVectorofInteger,Symbol,String, orBool, or a "drop" function of the form(i, name) -> drop::Bool; columns in the collection or for which the drop function returnstruewill ignored in the resultingCSV.File. Invalid values indropare ignored. -
limit: anIntegerto indicate a limited number of rows to parse in a csv file; use in combination withskiptoto read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimitargument may not result in an exact # of rows parsed; usethreaded=falseto ensure an exact limit if necessary -
buffer_in_memory: aBool, defaultfalse, which controls whether aCmd,IO, or gzipped source will be read/decompressed in memory vs. using a temporary file. -
ntasks::Integer=Threads.nthreads(): [not applicable toCSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADSenvironment variable orjulia -t N); settingntasks=1will avoid any calls toThreads.@spawnand just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells) -
rows_to_check::Integer=30: [not applicable toCSV.Rows] a multithreaded parsed file will be split up intontasks# of equal chunks;rows_to_checkcontrols the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_checkmay need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows -
source: [only applicable for vector of inputs toCSV.File] aSymbol,String, orPairofSymbolorStringtoVector. As a singleSymbolorString, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair, the 2nd part of the pair should be aVectorof values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
-
missingstring: either anothing,String, orVector{String}to use as sentinel values that will be parsed asmissing; ifnothingis passed, no sentinel/missing values will be parsed; by default,missingstring="", which means only an empty field (two consecutive delimiters) is consideredmissing -
delim=',': aCharorStringthat indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file -
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells -
quoted::Bool=true: whether parsing should check forquotecharat the start/end of cells -
quotechar='"',openquotechar,closequotechar: aChar(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters -
escapechar='"': theCharused to escape quote characters in a quoted field -
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt, or nameSymbolorStringto the format string for that column. -
decimal='.': aCharindicating how decimals are separated in floats, i.e.3.14uses'.', or3,14uses a comma',' -
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00). -
truestrings,falsestrings:+Vector{String}+s that indicate howtrueorfalsevalues are represented; by default"true", "True", "TRUE", "T", "1"are used to detecttrueand"false", "False", "FALSE", "F", "0"are used to detectfalse; note that columns with only1and0values will default toInt64column type unless explicitly requested to beBoolviatypeskeyword argument -
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
-
types: a singleType,AbstractVectororAbstractDictof types, or a function of the form(i, name) -> Union{T, Nothing}to be used for column types; if a singleTypeis provided, all columns will be parsed with that single type; anAbstractDictcan map column indexInteger, or nameSymbolorStringto type for a column, i.e.Dict(1=>Float64)will set the first column as aFloat64,Dict(:column1=>Float64)will set the column namedcolumn1toFloat64and,Dict("column1"=>Float64)will set thecolumn1toFloat64; if aVectoris provided, it must match the # of columns provided or detected inheader. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothingto signal the column’s type should be detected while parsing. -
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)would change every detectedFloat64column to be parsed asString; only "standard" types are allowed to be mapped to another type, i.e.Int64,Float64,Date,DateTime,Time, andBool. If a column of one of those types is "detected", it will be mapped to the specified type. -
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported byCSV.Rows] controls whether columns will be built asPooledArray; iftrue, all columns detected asStringwill be pooled; alternatively, the proportion of unique values below whichStringcolumns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25, it will be pooled). If provided as aTuple{Float64, Int}like(0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If anAbstractVector, each element should beBool,Real, orTuple{Float64, Int}and the # of elements should match the # of columns in the dataset; if anAbstractDict, aBool,Real, orTuple{Float64, Int}value can be provided for individual columns where the dict key is given as column indexInteger, or column name asSymbolorString. If a function is provided, it should take a column index and name as 2 arguments, and return aBool,Real,Tuple{Float64, Int}, ornothingfor each column. -
downcast::Bool=false: controls whether columns detected asInt64will be "downcast" to the smallest possible integer type likeInt8,Int16,Int32, etc. -
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default isInlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString. IfStringis passed, all string columns will just be normalStringvalues. IfPosLenStringis passed, string columns will be returned asPosLenStringVector, which is a special "lazy"AbstractVectorthat acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVectormakes it read-only, so operations likepush!,append!, orsetindex!are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail -
strict::Bool=false: whether invalid values should throw a parsing error or be replaced withmissing -
silencewarnings::Bool=false: ifstrict=false, whether invalid value warnings should be silenced -
maxwarnings::Int=100: if more thanmaxwarningsnumber of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings -
debug::Bool=false: passingtruewill result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed -
validate::Bool=true: whether or not to validate that columns specified in thetypes,dateformatandpoolkeywords are actually found in the data. Iffalseno validation is done, meaning no error will be thrown iftypes/dateformat/poolspecify settings for columns not actually found in the data.
Iteration options:
-
reusebuffer=false: [only supported byCSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it’s not safe to use this option if doingcollect(CSV.Rows(file))because only current iterated row is "valid")
#
CSV.Rows — Type
CSV.Rows(source; kwargs...) => CSV.Rows
Read a csv input returning a CSV.Rows object.
The input argument can be one of:
-
filename given as a string or FilePaths.jl type
-
a
Vector{UInt8}orSubArray{UInt8, 1, Vector{UInt8}}byte buffer -
a
CodeUnitsobject, which wraps aString, likecodeunits(str) -
a csv-formatted string can also be passed like
IOBuffer(str) -
a
Cmdor otherIO -
a gzipped file (or gzipped data in any of the above), which will automatically be decompressed for parsing
To read a csv file from a url, use the HTTP.jl package, where the HTTP.Response body can be passed like:
f = CSV.Rows(HTTP.get(url).body)
For other IO or Cmd inputs, you can pass them like: f = CSV.Rows(read(obj)).
While similar to CSV.File, CSV.Rows provides a slightly different interface, the tradeoffs including:
-
Very minimal memory footprint; while iterating, only the current row values are buffered
-
Only provides row access via iteration; to access columns, one can stream the rows into a table type
-
Performs no type inference; each column/cell is essentially treated as
Union{String, Missing}, users can utilize the performantParsers.parse(T, str)to convert values to a more specific type if needed, or pass types upon construction using thetypeortypeskeyword arguments
Opens the file and uses passed arguments to detect the number of columns, but not column types (column types default to String unless otherwise manually provided). The returned CSV.Rows object supports the Tables.jl interface and can iterate rows. Each row object supports propertynames, getproperty, and getindex to access individual row values. Note that duplicate column names will be detected and adjusted to ensure uniqueness (duplicate column name a will become a_1). For example, one could iterate over a csv file with column names a, b, and c by doing:
for row in CSV.Rows(file)
println("a=$(row.a), b=$(row.b), c=$(row.c)")
end
Arguments
File layout options:
-
header=1: how column names should be determined; if given as anInteger, indicates the row to parse for column names; as anAbstractVector{<:Integer}, indicates a set of rows to be concatenated together as column names;Vector{Symbol}orVector{String}give column names explicitly (should match # of columns in dataset); if a dataset doesn’t have column names, either provide them as aVector, or setheader=0orheader=falseand column names will be auto-generated (Column1,Column2, etc.). Note that if a row number header andcommentorignoreemptyrowsare provided, the header row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header row will actually be the next non-commented row. -
normalizenames::Bool=false: whether column names should be "normalized" into valid Julia identifier symbols; useful when using thetbl.col1getpropertysyntax or iterating rows and accessing column values of a row viagetproperty(e.g.row.col1) -
skipto::Integer: specifies the row where the data starts in the csv file; by default, the next row after theheaderrow(s) is used. Ifheader=0, then the 1st row is assumed to be the start of data; providing askiptoargument does not affect theheaderargument. Note that if a row numberskiptoandcommentorignoreemptyrowsare provided, the data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the data row will actually be the next non-commented row. -
footerskip::Integer: number of rows at the end of a file to skip parsing. Do note that commented rows (see thecommentkeyword argument) do not count towards the row number provided forfooterskip, they are completely ignored by the parser -
transpose::Bool: read a csv file "transposed", i.e. each column is parsed as a row -
comment::String: string that will cause rows that begin with it to be skipped while parsing. Note that if a row number header orskiptoandcommentare provided, the header/data row will be the first non-commented/non-empty row after the row number, meaning if the provided row number is a commented row, the header/data row will actually be the next non-commented row. -
ignoreemptyrows::Bool=true: whether empty rows in a file should be ignored (iffalse, each column will be assignedmissingfor that empty row) -
select: anAbstractVectorofInteger,Symbol,String, orBool, or a "selector" function of the form(i, name) -> keep::Bool; only columns in the collection or for which the selector function returnstruewill be parsed and accessible in the resultingCSV.File. Invalid values inselectare ignored. -
drop: inverse ofselect; anAbstractVectorofInteger,Symbol,String, orBool, or a "drop" function of the form(i, name) -> drop::Bool; columns in the collection or for which the drop function returnstruewill ignored in the resultingCSV.File. Invalid values indropare ignored. -
limit: anIntegerto indicate a limited number of rows to parse in a csv file; use in combination withskiptoto read a specific, contiguous chunk within a file; note for large files when multiple threads are used for parsing, thelimitargument may not result in an exact # of rows parsed; usethreaded=falseto ensure an exact limit if necessary -
buffer_in_memory: aBool, defaultfalse, which controls whether aCmd,IO, or gzipped source will be read/decompressed in memory vs. using a temporary file. -
ntasks::Integer=Threads.nthreads(): [not applicable toCSV.Rows] for multithreaded parsed files, this controls the number of tasks spawned to read a file in concurrent chunks; defaults to the # of threads Julia was started with (i.e.JULIA_NUM_THREADSenvironment variable orjulia -t N); settingntasks=1will avoid any calls toThreads.@spawnand just read the file serially on the main thread; a single thread will also be used for smaller files by default (< 5_000 cells) -
rows_to_check::Integer=30: [not applicable toCSV.Rows] a multithreaded parsed file will be split up intontasks# of equal chunks;rows_to_checkcontrols the # of rows are checked to ensure parsing correctly found valid rows; for certain files with very large quoted text fields,lines_to_checkmay need to be higher (10, 30, etc.) to ensure parsing correctly finds these rows -
source: [only applicable for vector of inputs toCSV.File] aSymbol,String, orPairofSymbolorStringtoVector. As a singleSymbolorString, provides the column name that will be added to the parsed columns, the values of the column will be the input "name" (usually file name) of the input from whence the value was parsed. As aPair, the 2nd part of the pair should be aVectorof values matching the length of the # of inputs, where each value will be used instead of the input name for that inputs values in the auto-added column.
Parsing options:
-
missingstring: either anothing,String, orVector{String}to use as sentinel values that will be parsed asmissing; ifnothingis passed, no sentinel/missing values will be parsed; by default,missingstring="", which means only an empty field (two consecutive delimiters) is consideredmissing -
delim=',': aCharorStringthat indicates how columns are delimited in a file; if no argument is provided, parsing will try to detect the most consistent delimiter on the first 10 rows of the file -
ignorerepeated::Bool=false: whether repeated (consecutive/sequential) delimiters should be ignored while parsing; useful for fixed-width files with delimiter padding between cells -
quoted::Bool=true: whether parsing should check forquotecharat the start/end of cells -
quotechar='"',openquotechar,closequotechar: aChar(or different start and end characters) that indicate a quoted field which may contain textual delimiters or newline characters -
escapechar='"': theCharused to escape quote characters in a quoted field -
dateformat::Union{String, Dates.DateFormat, Nothing, AbstractDict}: a date format string to indicate how Date/DateTime columns are formatted for the entire file; if given as anAbstractDict, date format strings to indicate how the Date/DateTime columns corresponding to the keys are formatted. The Dict can map column indexInt, or nameSymbolorStringto the format string for that column. -
decimal='.': aCharindicating how decimals are separated in floats, i.e.3.14uses'.', or3,14uses a comma',' -
groupmark=nothing: optionally specify a single-byte character denoting the number grouping mark, this allows parsing of numbers that have, e.g., thousand separators (1,000.00). -
truestrings,falsestrings:+Vector{String}+s that indicate howtrueorfalsevalues are represented; by default"true", "True", "TRUE", "T", "1"are used to detecttrueand"false", "False", "FALSE", "F", "0"are used to detectfalse; note that columns with only1and0values will default toInt64column type unless explicitly requested to beBoolviatypeskeyword argument -
stripwhitespace=false: if true, leading and trailing whitespace are stripped from string values, including column names
Column Type Options:
-
types: a singleType,AbstractVectororAbstractDictof types, or a function of the form(i, name) -> Union{T, Nothing}to be used for column types; if a singleTypeis provided, all columns will be parsed with that single type; anAbstractDictcan map column indexInteger, or nameSymbolorStringto type for a column, i.e.Dict(1=>Float64)will set the first column as aFloat64,Dict(:column1=>Float64)will set the column namedcolumn1toFloat64and,Dict("column1"=>Float64)will set thecolumn1toFloat64; if aVectoris provided, it must match the # of columns provided or detected inheader. If a function is provided, it takes a column index and name as arguments, and should return the desired column type for the column, ornothingto signal the column’s type should be detected while parsing. -
typemap::IdDict{Type, Type}: a mapping of a type that should be replaced in every instance with another type, i.e.Dict(Float64=>String)would change every detectedFloat64column to be parsed asString; only "standard" types are allowed to be mapped to another type, i.e.Int64,Float64,Date,DateTime,Time, andBool. If a column of one of those types is "detected", it will be mapped to the specified type. -
pool::Union{Bool, Real, AbstractVector, AbstractDict, Function, Tuple{Float64, Int}}=(0.2, 500): [not supported byCSV.Rows] controls whether columns will be built asPooledArray; iftrue, all columns detected asStringwill be pooled; alternatively, the proportion of unique values below whichStringcolumns should be pooled (meaning that if the # of unique strings in a column is under 25%,pool=0.25, it will be pooled). If provided as aTuple{Float64, Int}like(0.2, 500), it represents the percent cardinality threshold as the 1st tuple element (0.2), and an upper limit for the # of unique values (500), under which the column will be pooled; this is the default (pool=(0.2, 500)). If anAbstractVector, each element should beBool,Real, orTuple{Float64, Int}and the # of elements should match the # of columns in the dataset; if anAbstractDict, aBool,Real, orTuple{Float64, Int}value can be provided for individual columns where the dict key is given as column indexInteger, or column name asSymbolorString. If a function is provided, it should take a column index and name as 2 arguments, and return aBool,Real,Tuple{Float64, Int}, ornothingfor each column. -
downcast::Bool=false: controls whether columns detected asInt64will be "downcast" to the smallest possible integer type likeInt8,Int16,Int32, etc. -
stringtype=InlineStrings.InlineString: controls how detected string columns will ultimately be returned; default isInlineString, which stores string data in a fixed-size primitive type that helps avoid excessive heap memory usage; if a column has values longer than 32 bytes, it will default toString. IfStringis passed, all string columns will just be normalStringvalues. IfPosLenStringis passed, string columns will be returned asPosLenStringVector, which is a special "lazy"AbstractVectorthat acts as a "view" into the original file data. This can lead to the most efficient parsing times, but note that the "view" nature ofPosLenStringVectormakes it read-only, so operations likepush!,append!, orsetindex!are not supported. It also keeps a reference to the entire input dataset source, so trying to modify or delete the underlying file, for example, may fail -
strict::Bool=false: whether invalid values should throw a parsing error or be replaced withmissing -
silencewarnings::Bool=false: ifstrict=false, whether invalid value warnings should be silenced -
maxwarnings::Int=100: if more thanmaxwarningsnumber of warnings are printed while parsing, further warnings will be silenced by default; for multithreaded parsing, each parsing task will print up tomaxwarnings -
debug::Bool=false: passingtruewill result in many informational prints while a dataset is parsed; can be useful when reporting issues or figuring out what is going on internally while a dataset is parsed -
validate::Bool=true: whether or not to validate that columns specified in thetypes,dateformatandpoolkeywords are actually found in the data. Iffalseno validation is done, meaning no error will be thrown iftypes/dateformat/poolspecify settings for columns not actually found in the data.
Iteration options:
-
reusebuffer=false: [only supported byCSV.Rows] while iterating, whether a single row buffer should be allocated and reused on each iteration; only use if each row will be iterated once and not re-used (e.g. it’s not safe to use this option if doingcollect(CSV.Rows(file))because only current iterated row is "valid")
Utilities
#
CSV.detect — Function
CSV.detect(str::String)
Use the same logic used by CSV.File to detect column types, to parse a value from a plain string. This can be useful in conjunction with the CSV.Rows type, which returns each cell of a file as a String. The order of types attempted is: Int, Float64, Date, DateTime, Bool, and if all fail, the input String is returned. No errors are thrown. For advanced usage, you can pass your own Parsers.Options type as a keyword argument option=ops for sentinel value detection.
Common terms
Standard types
The types that are detected by default when column types are not provided by the user otherwise. They include: Int64, Float64, Date, DateTime, Time, Bool, and String.
Newlines
For all parsing functionality, newlines are detected/parsed automatically, regardless if they’re present in the data as a single newline character ('\n'), single return character ('\r'), or full CRLF sequence ("\r\n").
Cardinality
Refers to the ratio of unique values to total number of values in a column. Columns with "low cardinality" have a low % of unique values, or put another way, there are only a few unique values for the entire column of data where unique values are repeated many times. Columns with "high cardinality" have a high % of unique values relative to total number of values. Think of these as "id-like" columns where each or almost each value is a unique identifier with no (or few) repeated values.