Python source

Source code for llms_txt Python module, containing helpers to create and use llms.txt files

Introduction

The llms.txt file spec is for files located in the path llms.txt of a website (or, optionally, in a subpath). llms-sample.txt is a simple example. A file following the spec contains the following sections as markdown, in the specific order:

An H1 with the name of the project or site. This is the only required section
A blockquote with a short summary of the project, containing key information necessary for understanding the rest of the file
Zero or more markdown sections (e.g. paragraphs, lists, etc) of any type, except headings, containing more detailed information about the project and how to interpret the provided files
Zero or more markdown sections delimited by H2 headers, containing “file lists” of URLs where further detail is available
- Each “file list” is a markdown list, containing a required markdown hyperlink [name](url), then optionally a : and notes about the file.

Here’s the start of a sample llms.txt file we’ll use for testing:

samp = Path('llms-sample.txt').read_text()
print(samp[:480])

# FastHTML

> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore's `FT` "FastTags" into a library for creating server-rendered hypermedia applications.

Remember:

- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it's automatic)
- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element

Reading

We’ll implement parse_llms_file to pull out the sections of llms.txt into a simple data structure.

source

search

 search (pat, txt, flags=0)

Dictionary of matched groups in pat within txt

source

named_re

 named_re (nm, pat)

Pattern to match pat in a named capture group

source

opt_re

 opt_re (s)

Pattern to optionally match s

We’ll work “outside in” so we can test the innermost matches as we go.

Parse links

link = '- [FastHTML quick start](https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md): A brief overview of FastHTML features'

Parse the first part of link into a dict

title = named_re('title', r'[^\]]+')
pat =  fr'-\s*\[{title}\]'
search(pat, samp)

{'title': 'internal docs - ed'}

Do the next bit.

url = named_re('url', r'[^\)]+')
pat += fr'\({url}\)'
search(pat, samp)

{'title': 'internal docs - ed', 'url': 'https://llmstxt.org/ed.html'}

Do the final bit. Note it’s optional.

desc = named_re('desc', r'.*')
pat += opt_re(fr':\s*{desc}')
search(pat, link)

{'title': 'FastHTML quick start',
 'url': 'https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md',
 'desc': 'A brief overview of FastHTML features'}

Combine those sections into a function parse_link(txt)

source

parse_link

 parse_link (txt)

Parse a link section from llms.txt

parse_link(link)

{'title': 'FastHTML quick start',
 'url': 'https://fastht.ml/docs/tutorials/quickstart_for_web_devs.html.md',
 'desc': 'A brief overview of FastHTML features'}

parse_link('-[foo](http://foo)')

{'title': 'foo', 'url': 'http://foo', 'desc': None}

Parse sections

sections = '''First bit.

## S1

-[foo](http://foo)
- [foo2](http://foo2): stuff

## S2

- [foo3](http://foo3)'''

start,*rest = re.split(fr'^##\s*(.*?$)', sections, flags=re.MULTILINE)
start

'First bit.\n\n'

rest

['S1',
 '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2',
 '\n\n- [foo3](http://foo3)']

Concisely create a dict from the pairs in rest.

d = dict(chunked(rest, 2))
d

{'S1': '\n\n-[foo](http://foo)\n- [foo2](http://foo2): stuff\n\n',
 'S2': '\n\n- [foo3](http://foo3)'}

links = d['S1']
links.strip()

'-[foo](http://foo)\n- [foo2](http://foo2): stuff'

Parse links into a list of links. There can be multiple newlines between them.

_parse_links(links)

[{'title': 'foo', 'url': 'http://foo', 'desc': None},
 {'title': 'foo2', 'url': 'http://foo2', 'desc': 'stuff'}]

Create a function that uses the above steps to parse an llms.txt into start and a dict with keys like d and parsed list of links as values.

start, sects = _parse_llms(samp)
start

'# FastHTML\n\n> FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.\n\nRemember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'

title = named_re('title', r'.+?$')
summ = named_re('summary', '.+?$')
summ_pat = opt_re(fr"^>\s*{summ}$")
info = named_re('info', '.*')

pat = fr'^#\s*{title}\n+{summ_pat}\n+{info}'
search(pat, start, (re.MULTILINE|re.DOTALL))

{'title': 'FastHTML',
 'summary': 'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.',
 'info': 'Remember:\n\n- Use `serve()` for running uvicorn (`if __name__ == "__main__"` is not needed since it\'s automatic)\n- When a title is needed with a response, use `Titled`; note that that already wraps children in `Container`, and already includes both the meta title as well as the H1 element.'}

Let’s finish it off!

source

parse_llms_file

 parse_llms_file (txt)

Parse llms.txt file contents in txt to an AttrDict

llmsd = parse_llms_file(samp)
llmsd.summary

'FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore\'s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'

llmsd.sections.Examples

(#1) [{'title': 'Todo list application', 'url': 'https://raw.githubusercontent.com/AnswerDotAI/fasthtml/main/examples/adv_app.py', 'desc': 'Detailed walk-thru of a complete CRUD app in FastHTML showing idiomatic use of FastHTML and HTMX patterns.'}]

XML conversion

For some LLMs such as Claude, XML format is preferred, so we’ll provide a function to create that format.

source

get_doc_content

 get_doc_content (url)

Fetch content from local file if in nbdev repo.

source

mk_ctx

 mk_ctx (d, optional=True, n_workers=None)

Create a Project with a Section for each H2 part in d, optionally skipping the ‘optional’ section.

ctx = mk_ctx(llmsd)
print(to_xml(ctx, do_escape=False)[:260]+'...')

<project title="FastHTML" summary='FastHTML is a python library which brings together Starlette, Uvicorn, HTMX, and fastcore&#39;s `FT` "FastTags" into a library for creating server-rendered hypermedia applications.'>Remember:

- Use `serve()` for running uvic...

source

get_sizes

 get_sizes (ctx)

Get the size of each section of the LLM context

get_sizes(ctx)

{'docs': {'internal docs - ed': 34464,
  'FastHTML quick start': 27383,
  'HTMX reference': 26812,
  'Starlette quick guide': 7936},
 'examples': {'Todo list application': 18558},
 'optional': {'Starlette full documentation': 48331}}

Path('../fasthtml.md').write_text(to_xml(ctx, do_escape=False))

source

create_ctx

 create_ctx (txt, optional=False, n_workers=None)

A Project with a Section for each H2 part in txt, optionally skipping the ‘optional’ section.

source

llms_txt2ctx

 llms_txt2ctx (fname:str, optional:<function bool_arg>=False,
               n_workers:int=None, save_nbdev_fname:str=None)

Print a Project with a Section for each H2 part in file read from fname, optionally skipping the ‘optional’ section.

	Type	Default	Details
fname	str		File name to read
optional	bool_arg	False	Include ‘optional’ section?
n_workers	int	None	Number of threads to use for parallel downloading
save_nbdev_fname	str	None	save output to nbdev `{docs_path}` instead of emitting to stdout

!llms_txt2ctx llms-sample.txt > ../fasthtml.md